
-
Previous Article
Posterior contraction rates for non-parametric state and drift estimation
- FoDS Home
- This Issue
-
Next Article
Probabilistic learning on manifolds
Adaptive random Fourier features with Metropolis sampling
1. | Department of Mathematics, Royal Institute of Technology, 100 44, Stockholm, Sweden |
2. | H-Ai AB, Grevgatan 23,114 53, Stockholm, Sweden |
3. | Department of Mathematical Sciences, University of Delaware, Newark, DE 19716, USA |
The supervised learning problem to determine a neural network approximation $ \mathbb{R}^d\ni x\mapsto\sum_{k = 1}^K\hat\beta_k e^{{{\mathrm{i}}}\omega_k\cdot x} $ with one hidden layer is studied as a random Fourier features algorithm. The Fourier features, i.e., the frequencies $ \omega_k\in\mathbb{R}^d $, are sampled using an adaptive Metropolis sampler. The Metropolis test accepts proposal frequencies $ \omega_k' $, having corresponding amplitudes $ \hat\beta_k' $, with the probability $ \min\big\{1, (|\hat\beta_k'|/|\hat\beta_k|)^{\gamma}\big\} $, for a certain positive parameter $ {\gamma} $, determined by minimizing the approximation error for given computational work. This adaptive, non-parametric stochastic method leads asymptotically, as $ K\to\infty $, to equidistributed amplitudes $ |\hat\beta_k| $, analogous to deterministic adaptive algorithms for differential equations. The equidistributed amplitudes are shown to asymptotically correspond to the optimal density for independent samples in random Fourier features methods. Numerical evidence is provided in order to demonstrate the approximation properties and efficiency of the proposed algorithm. The algorithm is tested both on synthetic data and a real-world high-dimensional benchmark.
References:
[1] |
H. Avron, M. Kapralov, C. Musco, C. Musco, A. Velingker and A. Zandieh, Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees, in Proceedings of the 34th International Conference on Machine Learning (eds. D. Precup and Y. W. Teh), vol. 70 of Proceedings of Machine Learning Research, PMLR, International Convention Centre, Sydney, Australia, 2017,253–262, http://proceedings.mlr.press/v70/avron17a.html. |
[2] |
I. Babuška and W. C. Rheinboldt,
Error estimates for adaptive finite element computations, SIAM J. Numer. Anal., 15 (1978), 736-754.
doi: 10.1137/0715049. |
[3] |
F. Bach, On the equivalence between kernel quadrature rules and random feature expansions, J. Mach. Learn. Res., 18 (2017), Paper No. 21, 38. |
[4] |
A. R. Barron,
Universal approximation bounds for superpositions of a sigmoidal function, IEEE Trans. Inform. Theory, 39 (1993), 930-945.
doi: 10.1109/18.256500. |
[5] |
W. E, C. Ma and L. Wu,
A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics, Sci. China Math., 63 (2020), 1235-1258.
doi: 10.1007/s11425-019-1628-5. |
[6] |
L. C. Evans, Partial Differential Equations, vol. 19 of Graduate Studies in Mathematics, 2nd edition, American Mathematical Society, Providence, RI, 2010.
doi: 10.1090/gsm/019. |
[7] |
H. Haario, E. Saksman and J. Tamminen, An adaptive metropolis algorithm, Bernoulli, 7 (2001), 223–242, https://projecteuclid.org:443/euclid.bj/1080222083.
doi: 10.2307/3318737. |
[8] |
L. K. Jones,
A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training, Ann. Statist., 20 (1992), 608-613.
doi: 10.1214/aos/1176348546. |
[9] |
Z. Li, J.-F. Ton, D. Oglic and D. Sejdinovic, Towards a unified analysis of random Fourier features, in ICML, 2019, arXiv: 1806.09178v4. |
[10] |
A. Rahimi and B. Recht, Random features for large-scale kernel machines, in Advances in Neural Information Processing Systems 20 (eds. J. C. Platt, D. Koller, Y. Singer and S. T. Roweis), Curran Associates, Inc., 2008, 1177–1184. |
[11] |
G. O. Roberts, A. Gelman and W. R. Gilks,
Weak convergence and optimal scaling of random walk Metropolis algorithms, Ann. Appl. Probab., 7 (1997), 110-120.
doi: 10.1214/aoap/1034625254. |
[12] |
G. O. Roberts and J. S. Rosenthal,
Optimal scaling for various Metropolis-Hastings algorithms, Statist. Sci., 16 (2001), 351-367.
doi: 10.1214/ss/1015346320. |
[13] |
G. O. Roberts and J. S. Rosenthal, Examples of adaptive MCMC, J. Comput. Graph. Statist., 18 (2009), 349–367, http://www.jstor.org/stable/25651249.
doi: 10.1198/jcgs.2009.06134. |
[14] |
A. Rudi and L. Rosasco, Generalization properties of learning with random features, in Advances in Neural Information Processing Systems 30 (eds. I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett), Curran Associates, Inc., 2017, 3215–3225. |
[15] |
S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014.
doi: 10.1017/CBO9781107298019.![]() ![]() |
[16] |
L. N. Trefethen and D. Bau III, Numerical Linear Algebra, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1997.
doi: 10.1137/1.9780898719574. |
[17] |
A. G. Wilson and R. P. Adams, Gaussian process kernels for pattern discovery and extrapolation, in ICML, 2013, arXiv: 1302.4245v3. |
show all references
References:
[1] |
H. Avron, M. Kapralov, C. Musco, C. Musco, A. Velingker and A. Zandieh, Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees, in Proceedings of the 34th International Conference on Machine Learning (eds. D. Precup and Y. W. Teh), vol. 70 of Proceedings of Machine Learning Research, PMLR, International Convention Centre, Sydney, Australia, 2017,253–262, http://proceedings.mlr.press/v70/avron17a.html. |
[2] |
I. Babuška and W. C. Rheinboldt,
Error estimates for adaptive finite element computations, SIAM J. Numer. Anal., 15 (1978), 736-754.
doi: 10.1137/0715049. |
[3] |
F. Bach, On the equivalence between kernel quadrature rules and random feature expansions, J. Mach. Learn. Res., 18 (2017), Paper No. 21, 38. |
[4] |
A. R. Barron,
Universal approximation bounds for superpositions of a sigmoidal function, IEEE Trans. Inform. Theory, 39 (1993), 930-945.
doi: 10.1109/18.256500. |
[5] |
W. E, C. Ma and L. Wu,
A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics, Sci. China Math., 63 (2020), 1235-1258.
doi: 10.1007/s11425-019-1628-5. |
[6] |
L. C. Evans, Partial Differential Equations, vol. 19 of Graduate Studies in Mathematics, 2nd edition, American Mathematical Society, Providence, RI, 2010.
doi: 10.1090/gsm/019. |
[7] |
H. Haario, E. Saksman and J. Tamminen, An adaptive metropolis algorithm, Bernoulli, 7 (2001), 223–242, https://projecteuclid.org:443/euclid.bj/1080222083.
doi: 10.2307/3318737. |
[8] |
L. K. Jones,
A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training, Ann. Statist., 20 (1992), 608-613.
doi: 10.1214/aos/1176348546. |
[9] |
Z. Li, J.-F. Ton, D. Oglic and D. Sejdinovic, Towards a unified analysis of random Fourier features, in ICML, 2019, arXiv: 1806.09178v4. |
[10] |
A. Rahimi and B. Recht, Random features for large-scale kernel machines, in Advances in Neural Information Processing Systems 20 (eds. J. C. Platt, D. Koller, Y. Singer and S. T. Roweis), Curran Associates, Inc., 2008, 1177–1184. |
[11] |
G. O. Roberts, A. Gelman and W. R. Gilks,
Weak convergence and optimal scaling of random walk Metropolis algorithms, Ann. Appl. Probab., 7 (1997), 110-120.
doi: 10.1214/aoap/1034625254. |
[12] |
G. O. Roberts and J. S. Rosenthal,
Optimal scaling for various Metropolis-Hastings algorithms, Statist. Sci., 16 (2001), 351-367.
doi: 10.1214/ss/1015346320. |
[13] |
G. O. Roberts and J. S. Rosenthal, Examples of adaptive MCMC, J. Comput. Graph. Statist., 18 (2009), 349–367, http://www.jstor.org/stable/25651249.
doi: 10.1198/jcgs.2009.06134. |
[14] |
A. Rudi and L. Rosasco, Generalization properties of learning with random features, in Advances in Neural Information Processing Systems 30 (eds. I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett), Curran Associates, Inc., 2017, 3215–3225. |
[15] |
S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014.
doi: 10.1017/CBO9781107298019.![]() ![]() |
[16] |
L. N. Trefethen and D. Bau III, Numerical Linear Algebra, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1997.
doi: 10.1137/1.9780898719574. |
[17] |
A. G. Wilson and R. P. Adams, Gaussian process kernels for pattern discovery and extrapolation, in ICML, 2013, arXiv: 1302.4245v3. |






Regression | |||||||
Case | Case 1: a regularised step function | Case 2: a high dimensional function | |||||
Purpose | Find $p$ such that the constant $\mathbb{E}_\omega[\frac{|\hat f(\omega)|^2}{(2\pi)^{d}p^2(\omega)}]$ does not become large |
Study the ability of Alg. 1 to find a high dimensional dimensional function | |||||
Target $f(x)$ | $\text{Si}\left(\frac{x}{a}\right)e^{-\frac{x^2}{2}}$ $a = 10^{-3}$ |
$\text{Si}\left(\frac{x_1}{a}\right)e^{-\frac{|x|^2}{2}}$ $a = 10^{-1}$ |
|||||
$d$ | $1$ | $5$ | |||||
$K$ | $2^i, \, i = 1, 2, ..., 11$ | $2^i, \, i = 1, 2, ..., 10$ | |||||
Experiment | Exp. 1 | Exp. 2 | Exp. 3 | Exp. 4 | Exp. 5 | Exp. 1 | Exp. 4 |
Method | Alg. 1 | Alg. 2 | RFF$^{1}$ | SGM | Alg. 1 tabular |
Alg. 1 | SGM |
$N$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ |
$\tilde{N}$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ |
$γ$ | $3d-2$ | $3d-2$ | $3d-2$ | $3d-2$ | |||
$\lambda$ | $0.1$ | $0.1$ | $0.1$ | $0$ | $0.1$ | $0.1$ | 0 |
$M$ | $10^3$ | $5000$ | N/A | $10^7$ | $10^4$ | $2.5\times 10^3$ | $10^7$ |
$\bar{M}$ | $10$ | $10$ | $10$ | $10$ | $10$ | $10$ | $10$ |
$\delta$ | $2.4^2/d$ | $0.1$ | $2.4^2/d$ | $\frac{2.4^2}{10d}$ | |||
$\Delta t$ | $\mathtt{1.5e-4}$ | $\mathtt{3.0e-4}$ | |||||
$t_0$ | $M/10$ | ||||||
$ω_{\text{max}}$ | $\infty$ | ||||||
$m$ | $10$ | $50$ | $100$ | $25$ | |||
Regression | Classification | ||||||
Case | Case 3: anisotropic Gaussian function |
Case 4: The MNIST data set |
|||||
Purpose | Computational complexity comparison |
Study the ability of Alg. 1 to work with non synthetic data | |||||
Target $f(x)$ |
$e^{-(32 x_1)^2/2}e^{-(32^{-1} x_2)^2/2}$ | ||||||
$d$ | $2$ | $784$ | |||||
$K$ | $256$ | $2^i, \, i = 1, 2, ..., 13$ | |||||
Experiment | Exp. 1 | Exp. 2 | Exp. 4 | Exp. 1 | Exp. 3 | ||
Method | Alg. 1 | Alg. 2 | SGM | Alg. 1 | RFF$^{2}$ | ||
$N$ | $10^4$ | $10^4$ | $3\times 10^7$ | $6\times 10^4$ | $6 \times 10^4$ | ||
$\tilde{N}$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | ||
$γ$ | $3d-2$ | $3d-2$ | $3d-2$ | ||||
$\lambda$ | $0.1$ | $0.1$ | $0$ | $0.1$ | $0.1$ | ||
$M$ | $10^4$ | $10^4$ | $3\times 10^7$ | $10^2$ | |||
$\bar{M}$ | $1$ | $1$ | $1$ | $1$ | $1$ | ||
$\delta$ | $0.5$ | $0.1$ | $0.1$ | ||||
$\Delta t$ | $\mathtt{1.5e-3}$ | ||||||
$t_0$ | $M/10$ | ||||||
$ω_{\text{max}}$ | $\infty$ | ||||||
$m$ | $100$ | $100$ | $M+1$ | ||||
Regression | |||||||
Case | Case 1: a regularised step function | Case 2: a high dimensional function | |||||
Purpose | Find $p$ such that the constant $\mathbb{E}_\omega[\frac{|\hat f(\omega)|^2}{(2\pi)^{d}p^2(\omega)}]$ does not become large |
Study the ability of Alg. 1 to find a high dimensional dimensional function | |||||
Target $f(x)$ | $\text{Si}\left(\frac{x}{a}\right)e^{-\frac{x^2}{2}}$ $a = 10^{-3}$ |
$\text{Si}\left(\frac{x_1}{a}\right)e^{-\frac{|x|^2}{2}}$ $a = 10^{-1}$ |
|||||
$d$ | $1$ | $5$ | |||||
$K$ | $2^i, \, i = 1, 2, ..., 11$ | $2^i, \, i = 1, 2, ..., 10$ | |||||
Experiment | Exp. 1 | Exp. 2 | Exp. 3 | Exp. 4 | Exp. 5 | Exp. 1 | Exp. 4 |
Method | Alg. 1 | Alg. 2 | RFF$^{1}$ | SGM | Alg. 1 tabular |
Alg. 1 | SGM |
$N$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ |
$\tilde{N}$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ |
$γ$ | $3d-2$ | $3d-2$ | $3d-2$ | $3d-2$ | |||
$\lambda$ | $0.1$ | $0.1$ | $0.1$ | $0$ | $0.1$ | $0.1$ | 0 |
$M$ | $10^3$ | $5000$ | N/A | $10^7$ | $10^4$ | $2.5\times 10^3$ | $10^7$ |
$\bar{M}$ | $10$ | $10$ | $10$ | $10$ | $10$ | $10$ | $10$ |
$\delta$ | $2.4^2/d$ | $0.1$ | $2.4^2/d$ | $\frac{2.4^2}{10d}$ | |||
$\Delta t$ | $\mathtt{1.5e-4}$ | $\mathtt{3.0e-4}$ | |||||
$t_0$ | $M/10$ | ||||||
$ω_{\text{max}}$ | $\infty$ | ||||||
$m$ | $10$ | $50$ | $100$ | $25$ | |||
Regression | Classification | ||||||
Case | Case 3: anisotropic Gaussian function |
Case 4: The MNIST data set |
|||||
Purpose | Computational complexity comparison |
Study the ability of Alg. 1 to work with non synthetic data | |||||
Target $f(x)$ |
$e^{-(32 x_1)^2/2}e^{-(32^{-1} x_2)^2/2}$ | ||||||
$d$ | $2$ | $784$ | |||||
$K$ | $256$ | $2^i, \, i = 1, 2, ..., 13$ | |||||
Experiment | Exp. 1 | Exp. 2 | Exp. 4 | Exp. 1 | Exp. 3 | ||
Method | Alg. 1 | Alg. 2 | SGM | Alg. 1 | RFF$^{2}$ | ||
$N$ | $10^4$ | $10^4$ | $3\times 10^7$ | $6\times 10^4$ | $6 \times 10^4$ | ||
$\tilde{N}$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | $10^4$ | ||
$γ$ | $3d-2$ | $3d-2$ | $3d-2$ | ||||
$\lambda$ | $0.1$ | $0.1$ | $0$ | $0.1$ | $0.1$ | ||
$M$ | $10^4$ | $10^4$ | $3\times 10^7$ | $10^2$ | |||
$\bar{M}$ | $1$ | $1$ | $1$ | $1$ | $1$ | ||
$\delta$ | $0.5$ | $0.1$ | $0.1$ | ||||
$\Delta t$ | $\mathtt{1.5e-3}$ | ||||||
$t_0$ | $M/10$ | ||||||
$ω_{\text{max}}$ | $\infty$ | ||||||
$m$ | $100$ | $100$ | $M+1$ | ||||
K | Fixed | Fixed | Adaptive |
2 | 89.97% | 81.65% | 80.03% |
4 | 89.2% | 70.65% | 65.25% |
8 | 88.98% | 55.35% | 53.44% |
16 | 88.69% | 46.77% | 36.42% |
32 | 88.9% | 30.43% | 23.52% |
64 | 88.59% | 19.73% | 16.98% |
128 | 88.7% | 13.6% | 11.13% |
256 | 88.09% | 10.12% | 7.99% |
512 | 88.01% | 8.16% | 5.93% |
1024 | 87.34% | 6.29% | 4.57% |
2048 | 86.5% | 4.94% | 3.5% |
4096 | 85.21% | 3.76% | 2.74% |
8192 | 83.98% | 3.16% | 1.98% |
K | Fixed | Fixed | Adaptive |
2 | 89.97% | 81.65% | 80.03% |
4 | 89.2% | 70.65% | 65.25% |
8 | 88.98% | 55.35% | 53.44% |
16 | 88.69% | 46.77% | 36.42% |
32 | 88.9% | 30.43% | 23.52% |
64 | 88.59% | 19.73% | 16.98% |
128 | 88.7% | 13.6% | 11.13% |
256 | 88.09% | 10.12% | 7.99% |
512 | 88.01% | 8.16% | 5.93% |
1024 | 87.34% | 6.29% | 4.57% |
2048 | 86.5% | 4.94% | 3.5% |
4096 | 85.21% | 3.76% | 2.74% |
8192 | 83.98% | 3.16% | 1.98% |
[1] |
Yacine Chitour, Zhenyu Liao, Romain Couillet. A geometric approach of gradient descent algorithms in linear neural networks. Mathematical Control and Related Fields, 2022 doi: 10.3934/mcrf.2022021 |
[2] |
Feng Bao, Thomas Maier. Stochastic gradient descent algorithm for stochastic optimization in solving analytic continuation problems. Foundations of Data Science, 2020, 2 (1) : 1-17. doi: 10.3934/fods.2020001 |
[3] |
Leong-Kwan Li, Sally Shao. Convergence analysis of the weighted state space search algorithm for recurrent neural networks. Numerical Algebra, Control and Optimization, 2014, 4 (3) : 193-207. doi: 10.3934/naco.2014.4.193 |
[4] |
Meiyu Sui, Yejuan Wang, Peter E. Kloeden. Pullback attractors for stochastic recurrent neural networks with discrete and distributed delays. Electronic Research Archive, 2021, 29 (2) : 2187-2221. doi: 10.3934/era.2020112 |
[5] |
King Hann Lim, Hong Hui Tan, Hendra G. Harno. Approximate greatest descent in neural network optimization. Numerical Algebra, Control and Optimization, 2018, 8 (3) : 327-336. doi: 10.3934/naco.2018021 |
[6] |
Leong-Kwan Li, Sally Shao, K. F. Cedric Yiu. Nonlinear dynamical system modeling via recurrent neural networks and a weighted state space search algorithm. Journal of Industrial and Management Optimization, 2011, 7 (2) : 385-400. doi: 10.3934/jimo.2011.7.385 |
[7] |
Stephen Coombes, Helmut Schmidt, Carlo R. Laing, Nils Svanstedt, John A. Wyller. Waves in random neural media. Discrete and Continuous Dynamical Systems, 2012, 32 (8) : 2951-2970. doi: 10.3934/dcds.2012.32.2951 |
[8] |
Ying Sue Huang. Resynchronization of delayed neural networks. Discrete and Continuous Dynamical Systems, 2001, 7 (2) : 397-401. doi: 10.3934/dcds.2001.7.397 |
[9] |
Ruoxia Li, Huaiqin Wu, Xiaowei Zhang, Rong Yao. Adaptive projective synchronization of memristive neural networks with time-varying delays and stochastic perturbation. Mathematical Control and Related Fields, 2015, 5 (4) : 827-844. doi: 10.3934/mcrf.2015.5.827 |
[10] |
Pierre Guiraud, Etienne Tanré. Stability of synchronization under stochastic perturbations in leaky integrate and fire neural networks of finite size. Discrete and Continuous Dynamical Systems - B, 2019, 24 (9) : 5183-5201. doi: 10.3934/dcdsb.2019056 |
[11] |
Quan Hai, Shutang Liu. Mean-square delay-distribution-dependent exponential synchronization of chaotic neural networks with mixed random time-varying delays and restricted disturbances. Discrete and Continuous Dynamical Systems - B, 2021, 26 (6) : 3097-3118. doi: 10.3934/dcdsb.2020221 |
[12] |
Delio Mugnolo, René Pröpper. Gradient systems on networks. Conference Publications, 2011, 2011 (Special) : 1078-1090. doi: 10.3934/proc.2011.2011.1078 |
[13] |
Xiaming Chen. Kernel-based online gradient descent using distributed approach. Mathematical Foundations of Computing, 2019, 2 (1) : 1-9. doi: 10.3934/mfc.2019001 |
[14] |
Ting Hu. Kernel-based maximum correntropy criterion with gradient descent method. Communications on Pure and Applied Analysis, 2020, 19 (8) : 4159-4177. doi: 10.3934/cpaa.2020186 |
[15] |
Shishun Li, Zhengda Huang. Guaranteed descent conjugate gradient methods with modified secant condition. Journal of Industrial and Management Optimization, 2008, 4 (4) : 739-755. doi: 10.3934/jimo.2008.4.739 |
[16] |
Wataru Nakamura, Yasushi Narushima, Hiroshi Yabe. Nonlinear conjugate gradient methods with sufficient descent properties for unconstrained optimization. Journal of Industrial and Management Optimization, 2013, 9 (3) : 595-619. doi: 10.3934/jimo.2013.9.595 |
[17] |
Tatyana S. Turova. Structural phase transitions in neural networks. Mathematical Biosciences & Engineering, 2014, 11 (1) : 139-148. doi: 10.3934/mbe.2014.11.139 |
[18] |
Yunsai Chen, Zhao Yang, Liang Ma, Peng Li, Yongjie Pang, Xin Zhao, Wenyi Yang. Efficient extraction algorithm for local fuzzy features of dynamic images. Discrete and Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1311-1325. doi: 10.3934/dcdss.2019090 |
[19] |
Xianmin Geng, Shengli Zhou, Jiashan Tang, Cong Yang. A sufficient condition for classified networks to possess complex network features. Networks and Heterogeneous Media, 2012, 7 (1) : 59-69. doi: 10.3934/nhm.2012.7.59 |
[20] |
Benedict Leimkuhler, Charles Matthews, Tiffany Vlaar. Partitioned integrators for thermodynamic parameterization of neural networks. Foundations of Data Science, 2019, 1 (4) : 457-489. doi: 10.3934/fods.2019019 |
Impact Factor:
Tools
Metrics
Other articles
by authors
[Back to Top]