
-
Previous Article
Some worst-case datasets of deterministic first-order methods for solving binary logistic regression
- IPI Home
- This Issue
-
Next Article
Automatic extraction of cell nuclei using dilated convolutional network
Global convergence and geometric characterization of slow to fast weight evolution in neural network training for classifying linearly non-separable data
1. | Department of Mathematics, University of California, Irvine, CA 92697, USA |
2. | Department of Mathematics and Statistics, State University of New York at Albany, Albany, NY 12222, USA |
3. | Department of Mathematics, University of California, Irvine, Irvine, CA 92697, USA |
In this paper, we study the dynamics of gradient descent in learning neural networks for classification problems. Unlike in existing works, we consider the linearly non-separable case where the training data of different classes lie in orthogonal subspaces. We show that when the network has sufficient (but not exceedingly large) number of neurons, (1) the corresponding minimization problem has a desirable landscape where all critical points are global minima with perfect classification; (2) gradient descent is guaranteed to converge to the global minima. Moreover, we discovered a geometric condition on the network weights so that when it is satisfied, the weight evolution transitions from a slow phase of weight direction spreading to a fast phase of weight convergence. The geometric condition says that the convex hull of the weights projected on the unit sphere contains the origin.
References:
[1] |
Z. Allen-Zhu, Y. Li and Z. Song, A convergence theory for deep learning via over-parameterization, preprint, arXiv: 1811.03962. Google Scholar |
[2] |
A. Brutzkus and A. Globerson, Globally optimal gradient descent for a ConvNet with Gaussian inputs, preprint, arXiv: 1702.07966. Google Scholar |
[3] |
A. Brutzkus and A. Globerson, Over-parameterization improves generalization in the XOR detection problem, preprint. Google Scholar |
[4] |
A. Brutzkus, A. Globerson, E. Malach and S. Shalev-Shwartz, SGDlearns over-parameterized networks that provably generalize on linearly separable data, 6th International Conference on Learning Representations, Vancouver, BC, Canada, 2018 Google Scholar |
[5] |
R. T. des Combes, M. Pezeshki, S. Shabanian, A. Courville and Y. Bengio, Convergence properties of deep neural networks on separable data, 2019., Available from: https://openreview.net/forum?id=HJfQrs0qt7. Google Scholar |
[6] |
S. S. Du, X. Zhai, B. Poczós and A. Singh, Gradient descent provably optimizes over-parameterized neural networks, preprint, arXiv: 1810.02054. Google Scholar |
[7] |
C. Ho and S. Zimmerman,
On the number of regions in an $m$-dimensional space cut by $n$ hyperplanes, Austral. Math. Soc. Gaz., 33 (2006), 260-264.
|
[8] |
S. Hochreiter and J. Schmidhuber,
Long short-term memory, Neural Comput., 9 (1997), 1735-1780.
doi: 10.1162/neco.1997.9.8.1735. |
[9] |
A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, 2012, 1097-1105.
doi: 10.1145/3065386. |
[10] |
LeNet-5 - A Classic CNN Architecture., Available from: https://engmrk.com/lenet-5-a-classic-cnn-architecture/., Google Scholar |
[11] |
Y. Li and Y. Liang, Learning overparameterized neural networks via stochastic gradient descent on structured data, preprint, arXiv: 1808.01204. Google Scholar |
[12] |
S. Liang, R. Sun, Y. Li and R. Srikant, Understanding the loss surface of neural networks for binary classification, preprint, arXiv: 1803.00909. Google Scholar |
[13] |
B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun and N. Srebro, Towards understanding the role of over-parametrization in generalization of neural networks, preprint, arXiv: 1805.12076. Google Scholar |
[14] |
Q. Nguyen, M. C. Mukkamala and M. Hein, On the loss landscape of a class of deep neural networks with no bad local valleys, preprint, arXiv: 1809.10749. Google Scholar |
[15] |
S. Ren, K. He, R. Girshick and J. Sun,
Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Machine Intell., 39 (2017), 1137-1149.
doi: 10.1109/TPAMI.2016.2577031. |
[16] |
A. Rosebrock, LeNet - Convolutional neural network in Python, 2016., Available from: https://www.pyimagesearch.com/2016/08/01/lenet-convolutional-neural-network-in-python/. Google Scholar |
[17] |
D. Silver, A. Huang, C. J. Maddison, A. Guez and L. et al. Sifre,
Mastering the game of Go with deep neural networks and tree search, Nature, 529 (2016), 484-489.
doi: 10.1038/nature16961. |
[18] |
H. Wang, Y. Wang, Z. Zhou, X. Ji and D. Gong, et al., CosFace: Large margin cosine loss for deep face recognition, preprint, arXiv: 1801.09414.
doi: 10.1109/CVPR.2018.00552. |
[19] |
P. Yin, J. Xin and Y. Qi,
Linear feature transform and enhancement of classification on deep neural network, J. Sci. Comput., 76 (2018), 1396-1406.
doi: 10.1007/s10915-018-0666-1. |
show all references
References:
[1] |
Z. Allen-Zhu, Y. Li and Z. Song, A convergence theory for deep learning via over-parameterization, preprint, arXiv: 1811.03962. Google Scholar |
[2] |
A. Brutzkus and A. Globerson, Globally optimal gradient descent for a ConvNet with Gaussian inputs, preprint, arXiv: 1702.07966. Google Scholar |
[3] |
A. Brutzkus and A. Globerson, Over-parameterization improves generalization in the XOR detection problem, preprint. Google Scholar |
[4] |
A. Brutzkus, A. Globerson, E. Malach and S. Shalev-Shwartz, SGDlearns over-parameterized networks that provably generalize on linearly separable data, 6th International Conference on Learning Representations, Vancouver, BC, Canada, 2018 Google Scholar |
[5] |
R. T. des Combes, M. Pezeshki, S. Shabanian, A. Courville and Y. Bengio, Convergence properties of deep neural networks on separable data, 2019., Available from: https://openreview.net/forum?id=HJfQrs0qt7. Google Scholar |
[6] |
S. S. Du, X. Zhai, B. Poczós and A. Singh, Gradient descent provably optimizes over-parameterized neural networks, preprint, arXiv: 1810.02054. Google Scholar |
[7] |
C. Ho and S. Zimmerman,
On the number of regions in an $m$-dimensional space cut by $n$ hyperplanes, Austral. Math. Soc. Gaz., 33 (2006), 260-264.
|
[8] |
S. Hochreiter and J. Schmidhuber,
Long short-term memory, Neural Comput., 9 (1997), 1735-1780.
doi: 10.1162/neco.1997.9.8.1735. |
[9] |
A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, 2012, 1097-1105.
doi: 10.1145/3065386. |
[10] |
LeNet-5 - A Classic CNN Architecture., Available from: https://engmrk.com/lenet-5-a-classic-cnn-architecture/., Google Scholar |
[11] |
Y. Li and Y. Liang, Learning overparameterized neural networks via stochastic gradient descent on structured data, preprint, arXiv: 1808.01204. Google Scholar |
[12] |
S. Liang, R. Sun, Y. Li and R. Srikant, Understanding the loss surface of neural networks for binary classification, preprint, arXiv: 1803.00909. Google Scholar |
[13] |
B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun and N. Srebro, Towards understanding the role of over-parametrization in generalization of neural networks, preprint, arXiv: 1805.12076. Google Scholar |
[14] |
Q. Nguyen, M. C. Mukkamala and M. Hein, On the loss landscape of a class of deep neural networks with no bad local valleys, preprint, arXiv: 1809.10749. Google Scholar |
[15] |
S. Ren, K. He, R. Girshick and J. Sun,
Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Machine Intell., 39 (2017), 1137-1149.
doi: 10.1109/TPAMI.2016.2577031. |
[16] |
A. Rosebrock, LeNet - Convolutional neural network in Python, 2016., Available from: https://www.pyimagesearch.com/2016/08/01/lenet-convolutional-neural-network-in-python/. Google Scholar |
[17] |
D. Silver, A. Huang, C. J. Maddison, A. Guez and L. et al. Sifre,
Mastering the game of Go with deep neural networks and tree search, Nature, 529 (2016), 484-489.
doi: 10.1038/nature16961. |
[18] |
H. Wang, Y. Wang, Z. Zhou, X. Ji and D. Gong, et al., CosFace: Large margin cosine loss for deep face recognition, preprint, arXiv: 1801.09414.
doi: 10.1109/CVPR.2018.00552. |
[19] |
P. Yin, J. Xin and Y. Qi,
Linear feature transform and enhancement of classification on deep neural network, J. Sci. Comput., 76 (2018), 1396-1406.
doi: 10.1007/s10915-018-0666-1. |






# of Neurons ( |
Random Init. | Half Space Init. |
6 | 578.90 |
672.41 |
8 | 423.96 |
582.16 |
10 | 313.29 |
550.19 |
12 | 242.72 |
517.26 |
14 | 183.53 |
500.42 |
16 | 141.00 |
487.42 |
18 | 126.52 |
478.25 |
20 | 102.09 |
412.46 |
22 | 90.65 |
454.08 |
24 | 82.93 |
416.82 |
# of Neurons ( |
Random Init. | Half Space Init. |
6 | 578.90 |
672.41 |
8 | 423.96 |
582.16 |
10 | 313.29 |
550.19 |
12 | 242.72 |
517.26 |
14 | 183.53 |
500.42 |
16 | 141.00 |
487.42 |
18 | 126.52 |
478.25 |
20 | 102.09 |
412.46 |
22 | 90.65 |
454.08 |
24 | 82.93 |
416.82 |
[1] |
Juan Manuel Pastor, Javier García-Algarra, Javier Galeano, José María Iriondo, José J. Ramasco. A simple and bounded model of population dynamics for mutualistic networks. Networks & Heterogeneous Media, 2015, 10 (1) : 53-70. doi: 10.3934/nhm.2015.10.53 |
[2] |
Zengyun Wang, Jinde Cao, Zuowei Cai, Lihong Huang. Finite-time stability of impulsive differential inclusion: Applications to discontinuous impulsive neural networks. Discrete & Continuous Dynamical Systems - B, 2021, 26 (5) : 2677-2692. doi: 10.3934/dcdsb.2020200 |
[3] |
Mats Gyllenberg, Jifa Jiang, Lei Niu, Ping Yan. On the classification of generalized competitive Atkinson-Allen models via the dynamics on the boundary of the carrying simplex. Discrete & Continuous Dynamical Systems - A, 2018, 38 (2) : 615-650. doi: 10.3934/dcds.2018027 |
[4] |
M. Phani Sudheer, Ravi S. Nanjundiah, A. S. Vasudeva Murthy. Revisiting the slow manifold of the Lorenz-Krishnamurthy quintet. Discrete & Continuous Dynamical Systems - B, 2006, 6 (6) : 1403-1416. doi: 10.3934/dcdsb.2006.6.1403 |
[5] |
Juan Manuel Pastor, Javier García-Algarra, José M. Iriondo, José J. Ramasco, Javier Galeano. Dragging in mutualistic networks. Networks & Heterogeneous Media, 2015, 10 (1) : 37-52. doi: 10.3934/nhm.2015.10.37 |
[6] |
Fritz Gesztesy, Helge Holden, Johanna Michor, Gerald Teschl. The algebro-geometric initial value problem for the Ablowitz-Ladik hierarchy. Discrete & Continuous Dynamical Systems - A, 2010, 26 (1) : 151-196. doi: 10.3934/dcds.2010.26.151 |
[7] |
Kin Ming Hui, Soojung Kim. Asymptotic large time behavior of singular solutions of the fast diffusion equation. Discrete & Continuous Dynamical Systems - A, 2017, 37 (11) : 5943-5977. doi: 10.3934/dcds.2017258 |
[8] |
Y. Latushkin, B. Layton. The optimal gap condition for invariant manifolds. Discrete & Continuous Dynamical Systems - A, 1999, 5 (2) : 233-268. doi: 10.3934/dcds.1999.5.233 |
[9] |
Jan Prüss, Laurent Pujo-Menjouet, G.F. Webb, Rico Zacher. Analysis of a model for the dynamics of prions. Discrete & Continuous Dynamical Systems - B, 2006, 6 (1) : 225-235. doi: 10.3934/dcdsb.2006.6.225 |
[10] |
Simone Calogero, Juan Calvo, Óscar Sánchez, Juan Soler. Dispersive behavior in galactic dynamics. Discrete & Continuous Dynamical Systems - B, 2010, 14 (1) : 1-16. doi: 10.3934/dcdsb.2010.14.1 |
[11] |
F.J. Herranz, J. de Lucas, C. Sardón. Jacobi--Lie systems: Fundamentals and low-dimensional classification. Conference Publications, 2015, 2015 (special) : 605-614. doi: 10.3934/proc.2015.0605 |
[12] |
Rui Hu, Yuan Yuan. Stability, bifurcation analysis in a neural network model with delay and diffusion. Conference Publications, 2009, 2009 (Special) : 367-376. doi: 10.3934/proc.2009.2009.367 |
[13] |
Tuan Hiep Pham, Jérôme Laverne, Jean-Jacques Marigo. Stress gradient effects on the nucleation and propagation of cohesive cracks. Discrete & Continuous Dynamical Systems - S, 2016, 9 (2) : 557-584. doi: 10.3934/dcdss.2016012 |
[14] |
Alessandro Gondolo, Fernando Guevara Vasquez. Characterization and synthesis of Rayleigh damped elastodynamic networks. Networks & Heterogeneous Media, 2014, 9 (2) : 299-314. doi: 10.3934/nhm.2014.9.299 |
[15] |
Matthias Erbar, Jan Maas. Gradient flow structures for discrete porous medium equations. Discrete & Continuous Dynamical Systems - A, 2014, 34 (4) : 1355-1374. doi: 10.3934/dcds.2014.34.1355 |
[16] |
J. Frédéric Bonnans, Justina Gianatti, Francisco J. Silva. On the convergence of the Sakawa-Shindo algorithm in stochastic control. Mathematical Control & Related Fields, 2016, 6 (3) : 391-406. doi: 10.3934/mcrf.2016008 |
[17] |
Fernando P. da Costa, João T. Pinto, Rafael Sasportes. On the convergence to critical scaling profiles in submonolayer deposition models. Kinetic & Related Models, 2018, 11 (6) : 1359-1376. doi: 10.3934/krm.2018053 |
[18] |
Alberto Bressan, Carlotta Donadello. On the convergence of viscous approximations after shock interactions. Discrete & Continuous Dynamical Systems - A, 2009, 23 (1&2) : 29-48. doi: 10.3934/dcds.2009.23.29 |
[19] |
Caifang Wang, Tie Zhou. The order of convergence for Landweber Scheme with $\alpha,\beta$-rule. Inverse Problems & Imaging, 2012, 6 (1) : 133-146. doi: 10.3934/ipi.2012.6.133 |
[20] |
Carlos Gutierrez, Nguyen Van Chau. A remark on an eigenvalue condition for the global injectivity of differentiable maps of $R^2$. Discrete & Continuous Dynamical Systems - A, 2007, 17 (2) : 397-402. doi: 10.3934/dcds.2007.17.397 |
2019 Impact Factor: 1.373
Tools
Metrics
Other articles
by authors
[Back to Top]