doi: 10.3934/mcrf.2022021
Online First

Online First articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Online First publication benefits the research community by making new scientific discoveries known as quickly as possible.

Readers can access Online First articles via the “Online First” tab for the selected journal.

A geometric approach of gradient descent algorithms in linear neural networks

1. 

Laboratoire des Signaux et Systèmes, CentraleSupélec, Université Paris-Saclay, France

2. 

EIC, Huazhong University of Science and Technology, Wuhan, China

3. 

University Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, 38000 Grenoble, France

* Corresponding author: Yacine Chitour

Received  April 2021 Revised  March 2022 Early access April 2022

In this paper, we propose a geometric framework to analyze the convergence properties of gradient descent trajectories in the context of linear neural networks. We translate a well-known empirical observation of linear neural nets into a conjecture that we call the overfitting conjecture which states that, for almost all training data and initial conditions, the trajectory of the corresponding gradient descent system converges to a global minimum. This would imply that the solution achieved by vanilla gradient descent algorithms is equivalent to that of the least-squares estimation, for linear neural networks of an arbitrary number of hidden layers. Built upon a key invariance property induced by the network structure, we first establish convergence of gradient descent trajectories to critical points of the square loss function in the case of linear networks of arbitrary depth. Our second result is the proof of the overfitting conjecture in the case of single-hidden-layer linear networks with an argument based on the notion of normal hyperbolicity and under a generic property on the training data (i.e., holding for almost all training data).

Citation: Yacine Chitour, Zhenyu Liao, Romain Couillet. A geometric approach of gradient descent algorithms in linear neural networks. Mathematical Control and Related Fields, doi: 10.3934/mcrf.2022021
References:
[1]

P.-A. AbsilR. Mahony and B. Andrews, Convergence of the iterates of descent methods for analytic cost functions, SIAM J. Optim., 16 (2005), 531-547.  doi: 10.1137/040605266.

[2]

Z. Allen-Zhu, Y. Li and Z. Song, A convergence theory for deep learning via over-parameterization, In Proceedings of the 36th International Conference on Machine Learning; Proceedings of Machine Learning Research, PMLR, 09–15, 97 (2019), 242–252.

[3]

S. Arora, N. Cohen and E. Hazan, On the optimization of deep networks: Implicit acceleration by overparameterization, In Proceedings of the 35th International Conference on Machine Learning; Proceedings of Machine Learning Research, PMLR, 10–15, 80 (2018), 244–253.

[4]

P. Baldi and K. Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural Networks, 2 (1989), 53-58. 

[5]

A. Blum and R. L. Rivest, Training a 3-node neural network is NP-complete, In Proceedings of the 1988 Workshop on Computational Learning Theory (Cambridge, MA, 1988), (1989), 9–18.

[6]

A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous and Y. LeCun, The loss surfaces of multilayer networks, In Artificial Intelligence and Statistics, (2015), 192–204.

[7]

F. H. Clarke, Y. S. Ledyaev, R. J. Stern and P. R. Wolenski, Nonsmooth Analysis and Control Theory, Springer-Verlag, New York, 1998.

[8]

R. Collobert and J. Weston, A unified architecture for natural language processing: Deep neural networks with multitask learning, In Proceedings of the 25th International Conference on Machine Learning, ACM, (2008), 160–167. doi: 10.1145/1390156.1390177.

[9]

D. D'Acunto and K. Kurdyka, Explicit bounds for the Łojasiewicz exponent in the gradient inequality for polynomials, Ann. Polon. Math., 87 (2005), 51-61.  doi: 10.4064/ap87-0-5.

[10]

S. S. Du, W. Hu and J. D. Lee, Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced, In Advances in Neural Information Processing Systems, (2018), 382–393.

[11]

S. S. Du, X. Zhai, B. Poczos and A. Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv preprint, arXiv: 1810.02054, 2018.

[12]

F. E. Harrell Jr, Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Springer, 2015.

[13]

M. W. Hirsch, C. C. Pugh and M. Shub, Invariant Manifolds, Lecture Notes in Mathematics, Vol. 583. Springer-Verlag, Berlin-New York, 1977.

[14] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge university press, 1990. 
[15]

C. Jin, R. Ge, P. Netrapalli, S. M. Kakade and M. I. Jordan, How to escape saddle points efficiently, In Proceedings of the 34th International Conference on Machine Learning, 70 (2017), 1724-1732.

[16]

K. Kawaguchi, Deep learning without poor local minima, In Advances in Neural Information Processing Systems, (2016), 586–594.

[17]

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint, arXiv: 1412.6980, 2014.

[18]

A. KrizhevskyI. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Communications of the ACM, 60 (2017), 84-90.  doi: 10.1145/3065386.

[19]

T. Laurent and J. Brecht, Deep linear networks with arbitrary loss: All local minima are global, In International Conference on Machine Learning, (2018), 2908–2913.

[20]

J. D. LeeI. PanageasG. PiliourasM. SimchowitzM. I. Jordan and B. Recht, First-order methods almost always avoid strict saddle points, Math. Program., 176 (2019), 311-337.  doi: 10.1007/s10107-019-01374-3.

[21]

S. Łojasiewicz, Sur les trajectoires du gradient d'une fonction analytique, Geometry Seminars, 1982/83 (1984), 115-117. 

[22]

A. MohamedG. E. Dahl and G. Hinton, Acoustic modeling using deep belief networks, IEEE Transactions on Audio, Speech, and Language Processing, 20 (2012), 14-22.  doi: 10.1109/TASL.2011.2109382.

[23]

K. G. Murty and S. N. Kabadi, Some NP-complete problems in quadratic and nonlinear programming, Math. Programming, 39 (1987), 117-129.  doi: 10.1007/BF02592948.

[24]

I. Panageas and G. Piliouras, Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions, In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 67 (2017), Art. No. 2, 12 pp.

[25]

Y. B. Pesin, Lectures on Partial Hyperbolicity and Stable Ergodicity, European Mathematical Society, 2004. doi: 10.4171/003.

[26]

N. Qian, On the momentum term in gradient descent learning algorithms, Neural Networks, 12 (1999), 145-151.  doi: 10.1016/S0893-6080(98)00116-6.

[27]

A. M. Saxe, J. L. McClelland and S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, arXiv preprint, arXiv: 1312.6120, 2013.

[28]

J. Sun, Q. Qu and J. Wright, When are nonconvex problems not scary?, arXiv preprint, arXiv: 1510.06096, 2015.

[29]

G. Teschl, Ordinary Differential Equations and Dynamical Systems, volume 140, American Mathematical Society Providence, 2012. doi: 10.1090/gsm/140.

[30]

W. I. Zangwill, Convergence conditions for nonlinear programming algorithms, Management Sci., 16 (1969/70), 1-13.  doi: 10.1287/mnsc.16.1.1.

show all references

References:
[1]

P.-A. AbsilR. Mahony and B. Andrews, Convergence of the iterates of descent methods for analytic cost functions, SIAM J. Optim., 16 (2005), 531-547.  doi: 10.1137/040605266.

[2]

Z. Allen-Zhu, Y. Li and Z. Song, A convergence theory for deep learning via over-parameterization, In Proceedings of the 36th International Conference on Machine Learning; Proceedings of Machine Learning Research, PMLR, 09–15, 97 (2019), 242–252.

[3]

S. Arora, N. Cohen and E. Hazan, On the optimization of deep networks: Implicit acceleration by overparameterization, In Proceedings of the 35th International Conference on Machine Learning; Proceedings of Machine Learning Research, PMLR, 10–15, 80 (2018), 244–253.

[4]

P. Baldi and K. Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural Networks, 2 (1989), 53-58. 

[5]

A. Blum and R. L. Rivest, Training a 3-node neural network is NP-complete, In Proceedings of the 1988 Workshop on Computational Learning Theory (Cambridge, MA, 1988), (1989), 9–18.

[6]

A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous and Y. LeCun, The loss surfaces of multilayer networks, In Artificial Intelligence and Statistics, (2015), 192–204.

[7]

F. H. Clarke, Y. S. Ledyaev, R. J. Stern and P. R. Wolenski, Nonsmooth Analysis and Control Theory, Springer-Verlag, New York, 1998.

[8]

R. Collobert and J. Weston, A unified architecture for natural language processing: Deep neural networks with multitask learning, In Proceedings of the 25th International Conference on Machine Learning, ACM, (2008), 160–167. doi: 10.1145/1390156.1390177.

[9]

D. D'Acunto and K. Kurdyka, Explicit bounds for the Łojasiewicz exponent in the gradient inequality for polynomials, Ann. Polon. Math., 87 (2005), 51-61.  doi: 10.4064/ap87-0-5.

[10]

S. S. Du, W. Hu and J. D. Lee, Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced, In Advances in Neural Information Processing Systems, (2018), 382–393.

[11]

S. S. Du, X. Zhai, B. Poczos and A. Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv preprint, arXiv: 1810.02054, 2018.

[12]

F. E. Harrell Jr, Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Springer, 2015.

[13]

M. W. Hirsch, C. C. Pugh and M. Shub, Invariant Manifolds, Lecture Notes in Mathematics, Vol. 583. Springer-Verlag, Berlin-New York, 1977.

[14] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge university press, 1990. 
[15]

C. Jin, R. Ge, P. Netrapalli, S. M. Kakade and M. I. Jordan, How to escape saddle points efficiently, In Proceedings of the 34th International Conference on Machine Learning, 70 (2017), 1724-1732.

[16]

K. Kawaguchi, Deep learning without poor local minima, In Advances in Neural Information Processing Systems, (2016), 586–594.

[17]

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint, arXiv: 1412.6980, 2014.

[18]

A. KrizhevskyI. Sutskever and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Communications of the ACM, 60 (2017), 84-90.  doi: 10.1145/3065386.

[19]

T. Laurent and J. Brecht, Deep linear networks with arbitrary loss: All local minima are global, In International Conference on Machine Learning, (2018), 2908–2913.

[20]

J. D. LeeI. PanageasG. PiliourasM. SimchowitzM. I. Jordan and B. Recht, First-order methods almost always avoid strict saddle points, Math. Program., 176 (2019), 311-337.  doi: 10.1007/s10107-019-01374-3.

[21]

S. Łojasiewicz, Sur les trajectoires du gradient d'une fonction analytique, Geometry Seminars, 1982/83 (1984), 115-117. 

[22]

A. MohamedG. E. Dahl and G. Hinton, Acoustic modeling using deep belief networks, IEEE Transactions on Audio, Speech, and Language Processing, 20 (2012), 14-22.  doi: 10.1109/TASL.2011.2109382.

[23]

K. G. Murty and S. N. Kabadi, Some NP-complete problems in quadratic and nonlinear programming, Math. Programming, 39 (1987), 117-129.  doi: 10.1007/BF02592948.

[24]

I. Panageas and G. Piliouras, Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions, In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 67 (2017), Art. No. 2, 12 pp.

[25]

Y. B. Pesin, Lectures on Partial Hyperbolicity and Stable Ergodicity, European Mathematical Society, 2004. doi: 10.4171/003.

[26]

N. Qian, On the momentum term in gradient descent learning algorithms, Neural Networks, 12 (1999), 145-151.  doi: 10.1016/S0893-6080(98)00116-6.

[27]

A. M. Saxe, J. L. McClelland and S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, arXiv preprint, arXiv: 1312.6120, 2013.

[28]

J. Sun, Q. Qu and J. Wright, When are nonconvex problems not scary?, arXiv preprint, arXiv: 1510.06096, 2015.

[29]

G. Teschl, Ordinary Differential Equations and Dynamical Systems, volume 140, American Mathematical Society Providence, 2012. doi: 10.1090/gsm/140.

[30]

W. I. Zangwill, Convergence conditions for nonlinear programming algorithms, Management Sci., 16 (1969/70), 1-13.  doi: 10.1287/mnsc.16.1.1.

Figure 1.  Illustration of a $ H $-hidden-layer linear neural network
Figure 2.  A geometric "vision" of the loss landscape
Figure 3.  Visual representation of normal hyperbolicity
Figure 4.  Visual representation of normal hyperbolicity and trajectories samples (in green)
[1]

Christopher Oballe, David Boothe, Piotr J. Franaszczuk, Vasileios Maroulas. ToFU: Topology functional units for deep learning. Foundations of Data Science, 2021  doi: 10.3934/fods.2021021

[2]

Richard Archibald, Feng Bao, Yanzhao Cao, He Zhang. A backward SDE method for uncertainty quantification in deep learning. Discrete and Continuous Dynamical Systems - S, 2022  doi: 10.3934/dcdss.2022062

[3]

Ziju Shen, Yufei Wang, Dufan Wu, Xu Yang, Bin Dong. Learning to scan: A deep reinforcement learning approach for personalized scanning in CT imaging. Inverse Problems and Imaging, 2022, 16 (1) : 179-195. doi: 10.3934/ipi.2021045

[4]

Yakov Pesin, Vaughn Climenhaga. Open problems in the theory of non-uniform hyperbolicity. Discrete and Continuous Dynamical Systems, 2010, 27 (2) : 589-607. doi: 10.3934/dcds.2010.27.589

[5]

Xiaming Chen. Kernel-based online gradient descent using distributed approach. Mathematical Foundations of Computing, 2019, 2 (1) : 1-9. doi: 10.3934/mfc.2019001

[6]

Ting Hu. Kernel-based maximum correntropy criterion with gradient descent method. Communications on Pure and Applied Analysis, 2020, 19 (8) : 4159-4177. doi: 10.3934/cpaa.2020186

[7]

Feng Bao, Thomas Maier. Stochastic gradient descent algorithm for stochastic optimization in solving analytic continuation problems. Foundations of Data Science, 2020, 2 (1) : 1-17. doi: 10.3934/fods.2020001

[8]

Shishun Li, Zhengda Huang. Guaranteed descent conjugate gradient methods with modified secant condition. Journal of Industrial and Management Optimization, 2008, 4 (4) : 739-755. doi: 10.3934/jimo.2008.4.739

[9]

Wataru Nakamura, Yasushi Narushima, Hiroshi Yabe. Nonlinear conjugate gradient methods with sufficient descent properties for unconstrained optimization. Journal of Industrial and Management Optimization, 2013, 9 (3) : 595-619. doi: 10.3934/jimo.2013.9.595

[10]

Martin Benning, Elena Celledoni, Matthias J. Ehrhardt, Brynjulf Owren, Carola-Bibiane Schönlieb. Deep learning as optimal control problems: Models and numerical methods. Journal of Computational Dynamics, 2019, 6 (2) : 171-198. doi: 10.3934/jcd.2019009

[11]

Nicholas Geneva, Nicholas Zabaras. Multi-fidelity generative deep learning turbulent flows. Foundations of Data Science, 2020, 2 (4) : 391-428. doi: 10.3934/fods.2020019

[12]

Miria Feng, Wenying Feng. Evaluation of parallel and sequential deep learning models for music subgenre classification. Mathematical Foundations of Computing, 2021, 4 (2) : 131-143. doi: 10.3934/mfc.2021008

[13]

Govinda Anantha Padmanabha, Nicholas Zabaras. A Bayesian multiscale deep learning framework for flows in random media. Foundations of Data Science, 2021, 3 (2) : 251-303. doi: 10.3934/fods.2021016

[14]

Suhua Wang, Zhiqiang Ma, Hongjie Ji, Tong Liu, Anqi Chen, Dawei Zhao. Personalized exercise recommendation method based on causal deep learning: Experiments and implications. STEM Education, 2022, 2 (2) : 157-172. doi: 10.3934/steme.2022011

[15]

Paweł Lubowiecki, Henryk Żołądek. The Hess-Appelrot system. I. Invariant torus and its normal hyperbolicity. Journal of Geometric Mechanics, 2012, 4 (4) : 443-467. doi: 10.3934/jgm.2012.4.443

[16]

Wenqing Hu, Chris Junchi Li. A convergence analysis of the perturbed compositional gradient flow: Averaging principle and normal deviations. Discrete and Continuous Dynamical Systems, 2018, 38 (10) : 4951-4977. doi: 10.3934/dcds.2018216

[17]

G. Calafiore, M.C. Campi. A learning theory approach to the construction of predictor models. Conference Publications, 2003, 2003 (Special) : 156-166. doi: 10.3934/proc.2003.2003.156

[18]

Saman Babaie–Kafaki, Reza Ghanbari. A class of descent four–term extension of the Dai–Liao conjugate gradient method based on the scaled memoryless BFGS update. Journal of Industrial and Management Optimization, 2017, 13 (2) : 649-658. doi: 10.3934/jimo.2016038

[19]

Gaohang Yu, Lutai Guan, Guoyin Li. Global convergence of modified Polak-Ribière-Polyak conjugate gradient methods with sufficient descent property. Journal of Industrial and Management Optimization, 2008, 4 (3) : 565-579. doi: 10.3934/jimo.2008.4.565

[20]

Marcin Mazur, Jacek Tabor, Piotr Kościelniak. Semi-hyperbolicity and hyperbolicity. Discrete and Continuous Dynamical Systems, 2008, 20 (4) : 1029-1038. doi: 10.3934/dcds.2008.20.1029

2021 Impact Factor: 1.141

Metrics

  • PDF downloads (199)
  • HTML views (88)
  • Cited by (0)

Other articles
by authors

[Back to Top]