# American Institute of Mathematical Sciences

doi: 10.3934/mcrf.2022021
Online First

Online First articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Online First publication benefits the research community by making new scientific discoveries known as quickly as possible.

Readers can access Online First articles via the “Online First” tab for the selected journal.

## A geometric approach of gradient descent algorithms in linear neural networks

 1 Laboratoire des Signaux et Systèmes, CentraleSupélec, Université Paris-Saclay, France 2 EIC, Huazhong University of Science and Technology, Wuhan, China 3 University Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, 38000 Grenoble, France

* Corresponding author: Yacine Chitour

Received  April 2021 Revised  March 2022 Early access April 2022

In this paper, we propose a geometric framework to analyze the convergence properties of gradient descent trajectories in the context of linear neural networks. We translate a well-known empirical observation of linear neural nets into a conjecture that we call the overfitting conjecture which states that, for almost all training data and initial conditions, the trajectory of the corresponding gradient descent system converges to a global minimum. This would imply that the solution achieved by vanilla gradient descent algorithms is equivalent to that of the least-squares estimation, for linear neural networks of an arbitrary number of hidden layers. Built upon a key invariance property induced by the network structure, we first establish convergence of gradient descent trajectories to critical points of the square loss function in the case of linear networks of arbitrary depth. Our second result is the proof of the overfitting conjecture in the case of single-hidden-layer linear networks with an argument based on the notion of normal hyperbolicity and under a generic property on the training data (i.e., holding for almost all training data).

Citation: Yacine Chitour, Zhenyu Liao, Romain Couillet. A geometric approach of gradient descent algorithms in linear neural networks. Mathematical Control and Related Fields, doi: 10.3934/mcrf.2022021
##### References:

show all references

##### References:
Illustration of a $H$-hidden-layer linear neural network
A geometric "vision" of the loss landscape
Visual representation of normal hyperbolicity
Visual representation of normal hyperbolicity and trajectories samples (in green)
 [1] Christopher Oballe, David Boothe, Piotr J. Franaszczuk, Vasileios Maroulas. ToFU: Topology functional units for deep learning. Foundations of Data Science, 2021  doi: 10.3934/fods.2021021 [2] Richard Archibald, Feng Bao, Yanzhao Cao, He Zhang. A backward SDE method for uncertainty quantification in deep learning. Discrete and Continuous Dynamical Systems - S, 2022  doi: 10.3934/dcdss.2022062 [3] Ziju Shen, Yufei Wang, Dufan Wu, Xu Yang, Bin Dong. Learning to scan: A deep reinforcement learning approach for personalized scanning in CT imaging. Inverse Problems and Imaging, 2022, 16 (1) : 179-195. doi: 10.3934/ipi.2021045 [4] Yakov Pesin, Vaughn Climenhaga. Open problems in the theory of non-uniform hyperbolicity. Discrete and Continuous Dynamical Systems, 2010, 27 (2) : 589-607. doi: 10.3934/dcds.2010.27.589 [5] Xiaming Chen. Kernel-based online gradient descent using distributed approach. Mathematical Foundations of Computing, 2019, 2 (1) : 1-9. doi: 10.3934/mfc.2019001 [6] Ting Hu. Kernel-based maximum correntropy criterion with gradient descent method. Communications on Pure and Applied Analysis, 2020, 19 (8) : 4159-4177. doi: 10.3934/cpaa.2020186 [7] Feng Bao, Thomas Maier. Stochastic gradient descent algorithm for stochastic optimization in solving analytic continuation problems. Foundations of Data Science, 2020, 2 (1) : 1-17. doi: 10.3934/fods.2020001 [8] Shishun Li, Zhengda Huang. Guaranteed descent conjugate gradient methods with modified secant condition. Journal of Industrial and Management Optimization, 2008, 4 (4) : 739-755. doi: 10.3934/jimo.2008.4.739 [9] Wataru Nakamura, Yasushi Narushima, Hiroshi Yabe. Nonlinear conjugate gradient methods with sufficient descent properties for unconstrained optimization. Journal of Industrial and Management Optimization, 2013, 9 (3) : 595-619. doi: 10.3934/jimo.2013.9.595 [10] Martin Benning, Elena Celledoni, Matthias J. Ehrhardt, Brynjulf Owren, Carola-Bibiane Schönlieb. Deep learning as optimal control problems: Models and numerical methods. Journal of Computational Dynamics, 2019, 6 (2) : 171-198. doi: 10.3934/jcd.2019009 [11] Nicholas Geneva, Nicholas Zabaras. Multi-fidelity generative deep learning turbulent flows. Foundations of Data Science, 2020, 2 (4) : 391-428. doi: 10.3934/fods.2020019 [12] Miria Feng, Wenying Feng. Evaluation of parallel and sequential deep learning models for music subgenre classification. Mathematical Foundations of Computing, 2021, 4 (2) : 131-143. doi: 10.3934/mfc.2021008 [13] Govinda Anantha Padmanabha, Nicholas Zabaras. A Bayesian multiscale deep learning framework for flows in random media. Foundations of Data Science, 2021, 3 (2) : 251-303. doi: 10.3934/fods.2021016 [14] Suhua Wang, Zhiqiang Ma, Hongjie Ji, Tong Liu, Anqi Chen, Dawei Zhao. Personalized exercise recommendation method based on causal deep learning: Experiments and implications. STEM Education, 2022, 2 (2) : 157-172. doi: 10.3934/steme.2022011 [15] Paweł Lubowiecki, Henryk Żołądek. The Hess-Appelrot system. I. Invariant torus and its normal hyperbolicity. Journal of Geometric Mechanics, 2012, 4 (4) : 443-467. doi: 10.3934/jgm.2012.4.443 [16] Wenqing Hu, Chris Junchi Li. A convergence analysis of the perturbed compositional gradient flow: Averaging principle and normal deviations. Discrete and Continuous Dynamical Systems, 2018, 38 (10) : 4951-4977. doi: 10.3934/dcds.2018216 [17] G. Calafiore, M.C. Campi. A learning theory approach to the construction of predictor models. Conference Publications, 2003, 2003 (Special) : 156-166. doi: 10.3934/proc.2003.2003.156 [18] Saman Babaie–Kafaki, Reza Ghanbari. A class of descent four–term extension of the Dai–Liao conjugate gradient method based on the scaled memoryless BFGS update. Journal of Industrial and Management Optimization, 2017, 13 (2) : 649-658. doi: 10.3934/jimo.2016038 [19] Gaohang Yu, Lutai Guan, Guoyin Li. Global convergence of modified Polak-Ribière-Polyak conjugate gradient methods with sufficient descent property. Journal of Industrial and Management Optimization, 2008, 4 (3) : 565-579. doi: 10.3934/jimo.2008.4.565 [20] Marcin Mazur, Jacek Tabor, Piotr Kościelniak. Semi-hyperbolicity and hyperbolicity. Discrete and Continuous Dynamical Systems, 2008, 20 (4) : 1029-1038. doi: 10.3934/dcds.2008.20.1029

2021 Impact Factor: 1.141