# American Institute of Mathematical Sciences

February  2021, 15(1): 41-62. doi: 10.3934/ipi.2020077

## Global convergence and geometric characterization of slow to fast weight evolution in neural network training for classifying linearly non-separable data

 1 Department of Mathematics, University of California, Irvine, CA 92697, USA 2 Department of Mathematics and Statistics, State University of New York at Albany, Albany, NY 12222, USA 3 Department of Mathematics, University of California, Irvine, Irvine, CA 92697, USA

* Corresponding author: Ziang Long

Received  November 2019 Revised  October 2020 Published  December 2020

In this paper, we study the dynamics of gradient descent in learning neural networks for classification problems. Unlike in existing works, we consider the linearly non-separable case where the training data of different classes lie in orthogonal subspaces. We show that when the network has sufficient (but not exceedingly large) number of neurons, (1) the corresponding minimization problem has a desirable landscape where all critical points are global minima with perfect classification; (2) gradient descent is guaranteed to converge to the global minima. Moreover, we discovered a geometric condition on the network weights so that when it is satisfied, the weight evolution transitions from a slow phase of weight direction spreading to a fast phase of weight convergence. The geometric condition says that the convex hull of the weights projected on the unit sphere contains the origin.

Citation: Ziang Long, Penghang Yin, Jack Xin. Global convergence and geometric characterization of slow to fast weight evolution in neural network training for classifying linearly non-separable data. Inverse Problems & Imaging, 2021, 15 (1) : 41-62. doi: 10.3934/ipi.2020077
Geometric Condition in Lemma 5.2 ($d = 3$)
2-dim section of $\mathbb{R}^d$ spanned by $\tilde{{\mathit{\boldsymbol{w}}}}_1$ and ${\mathit{\boldsymbol{n}}}$
Number of iterations to convergence v.s. $\theta$, the anlge between subspaces $V_1$ and $V_2$
Left: convergent iterations vs. number of neurons ($d = 2$). Right: histogram of norm of weights: $\max\limits_{t}\left| {{\mathit{\boldsymbol{W}}}}^{t}\right|$ ($d = 2$ and $k = 4$)
Dynamics of weights: $\tilde{{\mathit{\boldsymbol{w}}}}_j$ and ${\mathit{\boldsymbol{u}}}_j$
Left: Slow-to-Fast transition during LeNet [16] training on MNIST dataset. Right: 2D projections of MNIST features from a trained convolutional neural network [18]. The 10 classes are color coded, the feature points cluster near linearly independent subspaces
Top row: Projections onto $\mathcal{S}^2$ (inside randomly selected 3D subspaces) of weight vectors in the first fully connected layer of a trained LeNet. Bottom row: Projections onto $\mathcal{S}^2$ (inside randomly selected 3D subspaces) of weight vectors and their convex hull in the second fully connected layer of a trained LeNet
Iterations taken ($\text{mean}\pm\text{std}$) to convergence with random and half space initializations
 # of Neurons ($2k$) Random Init. Half Space Init. 6 578.90$\pm$205.43 672.41$\pm$226.53 8 423.96$\pm$190.91 582.16$\pm$200.81 10 313.29$\pm$178.67 550.19$\pm$180.59 12 242.72$\pm$178.94 517.26$\pm$172.46 14 183.53$\pm$108.60 500.42$\pm$215.87 16 141.00$\pm$80.66 487.42$\pm$220.48 18 126.52$\pm$62.07 478.25$\pm$202.71 20 102.09$\pm$32.32 412.46$\pm$195.92 22 90.65$\pm$28.01 454.08$\pm$203.00 24 82.93$\pm$26.76 416.82$\pm$216.58
