# American Institute of Mathematical Sciences

doi: 10.3934/mcrf.2022036
Online First

Online First articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Online First publication benefits the research community by making new scientific discoveries known as quickly as possible.

Readers can access Online First articles via the “Online First” tab for the selected journal.

## Deep Learning approximation of diffeomorphisms via linear-control systems

 Scuola Internazionale Superiore di Studi Avanzati, Trieste, Italy

*Corresponding author: Alessandro Scagliotti

Received  November 2021 Revised  July 2022 Early access August 2022

Fund Project: The first author is partially supported by INdAM–GNAMPA

In this paper we propose a Deep Learning architecture to approximate diffeomorphisms diffeotopic to the identity. We consider a control system of the form $\dot x = \sum_{i = 1}^lF_i(x)u_i$, with linear dependence in the controls, and we use the corresponding flow to approximate the action of a diffeomorphism on a compact ensemble of points. Despite the simplicity of the control system, it has been recently shown that a Universal Approximation Property holds. The problem of minimizing the sum of the training error and of a regularizing term induces a gradient flow in the space of admissible controls. A possible training procedure for the discrete-time neural network consists in projecting the gradient flow onto a finite-dimensional subspace of the admissible controls. An alternative approach relies on an iterative method based on Pontryagin Maximum Principle for the numerical resolution of Optimal Control problems. Here the maximization of the Hamiltonian can be carried out with an extremely low computational effort, owing to the linear dependence of the system in the control variables. Finally, we use tools from $\Gamma$-convergence to provide an estimate of the expected generalization error.

Citation: Alessandro Scagliotti. Deep Learning approximation of diffeomorphisms via linear-control systems. Mathematical Control and Related Fields, doi: 10.3934/mcrf.2022036
##### References:
 [1] A. Agrachev, D. Barilari and U. Boscain, A Comprehensive Introduction to Sub-Riemannian Geometry, Cambridge Studies in Advanced Mathematics, 181. Cambridge University Press, Cambridge, 2020. [2] A. Agrachev, Y. Baryshnikov and A. Sarychev, Ensemble controllability by Lie algebraic methods, ESAIM Control Optim. Calc. Var., 22 (2016), 921-938.  doi: 10.1051/cocv/2016029. [3] A. A. Agrachev and Y. L. Sachkov, Control Theory from the Geometric Viewpoint, Encyclopaedia of Mathematical Sciences, 87. Control Theory and Optimization, Ⅱ. Springer-Verlag, Berlin, 2004. doi: 10.1007/978-3-662-06404-7. [4] A. Agrachev and A. Sarychev, Control in the spaces of ensembles of points, SIAM J. Control Optim., 58 (2020), 1579-1596.  doi: 10.1137/19M1273049. [5] A. A. Agrachev and A. V. Sarychev, Control on the manifolds of mappings with a view to the deep learning, J. Dyn. Control Syst., (2021). [6] L. Ambrosio, N. Gigli and G. Savaré, Gradient Flows in Metric Spaces and in the Space of Probability Measures, Second edition, Lectures in Mathematics ETH Zürich, Birkhäuser Verlag, Basel, 2008. [7] Y. Bengio, P. Simard and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Netw., 5 (1994), 157-166. [8] M. Benning, E. Celledoni, M. J. Erhardt, B. Owren and C. B. Schönlieb, Deep learning as optimal control problems: Models and numerical methods, J. Comput. Dyn., 6 (2019), 171-198.  doi: 10.3934/jcd.2019009. [9] M. Bongini, M. Fornasier, F. Rossi and F. Solombrino, Mean-field pontryagin maximum principle, J. Optim. Theory Appl., 175 (2017), 1-38.  doi: 10.1007/s10957-017-1149-5. [10] B. Bonnet, C. Cipriani, M. Fornasier and H. Huang, A measure theoretical approach to the Mean-field Maximum Principle for training NeurODEs, preprint, 2021, arXiv: 2107.08707. [11] F. L. Chernousko and A. A. Lyubushin, Method of successive approximations for solution of optimal control problems, Optim. Control Appl. Methods, 3 (1982), 101-114.  doi: 10.1002/oca.4660030201. [12] E. Çinlar, Probability and Stochastics, Graduate Texts in Mathematics, 261. Springer, New York, 2011. doi: 10.1007/978-0-387-87859-1. [13] G. Dal Maso, An Introduction to $\Gamma$-Convergence, Progress in Nonlinear Differential Equations and their Applications, 8. Birkhäuser Boston, Inc., Boston, MA, 1993. doi: 10.1007/978-1-4612-0327-8. [14] W. E, A proposal on machine learning via dynamical systems, Commun. Math. Stat., 5 (2017), 1-11.  doi: 10.1007/s40304-017-0103-z. [15] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, MIT Press, Cambridge MA, 2016. [16] E. Haber and L. Ruthotto, Stable architectures for deep neural networks, Inverse Problems, 34 (2018), 014004, 22 pp. doi: 10.1088/1361-6420/aa9a90. [17] K. He and J. Sun, Convolutional neural networks at constrained time cost, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2015), 5353-5360. [18] K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 770-778. [19] Q. Li, L. Chen, C. Tai and W. E, Maximum principle based algorithms for deep learning, J. Mach. Learn. Res., 18 (2017), Paper No. 165, 29 pp. [20] M. Marchi, B. Gharesifard and P. Tabuada, Training deep residual networks for uniform approximation guarantees, PMLR, 144 (2021), 677-688. [21] A. V. Pukhlikov, Optimal control of distributions, Comput. Math. Model., 15 (2004), 223-256.  doi: 10.1023/B:COMI.0000035820.49408.56. [22] Y. Sakawa and Y. Shindo, On global convergence of an algorithm for optimal control, IEEE Trans. Automat. Contr., 25 (1980), 1149-1153.  doi: 10.1109/TAC.1980.1102517. [23] A. Scagliotti, A gradient flow equation for optimal control problems with end-point cost, J. Dyn. Control Syst., (2022). [24] P. Tabuada, B. Gharesifard, Universal approximation power of deep neural networks via nonlinear control theory, preprint, 2020, arXiv: 2007.06007. [25] M. Thorpe and Y. van Gennip, Deep limits of residual neural networks, preprint, 2018, arXiv: 1810.11741.

show all references

##### References:
 [1] A. Agrachev, D. Barilari and U. Boscain, A Comprehensive Introduction to Sub-Riemannian Geometry, Cambridge Studies in Advanced Mathematics, 181. Cambridge University Press, Cambridge, 2020. [2] A. Agrachev, Y. Baryshnikov and A. Sarychev, Ensemble controllability by Lie algebraic methods, ESAIM Control Optim. Calc. Var., 22 (2016), 921-938.  doi: 10.1051/cocv/2016029. [3] A. A. Agrachev and Y. L. Sachkov, Control Theory from the Geometric Viewpoint, Encyclopaedia of Mathematical Sciences, 87. Control Theory and Optimization, Ⅱ. Springer-Verlag, Berlin, 2004. doi: 10.1007/978-3-662-06404-7. [4] A. Agrachev and A. Sarychev, Control in the spaces of ensembles of points, SIAM J. Control Optim., 58 (2020), 1579-1596.  doi: 10.1137/19M1273049. [5] A. A. Agrachev and A. V. Sarychev, Control on the manifolds of mappings with a view to the deep learning, J. Dyn. Control Syst., (2021). [6] L. Ambrosio, N. Gigli and G. Savaré, Gradient Flows in Metric Spaces and in the Space of Probability Measures, Second edition, Lectures in Mathematics ETH Zürich, Birkhäuser Verlag, Basel, 2008. [7] Y. Bengio, P. Simard and P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Netw., 5 (1994), 157-166. [8] M. Benning, E. Celledoni, M. J. Erhardt, B. Owren and C. B. Schönlieb, Deep learning as optimal control problems: Models and numerical methods, J. Comput. Dyn., 6 (2019), 171-198.  doi: 10.3934/jcd.2019009. [9] M. Bongini, M. Fornasier, F. Rossi and F. Solombrino, Mean-field pontryagin maximum principle, J. Optim. Theory Appl., 175 (2017), 1-38.  doi: 10.1007/s10957-017-1149-5. [10] B. Bonnet, C. Cipriani, M. Fornasier and H. Huang, A measure theoretical approach to the Mean-field Maximum Principle for training NeurODEs, preprint, 2021, arXiv: 2107.08707. [11] F. L. Chernousko and A. A. Lyubushin, Method of successive approximations for solution of optimal control problems, Optim. Control Appl. Methods, 3 (1982), 101-114.  doi: 10.1002/oca.4660030201. [12] E. Çinlar, Probability and Stochastics, Graduate Texts in Mathematics, 261. Springer, New York, 2011. doi: 10.1007/978-0-387-87859-1. [13] G. Dal Maso, An Introduction to $\Gamma$-Convergence, Progress in Nonlinear Differential Equations and their Applications, 8. Birkhäuser Boston, Inc., Boston, MA, 1993. doi: 10.1007/978-1-4612-0327-8. [14] W. E, A proposal on machine learning via dynamical systems, Commun. Math. Stat., 5 (2017), 1-11.  doi: 10.1007/s40304-017-0103-z. [15] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, MIT Press, Cambridge MA, 2016. [16] E. Haber and L. Ruthotto, Stable architectures for deep neural networks, Inverse Problems, 34 (2018), 014004, 22 pp. doi: 10.1088/1361-6420/aa9a90. [17] K. He and J. Sun, Convolutional neural networks at constrained time cost, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2015), 5353-5360. [18] K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 770-778. [19] Q. Li, L. Chen, C. Tai and W. E, Maximum principle based algorithms for deep learning, J. Mach. Learn. Res., 18 (2017), Paper No. 165, 29 pp. [20] M. Marchi, B. Gharesifard and P. Tabuada, Training deep residual networks for uniform approximation guarantees, PMLR, 144 (2021), 677-688. [21] A. V. Pukhlikov, Optimal control of distributions, Comput. Math. Model., 15 (2004), 223-256.  doi: 10.1023/B:COMI.0000035820.49408.56. [22] Y. Sakawa and Y. Shindo, On global convergence of an algorithm for optimal control, IEEE Trans. Automat. Contr., 25 (1980), 1149-1153.  doi: 10.1109/TAC.1980.1102517. [23] A. Scagliotti, A gradient flow equation for optimal control problems with end-point cost, J. Dyn. Control Syst., (2022). [24] P. Tabuada, B. Gharesifard, Universal approximation power of deep neural networks via nonlinear control theory, preprint, 2020, arXiv: 2007.06007. [25] M. Thorpe and Y. van Gennip, Deep limits of residual neural networks, preprint, 2018, arXiv: 1810.11741.
On the left we report the grid of points $\{ x^1_0,\ldots,x^M_0 \}$ where we have evaluated the diffeomorphism $\Psi: \mathbb{R}^2\to \mathbb{R}^2$ defined as in (53). The picture on the right represents the transformation of the training dataset through the diffeomorphism $\Psi$
ResNet 52, 16 layers, Algorithm 1, $\beta = 10^{-4}$. On the top-left we reported the transformation of the initial grid through the approximating diffeomorphism (red circles) and through the original one (blue circles). On the top-right, we plotted the prediction on the testing data-set provided by the approximating diffeomorphism (red crosses) and the correct values obtained through the original transformation (blue crosses). In both cases, the approximation obtained is unsatisfactory. At bottom we plotted the decrease of the training error and the testing error versus the number of iterations. Finally, the curve in magenta represents the estimate of the generalization error provided by (35)
ResNet 55, 16 layers, Algorithm 1, $\beta = 10^{-3}$. On the top-left we reported the transformation of the initial grid through the approximating diffeomorphism (red circles) and through the original one (blue circles). On the top-right, we plotted the prediction on the testing data-set provided by the approximating diffeomorphism (red crosses) and the correct values obtained through the original transformation (blue crosses). In both cases, the approximation obtained is good, and we observe that it is better where we have more data density. At bottom we plotted the decrease of the training error and the testing error versus the number of iterations. Finally, the curve in magenta represents the estimate of the generalization error provided by (35)
ResNet 52, $16$ layers, $128$ parameters, Algorithm 1. Running time $\sim 160$ s
 $\beta$ $L_{\Phi_u}$ Training error Testing error $10^0$ $1.19$ $3.8785$ $3.8173$ $10^{-1}$ $8.40$ $1.3143$ $1.2476$ $10^{-2}$ $9.32$ $1.1991$ $1.1451$ $10^{-3}$ $9.37$ $1.1852$ $1.1330$ $10^{-4}$ $9.37$ $1.1839$ $1.1318$
 $\beta$ $L_{\Phi_u}$ Training error Testing error $10^0$ $1.19$ $3.8785$ $3.8173$ $10^{-1}$ $8.40$ $1.3143$ $1.2476$ $10^{-2}$ $9.32$ $1.1991$ $1.1451$ $10^{-3}$ $9.37$ $1.1852$ $1.1330$ $10^{-4}$ $9.37$ $1.1839$ $1.1318$
ResNet 52, $16$ layers, $128$ parameters, Algorithm 2. Running time $\sim 130$ s
 $\beta$ $L_{\Phi_u}$ Training error Testing error $10^0$ $1.19$ $3.8749$ $3.8157$ $10^{-1}$ $8.40$ $1.3084$ $1.2455$ $10^{-2}$ $9.32$ $1.2014$ $1.1486$ $10^{-3}$ $9.33$ $1.1898$ $1.1387$ $10^{-4}$ $9.33$ $1.1898$ $1.1379$
 $\beta$ $L_{\Phi_u}$ Training error Testing error $10^0$ $1.19$ $3.8749$ $3.8157$ $10^{-1}$ $8.40$ $1.3084$ $1.2455$ $10^{-2}$ $9.32$ $1.2014$ $1.1486$ $10^{-3}$ $9.33$ $1.1898$ $1.1387$ $10^{-4}$ $9.33$ $1.1898$ $1.1379$
ResNet 52, $32$ layers, $256$ parameters, Algorithm 1. Running time $\sim 320$ s
 $\beta$ $L_{\Phi_u}$ Training error Testing error $10^0$ $1.19$ $3.8779$ $3.8168$ $10^{-1}$ $8.40$ $1.3074$ $1.2425$ $10^{-2}$ $9.26$ $1.2015$ $1.1477$ $10^{-3}$ $9.34$ $1.1860$ $1.1352$ $10^{-4}$ $9.34$ $1.1842$ $1.1332$
 $\beta$ $L_{\Phi_u}$ Training error Testing error $10^0$ $1.19$ $3.8779$ $3.8168$ $10^{-1}$ $8.40$ $1.3074$ $1.2425$ $10^{-2}$ $9.26$ $1.2015$ $1.1477$ $10^{-3}$ $9.34$ $1.1860$ $1.1352$ $10^{-4}$ $9.34$ $1.1842$ $1.1332$
ResNet 52, $32$ layers, $256$ parameters, Algorithm 2. Running time $\sim 260$ s
 $\beta$ $L_{\Phi_u}$ Training error Testing error $10^0$ $1.19$ $3.8739$ $3.8148$ $10^{-1}$ $8.35$ $1.3085$ $1.2449$ $10^{-2}$ $9.23$ $1.2075$ $1.1538$ $10^{-3}$ $9.26$ $1.1931$ $1.1416$ $10^{-4}$ $9.26$ $1.1918$ $1.1404$
 $\beta$ $L_{\Phi_u}$ Training error Testing error $10^0$ $1.19$ $3.8739$ $3.8148$ $10^{-1}$ $8.35$ $1.3085$ $1.2449$ $10^{-2}$ $9.23$ $1.2075$ $1.1538$ $10^{-3}$ $9.26$ $1.1931$ $1.1416$ $10^{-4}$ $9.26$ $1.1918$ $1.1404$
ResNet 55, $16$ layers, $224$ parameters, Algorithm 1. Running time $\sim 320$ s
 $\beta$ $L_{\Phi_u}$ Training error Testing error $10^0$ $10.14$ $2.3791$ $2.3036$ $10^{-1}$ $13.84$ $0.1809$ $0.2314$ $10^{-2}$ $15.64$ $0.1290$ $0.1784$ $10^{-3}$ $15.83$ $0.1254$ $0.1747$ $10^{-4}$ $15.86$ $0.1257$ $0.1751$
 $\beta$ $L_{\Phi_u}$ Training error Testing error $10^0$ $10.14$ $2.3791$ $2.3036$ $10^{-1}$ $13.84$ $0.1809$ $0.2314$ $10^{-2}$ $15.64$ $0.1290$ $0.1784$ $10^{-3}$ $15.83$ $0.1254$ $0.1747$ $10^{-4}$ $15.86$ $0.1257$ $0.1751$
ResNet 55, $16$ layers, $224$ parameters, Algorithm 2. Running time $\sim 310$ s
 $\beta$ $L_{\Phi_u}$ Training error Testing error $10^0$ $10.78$ $2.3638$ $2.3910$ $10^{-1}$ $14.32$ $0.1921$ $0.2422$ $10^{-2}$ $15.43$ $0.1887$ $0.2347$ $10^{-3}$ $15.56$ $0.2260$ $0.2719$ $10^{-4}$ $15.59$ $0.2127$ $0.2564$
 $\beta$ $L_{\Phi_u}$ Training error Testing error $10^0$ $10.78$ $2.3638$ $2.3910$ $10^{-1}$ $14.32$ $0.1921$ $0.2422$ $10^{-2}$ $15.43$ $0.1887$ $0.2347$ $10^{-3}$ $15.56$ $0.2260$ $0.2719$ $10^{-4}$ $15.59$ $0.2127$ $0.2564$
 [1] Augusto Visintin. $\Gamma$-compactness and $\Gamma$-stability of the flow of heat-conducting fluids. Discrete and Continuous Dynamical Systems - S, 2022, 15 (8) : 2331-2343. doi: 10.3934/dcdss.2022066 [2] Harun Karsli, Purshottam Narain Agrawal. Rate of convergence of Stancu type modified $q$-Gamma operators for functions with derivatives of bounded variation. Mathematical Foundations of Computing, 2022  doi: 10.3934/mfc.2022002 [3] Umberto De Maio, Peter I. Kogut, Gabriella Zecca. On optimal $L^1$-control in coefficients for quasi-linear Dirichlet boundary value problems with $BMO$-anisotropic $p$-Laplacian. Mathematical Control and Related Fields, 2020, 10 (4) : 827-854. doi: 10.3934/mcrf.2020021 [4] Chiun-Chuan Chen, Li-Chang Hung, Chen-Chih Lai. An N-barrier maximum principle for autonomous systems of $n$ species and its application to problems arising from population dynamics. Communications on Pure and Applied Analysis, 2019, 18 (1) : 33-50. doi: 10.3934/cpaa.2019003 [5] Harbir Antil, Mahamadi Warma. Optimal control of the coefficient for the regional fractional $p$-Laplace equation: Approximation and convergence. Mathematical Control and Related Fields, 2019, 9 (1) : 1-38. doi: 10.3934/mcrf.2019001 [6] Wenxian Shen, Shuwen Xue. Persistence and convergence in parabolic-parabolic chemotaxis system with logistic source on $\mathbb{R}^{N}$. Discrete and Continuous Dynamical Systems, 2022, 42 (6) : 2893-2925. doi: 10.3934/dcds.2022003 [7] Martin Heida, Stefan Neukamm, Mario Varga. Stochastic homogenization of $\Lambda$-convex gradient flows. Discrete and Continuous Dynamical Systems - S, 2021, 14 (1) : 427-453. doi: 10.3934/dcdss.2020328 [8] Dario Mazzoleni, Giuseppe Savaré. $L^2$-Gradient flows of spectral functionals. Discrete and Continuous Dynamical Systems, 2022  doi: 10.3934/dcds.2022123 [9] Ahmad El Hajj, Hassan Ibrahim, Vivian Rizik. $BV$ solution for a non-linear Hamilton-Jacobi system. Discrete and Continuous Dynamical Systems, 2021, 41 (7) : 3273-3293. doi: 10.3934/dcds.2020405 [10] Mathias Dus. The discretized backstepping method: An application to a general system of $2\times 2$ linear balance laws. Mathematical Control and Related Fields, 2022  doi: 10.3934/mcrf.2022006 [11] Zhen-Zhen Tao, Bing Sun. Error estimates for spectral approximation of flow optimal control problem with $L^2$-norm control constraint. Journal of Industrial and Management Optimization, 2022  doi: 10.3934/jimo.2022030 [12] Tian-Xiao He, Peter J.-S. Shiue. Identities for linear recursive sequences of order $2$. Electronic Research Archive, 2021, 29 (5) : 3489-3507. doi: 10.3934/era.2021049 [13] Teresa Alberico, Costantino Capozzoli, Luigi D'Onofrio, Roberta Schiattarella. $G$-convergence for non-divergence elliptic operators with VMO coefficients in $\mathbb R^3$. Discrete and Continuous Dynamical Systems - S, 2019, 12 (2) : 129-137. doi: 10.3934/dcdss.2019009 [14] Jong Yoon Hyun, Yoonjin Lee, Yansheng Wu. Connection of $p$-ary $t$-weight linear codes to Ramanujan Cayley graphs with $t+1$ eigenvalues. Advances in Mathematics of Communications, 2021  doi: 10.3934/amc.2020133 [15] Jianqin Zhou, Wanquan Liu, Xifeng Wang, Guanglu Zhou. On the $k$-error linear complexity for $p^n$-periodic binary sequences via hypercube theory. Mathematical Foundations of Computing, 2019, 2 (4) : 279-297. doi: 10.3934/mfc.2019018 [16] Peili Li, Xiliang Lu, Yunhai Xiao. Smoothing Newton method for $\ell^0$-$\ell^2$ regularized linear inverse problem. Inverse Problems and Imaging, 2022, 16 (1) : 153-177. doi: 10.3934/ipi.2021044 [17] Rakesh Nandi, Sujit Kumar Samanta, Chesoong Kim. Analysis of $D$-$BMAP/G/1$ queueing system under $N$-policy and its cost optimization. Journal of Industrial and Management Optimization, 2021, 17 (6) : 3603-3631. doi: 10.3934/jimo.2020135 [18] Ying Lin, Rongrong Lin, Qi Ye. Sparse regularized learning in the reproducing kernel banach spaces with the $\ell^1$ norm. Mathematical Foundations of Computing, 2020, 3 (3) : 205-218. doi: 10.3934/mfc.2020020 [19] Mathew Gluck. Classification of solutions to a system of $n^{\rm th}$ order equations on $\mathbb R^n$. Communications on Pure and Applied Analysis, 2020, 19 (12) : 5413-5436. doi: 10.3934/cpaa.2020246 [20] Mohan Mallick, R. Shivaji, Byungjae Son, S. Sundar. Bifurcation and multiplicity results for a class of $n\times n$ $p$-Laplacian system. Communications on Pure and Applied Analysis, 2018, 17 (3) : 1295-1304. doi: 10.3934/cpaa.2018062

2021 Impact Factor: 1.141