January  2020, 7(1): 79-96. doi: 10.3934/jdg.2020005

A regularization interpretation of the proximal point method for weakly convex functions

Department of Mathematics and Statistics, McGill University, Montreal, Canada

* Corresponding author: tim.hoheisel@mcgill.ca

Received  January 2019 Published  February 2020

Fund Project: The third author is supported by AFOSR FA9550-18-1-0167.

Empirical evidence and theoretical results suggest that the proximal point method can be computed approximately and still converge faster than the corresponding gradient descent method, in both the stochastic and exact gradient case. In this article we provide a perspective on this result by interpreting the method as gradient descent on a regularized function. This perspective applies in the case of weakly convex functions where proofs of the faster rates are not available. Using this analysis we find the optimal value of the regularization parameter in terms of the weak convexity.

Citation: Tim Hoheisel, Maxime Laborde, Adam Oberman. A regularization interpretation of the proximal point method for weakly convex functions. Journal of Dynamics and Games, 2020, 7 (1) : 79-96. doi: 10.3934/jdg.2020005
References:
[1]

Z. Allen-Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, Journal of Machine Learning Research (JMLR), 18 (2017), 51 pp.

[2]

L. T. H. An and P. D Tao, Convex analysis approach to d.c. programming: Theory, algorithms and applications, Acta Mathematica Vietnamica, 22 (1997), 289-355. 

[3]

H. Attouch and D. Azé, Approximation and regularization of arbitrary functions in Hilbert spaces by the Lasry-Lions method, Ann. Inst. H. Poincaré Anal. Non Linéaire, 10 (1993), 289-319.  doi: 10.1016/S0294-1449(16)30214-1.

[4]

H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 116 (2009), 5-16.  doi: 10.1007/s10107-007-0133-5.

[5]

J.-B. Baillon and G. Haddad, Quelques propriétés des opérateurs angle-bornés et $n$-cycliquement monotones, Israel J. Math., 26 (1977), 137-150.  doi: 10.1007/BF03007664.

[6]

H. H. Bauschke and P. L. Combettes, The Baillon-Haddad theorem revisited, J. Convex Anal., 17 (2010), 781-787. 

[7]

H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, Springer, New York, 2011. doi: 10.1007/978-1-4419-9467-7.

[8]

A. Beck, First-order Methods in Optimization, MOS-SIAM Series on Optimization, 25. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, Mathematical Optimization Society, Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997.ch1.

[9]

L. BottouF. E. Curtis and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), 223-311.  doi: 10.1137/16M1080173.

[10]

S. Bubeck, Convex optimization: Algorithms and complexity, Found. Trends Mach. Learn., 8 (2015), 231-357.  doi: 10.1561/2200000050.

[11]

Y. CarmonJ. C. DuchiO. Hinder and A. Sidford, Accelerated methods for nonconvex optimization, SIAM J. Optim., 28 (2018), 1751-1772.  doi: 10.1137/17M1114296.

[12]

P. Chaudhari, A. Oberman, S. Osher, S. Soatto and G. Carlier, Deep Relaxation: Partial differential equations for optimizing deep neural networks, Res. Math. Sci., 5 (2018), 30 pp. doi: 10.1007/s40687-018-0148-y.

[13]

L. C. Evans, Partial Differential Equations, Second edition, Graduate Studies in Mathematics, 19. American Mathematical Society, Providence, RI, 2010. doi: 978-0-8218-4974-3.

[14]

N. Flammarion and F. Bach, From averaging to acceleration, there is only a step-size, Conference on Learning Theory, (2015), 658–695.

[15]

A. Kaplan and R. Tichatschke, Proximal point methods and nonconvex optimization, J. Global Optim., 13 (1998), 389-406.  doi: 10.1023/A:1008321423879.

[16]

J.-M. Lasri and P.-L. Lions, A remark on regularization in Hilbert spaces, Israel J. Math., 55 (1986), 257-266.  doi: 10.1007/BF02765025.

[17]

L. LessardB. Recht and A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26 (2016), 57-95.  doi: 10.1137/15M1009597.

[18]

H. Lin, J. Mairal and Z. Harchaoui, A universal catalyst for first-order optimization, Advances in Neural Information Processing Systems 28, Curran Associates, Inc., (2015), 3384–3392.

[19]

B. Merlet and M. Pierre, Convergence to equilibrium for the backward Euler scheme and applications, Commun. Pure Appl. Anal., 9 (2010), 685-702.  doi: 10.3934/cpaa.2010.9.685.

[20]

Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, 87. Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-8853-9.

[21]

A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, Advances in Neural Information Processing Systems 27, Curran Associates, Inc., (2014), 1574–1582.

[22]

C. PaquetteH. LinD. DrusvyatskiyJ. Mairal and Z. Harchaoui, Catalyst for gradient-based nonconvex optimization, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, 84 (2018), 613-622. 

[23]

N. Parikh and S. Boyd, Proximal algorithms, Found. Trends Optim., 1 (2014), 127-239.  doi: 10.1561/2400000003.

[24]

B. T. Polyak, Some methods of speeding up the convergence of iteration methods, Ž. Vyčisl. Mat i Fiz., 4 (1964), 791–803.

[25]

R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Grundlehren der Mathematischen Wissenschaften, 317. Springer-Verlag, Berlin, 1998. doi: 10.1007/978-3-642-02431-3.

[26]

L. Rosasco, S. Villa and B. C. Vũ, Convergence of stochastic proximal gradient algorithm, preprint, arXiv: 1403.5074.

[27]

M. Schmidt, N. Le Roux and F. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, Advances in Neural Information Processing Systems 24, Curran Associates, Inc., (2011), 1458–1466.

[28]

D. Scieur, V. Roulet, F. Bach and A. D'Aspremont, Integration methods and optimization algorithms, Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., (2017), 1109–1118.

[29]

A. M. Stuart and A. R. Humphries, Dynamical Systems and Numerical Analysis, Cambridge Monographs on Applied and Computational Mathematics, 2. Cambridge University Press, Cambridge, 1996.

[30]

W. J. Su, S. Boyd and E. J. Candes, A differential equation for modeling Nesterov's accelerated gradient method: Theory and insights, J. Mach. Learn. Res., 17 (2016), 43 pp.

[31]

P. H. YinM. PhamA. Oberman and S. Osher, Stochastic backward Euler: An implicit gradient descent algorithm for $k$-means clustering, J. Sci. Comput., 77 (2018), 1133-1146.  doi: 10.1007/s10915-018-0744-4.

show all references

References:
[1]

Z. Allen-Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, Journal of Machine Learning Research (JMLR), 18 (2017), 51 pp.

[2]

L. T. H. An and P. D Tao, Convex analysis approach to d.c. programming: Theory, algorithms and applications, Acta Mathematica Vietnamica, 22 (1997), 289-355. 

[3]

H. Attouch and D. Azé, Approximation and regularization of arbitrary functions in Hilbert spaces by the Lasry-Lions method, Ann. Inst. H. Poincaré Anal. Non Linéaire, 10 (1993), 289-319.  doi: 10.1016/S0294-1449(16)30214-1.

[4]

H. Attouch and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 116 (2009), 5-16.  doi: 10.1007/s10107-007-0133-5.

[5]

J.-B. Baillon and G. Haddad, Quelques propriétés des opérateurs angle-bornés et $n$-cycliquement monotones, Israel J. Math., 26 (1977), 137-150.  doi: 10.1007/BF03007664.

[6]

H. H. Bauschke and P. L. Combettes, The Baillon-Haddad theorem revisited, J. Convex Anal., 17 (2010), 781-787. 

[7]

H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, Springer, New York, 2011. doi: 10.1007/978-1-4419-9467-7.

[8]

A. Beck, First-order Methods in Optimization, MOS-SIAM Series on Optimization, 25. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, Mathematical Optimization Society, Philadelphia, PA, 2017. doi: 10.1137/1.9781611974997.ch1.

[9]

L. BottouF. E. Curtis and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), 223-311.  doi: 10.1137/16M1080173.

[10]

S. Bubeck, Convex optimization: Algorithms and complexity, Found. Trends Mach. Learn., 8 (2015), 231-357.  doi: 10.1561/2200000050.

[11]

Y. CarmonJ. C. DuchiO. Hinder and A. Sidford, Accelerated methods for nonconvex optimization, SIAM J. Optim., 28 (2018), 1751-1772.  doi: 10.1137/17M1114296.

[12]

P. Chaudhari, A. Oberman, S. Osher, S. Soatto and G. Carlier, Deep Relaxation: Partial differential equations for optimizing deep neural networks, Res. Math. Sci., 5 (2018), 30 pp. doi: 10.1007/s40687-018-0148-y.

[13]

L. C. Evans, Partial Differential Equations, Second edition, Graduate Studies in Mathematics, 19. American Mathematical Society, Providence, RI, 2010. doi: 978-0-8218-4974-3.

[14]

N. Flammarion and F. Bach, From averaging to acceleration, there is only a step-size, Conference on Learning Theory, (2015), 658–695.

[15]

A. Kaplan and R. Tichatschke, Proximal point methods and nonconvex optimization, J. Global Optim., 13 (1998), 389-406.  doi: 10.1023/A:1008321423879.

[16]

J.-M. Lasri and P.-L. Lions, A remark on regularization in Hilbert spaces, Israel J. Math., 55 (1986), 257-266.  doi: 10.1007/BF02765025.

[17]

L. LessardB. Recht and A. Packard, Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26 (2016), 57-95.  doi: 10.1137/15M1009597.

[18]

H. Lin, J. Mairal and Z. Harchaoui, A universal catalyst for first-order optimization, Advances in Neural Information Processing Systems 28, Curran Associates, Inc., (2015), 3384–3392.

[19]

B. Merlet and M. Pierre, Convergence to equilibrium for the backward Euler scheme and applications, Commun. Pure Appl. Anal., 9 (2010), 685-702.  doi: 10.3934/cpaa.2010.9.685.

[20]

Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, 87. Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-8853-9.

[21]

A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, Advances in Neural Information Processing Systems 27, Curran Associates, Inc., (2014), 1574–1582.

[22]

C. PaquetteH. LinD. DrusvyatskiyJ. Mairal and Z. Harchaoui, Catalyst for gradient-based nonconvex optimization, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, 84 (2018), 613-622. 

[23]

N. Parikh and S. Boyd, Proximal algorithms, Found. Trends Optim., 1 (2014), 127-239.  doi: 10.1561/2400000003.

[24]

B. T. Polyak, Some methods of speeding up the convergence of iteration methods, Ž. Vyčisl. Mat i Fiz., 4 (1964), 791–803.

[25]

R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Grundlehren der Mathematischen Wissenschaften, 317. Springer-Verlag, Berlin, 1998. doi: 10.1007/978-3-642-02431-3.

[26]

L. Rosasco, S. Villa and B. C. Vũ, Convergence of stochastic proximal gradient algorithm, preprint, arXiv: 1403.5074.

[27]

M. Schmidt, N. Le Roux and F. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, Advances in Neural Information Processing Systems 24, Curran Associates, Inc., (2011), 1458–1466.

[28]

D. Scieur, V. Roulet, F. Bach and A. D'Aspremont, Integration methods and optimization algorithms, Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., (2017), 1109–1118.

[29]

A. M. Stuart and A. R. Humphries, Dynamical Systems and Numerical Analysis, Cambridge Monographs on Applied and Computational Mathematics, 2. Cambridge University Press, Cambridge, 1996.

[30]

W. J. Su, S. Boyd and E. J. Candes, A differential equation for modeling Nesterov's accelerated gradient method: Theory and insights, J. Mach. Learn. Res., 17 (2016), 43 pp.

[31]

P. H. YinM. PhamA. Oberman and S. Osher, Stochastic backward Euler: An implicit gradient descent algorithm for $k$-means clustering, J. Sci. Comput., 77 (2018), 1133-1146.  doi: 10.1007/s10915-018-0744-4.

Figure 1.  Illustration of Example 1
Figure 2.  Top: Iterations of gradient descent and the proximal point method. The proximal point method gets closer to the global mininum with the same number of total gradient evaluations. Bottom: function values at each iteration of gradient descent and at each outer proximal point iteration for the Rosenbrock function (a fair comparison is used by counting total gradient evaluations, using 10 or 20 for the approximate proximal point). After 2000 gradient evaluations the function value for gradient descent is still order 1, while the proximal point method is order $ 10^{-2} $, moreover the function values appear to decrease at a first order rate
[1]

Igor Griva, Roman A. Polyak. Proximal point nonlinear rescaling method for convex optimization. Numerical Algebra, Control and Optimization, 2011, 1 (2) : 283-299. doi: 10.3934/naco.2011.1.283

[2]

Yuanyi Zhao, Wenxun Xing. Global optimization for non-convex programs via convex proximal point method. Journal of Industrial and Management Optimization, 2022  doi: 10.3934/jimo.2022142

[3]

Wolf-Jürgen Beyn, Raphael Kruse. Two-sided error estimates for the stochastic theta method. Discrete and Continuous Dynamical Systems - B, 2010, 14 (2) : 389-407. doi: 10.3934/dcdsb.2010.14.389

[4]

Chunming Tang, Jinbao Jian, Guoyin Li. A proximal-projection partial bundle method for convex constrained minimax problems. Journal of Industrial and Management Optimization, 2019, 15 (2) : 757-774. doi: 10.3934/jimo.2018069

[5]

Foxiang Liu, Lingling Xu, Yuehong Sun, Deren Han. A proximal alternating direction method for multi-block coupled convex optimization. Journal of Industrial and Management Optimization, 2019, 15 (2) : 723-737. doi: 10.3934/jimo.2018067

[6]

Jin-Zan Liu, Xin-Wei Liu. A dual Bregman proximal gradient method for relatively-strongly convex optimization. Numerical Algebra, Control and Optimization, 2022, 12 (4) : 679-692. doi: 10.3934/naco.2021028

[7]

Rongjie Lai, Jiang Liang, Hong-Kai Zhao. A local mesh method for solving PDEs on point clouds. Inverse Problems and Imaging, 2013, 7 (3) : 737-755. doi: 10.3934/ipi.2013.7.737

[8]

Zhongyi Huang. Tailored finite point method for the interface problem. Networks and Heterogeneous Media, 2009, 4 (1) : 91-106. doi: 10.3934/nhm.2009.4.91

[9]

Xiao Ding, Deren Han. A modification of the forward-backward splitting method for maximal monotone mappings. Numerical Algebra, Control and Optimization, 2013, 3 (2) : 295-307. doi: 10.3934/naco.2013.3.295

[10]

Yulan Lu, Minghui Song, Mingzhu Liu. Convergence rate and stability of the split-step theta method for stochastic differential equations with piecewise continuous arguments. Discrete and Continuous Dynamical Systems - B, 2019, 24 (2) : 695-717. doi: 10.3934/dcdsb.2018203

[11]

Hui Peng, Qilong Zhai. Weak Galerkin method for the Stokes equations with damping. Discrete and Continuous Dynamical Systems - B, 2022, 27 (4) : 1853-1875. doi: 10.3934/dcdsb.2021112

[12]

Xiu Ye, Shangyou Zhang. A new weak gradient for the stabilizer free weak Galerkin method with polynomial reduction. Discrete and Continuous Dynamical Systems - B, 2021, 26 (8) : 4131-4145. doi: 10.3934/dcdsb.2020277

[13]

Feishe Chen, Lixin Shen, Yuesheng Xu, Xueying Zeng. The Moreau envelope approach for the L1/TV image denoising model. Inverse Problems and Imaging, 2014, 8 (1) : 53-77. doi: 10.3934/ipi.2014.8.53

[14]

Min Tang. Second order all speed method for the isentropic Euler equations. Kinetic and Related Models, 2012, 5 (1) : 155-184. doi: 10.3934/krm.2012.5.155

[15]

Weiyin Fei, Liangjian Hu, Xuerong Mao, Dengfeng Xia. Advances in the truncated Euler–Maruyama method for stochastic differential delay equations. Communications on Pure and Applied Analysis, 2020, 19 (4) : 2081-2100. doi: 10.3934/cpaa.2020092

[16]

Leonardi Filippo. A projection method for the computation of admissible measure valued solutions of the incompressible Euler equations. Discrete and Continuous Dynamical Systems - S, 2018, 11 (5) : 941-961. doi: 10.3934/dcdss.2018056

[17]

Hssaine Boualam, Ahmed Roubi. Dual algorithms based on the proximal bundle method for solving convex minimax fractional programs. Journal of Industrial and Management Optimization, 2019, 15 (4) : 1897-1920. doi: 10.3934/jimo.2018128

[18]

Yan Gu, Nobuo Yamashita. Alternating direction method of multipliers with variable metric indefinite proximal terms for convex optimization. Numerical Algebra, Control and Optimization, 2020, 10 (4) : 487-510. doi: 10.3934/naco.2020047

[19]

Kangkang Deng, Zheng Peng, Jianli Chen. Sparse probabilistic Boolean network problems: A partial proximal-type operator splitting method. Journal of Industrial and Management Optimization, 2019, 15 (4) : 1881-1896. doi: 10.3934/jimo.2018127

[20]

Yasushi Narushima, Shummin Nakayama. A proximal quasi-Newton method based on memoryless modified symmetric rank-one formula. Journal of Industrial and Management Optimization, 2022  doi: 10.3934/jimo.2022123

 Impact Factor: 

Metrics

  • PDF downloads (628)
  • HTML views (228)
  • Cited by (0)

Other articles
by authors

[Back to Top]