Advanced Search
Article Contents
Article Contents

# General risk measures for robust machine learning

• * Corresponding author: Henri Gérard

The work of second author was supported by ENPC and Labex Bézout. The work of third author was supported by Institut Universitaire de France

• A wide array of machine learning problems are formulated as the minimization of the expectation of a convex loss function on some parameter space. Since the probability distribution of the data of interest is usually unknown, it is is often estimated from training sets, which may lead to poor out-of-sample performance. In this work, we bring new insights in this problem by using the framework which has been developed in quantitative finance for risk measures. We show that the original min-max problem can be recast as a convex minimization problem under suitable assumptions. We discuss several important examples of robust formulations, in particular by defining ambiguity sets based on $\varphi$-divergences and the Wasserstein metric. We also propose an efficient algorithm for solving the corresponding convex optimization problems involving complex convex constraints. Through simulation examples, we demonstrate that this algorithm scales well on real data sets.

Mathematics Subject Classification: 46N10, 65K10, 62F35, 68Q32.

 Citation:

• Figure 1.  $\mathtt{ionosphere}$ dataset: Log of the difference between current loss and final loss, with respect to the iteration number for various values of $\epsilon$

Figure 2.  $\mathtt{ionosphere}$ dataset: Log of the difference between current loss and final loss, with respect to the CPU time for vaious values of $\epsilon$ over the first 100 iterations

Figure 3.  $\mathtt{ionosphere}$ dataset: AUC metric as a function of $\epsilon$

Figure 4.  $\mathtt{ionosphere}$ dataset (altered): ROC curve for different values of $\epsilon$

Figure 5.  $\mathtt{ionosphere}$ dataset: AUC histogram for 1000 random realizations using 10% of data for the training set. Robust model is used with $\epsilon = 0.001$

Figure 6.  $\mathtt{ionosphere}$ dataset: AUC histogram for 1000 random realizations using 60% of data for the training set. Robust model is used with $\epsilon = 0.001$

Table 1.  Common perspective functions and their conjugate used to define $\varphi$ -divergences

 Divergence $\varphi\left( t \right)$ $\varphi\left( t \right), t \geq 0$ ${D_\varphi }\left( {p,q} \right)$ $\varphi^{*}\ \left( s \right)$ $\tilde \varphi \left( t \right)$ Kullback-Leibler $\varphi_{kl}\left( t \right)$ $t\log\left( t \right) -t +1$ $\sum_{i = 1}^{N}p_{i}\log\left( {\frac{{{p_i}}}{{{q_i}}}} \right)$ $e^{s}-1$ $\varphi_{b}\left( t \right)$ Burg entropy $\varphi_{b}\left( t \right)$ $-\log\left( t \right)+t-1$ $\sum_{i = 1}^{N}q_{i}\log\left( {\frac{{{q_i}}}{{{p_i}}}} \right)$ $-\log[\left( {1 - s} \right), s < 1$ $\varphi_{kl}\left( t \right)$ J-divergence $\varphi_{j}\left( t \right)$ $\left( {t - {\rm{1}}} \right)\log\left( t \right)$ $\sum_{i=1}^{N}\left( {{p_i} - {q_i}} \right)\log\left( {\frac{{{p_i}}}{{{q_i}}}} \right)$ no closed form $\varphi_{j}\left( t \right)$ $\chi^{2}$-distance $\varphi_{c}\left( t \right)$ $\frac{1}{t}\left( {t - {\rm{1}}} \right)^{2}$ $\sum_{i=1}^{N}\frac{p_{i}-q_{i}}{p_{i}}$ $2-2\sqrt{1-s}, s <1$ $\varphi_{mc}\left( t \right)$ Modified $\chi^{2}$-distance $\varphi_{mc}\left( t \right)$ $\left( {t - {\rm{1}}} \right)^{2}$ $\sum_{i=1}^{N}\frac{q_{i}-p_{i}}{q_{i}}$ $\left \{ \begin{array}{ll} -1, &s <-2 \\ s+s^{2}/4, &s\geq-2 \end{array} \right .$ $\varphi_{c}\left( t \right)$ Hellinger distance $\varphi_{h}\left( t \right)$ $\left( {\sqrt t - 1} \right)^{2}$ $\sum_{i=1}^{N}\left( {\sqrt {{p_i}} - \sqrt {{q_i}} } \right)$ $\frac{s}{1-s},s <1$ $\varphi_{h}\left( t \right)$ $\chi$-divergence of order $\theta$>1 $\varphi_{ca}^{\theta}\left( t \right)$ $|{t-1}|^{\theta}$ $\sum_{i=1}^{N}q_{i}{\rm{|}}1 - \frac{{{p_i}}}{{{q_i}}}|^{\theta}$ $s+\left( {\theta - 1} \right){\left( {\frac{{|s|}}{\theta }} \right)^{\frac{\theta }{{\theta - 1}}}}$ $t^{1-\theta}\varphi_{ca}^{\theta}\left( t \right)$ Variation distance $\varphi_{v}\left( t \right)$ $|{t-1}|$ $\sum_{i=1}^{N}|{p_i} - {q_i}|$ $\left \{ \begin{array}{ll} -1, &s\leq-1 \\ s, &-1 \leq s \leq 1 \end{array} \right .$ $\varphi_{v}\left( t \right)$ Cressie and Read $\varphi_{cr}^{\theta}\left( t \right)$ $\frac{1-\theta+\theta t-t^{\theta}}{\theta\left( {1 - \theta } \right)}, \theta \notin {\rm{\{ 0,1\} }}$ $\frac{1}{\theta\left( {1 - \theta } \right)}\left( {1 - \sum _{i = 1}^N {p_i^\theta } q_i^{1 - \theta }} \right)$ $\left \{ \begin{array}{l} \frac{1}{\theta}\left( {1 - s\left( {1 - \theta } \right)} \right)^{\frac{\theta}{\theta-1}}-\frac{1}{\theta} \\ s < \frac{1}{\theta-1} \end{array} \right .$ $\varphi_{cr}^{1-\theta}\left( t \right)$ Average Value at Risk of level $\beta$ $\varphi_{\textrm{avar}}^{\beta}\left( t \right)$ $\iota_{\left[ {0,\frac{1}{{1 - \beta }}} \right]}, \beta \in [0,1]$ $\sum_{i=1}^{N}\iota_{\left[ {0,\frac{1}{{1 - \beta }}} \right]}(\frac{p_{i}}{q_{i}})$ $\sigma_{\left[ {0,\frac{1}{{1 - \beta }}} \right]} = \left \{ \begin{array}{l} \frac{1}{1-\beta} , s\geq 0 \\ 0 , s < 0 \end{array} \right .$ $\iota_{[1-\beta,+\infty[}$

Table 2.  Parameters of the datasets

 Name of dataset $\mathtt{ionosphere}$ $\mathtt{colon-cancer}$ Number of observations ($N$) 351 64 Number of features ($d$) 34 2000

Table 3.  $\mathtt{colon-cancer}$ dataset: Values of the AUC for different values of $\epsilon$

 Value of $\epsilon$ AUC with KL AUC with Wasserstein $\epsilon = 0$ (LR) 0.832 0.832 $\epsilon = 0.001$ 0.757 0.787 $\epsilon = 0.002$ 0.750 0.770 $\epsilon = 0.003$ 0.779 0.706 $\epsilon = 0.004$ 0.698 0.691 $\epsilon = 0.005$ 0.868 0.831 $\epsilon = 0.006$ 0.890 0.860 $\epsilon = 0.007$ 0.728 0.838 $\epsilon = 0.008$ 0.809 0.768 $\epsilon = 0.009$ 0.875 0.890 $\epsilon = 0.01$ 0.801 0.853 $\epsilon = 0.05$ 0.786 0.794 $\epsilon = 0.1$ 0.801 0.816

Table 4.  $\mathtt{ionosphere}$ dataset (altered): Values of the area under ROC curve for different values of $\epsilon$

 Value of $\epsilon$ AUC with KL AUC with Wasserstein $\epsilon = 0$ (LR) 0.514 0.514 $\epsilon = 0.001$ 0.816 0.840 $\epsilon = 0.002$ 0.804 0.835 $\epsilon = 0.003$ 0.840 0.814 $\epsilon = 0.004$ 0.824 0.830 $\epsilon = 0.005$ 0.815 0.829 $\epsilon = 0.006$ 0.834 0.829 $\epsilon = 0.007$ 0.821 0.815 $\epsilon = 0.008$ 0.835 0.815 $\epsilon = 0.009$ 0.823 0.822 $\epsilon = 0.01$ 0.828 0.835 $\epsilon = 0.05$ 0.815 0.826 $\epsilon = 0.1$ 0.824 0.823
•  [1] S. M. Ali and S. D. Silvey, A general class of coefficients of divergence of one distribution from another, Journal of the Royal Statistical Society. Series B (Methodological), 28 (1966), 131-142.  doi: 10.1111/j.2517-6161.1966.tb00626.x. [2] P. Artzner, F. Delbaen, J.-M. Eber and D. Heath, Coherent measures of risk, Mathematical Finance, 9 (1999), 203-228.  doi: 10.1111/1467-9965.00068. [3] M. Basseville, Divergence measures for statistical data processing–an annotated bibliography, Signal Processing, 93 (2013), 621-633.  doi: 10.1016/j.sigpro.2012.09.003. [4] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Springer, New York, 2011. doi: 10.1007/978-3-319-48311-5. [5] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences, 2 (2009), 183-202.  doi: 10.1137/080716542. [6] A. Ben-Tal and A. Nemirovski, Robust solutions of linear programming problems contaminated with uncertain data, Mathematical Programming, 88 (2000), 411-424.  doi: 10.1007/PL00011380. [7] A. Ben-Tal,  L. El Ghaoui and  A. Nemirovski,  Robust Optimization, Princeton University Press, 2009. [8] A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg and G. Rennen, Robust solutions of optimization problems affected by uncertain probabilities, Management Science, 59 (2013), 341-357.  doi: 10.1287/mnsc.1120.1641. [9] A. P. Bradley, The use of the area under the roc curve in the evaluation of machine learning algorithms, Pattern Recognition, 30 (1997), 1145-1159.  doi: 10.1016/S0031-3203(96)00142-2. [10] L. M. Briceno-Arias, G. Chierchia, E. Chouzenoux and J.-C. Pesquet, A random block-coordinate douglas-rachford splitting method with low computational complexity for binary logistic regression, Computational Optimization and Applications, 72 (2019), 707-726.  doi: 10.1007/s10589-019-00060-6. [11] A. Chambolle and C. Dossal, On the convergence of the iterates of "FISTA", Journal of Optimization Theory and Applications, 166 (2015), 968-982.  doi: 10.1007/s10957-015-0746-4. [12] P. L. Combettes, Strong convergence of block-iterative outer approximation methods for convex optimization, SIAM Journal on Control and Optimization, 38 (2000), 538-565.  doi: 10.1137/S036301299732626X. [13] P. L. Combettes, A block-iterative surrogate constraint splitting method for quadratic signal recovery, IEEE Transactions on Signal Processing, 51 (2003), 1771-1782.  doi: 10.1109/TSP.2003.812846. [14] P. L. Combettes and C. L. Müller, Perspective functions: Proximal calculus and applications in high-dimensional statistics, Journal of Mathematical Analysis and Applications, 457 (2018), 1283-1306.  doi: 10.1016/j.jmaa.2016.12.021. [15] P. L. Combettes and J.-C. Pesquet, Proximal splitting methods in signal processing, Fixed-Point Algorithms for Inverse Problems in Science and Engineering, 185–212, Springer Optim. Appl., 49, Springer, New York, 2011. doi: 10.1007/978-1-4419-9569-8_10. [16] P. L. Combettes, D. Dung and B. C. Vũ, Dualization of signal recovery problems, Set-Valued and Variational Analysis, 18 (2010), 373-404.  doi: 10.1007/s11228-010-0147-7. [17] I. Csiszár, Eine informationstheoretische ungleichung und ihre anwendung auf beweis der ergodizitaet von markoffschen ketten, Magyer Tud. Akad. Mat. Kutato Int. Koezl., 8 (1963), 85-108. [18] J. Duchi, P. Glynn and H. Namkoong, Statistics of robust optimization: A generalized empirical likelihood approach, preprint, arXiv: 1610.03425, 2016. [19] P. M. Esfahani and D. Kuhn, Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations, Mathematical Programming, 171 (2018), 115-166.  doi: 10.1007/s10107-017-1172-1. [20] P. M. Esfahani, S. Shafieezadeh-Abadeh, G. A. Hanasusanto and D. Kuhn, Data-driven inverse optimization with imperfect information, Mathematical Programming, 167 (2018), 191-234.  doi: 10.1007/s10107-017-1216-6. [21] J. Feng, H. Xu, S. Mannor and S. Yan, Robust logistic regression and classification, In Advances in Neural Information Processing Systems, 2014,253–261. [22] H. Föllmer and A. Schied, Stochastic Finance: An Introduction in Discrete Time (4th edition), Walter de Gruyter, 2016. [23] J.-y. Gotoh, M. J. Kim and A. E. Lim, Robust empirical optimization is almost the same as mean–variance optimization, Operations Research Letters, 46 (2018), 448-452.  doi: 10.1016/j.orl.2018.05.005. [24] Y. Haugazeau, Sur les inéquations variationnelles et la minimisation de fonctionnelles convexes, These, Universite de Paris, 1968. [25] Z. Hu and L. J. Hong, Kullback-leibler divergence constrained distributionally robust optimization, Available at Optimization Online, 2013. [26] A. Kurakin, I. Goodfellow and S. Bengio, Adversarial examples in the physical world, preprint, arXiv: 1607.02533, 2016. [27] S. Moghaddam and M. Mahlooji, Robust simulation optimization using $\varphi$-divergence, International Journal of Industrial Engineering Computations, 7 (2016), 517-534.  doi: 10.5267/j.ijiec.2016.5.003. [28] T. Morimoto, Markov processes and the h-theorem, Journal of the Physical Society of Japan, 18 (1963), 328-331.  doi: 10.1143/JPSJ.18.328. [29] H. Namkoong and J. C. Duchi, Stochastic gradient methods for distributionally robust optimization with f-divergences, In Advances in Neural Information Processing Systems, 2016, 2208–2216. [30] N. Papernot, P. McDaniel and I. Goodfellow, Transferability in machine learning: From phenomena to black-box attacks using adversarial samples, preprint, arXiv: 1605.07277, 2016. [31] Y. Plan and R. Vershynin, Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach, IEEE Transactions on Information Theory, 59 (2013), 482-494.  doi: 10.1109/TIT.2012.2207945. [32] R. T. Rockafellar and S. Uryasev, Optimization of conditional value-at-risk, Journal of Risk, 2 (2000), 21-42.  doi: 10.21314/JOR.2000.038. [33] A. Ruszczyński and A. Shapiro, Conditional risk mappings, Mathematics of Operations Research, 31 (2006), 544-561.  doi: 10.1287/moor.1060.0204. [34] A. Ruszczynski and A. Shapiro, Optimization of convex risk functions, Mathematics of Operations Research, 31 (2006), 433-452.  doi: 10.1287/moor.1050.0186. [35] S. Shafieezadeh-Abadeh, P. M. Esfahani and D. Kuhn, Distributionally robust logistic regression, In Advances in Neural Information Processing Systems, (2015), 1576–1584.

Figures(6)

Tables(4)

## Article Metrics

HTML views(1457) PDF downloads(352) Cited by(0)

## Other Articles By Authors

• on this site
• on Google Scholar

### Catalog

/

DownLoad:  Full-Size Img  PowerPoint