Article Contents
Article Contents

# Modal additive models with data-driven structure identification

• * Corresponding author: Hong Chen
• Additive models, due to their high flexibility, have received a great deal of attention in high dimensional regression analysis. Many efforts have been made on capturing interactions between predictive variables within additive models. However, typical approaches are designed based on conditional mean assumptions, which may fail to reveal the structure when data is contaminated by heavy-tailed noise. In this paper, we propose a penalized modal regression method, Modal Additive Models (MAM), based on a conditional mode assumption for simultaneous function estimation and structure identification. MAM approximates the non-parametric function through forward neural networks, and maximizes modal risk with constraints on the function space and group structure. The proposed approach can be implemented by the half-quadratic (HQ) optimization technique, and its asymptotic estimation and selection consistency are established. It turns out that MAM can achieve satisfactory learning rate and identify the target group structure with high probability. The effectiveness of MAM is also supported by some simulated examples.

Mathematics Subject Classification: Primary: 68T05; Secondary: 62J02.

 Citation:

• Figure 1.  Estimated transformation function for selected groups. Top-left: group $(1, 6)$, top-right: group $(8, 12)$, bottom-left: group $(3, 7 )$, bottom-right: group $(10, 13)$

Table Algorithm 1.  Half-quadratic Optimization for MAM

 1: Require: Input data $({{\bf x}}_i, y_i)_{i=1}^n$, kernel-induced representing function $\phi$, activating function $\psi$, weight parameter ${{\bf w}}$ and bias term ${{\bf b}}$. 2: Ensure: ${{\bf a}}_{{\bf z}}$; 3: Define function $f$ such that $f({{\bf x}}^2) = \phi({{\bf x}})$; 4: Initialize $\sigma$, coefficient ${{\bf a}}$; 5:while not converge do 6:    Update $e_i$ by $e_i = f^\prime \Big( \big(\frac{y_i - f({{\bf x}}_i)}{\sigma} \big)^2 \Big)$; 7:    Update ${{\bf a}}$ by ${{\bf a}} = \arg \max_{{{\bf a}} \in \mathbb{R}^h} \frac{1}{n \sigma}\sum_{i=1}^{n} \Big( e_i \big(\frac{y_i - f({{\bf x}}_i)}{\sigma} \big)^2 - g(e_i) \Big) - \lambda \|{{\bf a}}\|_2^2$; 8:    update $\sigma$; 9: end while 10: Output: ${{\bf a}}_{{\bf z}} = {{\bf a}}$.

Table Algorithm 2.  Backward Stepwise Selection for MAM

 1: Start with the variable pool $G = \{(1,2,\cdots, d)\}$; 2: Solve (13) to obtain the maximum value $\mathscr{R}_{\lambda, G}$; 3: for each variable $j$ in $G$ do 4:    $\hat{G} \longleftarrow$ either divide $j$ into subgroups or add to an existing group; 5:    Solve (13) to obtain the maximum value $\mathscr{R}_{\lambda, \hat{G}}$; 6:    if $\mathscr{R}_{\lambda, \hat{G}} > \mathscr{R}_{\lambda, G}$ then 7:        Preserve $\hat{G}$ as the new group structure; 8:    end if 9: end for 10: Return $\hat{G}$.

Table 1.  Selected models for simulation study and the corresponding intrinsic group structures

 ID Model Intrinsic group structure M1 $y = x_1 + x_2^2 + \frac{1}{1+ x_3^2} + \sin(\pi x_4) +\log(x_5+5) + \sqrt{|x_6|} + \epsilon$ $\{(1),(2),(3),(4),(5),(6)\}$ M2 $y = \frac{\sin(x_1)}{x_1 } + \cos((x_2 +x_3)\cdot \pi ) + \arctan((x_4 + x_5 + x_6)^2)+ \epsilon$ $\{(1),(2, 3),(4, 5, 6)\}$ M3 $y = \sin(x_1 + x_2) + 2\log(x_3 + 5) +x_4 + x_5\cdot x_6 + \epsilon$ $\{(1, 2), (3), (4), (5, 6)\}$

Table 3.  Average performance that intrinsic group structures are identified for $(\mu, \beta)$ pair (Gaussian noise)

 Parameters M1 M2 M3 $\mu$ $\beta$ MF Size TP U O MF Size TP U O MF Size TP U O $1 \rm{e} - 6$ $1$ 0 2 1 1 0 0 2 0.66 1 0 0 2 1 0 1 $1 \rm{e} - 5$ $1$ 0 2 1 1 0 0 2 0.84 1 0 0 2 1 0 1 $1 \rm{e} - 4$ $1$ 0 2 1 1 0 0 2 0.68 1 0 0 2 0.1 1 0 $1 \rm{e} - 3$ $1$ 0 2 1 1 0 0 2 0.46 0.46 0 0 2 1 1 0 $1 \rm{e} - 2$ $1$ 0 2 1 1 0 0 2 0.62 0.62 0 0 2 1 1 0 $1 \rm{e} - 1$ $1$ 0 2 1 1 0 0 2 0.78 0.78 0 0 2 1 0 0 $1 \rm{e} - 6$ $3$ 0 3 2 1 0 0 2 0.42 0.42 0 0 2 0.66 0.66 0 $1 \rm{e} - 5$ $3$ 0 2.84 1.78 0.94 0 0 2 0.54 0.54 0 0 2 0 1 0 $1 \rm{e} - 4$ $3$ 0 3.36 2.32 1 0 0 2 0.58 0.58 0 0 2.2 1.6 1 0 $1 \rm{e} - 3$ $3$ 0 4.9 3.9 1 0 0 2 0.78 0.78 0 50 4 4 0 0 $1 \rm{e} - 2$ $3$ 50 6 6 0 0 29 3.62 1.9 0 0.22 50 4 4 0 0 $1 \rm{e} - 1$ $3$ 50 6 6 0 0 0 5.38 1.62 0 1 0 6 2 0 1 $1 \rm{e} - 6$ $5$ 0 2.72 1.64 0.92 0 0 2 0.5 0.5 0 0 2.3 0.6 1 0 $1 \rm{e} - 5$ $5$ 0 3.4 1.6 0.8 0 0 2 0.58 0.58 0 0 3 2 1 0 $1 \rm{e} - 4$ $5$ 0 4.82 3.82 1 0 0 2.01 0.38 0.38 0 50 4 4 0 0 $1 \rm{e} - 3$ $5$ 27 5.54 5.08 0.46 0 28 3.44 1.76 0 0 50 4 4 0 0 $1 \rm{e} - 2$ $5$ 50 6 6 0 0 0 5 2 0 1 0 6 2 0 1 $1 \rm{e} - 1$ $5$ 50 6 6 0 0 0 6 1 0 1 0 6 2 0 1

Table 4.  Average performance that intrinsic group structures are identified for $(\mu, \beta)$ pair (Gamma noise)

 Parameters M1 M2 M3 $\mu$ $\beta$ MF Size TP U O MF Size TP U O MF Size TP U O $1 \rm{e} - 6$ $1$ 0 2 1 1 0 0 2 0.6 0.6 0 0 2 1 1 0 $1 \rm{e} - 5$ $1$ 0 2 1 1 0 0 2 0.7 0.7 0 0 2 1 1 0 $1 \rm{e} - 4$ $1$ 0 2 1 1 0 0 2 1 1 0 0 2 1 1 0 $1 \rm{e} - 3$ $1$ 0 2 1 1 0 0 2 0.92 0.92 0 0 2 1 1 0 $1 \rm{e} - 2$ $1$ 0 2 1 1 0 0 2 0.58 0.58 0 0 2 1 1 0 $1 \rm{e} - 1$ $1$ 0 2 1 1 0 0 2 0.76 0.76 0 0 2 1 1 0 $1 \rm{e} - 6$ $3$ 0 2 1 1 0 0 2 0.52 0.52 0 0 2 1 1 0 $1 \rm{e} - 5$ $3$ 0 2 1 1 0 0 2 1 1 0 0 2.42 0.66 1 0 $1 \rm{e} - 4$ $3$ 0 3.8 2.6 1 0 0 2 0.8 0.8 0 0 2 1 1 0 $1 \rm{e} - 3$ $3$ 0 4 3 1 0 5 2.26 0.92 0.62 0 50 4 4 0 0 $1 \rm{e} - 2$ $3$ 42 5.84 5.88 0.16 0 27 3.66 1.82 0 0.2 50 4 4 0 0 $1 \rm{e} - 1$ $3$ 50 6 6 0 0 0 6 1 0 1 0 6 2 0 1 $1 \rm{e} - 6$ $5$ 0 2.56 1.48 1 0 0 2 0.62 0.62 0 0 2 0.92 0.92 0 $1 \rm{e} - 5$ $5$ 0 3.5 2.5 1 0 0 2 0.66 0.66 0 0 3 2 1 0 $1 \rm{e} - 4$ $5$ 7 4.88 3.76 0.86 0 24 3.08 1.8 0 0.08 0 2.2 0.52 1 0 $1 \rm{e} - 3$ $5$ 8 4.94 3.84 0.84 0 27 3.4 1.6 0 0 50 4 4 0 0 $1 \rm{e} - 2$ $5$ 50 6 6 0 0 0 5 2 0 1 0 5.14 2.86 0 1 $1 \rm{e} - 1$ $5$ 50 6 6 0 0 0 6 1 0 1 0 6 2 0 1

Table 2.  Mean absolute error comparisons (Mean$\pm$std.) for Gaussian and Gamma noise}

 GASI MAM Model Gaussian Gamma Gaussian Gamma M1 $186.3.92. \pm 437.8$ $458.8 \pm 988.8$ $\mathbf{109.92 \pm 257.2}$ $\mathbf{272.8 \pm 536.2}$ M2 $1.088 \pm 0.025$ $0.774 \pm 0.032$ $\mathbf{0.839 \pm 0.023}$ $\mathbf{0.751 \pm 0.028}$ M3 $\mathbf{0.857 \pm 0.025}$ $\mathbf{ 0.873 \pm 0.019}$ $0.901 \pm 0.028$ $0.917 \pm 0.021$
•  [1] L. Breiman and J. Friedman, Estimating optimal transformations for multiple regression and correlation, Journal of the American Statistical Association, 80 (1985), 580-598.  doi: 10.1080/01621459.1985.10478157. [2] P. Chao and M. Zhu, Group additive structure identification for kernel non-parametric regression, Advances in Neural Information Processing Systems, (2017). [3] H. Chen, X. Wang, C. Deng and H. Huang, Group sparse additive machine, Advances in Neural Information Processing Systems, (2017). [4] H. Chen and Y. L. Wang, Kernel-based sparse regression with the correntropy-induced loss, Appl. Comput. Harmon. Anal., 44 (2018), 144-164.  doi: 10.1016/j.acha.2016.04.004. [5] Y.-C. Chen, R. Genovese, R. Tibshirani and L. Wasserman, Nonparametric modal regression, Annals of Statistics, 44 (2016), 489-514.  doi: 10.1214/15-AOS1373. [6] G. Collomb, W. Härdle and S. Hassani, A note on prediction via estimation of the conditional mode function, Journal of Statistical Planning and Inference, 15 (1986), 227– 236. doi: 10.1016/0378-3758(86)90099-6. [7] F. Cucker and S. Smale, Best choices for regularization parameters in learning theory: On the bias-variance problem, Foundations of Computational Mathematics, 2 (2002), 413-428.  doi: 10.1007/s102080010030. [8] F. Cucker and S. Smale, On the mathematical foundations of learning, Bulletin of the American Mathematical Society, 39 (2002), 1-49.  doi: 10.1090/S0273-0979-01-00923-5. [9] J. Q. Fan, Y. Feng and R. Song, Nonparametric independence screening in sparse ultra-high-dimensional additive models, Journal of the American Statistical Association, 106 (2011), 544– 557. doi: 10.1198/jasa.2011.tm09779. [10] J. Q. Fan and R. Z. Li, Variable selection via non-concave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96 (2001), 1348-1360.  doi: 10.1198/016214501753382273. [11] J. Q. Fan and J. C. Lv, Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B Stat. Methodol., 70 (2008), 849-911.  doi: 10.1111/j.1467-9868.2008.00674.x. [12] Y. Feng, J. Fan and Y. Suykens, A statistical learning approach to modal regression, Journal of Machine Learning Research, 21 (2020), 1-35. [13] D. Geman and C. Yang, Nonlinear image recovery with half-quadratic regularization, IEEE Transactions on Image Processing, 4 (1995), 932-946.  doi: 10.1109/83.392335. [14] T. L. Gong, Z. B. Xu and H. Chen, Generalization analysis of Fredholm kernel regularized classifiers, Neural Computation, 29 (2017), 1879-1901.  doi: 10.1162/NECO_a_00967. [15] C. Gu, Smoothing Spline ANOVA Models, Second edition, Springer Series in Statistics, 297. Springer, New York, 2013. doi: 10.1007/978-1-4614-5369-7. [16] X. He, J. Wang and S. Lv, Scalable kernel-based variable selection with sparsistency, preprint, arXiv: 1802.09246. [17] J. Huang, J. Horowitz and F. R. Wei, Variable selection in nonparametric additive models, Annals of Statistics, 38 (2010), 2282-2313.  doi: 10.1214/09-AOS781. [18] J. Huang and L. J. Yang, Identification of non-linear additive autoregressive models, Journal of the Royal Statistical Society, Series B, 66 (2004), 463-477.  doi: 10.1111/j.1369-7412.2004.05500.x. [19] J. Huang, S. G. Ma and C.-H. Zhang, Adaptive lasso for sparse high-dimensional regression models, Statistica Sinica., 18 (2008), 1603-1618. [20] K. Kandasamy and Y. Yu, Additive approximations in high-dimensional non- parametric regression via the salsa, International Conference on Machine Learning, (2016). [21] T. Kühn, Covering numbers of Gaussian reproducing kernel Hilbert spaces, Journal of Complexity, 27 (2011), 489-499.  doi: 10.1016/j.jco.2011.01.005. [22] F. Kuo, G. Sloan, G. Wasilkowski and H. Woźniakowski, On decompositions of multivariate functions, Mathematics of computation, Mathematics of Computation, 79 (2010), 953-966.  doi: 10.1090/S0025-5718-09-02319-9. [23] Y. Lin and H. Zhang, Component selection and smoothing in multi-variate nonparametric regression, Annals of Statistics, 34 (2006), 2272-2297.  doi: 10.1214/009053606000000722. [24] T. Sager and R. Thisted, Maximum likelihood estimation of isotonic modal regression, Annals of Statistics, 10 (1982), 690-707.  doi: 10.1214/aos/1176345865. [25] X. T. Shen, W. Pan and Y. Z. Zhu, Likelihood-based selection and sharp parameter estimation, Journal of the American Statistical Association, 107 (2012), 223-232.  doi: 10.1080/01621459.2011.645783. [26] L. Shi, Y.-L. Feng and D.-X. Zhou, Concentration estimates for learning with $\ell^{1}$-regularizer and data dependent hypothesis space, Applied and Computational Harmonic Analysis, 31 (2011), 286 – 302. doi: 10.1016/j.acha.2011.01.001. [27] T. Shively, R. Kohn and S. Wood, Variable selection and function estimation in additive non-parametric regression using a data-based prior, Journal of the American Statistical Association, 94 (1999), 777-794.  doi: 10.1080/01621459.1999.10474180. [28] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B, 58 (1996), 267-288.  doi: 10.1111/j.2517-6161.1996.tb02080.x. [29] X. Wang, H. Chen, W. Cai, D. Shen and H. Huang, Regularized modal regression with applications in cognitive impairment prediction, Advances in Neural Information Processing Systems, (2017). [30] Q. Wu, Y. M. Ying and D.-X. Zhou, Multi-kernel regularized classifiers, Journal of Complexity, 23 (2007), 108-134.  doi: 10.1016/j.jco.2006.06.007. [31] Q. Wu and D.-X. Zhou, Learning with sample dependent hypothesis spaces, Computers and Mathematics with Applications, 56 (2008), 2896-2907.  doi: 10.1016/j.camwa.2008.09.014. [32] W. Yao and R. Lindsay amd R. Li, Local modal regression, Journal of Nonparametric Statistics, 24 (2012), 647-663.  doi: 10.1080/10485252.2012.678848. [33] J. Yin, X. Chen and E. Xing, Group sparse additive models, International Conference on Machine Learning, (2012). [34] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society, Series B, 68 (2006), 49-67.  doi: 10.1111/j.1467-9868.2005.00532.x. [35] T. Zhang, Covering number bounds of certain regularized linear function classes, Journal of Machine Learning Research, 2 (2002), 527-550. [36] D.-X. Zhou, The covering number in learning theory, Journal of Complexity, 18 (2002), 739-767.  doi: 10.1006/jcom.2002.0635. [37] D.-X. Zhou, Capacity of reproducing kernel space in learning theory, IEEE Transactions on Information Theory, 49 (2003), 1743-1752.  doi: 10.1109/TIT.2003.813564. [38] D.-X. Zhou and K. Jetter, Approximation with polynomial kernels and SVM classifiers, Advances in Computational Mathematics, 25 (2006), 323-344.  doi: 10.1007/s10444-004-7206-2. [39] H. Zou, The adaptive lasso and its oracle properties, Journal of the American Statistical Association, 101 (2006), 1418-1429.  doi: 10.1198/016214506000000735.

Figures(1)

Tables(6)