Article Contents
Article Contents

High-dimensional linear regression with hard thresholding regularization: Theory and algorithm

• *Corresponding author: Jing Zhang
• Variable selection and parameter estimation are fundamental and important problems in high dimensional data analysis. In this paper, we employ the hard thresholding regularization method [1] to handle these issues under the framework of high-dimensional and sparse linear regression model. Theoretically, we establish a sharp non-asymptotic estimation error for the global solution and further show that the support of the global solution coincides with the target support with high probability. Motivated by the KKT condition, we propose a primal dual active set algorithm (PDAS) to solve the minimization problem, and show that the proposed PDAS algorithm is essentially a generalized Newton method, which guarantees that the proposed PDAS algorithm will converge fast if a good initial value is provided. Furthermore, we propose a sequential version of the PDAS algorithm (SPDAS) with a warm-start strategy to choose the initial value adaptively. The most significant advantage of the proposed procedure is its fast calculation speed. Extensive numerical studies demonstrate that the proposed method performs well on variable selection and estimation accuracy. It has favorable exhibition over the existing methods in terms of computational speed. As an illustration, we apply the proposed method to a breast cancer gene expression data set.

Mathematics Subject Classification: Primary: 62J05, 62H12; Secondary: 68Q25.

 Citation:

• Figure 1.  Plot of $P_{3}(\cdot)$

Figure 2.  Time versus $n, p, \rho$ and $\sigma$

Figure 3.  RP versus $n, p, \rho$ and $\sigma$

Table 1.  Numerical results with $n = 300$, $p = 5000$, $T = 15$, $R = 10$, $\sigma = 0.2\; \mbox{and}\; 1$, $\rho = 0.2:0.2:0.8$ and ${\bf{X}}$ follows (I)

 $\rho$ $\sigma$ Method AE RE ($10^{-2}$) Time(s) RP MSES 0.2 0.2 Lasso 0.635 10.71 3.36 0.88 15.13 MCP 0.125 0.94 3.49 1 15 SCAD 0.315 2.56 3.56 1 15 GraHTP 0.025 0.28 0.10 1 15 SDAR 0.024 0.27 0.61 1 15 SPDAS 0.024 0.27 0.53 1 15 1 Lasso 0.662 10.80 4.12 0.52 16.27 MCP 0.184 1.79 3.93 0.99 15.01 SCAD 0.362 3.13 4.42 1 15 GraHTP 0.124 1.37 0.15 1 15 SDAR 0.122 1.36 0.56 1 15 SPDAS 0.122 1.36 0.45 1 15 0.4 0.2 Lasso 0.648 10.88 3.39 0.9 15.1 MCP 0.115 0.87 3.60 1 15 SCAD 0.305 2.44 3.47 1 15 GraHTP 0.023 0.27 0.11 1 15 SDAR 0.024 0.27 0.61 1 15 SPDAS 0.023 0.27 0.81 1 15 1 Lasso 0.683 11.04 3.61 0.32 16.27 MCP 0.187 1.78 4.04 1 15 SCAD 0.348 3.04 4.44 0.99 14.99 GraHTP 0.118 1.35 0.17 1 15 SDAR 0.121 1.36 0.57 1 15 SPDAS 0.118 1.35 0.73 1 15 0.6 0.2 Lasso 0.665 10.88 3.51 0.5 15.6 MCP 0.130 1.01 4.34 0.99 14.98 SCAD 0.314 2.60 3.92 0.99 14.99 GraHTP 0.088 0.80 0.15 0.96 15 SDAR 0.061 0.57 0.62 0.98 15 SPDAS 0.024 0.27 0.85 1 15 1 Lasso 0.695 11.06 3.55 0.13 17.18 MCP 0.198 1.91 4.09 0.99 14.98 SCAD 0.362 3.25 5.17 0.99 14.99 GraHTP 0.166 1.75 0.20 0.97 15 SDAR 0.136 1.48 0.60 0.99 15 SPDAS 0.121 1.36 0.76 1 15 0.8 0.2 Lasso 0.738 11.75 3.50 0.1 17.95 MCP 0.133 1.04 4.13 0.99 14.98 SCAD 0.343 2.86 3.58 0.99 14.98 GraHTP 0.648 5.48 0.22 0.75 15 SDAR 0.265 2.27 0.67 0.89 15 SPDAS 0.037 0.38 0.85 0.99 14.98 1 Lasso 0.784 12.17 3.87 0.02 20.29 MCP 0.192 1.92 3.97 0.99 14.98 SCAD 0.383 3.41 4.62 0.98 14.99 GraHTP 0.781 6.57 0.25 0.67 15 SDAR 0.368 3.41 0.62 0.86 15 SPDAS 0.138 1.49 0.72 0.99 14.98

Table 2.  Numerical results with $n = 300$, $p = 5000$, $T = 15$, $R = 10$, $\sigma = 0.2~ \mbox{and} ~ 1$, $C = 2:2:8$ and ${\bf{X}}$ follows (II)

 C $\sigma$ Method AE RE ($10^{-2}$) Time RP MSES 2 0.2 Lasso 0.655 10.87 4.31 0.84 15.24 MCP 0.119 0.94 4.02 1 15 SCAD 0.310 2.57 5.45 1 15 GraHTP 0.025 0.28 0.12 1 15 SDAR 0.025 0.27 0.64 1 15 SPDAS 0.024 0.27 0.67 1 15 1 Lasso 0.681 11.02 4.87 0.4 16.65 MCP 0.191 1.88 5.34 1 15 SCAD 0.359 3.21 3.05 1 15 GraHTP 0.123 1.38 0.14 1 15 SDAR 0.121 1.36 0.65 1 15 SPDAS 0.122 1.38 0.50 1 15 4 0.2 Lasso 0.648 10.86 4.89 0.88 15.12 MCP 0.115 0.90 5.19 1 15 SCAD 0.307 2.54 5.43 1 15 GraHTP 0.024 0.27 0.13 1 15 SDAR 0.024 0.27 0.60 1 15 SPDAS 0.024 0.26 0.70 1 15 1 Lasso 0.674 10.97 4.68 0.41 16.48 MCP 0.174 1.72 5.39 1 15 SCAD 0.333 3.03 3.17 1 15 GraHTP 0.122 1.35 0.13 1 15 SDAR 0.120 1.34 0.58 1 15 SPDAS 0.121 1.34 0.50 1 15 6 0.2 Lasso 0.646 10.73 5.30 0.89 15.11 MCP 0.121 0.95 4.96 1 15 SCAD 0.308 2.58 5.74 1 15 GraHTP 0.024 0.27 0.14 1 15 SDAR 0.024 0.27 0.61 1 15 SPDAS 0.024 0.27 0.74 1 15 1 Lasso 0.668 10.85 4.60 0.38 16.35 MCP 0.183 1.80 4.77 1 15 SCAD 0.349 3.15 3.44 1 15 GraHTP 0.120 1.34 0.15 1 15 SDAR 0.120 1.35 0.55 1 15 SPDAS 0.119 1.33 0.55 1 15 8 0.2 Lasso 0.645 10.76 4.63 0.95 15.05 MCP 0.124 0.96 3.47 1 15 SCAD 0.308 2.57 4.92 1 15 GraHTP 0.025 0.27 0.14 1 15 SDAR 0.024 0.27 0.61 1 15 SPDAS 0.024 0.27 0.89 1 15 1 Lasso 0.663 10.82 4.51 0.48 16.23 MCP 0.186 1.81 4.83 0.99 15.01 SCAD 0.343 3.09 3.13 0.99 15.01 GraHTP 0.124 1.35 0.13 1 15 SDAR 0.121 1.33 0.62 1 15 SPDAS 0.124 1.35 0.59 1 15

Table 3.  Numerical results with $n = 600$, $p = 10000$, $T = 30$, $R = 10$, $\sigma = 0.2~ \mbox{and}~ 1$, $\rho = 0.2:0.2:0.8$ and ${\bf{X}}$ follows (III)

 $\rho$ $\sigma$ Method AE RE ($10^{-2}$) Time RP MSES 0.2 0.2 Lasso 0.763 11.49 13.04 0.81 30.21 MCP 0.222 1.39 12.89 1 30 SCAD 0.485 3.62 13.11 1 30 GraHTP 0.018 0.17 0.46 1 30 SDAR 0.019 0.17 3.37 1 30 SPDAS 0.018 0.17 4.84 1 30 1 Lasso 0.769 11.51 12.25 0.64 30.45 MCP 0.246 1.73 18.84 1 30 SCAD 0.506 3.77 12.37 1 30 GraHTP 0.092 0.87 0.42 1 30 SDAR 0.094 0.87 3.51 1 30 SPDAS 0.092 0.87 3.69 1 30 0.4 0.2 Lasso 0.794 11.61 14.01 0.42 31.13 MCP 0.241 1.49 13.49 0.99 30.01 SCAD 0.508 3.76 13.96 0.99 30.01 GraHTP 0.057 0.38 0.55 0.98 30 SDAR 0.033 0.24 3.58 0.99 30 SPDAS 0.017 0.15 4.96 1 30 1 Lasso 0.802 11.64 12.23 0.24 31.44 MCP 0.262 1.78 18.58 0.99 30 SCAD 0.525 3.89 12.18 0.99 30.01 GraHTP 0.135 1.05 0.52 0.97 30 SDAR 0.098 0.85 3.54 0.99 30 SPDAS 0.084 0.78 3.60 1 30 0.6 0.2 Lasso 0.817 11.75 13.31 0.19 32.03 MCP 0.461 2.93 18.34 0.92 30.05 SCAD 0.612 4.45 13.06 0.94 30.06 GraHTP 0.304 1.63 0.62 0.84 30 SDAR 1.364 8.87 4.06 0.06 30 SPDAS 0.015 0.14 4.96 1 30 1 Lasso 0.825 11.79 12.08 0.11 32.51 MCP 0.475 3.15 15.58 0.91 30.05 SCAD 0.626 4.55 12.33 0.93 30.05 GraHTP 0.282 1.73 0.58 0.88 30 SDAR 1.379 9.08 3.99 0.05 30 SPDAS 0.089 0.76 3.26 0.99 30 0.8 0.2 Lasso 0.814 11.76 13.15 0.14 32.13 MCP 0.266 1.60 18.20 0.98 29.99 SCAD 0.535 3.92 12.63 0.97 30.01 GraHTP 0.178 0.95 0.56 0.89 30 SDAR 2.496 20.08 4.18 0 30 SPDAS 0.105 0.85 4.30 0.98 30.47 1 Lasso 0.822 11.80 12.37 0.12 32.55 MCP 0.285 1.81 18.87 0.96 29.99 SCAD 0.545 4.00 12.69 0.95 30.01 GraHTP 0.257 1.51 0.55 0.89 30 SDAR 2.501 19.98 4.03 0 30 SPDAS 0.179 1.32 3.27 0.97 31.36

Table 4.  The estimation of bcTCGA

 Gene name number Lasso MCP SCAD SPDAS ABHD13 82 - -0.022 - - C17orf53 1743 0.082 - 0.091 - CCDC56 2739 0.056 - 0.039 - CDC25C 2964 0.028 - 0.027 - CDC6 2987 0.011 - 0.005 0.069 CEACAM6 3076 - - - 0.023 CENPK 3105 0.018 - 0.011 - CRBN 3676 - -0.057 - - DTL 4543 0.091 0.355 0.089 - FABP1 5081 - - - -0.126 FAM77C 5261 - - - 0.017 FGFRL1 5481 - -0.022 - - HBG1 6616 0.069 HIST2H2BE 6811 - -0.012 - - KHDRBS1 7709 - 0.112 - - KIAA0101 7719 0.007 - - - KLHL13 8002 - -0.013 - - LSM12 8782 0.006 - - - MFGE8 9230 -0.005 - - - MIA 9359 -0.006 NBR2 9941 0.273 0.504 0.235 0.555 NPY1R 10311 0.008 PSME3 12146 0.085 - 0.074 - RDM1 12615 0.058 SETMAR 13518 - -0.063 - - SLC25A22 13833 - 0.017 - - SLC6A4 14021 - - - 0.013 SPAG5 14296 0.024 0.048 0.013 0.180 SPRY2 14397 -0.012 - -0.005 - TIMELESS 15122 0.033 - 0.036 - TMPRSS4 15432 - - - 0.031 TOP2A 15535 0.035 - 0.035 0.128 TUBA1B 15882 0.021 - - - TYR 15953 - - - 0.132 UHRF1 16087 0.003 - - - VPS25 16315 0.106 0.307 0.108 -
•  [1] A. Antoniadis, Wavelet methods in statistics: Some recent developments and their applications, Statistics Surveys, 1 (2007), 16-55.  doi: 10.1214/07-SS014. [2] L. Breiman, Heuristics of instability and stabilization in model selection, The Annals of Statistics, 24 (1996), 2350-2383.  doi: 10.1214/aos/1032181158. [3] P. Breheny and J. Huang, Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection, The Annals of Applied Statistics, 5 (2011), 232-253.  doi: 10.1214/10-AOAS388. [4] E. Candes and T. Tao, The dantzig selector: Statistical estimation when $p$ is much larger than $n$, The Annals of Statistics, 35 (2007), 2313-2351.  doi: 10.1214/009053606000001523. [5] J. Fan and R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96 (2001), 1348-1360.  doi: 10.1198/016214501753382273. [6] Q. Fan, Y. Jiao and X. Lu, A primal dual active set algorithm with continuation for compressed sensing, IEEE Transactions on Signal Processing, 62 (2014), 6276-6285.  doi: 10.1109/TSP.2014.2362880. [7] L. E. Frank and J. H. Friedman, A statistical view of some chemometrics regression tools, Technometrics, 35 (1993), 109-135. [8] W. J. Fu, Penalized regressions: The bridge versus the lasso, Journal of Computational and Graphical Statistics, 7 (1998), 397-416.  doi: 10.2307/1390712. [9] J. Huang, Y. Jiao, L. Kang, J. Liu, Y. Liu, X. Lu and Y. Yang, On newton screening, preprint, arXiv: 200110616. [10] J. Huang, Y. Jiao, B. Jin, J. Liu, X. Lu and C. Yang, A unified primal dual active set algorithm for nonconvex sparse recovery, Statistical Science, 36 (2021), 215-238.  doi: 10.1214/19-sts758. [11] J. Huang, Y. Jiao, Y. Liu and X. Lu, A constructive approach to L0 penalized regression, Journal of Machine Learning Research, 19 (2018), 1-37. [12] K. Ito and K. Kunisch, Lagrange multiplier approach to variational problems and applications, vol 15, 2008. doi: 10.1137/1.9780898718614. [13] Y. Jiao, B. Jin and X. Lu, A primal dual active set with continuation algorithm for the $l_0$-regularized optimization problem, Applied and Computational Harmonic Analysis, 39 (2015), 400-426.  doi: 10.1016/j.acha.2014.10.001. [14] Y. Kim, S. Kwon and H. Choi, Consistent model selection criteria on high dimensions, The Journal of Machine Learning Research, 13 (2012), 1037-1057. [15] B. Kummer, Newton's method for non-differentiable functions, Advances in Mathematical Optimization, 45 (1988), 114-125. [16] P. L. Loh and M. J. Wainwright, Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima, The Journal of Machine Learning Research, 16 (2015), 559-616. [17] S. Lv, H. Lin, H. Lian and J. Huang, Oracle inequalities for sparse additive quantile regression in reproducing kernel hilbert space, The Annals of Statistics, 46 (2018), 781-813.  doi: 10.1214/17-AOS1567. [18] L. Qi and J. Sun, A nonsmooth version of newton's method, Mathematical Programming, 58 (1993), 353-367.  doi: 10.1007/BF01581275. [19] Y. Shi, J. Huang, Y. Jiao and Q. Yang, A semismooth newton algorithm for high-dimensional nonconvex sparse learning, IEEE Transactions on Neural Networks and Learning Systems, 2019. doi: 10.1109/TNNLS.2019.2935001. [20] A. Tan and J. Huang, Bayesian inference for high-dimensional linear regression under mnet priors, The Canadian Journal of Statistics, 44 (2016), 180-197.  doi: 10.1002/cjs.11283. [21] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological), 58 (1996), 267-288.  doi: 10.1111/j.2517-6161.1996.tb02080.x. [22] P. Tseng, Convergence of a block coordinate descent method for nondifferentiable minimization, Journal of Optimization Theory and Applications, 109 (2001), 475-494.  doi: 10.1023/A:1017501703105. [23] M. J. Wainwright, Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_1$-constrained quadratic programming (lasso), IEEE Transactions on Information Theory, 55 (2009), 2183-2202.  doi: 10.1109/TIT.2009.2016018. [24] L. Wang, Y. Kim and R. Li, Calibrating non-convex penalized regression in ultra-high dimension, The Annals of Statistics, 41 (2013), 2505-2536.  doi: 10.1214/13-AOS1159. [25] C. Yi and J. Huang, Semismooth newton coordinate descent algorithm for elastic-net penalized huber loss regression and quantile regression, Journal of Computational and Graphical Statistics, 26 (2017), 547-557.  doi: 10.1080/10618600.2016.1256816. [26] X. Yuan, P. Li and T. Zhang, Gradient hard thresholding pursuit for sparsity-constrained optimization, The 31st International Conference on Machine Learning, PMLR, 32 (2014), 127-135. [27] C. H. Zhang, Nearly unbiased variable selection under minimax concave penalty, The Annals of Statistics, 38 (2010), 894-942.  doi: 10.1214/09-AOS729. [28] C. H. Zhang and J. Huang, The sparsity and bias of the lasso selection in high-dimensional linear regression, The Annals of Statistics, 36 (2008), 1567-1594.  doi: 10.1214/07-AOS520. [29] C. H. Zhang and T. Zhang, A general theory of concave regularization for high-dimensional sparse estimation problems, Statistical Science, 27 (2012), 576-593.  doi: 10.1214/12-STS399. [30] P. Zhao and B. Yu, On model selection consistency of lasso, Journal of Machine Learning Research, 7 (2006), 2541-2563. [31] H. Zou, The adaptive lasso and its oracle properties, Journal of the American Statistical Association, 101 (2006), 1418-1429.  doi: 10.1198/016214506000000735. [32] H. Zou and H. H. Zhang, On the adaptive elastic-net with a diverging number of parameters, The Annals of Statistics, 37 (2009), 1733-1751.  doi: 10.1214/08-AOS625.

Figures(3)

Tables(4)