December  2019, 1(4): 389-417. doi: 10.3934/fods.2019016

Issues using logistic regression with class imbalance, with a case study from credit risk modelling

Department of Mathematics, Imperial College London, London, SW7 2AZ, UK

* Corresponding author: Yazhe Li

Published  December 2019

The class imbalance problem arises in two-class classification problems, when the less frequent (minority) class is observed much less than the majority class. This characteristic is endemic in many problems such as modeling default or fraud detection. Recent work by Owen [19] has shown that, in a theoretical context related to infinite imbalance, logistic regression behaves in such a way that all data in the rare class can be replaced by their mean vector to achieve the same coefficient estimates. We build on Owen's results to show the phenomenon remains true for both weighted and penalized likelihood methods. Such results suggest that problems may occur if there is structure within the rare class that is not captured by the mean vector. We demonstrate this problem and suggest a relabelling solution based on clustering the minority class. In a simulation and a real mortgage dataset, we show that logistic regression is not able to provide the best out-of-sample predictive performance and that an approach that is able to model underlying structure in the minority class is often superior.

Citation: Yazhe Li, Tony Bellotti, Niall Adams. Issues using logistic regression with class imbalance, with a case study from credit risk modelling. Foundations of Data Science, 2019, 1 (4) : 389-417. doi: 10.3934/fods.2019016
References:
[1]

E. I. Altman and G. Sabato, Modelling credit risk for smes: Evidence from the US market, Abacus, 43 (2007), 332-357.  doi: 10.1111/j.1467-6281.2007.00234.x.

[2]

G. E. BatistaR. C. Prati and M. C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter, 6 (2004), 20-29.  doi: 10.1145/1007730.1007735.

[3]

C. BravoL. C. Thomas and R. Weber, Improving credit scoring by differentiating defaulter behaviour, Journal of the Operational Research Society, 66 (2015), 771-781.  doi: 10.1057/jors.2014.50.

[4]

N. V. ChawlaK. W. BowyerL. O. Hall and W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16 (2002), 321-357.  doi: 10.1613/jair.953.

[5]

T. M. Clauretie, A note on mortgage risk: Default vs. loss rates, Real Estate Economics, 18 (1990), 202-206.  doi: 10.1111/1540-6229.00517.

[6]

Cornell Law School, Definition of default, date of default, and requirement of notice of default, URL https://www.law.cornell.edu/cfr/text/24/203.467.

[7]

E. R. DeLong and D. L. Clarke-Pearson, Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach, Biometrics, 44 (1988), 837-845.  doi: 10.2307/2531595.

[8] B. Efron and T. Hastie, Computer Age Statistical Inference: Algorithms, Evidence, and Data Science, Institute of Mathematical Statistics (IMS) Monographs, 5. Cambridge University Press, New York, 2016.  doi: 10.1017/CBO9781316576533.
[9]

T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters, 27 (2006), 861-874.  doi: 10.1016/j.patrec.2005.10.010.

[10]

D. J. Hand, Reject inference in credit operations, Credit Risk Modeling: Design and Application, 181–190.

[11]

A. E. Hoerl and R. W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, 12 (1970), 55-67. 

[12]

G. King and L. Zeng, Logistic regression in rare events data, Political analysis, 9 (2001), 137-163. 

[13]

G. Krempl and V. Hofer, Classification in presence of drift and latency, in Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, IEEE, 2011, 596–603. doi: 10.1109/ICDMW.2011.47.

[14]

J. Laurikkala, Improving identification of difficult small classes by balancing class distribution, Artificial Intelligence in Medicine, 2101 (2001), 63-66.  doi: 10.1007/3-540-48229-6_9.

[15]

X.-Y. LiuJ. Wu and Z.-H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39 (2009), 539-550. 

[16]

F. J. Massey Jr, The Kolmogorov-{S}mirnov test for goodness of fit, Journal of the American Statistical Association, 46 (1951), 68-78. 

[17]

F. Murtagh and P. Contreras, Algorithms for hierarchical clustering: An overview, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2 (2012), 86-97. 

[18]

Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, 87. Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-8853-9.

[19]

A. B. Owen, Infinitely imbalanced logistic regression, Journal of Machine Learning Research, 8 (2007), 761-773. 

[20]

O. Pons, Bootstrap of means under stratified sampling, Electronic Journal of Statistics, 1 (2007), 381-391.  doi: 10.1214/07-EJS033.

[21]

R. Rockafellar, Convex Analysis, Princeton University Press, Princeton, N.J. 1970.

[22]

C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse and A. Napolitano, Resampling or reweighting: A comparison of boosting implementations, in 2008 20th IEEE International Conference on Tools with Artificial Intelligence, IEEE, 1 (2008), 445–451. doi: 10.1109/ICTAI.2008.59.

[23]

M. J. Silvapulle, On the existence of maximum likelihood estimators for the binomial response models, Journal of the Royal Statistical Society. Series B (Methodological), 43 (1981), 310-313.  doi: 10.1111/j.2517-6161.1981.tb01676.x.

[24]

St udent, The probable error of a mean, Biometrika, 6 (1908), 1-25. 

[25]

L. C. Thomas, Consumer Credit Models: Pricing, Profit and Portfolios, Oxford, 2009.

[26]

R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), 58 (1996), 267-288.  doi: 10.1111/j.2517-6161.1996.tb02080.x.

[27]

R. Tibshirani, The lasso problem and uniqueness, Electronic Journal of Statistics, 7 (2013), 1456-1490.  doi: 10.1214/13-EJS815.

[28]

H. Wang, Q. Xu and L. Zhou, Large unbalanced credit scoring using lasso-logistic regression ensemble, PLoS ONE, 10 (2015), e0117844. doi: 10.1371/journal.pone.0117844.

[29]

V. Wieringen and Wessel, Lecture notes on ridge regression, arXiv preprint, arXiv: 1509.09169.

[30]

G. Zeng, On the existence of maximum likelihood estimates for weighted logistic regression, Communications in Statistics-Theory and Methods, 46 (2017), 11194-11203.  doi: 10.1080/03610926.2016.1260742.

[31]

M. ZhuW. Su and H. A. Chipman, Lago: A computationally efficient approach for statistical detection, Technometrics, 48 (2006), 193-205.  doi: 10.1198/004017005000000643.

show all references

References:
[1]

E. I. Altman and G. Sabato, Modelling credit risk for smes: Evidence from the US market, Abacus, 43 (2007), 332-357.  doi: 10.1111/j.1467-6281.2007.00234.x.

[2]

G. E. BatistaR. C. Prati and M. C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter, 6 (2004), 20-29.  doi: 10.1145/1007730.1007735.

[3]

C. BravoL. C. Thomas and R. Weber, Improving credit scoring by differentiating defaulter behaviour, Journal of the Operational Research Society, 66 (2015), 771-781.  doi: 10.1057/jors.2014.50.

[4]

N. V. ChawlaK. W. BowyerL. O. Hall and W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16 (2002), 321-357.  doi: 10.1613/jair.953.

[5]

T. M. Clauretie, A note on mortgage risk: Default vs. loss rates, Real Estate Economics, 18 (1990), 202-206.  doi: 10.1111/1540-6229.00517.

[6]

Cornell Law School, Definition of default, date of default, and requirement of notice of default, URL https://www.law.cornell.edu/cfr/text/24/203.467.

[7]

E. R. DeLong and D. L. Clarke-Pearson, Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach, Biometrics, 44 (1988), 837-845.  doi: 10.2307/2531595.

[8] B. Efron and T. Hastie, Computer Age Statistical Inference: Algorithms, Evidence, and Data Science, Institute of Mathematical Statistics (IMS) Monographs, 5. Cambridge University Press, New York, 2016.  doi: 10.1017/CBO9781316576533.
[9]

T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters, 27 (2006), 861-874.  doi: 10.1016/j.patrec.2005.10.010.

[10]

D. J. Hand, Reject inference in credit operations, Credit Risk Modeling: Design and Application, 181–190.

[11]

A. E. Hoerl and R. W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, 12 (1970), 55-67. 

[12]

G. King and L. Zeng, Logistic regression in rare events data, Political analysis, 9 (2001), 137-163. 

[13]

G. Krempl and V. Hofer, Classification in presence of drift and latency, in Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on, IEEE, 2011, 596–603. doi: 10.1109/ICDMW.2011.47.

[14]

J. Laurikkala, Improving identification of difficult small classes by balancing class distribution, Artificial Intelligence in Medicine, 2101 (2001), 63-66.  doi: 10.1007/3-540-48229-6_9.

[15]

X.-Y. LiuJ. Wu and Z.-H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39 (2009), 539-550. 

[16]

F. J. Massey Jr, The Kolmogorov-{S}mirnov test for goodness of fit, Journal of the American Statistical Association, 46 (1951), 68-78. 

[17]

F. Murtagh and P. Contreras, Algorithms for hierarchical clustering: An overview, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2 (2012), 86-97. 

[18]

Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Applied Optimization, 87. Kluwer Academic Publishers, Boston, MA, 2004. doi: 10.1007/978-1-4419-8853-9.

[19]

A. B. Owen, Infinitely imbalanced logistic regression, Journal of Machine Learning Research, 8 (2007), 761-773. 

[20]

O. Pons, Bootstrap of means under stratified sampling, Electronic Journal of Statistics, 1 (2007), 381-391.  doi: 10.1214/07-EJS033.

[21]

R. Rockafellar, Convex Analysis, Princeton University Press, Princeton, N.J. 1970.

[22]

C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse and A. Napolitano, Resampling or reweighting: A comparison of boosting implementations, in 2008 20th IEEE International Conference on Tools with Artificial Intelligence, IEEE, 1 (2008), 445–451. doi: 10.1109/ICTAI.2008.59.

[23]

M. J. Silvapulle, On the existence of maximum likelihood estimators for the binomial response models, Journal of the Royal Statistical Society. Series B (Methodological), 43 (1981), 310-313.  doi: 10.1111/j.2517-6161.1981.tb01676.x.

[24]

St udent, The probable error of a mean, Biometrika, 6 (1908), 1-25. 

[25]

L. C. Thomas, Consumer Credit Models: Pricing, Profit and Portfolios, Oxford, 2009.

[26]

R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), 58 (1996), 267-288.  doi: 10.1111/j.2517-6161.1996.tb02080.x.

[27]

R. Tibshirani, The lasso problem and uniqueness, Electronic Journal of Statistics, 7 (2013), 1456-1490.  doi: 10.1214/13-EJS815.

[28]

H. Wang, Q. Xu and L. Zhou, Large unbalanced credit scoring using lasso-logistic regression ensemble, PLoS ONE, 10 (2015), e0117844. doi: 10.1371/journal.pone.0117844.

[29]

V. Wieringen and Wessel, Lecture notes on ridge regression, arXiv preprint, arXiv: 1509.09169.

[30]

G. Zeng, On the existence of maximum likelihood estimates for weighted logistic regression, Communications in Statistics-Theory and Methods, 46 (2017), 11194-11203.  doi: 10.1080/03610926.2016.1260742.

[31]

M. ZhuW. Su and H. A. Chipman, Lago: A computationally efficient approach for statistical detection, Technometrics, 48 (2006), 193-205.  doi: 10.1198/004017005000000643.

Figure 1.  Sample size and default rate from 2003 to 2013 in the Freddie Mac data set
Figure 2.  Scatter plot of Simulations Samples. Green points represent majority class and red points represent minority class
Figure 3.  AUC plot of different methods from test year 2003 to 2013
Figure 7.  Density plot of AUC on training year and four test quarters respectively, the left side is "with relabelling" method and the right side is "without relabelling"
Figure 4.  Scatter plot of mean AUC difference in test year v.s. default rate in training year
Figure 5.  Boxplots for Score, DTI, UPB, LTV and OIR in the year 2002, The $ p $-values in the plot are calculated through Student's $ t $-test [24] between "Default 1" and "Default 2" in each variable
Figure 6.  Boxplots for Score, DTI, UPB, LTV and OIR in the year 2004, The $ p $-values in the plot are calculated through Student's $ t $-test [24] between "Default 1" and "Default 2" in each variable
Table 1.  Simulation A for infinitely imbalanced penalized logistic regression. $ N $ observations in majority class ($ Y = 0 $) following $ N(0,1) $ and 100 observations in minority class with $ Y = 1, X = 1 $
Logistic Regression Ridge Penalized Logistic Regression
$ N $ $ \beta $ $ Ne^{\beta_0} $ $ \beta_0 $ $ e^{\beta_0} $ $ Ne^{\beta_0} $ $ \beta $ $ e^{\beta} $
100 1.1215 41.7805 -0.5247 0.5917 59.1750 0.6879 1.9896
1000 0.5656 65.3495 -2.4591 0.0855 85.5127 0.2454 1.2782
10000 0.5013 68.3830 -4.6289 0.0098 97.6581 0.0450 1.0460
100000 0.5007 68.6940 -6.9102 0.0010 99.7516 0.0049 1.0050
1000000 0.5001 68.7254 -9.2106 0.0001 99.9750 0.0005 1.0005
Logistic Regression Ridge Penalized Logistic Regression
$ N $ $ \beta $ $ Ne^{\beta_0} $ $ \beta_0 $ $ e^{\beta_0} $ $ Ne^{\beta_0} $ $ \beta $ $ e^{\beta} $
100 1.1215 41.7805 -0.5247 0.5917 59.1750 0.6879 1.9896
1000 0.5656 65.3495 -2.4591 0.0855 85.5127 0.2454 1.2782
10000 0.5013 68.3830 -4.6289 0.0098 97.6581 0.0450 1.0460
100000 0.5007 68.6940 -6.9102 0.0010 99.7516 0.0049 1.0050
1000000 0.5001 68.7254 -9.2106 0.0001 99.9750 0.0005 1.0005
Table 2.  Simulation B for infinitely imbalanced penalized logistic regression. $ N $ observations in majority class ($ Y = 0 $) following $ \mathrm{Uniform}(0,1) $ and 100 observations in minority class (half of them with $ Y = 1, X = 0.5 $, the others with $ Y = 1, X = 2 $)
Logistic Regression Ridge Penalized Logistic Regression
$ N $ $ \beta $ $ Ne^{\beta_0} $ $ \beta_0 $ $ e^{\beta_0} $ $ Ne^{\beta_0} $ $ \beta $ $ e^{\beta} $
100 2.2347 16.2756 -1.0602 0.3464 34.6374 1.2598 3.5246
1000 3.2033 8.4214 -3.4516 0.0317 31.6947 1.6478 5.1958
10000 4.6591 2.8035 -4.9902 0.0068 68.0441 0.7112 2.0364
100000 6.3475 0.7238 -6.9521 0.0010 95.6659 0.0878 1.0918
1000000 8.1866 0.1524 -9.2148 0.0001 99.5517 0.0090 1.0090
Logistic Regression Ridge Penalized Logistic Regression
$ N $ $ \beta $ $ Ne^{\beta_0} $ $ \beta_0 $ $ e^{\beta_0} $ $ Ne^{\beta_0} $ $ \beta $ $ e^{\beta} $
100 2.2347 16.2756 -1.0602 0.3464 34.6374 1.2598 3.5246
1000 3.2033 8.4214 -3.4516 0.0317 31.6947 1.6478 5.1958
10000 4.6591 2.8035 -4.9902 0.0068 68.0441 0.7112 2.0364
100000 6.3475 0.7238 -6.9521 0.0010 95.6659 0.0878 1.0918
1000000 8.1866 0.1524 -9.2148 0.0001 99.5517 0.0090 1.0090
Table 3.  Infinitely imbalanced logistic regression shrinkage law
Fixture Logistic Regression Ridge Lasso
$ \beta_0 $ $ -\infty $ $ -\infty $ $ -\infty $
$ N e^{\beta_0} $ certain value, $ k_1 $ n n
$ \beta $ certain value, $ k_2 $ 0 0
Fixture Logistic Regression Ridge Lasso
$ \beta_0 $ $ -\infty $ $ -\infty $ $ -\infty $
$ N e^{\beta_0} $ certain value, $ k_1 $ n n
$ \beta $ certain value, $ k_2 $ 0 0
Table 4.  Coefficient estimates of lasso penalized logistic regression with different penalty parameter $ \lambda $
$ \lambda $ $ \beta_{\cdot 1} $ $ \beta_{\cdot 2} $ $ \beta_{\cdot 3} $ $ \beta_{\cdot 4} $ $ \beta_{\cdot 5} $
0.0190 0 0 0 0 0
0.0168 0.1650 0 0 0 0
0.0153 0.3106 0.1148 0 0 0
0.0139 0.4388 0.2416 0.0377 0 0
0.0116 0.6435 0.4445 0.2392 0.0471 0
0.0087 0.8621 0.6581 0.4525 0.2547 0.0516
$ \lambda $ $ \beta_{\cdot 1} $ $ \beta_{\cdot 2} $ $ \beta_{\cdot 3} $ $ \beta_{\cdot 4} $ $ \beta_{\cdot 5} $
0.0190 0 0 0 0 0
0.0168 0.1650 0 0 0 0
0.0153 0.3106 0.1148 0 0 0
0.0139 0.4388 0.2416 0.0377 0 0
0.0116 0.6435 0.4445 0.2392 0.0471 0
0.0087 0.8621 0.6581 0.4525 0.2547 0.0516
Table 5.  Coefficient of logistic regression and two clusters multinomial logistic regression. The left three columns are logistic regression and right four columns are multinomial logistic regression
Logistic Regression Multinomial Logistic Regression
Coefficients Estimate $ \text{Pr}(>|z|) $ Cluster Coefficients Estimate $ \text{Pr}(>|t|) $
Intercept -5.7705 $< 2\times 10^{-16} $ $ c2 $ Intercept -7.6828 $< 2\times 10^{-16} $
$ x_1 $ 1.1384 $< 2\times 10^{-16} $ $ c3 $ Intercept -7.6828 $< 2\times 10^{-16} $
$ x_2 $ 1.1287 $< 2\times 10^{-16} $ $ c2 $ $ x1 $ 0.0818 0.6045
$ c3 $ $ x1 $ 2.2775 $< 2\times 10^{-16} $
$ c2 $ $ x2 $ 2.3532 $< 2\times 10^{-16} $
$ c3 $ $ x2 $ 0.0953 0.5412
Logistic Regression Multinomial Logistic Regression
Coefficients Estimate $ \text{Pr}(>|z|) $ Cluster Coefficients Estimate $ \text{Pr}(>|t|) $
Intercept -5.7705 $< 2\times 10^{-16} $ $ c2 $ Intercept -7.6828 $< 2\times 10^{-16} $
$ x_1 $ 1.1384 $< 2\times 10^{-16} $ $ c3 $ Intercept -7.6828 $< 2\times 10^{-16} $
$ x_2 $ 1.1287 $< 2\times 10^{-16} $ $ c2 $ $ x1 $ 0.0818 0.6045
$ c3 $ $ x1 $ 2.2775 $< 2\times 10^{-16} $
$ c2 $ $ x2 $ 2.3532 $< 2\times 10^{-16} $
$ c3 $ $ x2 $ 0.0953 0.5412
Table 8.  Description of variables in the Freddie Mac data set
Variable Type Description
Default Categorical Dependent variable: 1 if borrower greater than 180 days past due on monthly installments; 0 otherwise.
Score Continuous A number, prepared by third parties, summarizing the borrower's creditworthiness, which may be indicative of the likelihood that the borrower will timely repay future obligations.
DTI Continuous Original Debt-To-Income Ratio.
UPB Continuous Unpaid Principal Balance.
LTV Continuous Original Loan-To-Value.
OIR Continuous Original Interest Rate.
Number of Borrowers Categorical The number of borrower(s) who are obligated to repay the mortgage note secured by the mortgaged property. 1 = one borrower; 2 = more than one borrower.
Seller Categorical The entity acting in its capacity as a seller of mortgages to Freddie Mac at the time of acquisition.
Servicer Categorical The entity acting in its capacity as the servicer of mortgages to Freddie Mac as of the last period for which loan activity is reported in the Dataset.
First Time Homebuyer Categorical Y =yes; N = no.
Number of Units Categorical Denotes whether the mortgage is a one-, two-, three-, or four-unit property.
Occupancy Status Categorical O = Owner Occupied; I = Investment Property; S = Second Home; Space = Unknown.
Channel Categorical R = Retail; B = Broker; C = Correspondent; T = TPO Not Specified; Space = Unknown.
PPM Categorical Denotes whether the mortgage is a Prepayment Penalty Mortgage. Y = PPM; N = Not PPM.
Property Type Categorical CO = Condo; LH = Leasehold; PU = PUD; MH = Manufactured Housing; SF = 1-4 Fee Simple; CP = Co-op; Space = Unknown.
Channel Categorical R = Retail; B = Broker; C = Correspondent; T = TPO Not Specified; Space = Unknown.
Loan Purpose Categorical P = Purchase; C = Cash-out Refinance; N = No Cash-out Refinance; Space = Unknown.
Variable Type Description
Default Categorical Dependent variable: 1 if borrower greater than 180 days past due on monthly installments; 0 otherwise.
Score Continuous A number, prepared by third parties, summarizing the borrower's creditworthiness, which may be indicative of the likelihood that the borrower will timely repay future obligations.
DTI Continuous Original Debt-To-Income Ratio.
UPB Continuous Unpaid Principal Balance.
LTV Continuous Original Loan-To-Value.
OIR Continuous Original Interest Rate.
Number of Borrowers Categorical The number of borrower(s) who are obligated to repay the mortgage note secured by the mortgaged property. 1 = one borrower; 2 = more than one borrower.
Seller Categorical The entity acting in its capacity as a seller of mortgages to Freddie Mac at the time of acquisition.
Servicer Categorical The entity acting in its capacity as the servicer of mortgages to Freddie Mac as of the last period for which loan activity is reported in the Dataset.
First Time Homebuyer Categorical Y =yes; N = no.
Number of Units Categorical Denotes whether the mortgage is a one-, two-, three-, or four-unit property.
Occupancy Status Categorical O = Owner Occupied; I = Investment Property; S = Second Home; Space = Unknown.
Channel Categorical R = Retail; B = Broker; C = Correspondent; T = TPO Not Specified; Space = Unknown.
PPM Categorical Denotes whether the mortgage is a Prepayment Penalty Mortgage. Y = PPM; N = Not PPM.
Property Type Categorical CO = Condo; LH = Leasehold; PU = PUD; MH = Manufactured Housing; SF = 1-4 Fee Simple; CP = Co-op; Space = Unknown.
Channel Categorical R = Retail; B = Broker; C = Correspondent; T = TPO Not Specified; Space = Unknown.
Loan Purpose Categorical P = Purchase; C = Cash-out Refinance; N = No Cash-out Refinance; Space = Unknown.
Table 6.  Experiment Procedure Time Table
Training set year 2000 2001 $ \cdots $
Default collection year 2001 2002 2002 2003 $ \cdots $
Testing set year 2003 2004 $ \cdots $
Training set year 2000 2001 $ \cdots $
Default collection year 2001 2002 2002 2003 $ \cdots $
Testing set year 2003 2004 $ \cdots $
Table 9.  AUC and standard deviation in Freddie Mac experiment
Time With Relabelling Without Relabelling
AUC DeLong Bootstrap Stratified AUC DeLong Bootstrap Stratified
2003 Q1 0.879 0.033 0.035 0.032 0.873 0.032 0.028 0.033
2003 Q2 0.880 0.025 0.024 0.024 0.878 0.026 0.026 0.025
2003 Q3 0.839 0.035 0.033 0.031 0.824 0.039 0.037 0.038
2003 Q4 0.872 0.025 0.025 0.025 0.872 0.026 0.028 0.026
2004 Q1 0.808 0.042 0.041 0.041 0.804 0.043 0.041 0.039
2004 Q2 0.804 0.053 0.056 0.053 0.795 0.052 0.046 0.050
2004 Q3 0.636 0.067 0.063 0.067 0.634 0.075 0.067 0.073
2004 Q4 0.806 0.046 0.045 0.046 0.796 0.054 0.056 0.051
2005 Q1 0.865 0.025 0.024 0.027 0.805 0.042 0.045 0.043
2005 Q2 0.841 0.026 0.025 0.026 0.758 0.038 0.037 0.036
2005 Q3 0.849 0.021 0.020 0.022 0.799 0.033 0.032 0.033
2005 Q4 0.814 0.022 0.022 0.021 0.776 0.027 0.028 0.029
2006 Q1 0.817 0.017 0.016 0.016 0.797 0.020 0.021 0.019
2006 Q2 0.803 0.015 0.016 0.016 0.795 0.017 0.017 0.017
2006 Q3 0.789 0.016 0.015 0.015 0.776 0.018 0.018 0.018
2006 Q4 0.776 0.012 0.012 0.012 0.769 0.013 0.013 0.013
2007 Q1 0.697 0.013 0.013 0.014 0.713 0.013 0.012 0.012
2007 Q2 0.704 0.010 0.010 0.010 0.720 0.009 0.009 0.009
2007 Q3 0.725 0.008 0.008 0.008 0.727 0.008 0.008 0.008
2007 Q4 0.720 0.006 0.006 0.007 0.738 0.006 0.006 0.005
2008 Q1 0.837 0.004 0.004 0.004 0.838 0.004 0.005 0.005
2008 Q2 0.832 0.005 0.005 0.005 0.833 0.005 0.006 0.005
2008 Q3 0.830 0.006 0.006 0.007 0.831 0.006 0.006 0.007
2008 Q4 0.857 0.008 0.008 0.008 0.856 0.008 0.008 0.008
2009 Q1 0.804 0.024 0.023 0.022 0.805 0.024 0.023 0.023
2009 Q2 0.811 0.018 0.019 0.017 0.807 0.018 0.017 0.018
2009 Q3 0.757 0.013 0.013 0.013 0.758 0.013 0.012 0.013
2009 Q4 0.738 0.023 0.025 0.022 0.742 0.023 0.022 0.023
2010 Q1 0.825 0.033 0.034 0.032 0.829 0.032 0.029 0.031
2010 Q2 0.793 0.038 0.039 0.037 0.798 0.037 0.034 0.039
2010 Q3 0.826 0.034 0.031 0.034 0.830 0.033 0.029 0.033
2010 Q4 0.769 0.036 0.038 0.034 0.779 0.037 0.035 0.037
2011 Q1 0.789 0.039 0.037 0.035 0.780 0.039 0.043 0.039
2011 Q2 0.780 0.042 0.041 0.039 0.773 0.043 0.041 0.042
2011 Q3 0.740 0.048 0.048 0.044 0.733 0.049 0.048 0.046
2011 Q4 0.782 0.050 0.043 0.047 0.783 0.049 0.050 0.046
2012 Q1 0.861 0.034 0.032 0.033 0.868 0.031 0.031 0.031
2012 Q2 0.776 0.043 0.045 0.038 0.778 0.042 0.046 0.039
2012 Q3 0.771 0.045 0.043 0.045 0.784 0.045 0.046 0.043
2012 Q4 0.771 0.038 0.036 0.034 0.766 0.039 0.038 0.040
2013 Q1 0.769 0.039 0.037 0.039 0.772 0.040 0.039 0.041
2013 Q2 0.738 0.029 0.028 0.029 0.739 0.030 0.028 0.026
2013 Q3 0.730 0.040 0.039 0.041 0.735 0.042 0.043 0.041
2013 Q4 0.754 0.033 0.031 0.032 0.750 0.033 0.032 0.032
Time With Relabelling Without Relabelling
AUC DeLong Bootstrap Stratified AUC DeLong Bootstrap Stratified
2003 Q1 0.879 0.033 0.035 0.032 0.873 0.032 0.028 0.033
2003 Q2 0.880 0.025 0.024 0.024 0.878 0.026 0.026 0.025
2003 Q3 0.839 0.035 0.033 0.031 0.824 0.039 0.037 0.038
2003 Q4 0.872 0.025 0.025 0.025 0.872 0.026 0.028 0.026
2004 Q1 0.808 0.042 0.041 0.041 0.804 0.043 0.041 0.039
2004 Q2 0.804 0.053 0.056 0.053 0.795 0.052 0.046 0.050
2004 Q3 0.636 0.067 0.063 0.067 0.634 0.075 0.067 0.073
2004 Q4 0.806 0.046 0.045 0.046 0.796 0.054 0.056 0.051
2005 Q1 0.865 0.025 0.024 0.027 0.805 0.042 0.045 0.043
2005 Q2 0.841 0.026 0.025 0.026 0.758 0.038 0.037 0.036
2005 Q3 0.849 0.021 0.020 0.022 0.799 0.033 0.032 0.033
2005 Q4 0.814 0.022 0.022 0.021 0.776 0.027 0.028 0.029
2006 Q1 0.817 0.017 0.016 0.016 0.797 0.020 0.021 0.019
2006 Q2 0.803 0.015 0.016 0.016 0.795 0.017 0.017 0.017
2006 Q3 0.789 0.016 0.015 0.015 0.776 0.018 0.018 0.018
2006 Q4 0.776 0.012 0.012 0.012 0.769 0.013 0.013 0.013
2007 Q1 0.697 0.013 0.013 0.014 0.713 0.013 0.012 0.012
2007 Q2 0.704 0.010 0.010 0.010 0.720 0.009 0.009 0.009
2007 Q3 0.725 0.008 0.008 0.008 0.727 0.008 0.008 0.008
2007 Q4 0.720 0.006 0.006 0.007 0.738 0.006 0.006 0.005
2008 Q1 0.837 0.004 0.004 0.004 0.838 0.004 0.005 0.005
2008 Q2 0.832 0.005 0.005 0.005 0.833 0.005 0.006 0.005
2008 Q3 0.830 0.006 0.006 0.007 0.831 0.006 0.006 0.007
2008 Q4 0.857 0.008 0.008 0.008 0.856 0.008 0.008 0.008
2009 Q1 0.804 0.024 0.023 0.022 0.805 0.024 0.023 0.023
2009 Q2 0.811 0.018 0.019 0.017 0.807 0.018 0.017 0.018
2009 Q3 0.757 0.013 0.013 0.013 0.758 0.013 0.012 0.013
2009 Q4 0.738 0.023 0.025 0.022 0.742 0.023 0.022 0.023
2010 Q1 0.825 0.033 0.034 0.032 0.829 0.032 0.029 0.031
2010 Q2 0.793 0.038 0.039 0.037 0.798 0.037 0.034 0.039
2010 Q3 0.826 0.034 0.031 0.034 0.830 0.033 0.029 0.033
2010 Q4 0.769 0.036 0.038 0.034 0.779 0.037 0.035 0.037
2011 Q1 0.789 0.039 0.037 0.035 0.780 0.039 0.043 0.039
2011 Q2 0.780 0.042 0.041 0.039 0.773 0.043 0.041 0.042
2011 Q3 0.740 0.048 0.048 0.044 0.733 0.049 0.048 0.046
2011 Q4 0.782 0.050 0.043 0.047 0.783 0.049 0.050 0.046
2012 Q1 0.861 0.034 0.032 0.033 0.868 0.031 0.031 0.031
2012 Q2 0.776 0.043 0.045 0.038 0.778 0.042 0.046 0.039
2012 Q3 0.771 0.045 0.043 0.045 0.784 0.045 0.046 0.043
2012 Q4 0.771 0.038 0.036 0.034 0.766 0.039 0.038 0.040
2013 Q1 0.769 0.039 0.037 0.039 0.772 0.040 0.039 0.041
2013 Q2 0.738 0.029 0.028 0.029 0.739 0.030 0.028 0.026
2013 Q3 0.730 0.040 0.039 0.041 0.735 0.042 0.043 0.041
2013 Q4 0.754 0.033 0.031 0.032 0.750 0.033 0.032 0.032
Table 7.  $ D $ statistics from KS-test between training and test bootstrapped AUC
train year 2000 2001 2002 2003 2004 2005
test year 2003 2004 2005 2006 2007 2008
without relabelling 0.435 0.885 0.980 1.000 1.000 0.800
with relabelling 0.420 0.679 0.398 0.842 0.855 0.289
train year 2006 2007 2008 2009 2010
test year 2009 2010 2011 2012 2013
without relabelling 0.993 0.900 0.985 0.890 0.990
with relabelling 0.930 0.930 0.983 0.827 0.795
train year 2000 2001 2002 2003 2004 2005
test year 2003 2004 2005 2006 2007 2008
without relabelling 0.435 0.885 0.980 1.000 1.000 0.800
with relabelling 0.420 0.679 0.398 0.842 0.855 0.289
train year 2006 2007 2008 2009 2010
test year 2009 2010 2011 2012 2013
without relabelling 0.993 0.900 0.985 0.890 0.990
with relabelling 0.930 0.930 0.983 0.827 0.795
Table 10.  Mean AUC difference (Hierarchical - Logistic) in each year
Train year Training Default rate Test year Test Default rate AUC difference
2000 0.41% 2003 0.06% 0.0057
2001 0.20% 2004 0.07% 0.0063
2002 0.10% 2005 0.18% 0.0578
2003 0.06% 2006 0.89% 0.0119
2004 0.07% 2007 4.26% -0.0133
2005 0.18% 2008 3.15% -0.0005
2006 0.89% 2009 0.30% -0.0003
2007 4.26% 2010 0.09% -0.0055
2008 3.15% 2011 0.08% 0.0055
2009 0.30% 2012 0.06% -0.0041
2010 0.09% 2013 0.10% -0.0011
Train year Training Default rate Test year Test Default rate AUC difference
2000 0.41% 2003 0.06% 0.0057
2001 0.20% 2004 0.07% 0.0063
2002 0.10% 2005 0.18% 0.0578
2003 0.06% 2006 0.89% 0.0119
2004 0.07% 2007 4.26% -0.0133
2005 0.18% 2008 3.15% -0.0005
2006 0.89% 2009 0.30% -0.0003
2007 4.26% 2010 0.09% -0.0055
2008 3.15% 2011 0.08% 0.0055
2009 0.30% 2012 0.06% -0.0041
2010 0.09% 2013 0.10% -0.0011
[1]

Yuyuan Ouyang, Trevor Squires. Some worst-case datasets of deterministic first-order methods for solving binary logistic regression. Inverse Problems and Imaging, 2021, 15 (1) : 63-77. doi: 10.3934/ipi.2020047

[2]

Lican Kang, Yuan Luo, Jerry Zhijian Yang, Chang Zhu. A primal and dual active set algorithm for truncated $L_1$ regularized logistic regression. Journal of Industrial and Management Optimization, 2022  doi: 10.3934/jimo.2022050

[3]

Wenbin Lv, Qingyuan Wang. Global existence for a class of Keller-Segel models with signal-dependent motility and general logistic term. Evolution Equations and Control Theory, 2021, 10 (1) : 25-36. doi: 10.3934/eect.2020040

[4]

Alexander Quaas, Aliang Xia. Existence and uniqueness of positive solutions for a class of logistic type elliptic equations in $\mathbb{R}^N$ involving fractional Laplacian. Discrete and Continuous Dynamical Systems, 2017, 37 (5) : 2653-2668. doi: 10.3934/dcds.2017113

[5]

Yuri Kogan, Zvia Agur, Moran Elishmereni. A mathematical model for the immunotherapeutic control of the Th1/Th2 imbalance in melanoma. Discrete and Continuous Dynamical Systems - B, 2013, 18 (4) : 1017-1030. doi: 10.3934/dcdsb.2013.18.1017

[6]

Shuhua Wang, Zhenlong Chen, Baohuai Sheng. Convergence of online pairwise regression learning with quadratic loss. Communications on Pure and Applied Analysis, 2020, 19 (8) : 4023-4054. doi: 10.3934/cpaa.2020178

[7]

Adil Bagirov, Sona Taheri, Soodabeh Asadi. A difference of convex optimization algorithm for piecewise linear regression. Journal of Industrial and Management Optimization, 2019, 15 (2) : 909-932. doi: 10.3934/jimo.2018077

[8]

Shaoyong Lai, Qichang Xie. A selection problem for a constrained linear regression model. Journal of Industrial and Management Optimization, 2008, 4 (4) : 757-766. doi: 10.3934/jimo.2008.4.757

[9]

Juan J. Nieto, M. Victoria Otero-Espinar, Rosana Rodríguez-López. Dynamics of the fuzzy logistic family. Discrete and Continuous Dynamical Systems - B, 2010, 14 (2) : 699-717. doi: 10.3934/dcdsb.2010.14.699

[10]

Luis Caffarelli, Serena Dipierro, Enrico Valdinoci. A logistic equation with nonlocal interactions. Kinetic and Related Models, 2017, 10 (1) : 141-170. doi: 10.3934/krm.2017006

[11]

Zenonas Navickas, Rasa Smidtaite, Alfonsas Vainoras, Minvydas Ragulskis. The logistic map of matrices. Discrete and Continuous Dynamical Systems - B, 2011, 16 (3) : 927-944. doi: 10.3934/dcdsb.2011.16.927

[12]

Roberto De Leo, James A. Yorke. The graph of the logistic map is a tower. Discrete and Continuous Dynamical Systems, 2021, 41 (11) : 5243-5269. doi: 10.3934/dcds.2021075

[13]

Jiang Xie, Junfu Xu, Celine Nie, Qing Nie. Machine learning of swimming data via wisdom of crowd and regression analysis. Mathematical Biosciences & Engineering, 2017, 14 (2) : 511-527. doi: 10.3934/mbe.2017031

[14]

Bingzheng Li, Zhengzhan Dai. Error analysis on regularized regression based on the Maximum correntropy criterion. Mathematical Foundations of Computing, 2020, 3 (1) : 25-40. doi: 10.3934/mfc.2020003

[15]

Song Wang, Quanxi Shao, Xian Zhou. Knot-optimizing spline networks (KOSNETS) for nonparametric regression. Journal of Industrial and Management Optimization, 2008, 4 (1) : 33-52. doi: 10.3934/jimo.2008.4.33

[16]

Baohuai Sheng, Huanxiang Liu, Huimin Wang. Learning rates for the kernel regularized regression with a differentiable strongly convex loss. Communications on Pure and Applied Analysis, 2020, 19 (8) : 3973-4005. doi: 10.3934/cpaa.2020176

[17]

Erik Kropat, Gerhard Wilhelm Weber. Fuzzy target-environment networks and fuzzy-regression approaches. Numerical Algebra, Control and Optimization, 2018, 8 (2) : 135-155. doi: 10.3934/naco.2018008

[18]

Wei Li, Yun Teng. Enterprise inefficient investment behavior analysis based on regression analysis. Discrete and Continuous Dynamical Systems - S, 2019, 12 (4&5) : 1015-1025. doi: 10.3934/dcdss.2019069

[19]

Yang Mi, Kang Zheng, Song Wang. Homography estimation along short videos by recurrent convolutional regression network. Mathematical Foundations of Computing, 2020, 3 (2) : 125-140. doi: 10.3934/mfc.2020014

[20]

Qing Xu, Xiaohua (Michael) Xuan. Nonlinear regression without i.i.d. assumption. Probability, Uncertainty and Quantitative Risk, 2019, 4 (0) : 8-. doi: 10.1186/s41546-019-0042-6

 Impact Factor: 

Article outline

Figures and Tables

[Back to Top]