• Previous Article
    MatrixMap: Programming abstraction and implementation of matrix computation for big data analytics
  • BDIA Home
  • This Issue
  • Next Article
    A testbed to enable comparisons between competing approaches for computational social choice
October  2016, 1(4): 341-347. doi: 10.3934/bdia.2016014

Increase statistical reliability without losing predictive power by merging classes and adding variables

1. 

School of Mathematics and Information Sciences, Guangzhou University, Guangzhou, 510006, China

2. 

Clearpier Inc., 1300-121 Richmond St.W., Toronto, Ontario, Canada M5H 2K1, Canada

* Corresponding authors: Wenxue Huang and Xiaofeng Li

Revised  April 2017 Published  April 2017

It is usually true that adding explanatory variables into a probability model increases association degree yet risks losing statistical reliability. In this article, we propose an approach to merge classes within the categorical explanatory variables before the addition so as to keep the statistical reliability while increase the predictive power step by step.

Citation: Wenxue Huang, Xiaofeng Li, Yuanyi Pan. Increase statistical reliability without losing predictive power by merging classes and adding variables. Big Data & Information Analytics, 2016, 1 (4) : 341-347. doi: 10.3934/bdia.2016014
References:
[1]

H. L. Costner, Criteria for measure of association, American Sociology Review, 30 (1965), 341-353. 

[2]

M. Dash and H. Liu, Feature selection for classification, Intell. Data. Anal., 1 (1997), 131-156.  doi: 10.1016/S1088-467X(97)00008-5.

[3]

R. L. Ebel, Estimation of the reliability of ratings, Psychomereika, 16 (1951), 407-424. 

[4]

G. S. Fisher, Monte Carlo: Concepts, Algorithms, and Applications, Springer-Verlag, 1996.

[5]

P. Glasserman, Monte Carlo Method in Financial Engineering, (Stochastic Modelling and Applied Probability) (V. 53), Spinger, 2004.

[6]

L. A. Goodman and W. H. Kruskal, Measures of Associations for Cross Classification, With a foreword by Stephen E. Fienberg. Springer Series in Statistics, 1. Springer-Verlag, New York-Berlin, 1979.

[7]

L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95.  doi: 10.1007/BF02288925.

[8]

I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res., 3 (2003), 1157-1182. 

[9]

W. Huang and Y. Pan, On balancing between optimal and proportional categorical predictions, Big Data and Info. Anal., 1 (2016), 129-137.  doi: 10.3934/bdia.2016.1.129.

[10]

W. HuangY. Pan and J. Wu, Supervised Discretization with GK-τ, Proc. Comp. Sci., 17 (2013), 114-120. 

[11]

W. HuangY. Pan and J. Wu, Supervised discretization for optimal prediction, Proc. Comp. Sci., 30 (2014), 75-80.  doi: 10.1016/j.procs.2014.05.383.

[12]

W. Huang, Y. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 2017.

[13]

M. G. Kendall, The Advanced Theory of Statistics, London, Charles Griffin and Co. , Ltd, 1946.

[14]

C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley Sons, 1999.

[15]

K. Pearson and D. Heron, On Theories of association, Biometrika, 9 (1913), 159-315. 

[16]

STATCAN, Survey of Family Expenditures -1996. (1998)

[17]

D. L. Streiner and G. R. Norman, On Theories of association, J. of Cli. Epid., 59 (2006), 327-330.  doi: 10.1016/j.jclinepi.2005.09.005.

show all references

References:
[1]

H. L. Costner, Criteria for measure of association, American Sociology Review, 30 (1965), 341-353. 

[2]

M. Dash and H. Liu, Feature selection for classification, Intell. Data. Anal., 1 (1997), 131-156.  doi: 10.1016/S1088-467X(97)00008-5.

[3]

R. L. Ebel, Estimation of the reliability of ratings, Psychomereika, 16 (1951), 407-424. 

[4]

G. S. Fisher, Monte Carlo: Concepts, Algorithms, and Applications, Springer-Verlag, 1996.

[5]

P. Glasserman, Monte Carlo Method in Financial Engineering, (Stochastic Modelling and Applied Probability) (V. 53), Spinger, 2004.

[6]

L. A. Goodman and W. H. Kruskal, Measures of Associations for Cross Classification, With a foreword by Stephen E. Fienberg. Springer Series in Statistics, 1. Springer-Verlag, New York-Berlin, 1979.

[7]

L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95.  doi: 10.1007/BF02288925.

[8]

I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res., 3 (2003), 1157-1182. 

[9]

W. Huang and Y. Pan, On balancing between optimal and proportional categorical predictions, Big Data and Info. Anal., 1 (2016), 129-137.  doi: 10.3934/bdia.2016.1.129.

[10]

W. HuangY. Pan and J. Wu, Supervised Discretization with GK-τ, Proc. Comp. Sci., 17 (2013), 114-120. 

[11]

W. HuangY. Pan and J. Wu, Supervised discretization for optimal prediction, Proc. Comp. Sci., 30 (2014), 75-80.  doi: 10.1016/j.procs.2014.05.383.

[12]

W. Huang, Y. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 2017.

[13]

M. G. Kendall, The Advanced Theory of Statistics, London, Charles Griffin and Co. , Ltd, 1946.

[14]

C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley Sons, 1999.

[15]

K. Pearson and D. Heron, On Theories of association, Biometrika, 9 (1913), 159-315. 

[16]

STATCAN, Survey of Family Expenditures -1996. (1998)

[17]

D. L. Streiner and G. R. Norman, On Theories of association, J. of Cli. Epid., 59 (2006), 327-330.  doi: 10.1016/j.jclinepi.2005.09.005.

Table 2.  Feature selection with merging: Occupation
$X$$\tau_b^{(Y|X)}$$\lambda^{(Y|X)}$$E(\mbox{Gini}(X|Y))$
$Age group'$+Sex0.14840.03750.6688
(Age group'+Sex)'+Education'0.15420.04470.6620
$X$$\tau_b^{(Y|X)}$$\lambda^{(Y|X)}$$E(\mbox{Gini}(X|Y))$
$Age group'$+Sex0.14840.03750.6688
(Age group'+Sex)'+Education'0.15420.04470.6620
Table 1.  Feature selection without merging: Occupation
$X$$\tau^{Y|X}$$\lambda^{Y|X}$$E(\mbox{Gini}(X|Y))$
Age group0.13440.03110.8773
Age group + Sex0.15110.04760.9228
$X$$\tau^{Y|X}$$\lambda^{Y|X}$$E(\mbox{Gini}(X|Y))$
Age group0.13440.03110.8773
Age group + Sex0.15110.04760.9228
Table 3.  Compare different merging threshold:Occupation
$X$$\phi^{st}(Y|X)$$\lambda^{(Y|X)}$$\tau^{(Y|X)}$$E(Gini(X,Y))$
Age group-0.03110.13440.8773
$Age group'$+Sex0.00050.04140.14930.9222
$Age group'$+$Sex$0.00300.03750.14840.6688
$Age group'$+$Sex$0.01000.00000.02090.2710
$X$$\phi^{st}(Y|X)$$\lambda^{(Y|X)}$$\tau^{(Y|X)}$$E(Gini(X,Y))$
Age group-0.03110.13440.8773
$Age group'$+Sex0.00050.04140.14930.9222
$Age group'$+$Sex$0.00300.03750.14840.6688
$Age group'$+$Sex$0.01000.00000.02090.2710
Table 4.  Compare different merging threshold
$X$$\lambda^{(Y|X)}$$\tau^{(Y|X)}$$E(\mbox{Gini}(X|Y))$
Rooms0.34435980.30046560.8200656
$Rooms'$+$Tenure'$0.42551170.35832770.7911177
$(Rooms'$+$Tenure')'+bedroom'$0.43812470.39017670.7165204
$X$$\lambda^{(Y|X)}$$\tau^{(Y|X)}$$E(\mbox{Gini}(X|Y))$
Rooms0.34435980.30046560.8200656
$Rooms'$+$Tenure'$0.42551170.35832770.7911177
$(Rooms'$+$Tenure')'+bedroom'$0.43812470.39017670.7165204
[1]

A. Zeblah, Y. Massim, S. Hadjeri, A. Benaissa, H. Hamdaoui. Optimization for series-parallel continuous power systems with buffers under reliability constraints using ant colony. Journal of Industrial and Management Optimization, 2006, 2 (4) : 467-479. doi: 10.3934/jimo.2006.2.467

[2]

Marc Bocquet, Julien Brajard, Alberto Carrassi, Laurent Bertino. Bayesian inference of chaotic dynamics by merging data assimilation, machine learning and expectation-maximization. Foundations of Data Science, 2020, 2 (1) : 55-80. doi: 10.3934/fods.2020004

[3]

Daniel Mckenzie, Steven Damelin. Power weighted shortest paths for clustering Euclidean data. Foundations of Data Science, 2019, 1 (3) : 307-327. doi: 10.3934/fods.2019014

[4]

Luís Tiago Paiva, Fernando A. C. C. Fontes. Sampled–data model predictive control: Adaptive time–mesh refinement algorithms and guarantees of stability. Discrete and Continuous Dynamical Systems - B, 2019, 24 (5) : 2335-2364. doi: 10.3934/dcdsb.2019098

[5]

Michele La Rocca, Cira Perna. Designing neural networks for modeling biological data: A statistical perspective. Mathematical Biosciences & Engineering, 2014, 11 (2) : 331-342. doi: 10.3934/mbe.2014.11.331

[6]

Santiago Cañez. Double groupoids and the symplectic category. Journal of Geometric Mechanics, 2018, 10 (2) : 217-250. doi: 10.3934/jgm.2018009

[7]

Wenxue Huang, Yuanyi Pan, Lihong Zheng. Proportional association based roi model. Big Data & Information Analytics, 2017, 2 (2) : 119-125. doi: 10.3934/bdia.2017004

[8]

Weihong Guo, Yifei Lou, Jing Qin, Ming Yan. IPI special issue on "mathematical/statistical approaches in data science" in the Inverse Problem and Imaging. Inverse Problems and Imaging, 2021, 15 (1) : I-I. doi: 10.3934/ipi.2021007

[9]

Xavier Brusset, Per J. Agrell. Intrinsic impediments to category captainship collaboration. Journal of Industrial and Management Optimization, 2017, 13 (1) : 113-133. doi: 10.3934/jimo.2016007

[10]

Alan Weinstein. A note on the Wehrheim-Woodward category. Journal of Geometric Mechanics, 2011, 3 (4) : 507-515. doi: 10.3934/jgm.2011.3.507

[11]

Wenxue Huang, Yuanyi Pan. On balancing between optimal and proportional categorical predictions. Big Data & Information Analytics, 2016, 1 (1) : 129-137. doi: 10.3934/bdia.2016.1.129

[12]

Wenxue Huang, Qitian Qiu. Forward supervised discretization for multivariate with categorical responses. Big Data & Information Analytics, 2016, 1 (2&3) : 217-225. doi: 10.3934/bdia.2016005

[13]

Masataka Kato, Hiroyuki Masuyama, Shoji Kasahara, Yutaka Takahashi. Effect of energy-saving server scheduling on power consumption for large-scale data centers. Journal of Industrial and Management Optimization, 2016, 12 (2) : 667-685. doi: 10.3934/jimo.2016.12.667

[14]

Jianguo Dai, Wenxue Huang, Yuanyi Pan. A category-based probabilistic approach to feature selection. Big Data & Information Analytics, 2018  doi: 10.3934/bdia.2017020

[15]

Laura Aquilanti, Simone Cacace, Fabio Camilli, Raul De Maio. A Mean Field Games model for finite mixtures of Bernoulli and categorical distributions. Journal of Dynamics and Games, 2021, 8 (1) : 35-59. doi: 10.3934/jdg.2020033

[16]

Shi-Liang Wu, Cheng-Hsiung Hsu. Entire solutions with merging fronts to a bistable periodic lattice dynamical system. Discrete and Continuous Dynamical Systems, 2016, 36 (4) : 2329-2346. doi: 10.3934/dcds.2016.36.2329

[17]

Steven T. Dougherty, Jon-Lark Kim, Patrick Solé. Double circulant codes from two class association schemes. Advances in Mathematics of Communications, 2007, 1 (1) : 45-64. doi: 10.3934/amc.2007.1.45

[18]

Beniamin Mounits, Tuvi Etzion, Simon Litsyn. New upper bounds on codes via association schemes and linear programming. Advances in Mathematics of Communications, 2007, 1 (2) : 173-195. doi: 10.3934/amc.2007.1.173

[19]

Ye Wang, Ran Tao. Constructions of linear codes with small hulls from association schemes. Advances in Mathematics of Communications, 2022, 16 (2) : 349-364. doi: 10.3934/amc.2020114

[20]

Lars Grüne, Marleen Stieler. Multiobjective model predictive control for stabilizing cost criteria. Discrete and Continuous Dynamical Systems - B, 2019, 24 (8) : 3905-3928. doi: 10.3934/dcdsb.2018336

 Impact Factor: 

Metrics

  • PDF downloads (143)
  • HTML views (187)
  • Cited by (0)

Other articles
by authors

[Back to Top]