• Previous Article
    Prediction models for burden of caregivers applying data mining techniques
  • BDIA Home
  • This Issue
  • Next Article
    An application of PART to the Football Manager data for players clusters analyses to inform club team formation
doi: 10.3934/bdia.2017020
Online First

Online First articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Online First publication benefits the research community by making new scientific discoveries known as quickly as possible.

Readers can access Online First articles via the “Online First” tab for the selected journal.

A category-based probabilistic approach to feature selection

1. 

School of Mathematics and Information Sciences, Guangzhou University, Guangzhou 510006, China

2. 

Clearpier Inc., 1300-121 Richmond St.W., Toronto, Ontario M5H 2K1, Canada

Early access August 2018

A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.

Citation: Jianguo Dai, Wenxue Huang, Yuanyi Pan. A category-based probabilistic approach to feature selection. Big Data & Information Analytics, doi: 10.3934/bdia.2017020
References:
[1]

A. DalyT. Dekker and S. Hess, Dummy coding vs effects coding for categorical variables: Clarifications and extensions, J. Choice Modelling, 21 (2014), 36-41.  doi: 10.1016/j.jocm.2016.09.005.  Google Scholar

[2]

S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998. Google Scholar

[3]

S. S. Gokhale, Quantifying the variance in application reliability, IEEE Pacific Rim International Symposium on Dependable Computing, (2004), 113-121.  doi: 10.1109/PRDC.2004.1276562.  Google Scholar

[4]

L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979.  Google Scholar

[5]

L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95.  doi: 10.1007/BF02288925.  Google Scholar

[6]

W. HuangX. Li and Y. Pan, Increase statistical reliability without lossing predictive power by merging classes and adding variables, Big Data and Information Analytics, 1 (2016), 341-347.   Google Scholar

[7]

W. Huang and Y. Pan, On balancing between optimal and proportional categorical predictions, Big Data and Information Analytics, 1 (2016), 129-137.  doi: 10.3934/bdia.2016.1.129.  Google Scholar

[8]

W. HuangY. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 46 (2017), 7798-7819.  doi: 10.1080/03610926.2014.930911.  Google Scholar

[9]

S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015. Google Scholar

[10]

J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv: 1601.07996v4, [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94. doi: 10.1145/3136625.  Google Scholar

[11]

C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999.  Google Scholar

[12]

STATCAN, 1998. Survey of Family Expenditures-1996. Google Scholar

[13]

http://archive.ics.uci.edu/ml/datasets/Mushroom Google Scholar

show all references

References:
[1]

A. DalyT. Dekker and S. Hess, Dummy coding vs effects coding for categorical variables: Clarifications and extensions, J. Choice Modelling, 21 (2014), 36-41.  doi: 10.1016/j.jocm.2016.09.005.  Google Scholar

[2]

S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998. Google Scholar

[3]

S. S. Gokhale, Quantifying the variance in application reliability, IEEE Pacific Rim International Symposium on Dependable Computing, (2004), 113-121.  doi: 10.1109/PRDC.2004.1276562.  Google Scholar

[4]

L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979.  Google Scholar

[5]

L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95.  doi: 10.1007/BF02288925.  Google Scholar

[6]

W. HuangX. Li and Y. Pan, Increase statistical reliability without lossing predictive power by merging classes and adding variables, Big Data and Information Analytics, 1 (2016), 341-347.   Google Scholar

[7]

W. Huang and Y. Pan, On balancing between optimal and proportional categorical predictions, Big Data and Information Analytics, 1 (2016), 129-137.  doi: 10.3934/bdia.2016.1.129.  Google Scholar

[8]

W. HuangY. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 46 (2017), 7798-7819.  doi: 10.1080/03610926.2014.930911.  Google Scholar

[9]

S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015. Google Scholar

[10]

J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv: 1601.07996v4, [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94. doi: 10.1145/3136625.  Google Scholar

[11]

C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999.  Google Scholar

[12]

STATCAN, 1998. Survey of Family Expenditures-1996. Google Scholar

[13]

http://archive.ics.uci.edu/ml/datasets/Mushroom Google Scholar

Table 1.  Feature selection by the original variables
Original Features$|\mbox{Domain}(X_2, Y)|$$\tau(Y|X_2)$$\lambda(Y|X_2)$$EG$
1180.94290.96930.4797
2460.97820.98770.7718
31080.99070.99390.9076
4192110.9490
Original Features$|\mbox{Domain}(X_2, Y)|$$\tau(Y|X_2)$$\lambda(Y|X_2)$$EG$
1180.94290.96930.4797
2460.97820.98770.7718
31080.99070.99390.9076
4192110.9490
Table 2.  Feature selection by the dummy variables
Merged Features$|\mbox{Domain}(X'_2, Y)|$$\tau(Y|X'_2)$$\lambda(Y|X'_2)$$EG$
4160.94450.96930.2098
4240.99080.99390.2143
5300.99620.99790.4669
638110.6638
Merged Features$|\mbox{Domain}(X'_2, Y)|$$\tau(Y|X'_2)$$\lambda(Y|X'_2)$$EG$
4160.94450.96930.2098
4240.99080.99390.2143
5300.99620.99790.4669
638110.6638
Table 3.  Feature selection by the original variables
OrigVarFeatures$|\mbox{Domain}(X_2, Y)|$$\tau(Y|X_2)$$\lambda(Y|X_2)$$EG$
1660.30050.34440.8201
22520.39480.43910.9046
318300.43830.46480.9833
OrigVarFeatures$|\mbox{Domain}(X_2, Y)|$$\tau(Y|X_2)$$\lambda(Y|X_2)$$EG$
1660.30050.34440.8201
22520.39480.43910.9046
318300.43830.46480.9833
Table 4.  Feature selection by the dummy variables
Merged Features$|\mbox{Domain}(X'_2, Y)|$$\tau(Y|X'_2)$$\lambda(Y|X'_2)$$EG$
2240.32420.39340.5491
2360.35730.41650.6242
2480.37510.42340.6388
3960.39010.42340.7035
41860.40170.42690.7774
42820.41210.43170.8066
55580.42210.45480.8782
69660.43140.47680.8968
717160.44360.48560.9135
Merged Features$|\mbox{Domain}(X'_2, Y)|$$\tau(Y|X'_2)$$\lambda(Y|X'_2)$$EG$
2240.32420.39340.5491
2360.35730.41650.6242
2480.37510.42340.6388
3960.39010.42340.7035
41860.40170.42690.7774
42820.41210.43170.8066
55580.42210.45480.8782
69660.43140.47680.8968
717160.44360.48560.9135
[1]

Yunmei Lu, Mingyuan Yan, Meng Han, Qingliang Yang, Yanqing Zhang. Privacy preserving feature selection and Multiclass Classification for horizontally distributed data. Mathematical Foundations of Computing, 2018, 1 (4) : 331-348. doi: 10.3934/mfc.2018016

[2]

Junying Hu, Xiaofei Qian, Jun Pei, Changchun Tan, Panos M. Pardalos, Xinbao Liu. A novel quality prediction method based on feature selection considering high dimensional product quality data. Journal of Industrial & Management Optimization, 2021  doi: 10.3934/jimo.2021099

[3]

Renato Bruni, Gianpiero Bianchi, Alessandra Reale. A combinatorial optimization approach to the selection of statistical units. Journal of Industrial & Management Optimization, 2016, 12 (2) : 515-527. doi: 10.3934/jimo.2016.12.515

[4]

Wenxue Huang, Xiaofeng Li, Yuanyi Pan. Increase statistical reliability without losing predictive power by merging classes and adding variables. Big Data & Information Analytics, 2016, 1 (4) : 341-347. doi: 10.3934/bdia.2016014

[5]

Danuta Gaweł, Krzysztof Fujarewicz. On the sensitivity of feature ranked lists for large-scale biological data. Mathematical Biosciences & Engineering, 2013, 10 (3) : 667-690. doi: 10.3934/mbe.2013.10.667

[6]

Austin Lawson, Tyler Hoffman, Yu-Min Chung, Kaitlin Keegan, Sarah Day. A density-based approach to feature detection in persistence diagrams for firn data. Foundations of Data Science, 2021  doi: 10.3934/fods.2021012

[7]

Mohamed A. Tawhid, Kevin B. Dsouza. Hybrid binary dragonfly enhanced particle swarm optimization algorithm for solving feature selection problems. Mathematical Foundations of Computing, 2018, 1 (2) : 181-200. doi: 10.3934/mfc.2018009

[8]

Mohammed Abdulrazaq Kahya, Suhaib Abduljabbar Altamir, Zakariya Yahya Algamal. Improving whale optimization algorithm for feature selection with a time-varying transfer function. Numerical Algebra, Control & Optimization, 2021, 11 (1) : 87-98. doi: 10.3934/naco.2020017

[9]

Michele La Rocca, Cira Perna. Designing neural networks for modeling biological data: A statistical perspective. Mathematical Biosciences & Engineering, 2014, 11 (2) : 331-342. doi: 10.3934/mbe.2014.11.331

[10]

Wenxue Huang, Yuanyi Pan, Lihong Zheng. Proportional association based roi model. Big Data & Information Analytics, 2017, 2 (2) : 119-125. doi: 10.3934/bdia.2017004

[11]

Weihong Guo, Yifei Lou, Jing Qin, Ming Yan. IPI special issue on "mathematical/statistical approaches in data science" in the Inverse Problem and Imaging. Inverse Problems & Imaging, 2021, 15 (1) : I-I. doi: 10.3934/ipi.2021007

[12]

Wenxue Huang, Yuanyi Pan. On balancing between optimal and proportional categorical predictions. Big Data & Information Analytics, 2016, 1 (1) : 129-137. doi: 10.3934/bdia.2016.1.129

[13]

Wenxue Huang, Qitian Qiu. Forward supervised discretization for multivariate with categorical responses. Big Data & Information Analytics, 2016, 1 (2&3) : 217-225. doi: 10.3934/bdia.2016005

[14]

Laura Aquilanti, Simone Cacace, Fabio Camilli, Raul De Maio. A Mean Field Games model for finite mixtures of Bernoulli and categorical distributions. Journal of Dynamics & Games, 2021, 8 (1) : 35-59. doi: 10.3934/jdg.2020033

[15]

Ye Wang, Ran Tao. Constructions of linear codes with small hulls from association schemes. Advances in Mathematics of Communications, 2020  doi: 10.3934/amc.2020114

[16]

Steven T. Dougherty, Jon-Lark Kim, Patrick Solé. Double circulant codes from two class association schemes. Advances in Mathematics of Communications, 2007, 1 (1) : 45-64. doi: 10.3934/amc.2007.1.45

[17]

Beniamin Mounits, Tuvi Etzion, Simon Litsyn. New upper bounds on codes via association schemes and linear programming. Advances in Mathematics of Communications, 2007, 1 (2) : 173-195. doi: 10.3934/amc.2007.1.173

[18]

Vadim S. Anishchenko, Tatjana E. Vadivasova, Galina I. Strelkova, George A. Okrokvertskhov. Statistical properties of dynamical chaos. Mathematical Biosciences & Engineering, 2004, 1 (1) : 161-184. doi: 10.3934/mbe.2004.1.161

[19]

David Lubicz. On a classification of finite statistical tests. Advances in Mathematics of Communications, 2007, 1 (4) : 509-524. doi: 10.3934/amc.2007.1.509

[20]

Jiaoyan Wang, Jianzhong Su, Humberto Perez Gonzalez, Jonathan Rubin. A reliability study of square wave bursting $\beta$-cells with noise. Discrete & Continuous Dynamical Systems - B, 2011, 16 (2) : 569-588. doi: 10.3934/dcdsb.2011.16.569

 Impact Factor: 

Metrics

  • PDF downloads (88)
  • HTML views (1391)
  • Cited by (0)

Other articles
by authors

[Back to Top]