-
Previous Article
Prediction models for burden of caregivers applying data mining techniques
- BDIA Home
- This Issue
-
Next Article
An application of PART to the Football Manager data for players clusters analyses to inform club team formation
Online First articles are published articles within a journal that have not yet been assigned to a formal issue. This means they do not yet have a volume number, issue number, or page numbers assigned to them, however, they can still be found and cited using their DOI (Digital Object Identifier). Online First publication benefits the research community by making new scientific discoveries known as quickly as possible.
Readers can access Online First articles via the “Online First” tab for the selected journal.
A category-based probabilistic approach to feature selection
1. | School of Mathematics and Information Sciences, Guangzhou University, Guangzhou 510006, China |
2. | Clearpier Inc., 1300-121 Richmond St.W., Toronto, Ontario M5H 2K1, Canada |
A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.
References:
[1] |
A. Daly, T. Dekker and S. Hess,
Dummy coding vs effects coding for categorical variables: Clarifications and extensions, J. Choice Modelling, 21 (2014), 36-41.
doi: 10.1016/j.jocm.2016.09.005. |
[2] |
S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998. |
[3] |
S. S. Gokhale,
Quantifying the variance in application reliability, IEEE Pacific Rim International Symposium on Dependable Computing, (2004), 113-121.
doi: 10.1109/PRDC.2004.1276562. |
[4] |
L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979. |
[5] |
L. Guttman,
The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95.
doi: 10.1007/BF02288925. |
[6] |
W. Huang, X. Li and Y. Pan,
Increase statistical reliability without lossing predictive power by merging classes and adding variables, Big Data and Information Analytics, 1 (2016), 341-347.
|
[7] |
W. Huang and Y. Pan,
On balancing between optimal and proportional categorical predictions, Big Data and Information Analytics, 1 (2016), 129-137.
doi: 10.3934/bdia.2016.1.129. |
[8] |
W. Huang, Y. Shi and X. Wang,
A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 46 (2017), 7798-7819.
doi: 10.1080/03610926.2014.930911. |
[9] |
S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015. |
[10] |
J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv: 1601.07996v4, [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94.
doi: 10.1145/3136625. |
[11] |
C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999. |
[12] | |
[13] |
show all references
References:
[1] |
A. Daly, T. Dekker and S. Hess,
Dummy coding vs effects coding for categorical variables: Clarifications and extensions, J. Choice Modelling, 21 (2014), 36-41.
doi: 10.1016/j.jocm.2016.09.005. |
[2] |
S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998. |
[3] |
S. S. Gokhale,
Quantifying the variance in application reliability, IEEE Pacific Rim International Symposium on Dependable Computing, (2004), 113-121.
doi: 10.1109/PRDC.2004.1276562. |
[4] |
L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979. |
[5] |
L. Guttman,
The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95.
doi: 10.1007/BF02288925. |
[6] |
W. Huang, X. Li and Y. Pan,
Increase statistical reliability without lossing predictive power by merging classes and adding variables, Big Data and Information Analytics, 1 (2016), 341-347.
|
[7] |
W. Huang and Y. Pan,
On balancing between optimal and proportional categorical predictions, Big Data and Information Analytics, 1 (2016), 129-137.
doi: 10.3934/bdia.2016.1.129. |
[8] |
W. Huang, Y. Shi and X. Wang,
A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 46 (2017), 7798-7819.
doi: 10.1080/03610926.2014.930911. |
[9] |
S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015. |
[10] |
J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv: 1601.07996v4, [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94.
doi: 10.1145/3136625. |
[11] |
C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999. |
[12] | |
[13] |
Original Features | ||||
1 | 18 | 0.9429 | 0.9693 | 0.4797 |
2 | 46 | 0.9782 | 0.9877 | 0.7718 |
3 | 108 | 0.9907 | 0.9939 | 0.9076 |
4 | 192 | 1 | 1 | 0.9490 |
Original Features | ||||
1 | 18 | 0.9429 | 0.9693 | 0.4797 |
2 | 46 | 0.9782 | 0.9877 | 0.7718 |
3 | 108 | 0.9907 | 0.9939 | 0.9076 |
4 | 192 | 1 | 1 | 0.9490 |
Merged Features | ||||
4 | 16 | 0.9445 | 0.9693 | 0.2098 |
4 | 24 | 0.9908 | 0.9939 | 0.2143 |
5 | 30 | 0.9962 | 0.9979 | 0.4669 |
6 | 38 | 1 | 1 | 0.6638 |
Merged Features | ||||
4 | 16 | 0.9445 | 0.9693 | 0.2098 |
4 | 24 | 0.9908 | 0.9939 | 0.2143 |
5 | 30 | 0.9962 | 0.9979 | 0.4669 |
6 | 38 | 1 | 1 | 0.6638 |
OrigVarFeatures | ||||
1 | 66 | 0.3005 | 0.3444 | 0.8201 |
2 | 252 | 0.3948 | 0.4391 | 0.9046 |
3 | 1830 | 0.4383 | 0.4648 | 0.9833 |
OrigVarFeatures | ||||
1 | 66 | 0.3005 | 0.3444 | 0.8201 |
2 | 252 | 0.3948 | 0.4391 | 0.9046 |
3 | 1830 | 0.4383 | 0.4648 | 0.9833 |
Merged Features | ||||
2 | 24 | 0.3242 | 0.3934 | 0.5491 |
2 | 36 | 0.3573 | 0.4165 | 0.6242 |
2 | 48 | 0.3751 | 0.4234 | 0.6388 |
3 | 96 | 0.3901 | 0.4234 | 0.7035 |
4 | 186 | 0.4017 | 0.4269 | 0.7774 |
4 | 282 | 0.4121 | 0.4317 | 0.8066 |
5 | 558 | 0.4221 | 0.4548 | 0.8782 |
6 | 966 | 0.4314 | 0.4768 | 0.8968 |
7 | 1716 | 0.4436 | 0.4856 | 0.9135 |
Merged Features | ||||
2 | 24 | 0.3242 | 0.3934 | 0.5491 |
2 | 36 | 0.3573 | 0.4165 | 0.6242 |
2 | 48 | 0.3751 | 0.4234 | 0.6388 |
3 | 96 | 0.3901 | 0.4234 | 0.7035 |
4 | 186 | 0.4017 | 0.4269 | 0.7774 |
4 | 282 | 0.4121 | 0.4317 | 0.8066 |
5 | 558 | 0.4221 | 0.4548 | 0.8782 |
6 | 966 | 0.4314 | 0.4768 | 0.8968 |
7 | 1716 | 0.4436 | 0.4856 | 0.9135 |
[1] |
Yunmei Lu, Mingyuan Yan, Meng Han, Qingliang Yang, Yanqing Zhang. Privacy preserving feature selection and Multiclass Classification for horizontally distributed data. Mathematical Foundations of Computing, 2018, 1 (4) : 331-348. doi: 10.3934/mfc.2018016 |
[2] |
Junying Hu, Xiaofei Qian, Jun Pei, Changchun Tan, Panos M. Pardalos, Xinbao Liu. A novel quality prediction method based on feature selection considering high dimensional product quality data. Journal of Industrial and Management Optimization, 2021 doi: 10.3934/jimo.2021099 |
[3] |
Renato Bruni, Gianpiero Bianchi, Alessandra Reale. A combinatorial optimization approach to the selection of statistical units. Journal of Industrial and Management Optimization, 2016, 12 (2) : 515-527. doi: 10.3934/jimo.2016.12.515 |
[4] |
Danuta Gaweł, Krzysztof Fujarewicz. On the sensitivity of feature ranked lists for large-scale biological data. Mathematical Biosciences & Engineering, 2013, 10 (3) : 667-690. doi: 10.3934/mbe.2013.10.667 |
[5] |
Austin Lawson, Tyler Hoffman, Yu-Min Chung, Kaitlin Keegan, Sarah Day. A density-based approach to feature detection in persistence diagrams for firn data. Foundations of Data Science, 2021 doi: 10.3934/fods.2021012 |
[6] |
Wenxue Huang, Xiaofeng Li, Yuanyi Pan. Increase statistical reliability without losing predictive power by merging classes and adding variables. Big Data & Information Analytics, 2016, 1 (4) : 341-347. doi: 10.3934/bdia.2016014 |
[7] |
Mohamed A. Tawhid, Kevin B. Dsouza. Hybrid binary dragonfly enhanced particle swarm optimization algorithm for solving feature selection problems. Mathematical Foundations of Computing, 2018, 1 (2) : 181-200. doi: 10.3934/mfc.2018009 |
[8] |
Mohammed Abdulrazaq Kahya, Suhaib Abduljabbar Altamir, Zakariya Yahya Algamal. Improving whale optimization algorithm for feature selection with a time-varying transfer function. Numerical Algebra, Control and Optimization, 2021, 11 (1) : 87-98. doi: 10.3934/naco.2020017 |
[9] |
Azam Modares, Nasser Motahari Farimani, Vahideh Bafandegan Emroozi. A vendor-managed inventory model based on optimal retailers selection and reliability of supply chain. Journal of Industrial and Management Optimization, 2022 doi: 10.3934/jimo.2022078 |
[10] |
Michele La Rocca, Cira Perna. Designing neural networks for modeling biological data: A statistical perspective. Mathematical Biosciences & Engineering, 2014, 11 (2) : 331-342. doi: 10.3934/mbe.2014.11.331 |
[11] |
Wenxue Huang, Yuanyi Pan, Lihong Zheng. Proportional association based roi model. Big Data & Information Analytics, 2017, 2 (2) : 119-125. doi: 10.3934/bdia.2017004 |
[12] |
Weihong Guo, Yifei Lou, Jing Qin, Ming Yan. IPI special issue on "mathematical/statistical approaches in data science" in the Inverse Problem and Imaging. Inverse Problems and Imaging, 2021, 15 (1) : I-I. doi: 10.3934/ipi.2021007 |
[13] |
Wenxue Huang, Yuanyi Pan. On balancing between optimal and proportional categorical predictions. Big Data & Information Analytics, 2016, 1 (1) : 129-137. doi: 10.3934/bdia.2016.1.129 |
[14] |
Wenxue Huang, Qitian Qiu. Forward supervised discretization for multivariate with categorical responses. Big Data & Information Analytics, 2016, 1 (2&3) : 217-225. doi: 10.3934/bdia.2016005 |
[15] |
Laura Aquilanti, Simone Cacace, Fabio Camilli, Raul De Maio. A Mean Field Games model for finite mixtures of Bernoulli and categorical distributions. Journal of Dynamics and Games, 2021, 8 (1) : 35-59. doi: 10.3934/jdg.2020033 |
[16] |
Steven T. Dougherty, Jon-Lark Kim, Patrick Solé. Double circulant codes from two class association schemes. Advances in Mathematics of Communications, 2007, 1 (1) : 45-64. doi: 10.3934/amc.2007.1.45 |
[17] |
Beniamin Mounits, Tuvi Etzion, Simon Litsyn. New upper bounds on codes via association schemes and linear programming. Advances in Mathematics of Communications, 2007, 1 (2) : 173-195. doi: 10.3934/amc.2007.1.173 |
[18] |
Ye Wang, Ran Tao. Constructions of linear codes with small hulls from association schemes. Advances in Mathematics of Communications, 2022, 16 (2) : 349-364. doi: 10.3934/amc.2020114 |
[19] |
Vadim S. Anishchenko, Tatjana E. Vadivasova, Galina I. Strelkova, George A. Okrokvertskhov. Statistical properties of dynamical chaos. Mathematical Biosciences & Engineering, 2004, 1 (1) : 161-184. doi: 10.3934/mbe.2004.1.161 |
[20] |
David Lubicz. On a classification of finite statistical tests. Advances in Mathematics of Communications, 2007, 1 (4) : 509-524. doi: 10.3934/amc.2007.1.509 |
Impact Factor:
Tools
Metrics
Other articles
by authors
[Back to Top]