\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

A novel approach for solving skewed classification problem using cluster based ensemble method

  • * Corresponding author: Gillala Rekha

    * Corresponding author: Gillala Rekha 
The first author is supported by KL University
Abstract Full Text(HTML) Figure(1) / Table(3) Related Papers Cited by
  • In numerous real-world applications, the class imbalance problem is prevalent. When training samples of one class immensely outnumber samples of the other classes, the traditional machine learning algorithms show bias towards the majority class (a class with more number of samples) lead to significant losses of model performance. Several techniques have been proposed to handle the problem of class imbalance, including data sampling and boosting. In this paper, we present a cluster-based oversampling with boosting algorithm (Cluster+Boost) for learning from imbalanced data. We evaluate the performance of the proposed approach with state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost and SMOTEBoost. We conducted experiments on 22 data sets with various imbalance ratios. The experimental results are promising and provide an alternative approach for improving the performance of the classifier when learned on highly imbalanced data sets.

    Mathematics Subject Classification: Primary: 58F15, 58F17; Secondary: 53C35.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Framework of Cluster-based Oversampling with Boosting

    Table 1.  Dataset Characteristics

    Datasets Size # attr % IR
    ecoli-0_vs_1 220 7 1.82
    ecoli1 336 7 3.36
    ecoli2 336 7 5.46
    ecoli3 336 7 8.6
    glass0 214 9 2.06
    glass-0-1-2-3_vs_4-5-6 214 9 3.2
    glass1 214 9 1.82
    glass6 214 9 6.38
    haberman 306 3 2.78
    iris0 150 4 2
    new-thyroid1 215 5 5.14
    new-thyroid2 215 5 5.14
    page-blocks0 5472 10 8.79
    pima 768 8 1.87
    segment0 2308 19 6.02
    vehicle0 846 18 3.25
    vehicle1 846 18 2.9
    vehicle2 846 18 2.88
    vehicle3 846 18 2.99
    wisconsin 683 9 1.86
    yeast1 1484 8 2.46
    yeast3 1484 8 8.1
     | Show Table
    DownLoad: CSV

    Table 2.  Performances of the sampling techniques across all datasets using AUC Metric

    Datasets AdaBoost RUSBoost SMOTEBoost Cluster+boost
    ecoli-0_vs_1 0.6354 0.794 0.799 0.992
    ecoli1 0.778 0.883 0.899 0.985
    ecoli2 0.703 0.899 0.967 0.97
    ecoli3 0.681 0.856 0.955 0.986
    glass0 0.74 0.813 0.912 0.974
    glass-0-1-2-3_vs_4-5-6 0.703 0.91 0.987 0.987
    glass1 0.952 0.763 0.985 0.987
    glass6 0.947 0.918 0.991 0.997
    haberman 0.947 0.656 0.947 0.942
    iris0 0.949 0.98 0.978 0.981
    new-thyroid1 0.947 0.975 0.947 0.986
    new-thyroid2 0.687 0.961 0.987 0.994
    page-blocks0 0.637 0.953 0.967 0.996
    pima 0.6223 0.751 0.897 0.899
    segment0 0.996 0.994 0.998 0.998
    vehicle0 0.943 0.965 0.968 0.978
    vehicle1 0.754 0.768 0.897 0.899
    vehicle2 0.854 0.966 0.967 0.978
    vehicle3 0.745 0.763 0.894 0.894
    wisconsin 0.9 0.96 0.994 0.894
    yeast1 0.7589 0.7382 0.741 0.996
    yeast3 0.93 0.944 0.944 0.994
     | Show Table
    DownLoad: CSV

    Table 3.  Performances of the sampling techniques across all datasets using F-measure Metric

    Datasets AdaBoost RUSBoost SMOTEBoost Cluster+Boost
    ecoli-0_vs_1 0.632 0.795 0.799 0.995
    ecoli1 0.778 0.89 0.899 0.992
    ecoli2 0.71 0.899 0.967 0.986
    ecoli3 0.681 0.856 0.955 0.964
    glass0 0.74 0.813 0.912 0.982
    glass-0-1-2-3_vs_4-5-6 0.703 0.91 0.987 0.995
    glass1 0.952 0.763 0.985 0.99
    glass6 0.947 0.918 0.991 0.994
    haberman 0.947 0.656 0.947 0.942
    iris0 0.949 0.98 0.894 0.993
    new-thyroid1 0.947 0.975 0.947 0.983
    new-thyroid2 0.687 0.961 0.987 0.983
    page-blocks0 0.637 0.953 0.967 0.998
    pima 0.6223 0.751 0.897 0.894
    segment0 0.996 0.994 0.998 0.998
    vehicle0 0.943 0.965 0.988 0.984
    vehicle1 0.754 0.768 0.897 0.894
    vehicle2 0.854 0.966 0.967 0.941
    vehicle3 0.745 0.763 0.894 0.894
    wisconsin 0.9 0.96 0.994 0.997
    yeast1 0.7589 0.7382 0.741 0.979
    yeast3 0.93 0.944 0.944 0.974
     | Show Table
    DownLoad: CSV
  • [1] A. AliS. M. Shamsuddin and A. L. Ralescu, Classification with class imbalance problem: A review, Int J Adv Soft Comput Appl, 7 (2015), 176-204. 
    [2] J. Alcalá-FdezL. SánchezS. GarciaM. J. del JesusS. VenturaJ. M. GarrellJ. OteroC. RomeroJ. Bacardit and V. M. Rivas, et al., Keel: A software tool to assess evolutionary algorithms for data mining problems, Soft Computing, 13 (2009), 307-318. 
    [3] C. Bunkhumpornpat, K. Sinapiromsaran and C. Lursinsap, Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in: Proceedings of the IEEE Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 5476 (2009), 475–482. doi: 10.1007/978-3-642-01307-2_43.
    [4] S. BaruaM. M. IslamX. Yao and K. Murase, Mwmote–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Transactions on Knowledge and Data Engineering, 26 (2014), 405-425.  doi: 10.1109/TKDE.2012.232.
    [5] A. P. Bradley, The use of the area under the roc curve in the evaluation of machine learning algorithms, Pattern Recognition, 30 (1997), 1145-1159.  doi: 10.1016/S0031-3203(96)00142-2.
    [6] N. V. ChawlaK. W. BowyerL. O. Hall and W. P. Kegelmeyer, Smote: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16 (2002), 321-357.  doi: 10.1613/jair.953.
    [7] N. V. Chawla, A. Lazarevic, L. O. Hall and K. W. Bowyer, Smoteboost: Improving prediction of the minority class in boosting, in: European Conference on Principles of Data Mining and Knowledge Discovery, Springer, 2003,107–119. doi: 10.1007/978-3-540-39804-2_12.
    [8] M. GalarA. FernandezE. BarrenecheaH. Bustince and F. Herrera, A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches,, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42 (2012), 463-484.  doi: 10.1109/TSMCC.2011.2161285.
    [9] V. GarcíaR. A. Mollineda and J. S. Sánchez, On the k-nn performance in a challenging scenario of imbalance and overlapping,, Pattern Analysis and Applications, 11 (2008), 269-280.  doi: 10.1007/s10044-007-0087-5.
    [10] H. He, Y. Bai, E. A. Garcia and S. Li, Adasyn: Adaptive synthetic sampling approach for imbalanced learning, in: Proceedings of the IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, 2008, 1322–1328.
    [11] S. Hu, Y. Liang, L. Ma and Y. He, Msmote: Improving classification performance when training data is imbalanced, in: Proceedings of the Second International Workshop on Computer Science and Engineering, IEEE, 2 (2009), 13–17. doi: 10.1109/WCSE.2009.756.
    [12] H. Han, W.-Y. Wang and B.-H. Mao, Borderline-smote: A new over-sampling method in imbalanced data sets learning, in: Proceedings of the International Conference on Intelligent Computing, Springer, 2005,878–887. doi: 10.1007/11538059_91.
    [13] M. Krstic and M. Bjelica, Impact of class imbalance on personalized program guide performance, IEEE Transactions on Consumer Electronics, 61 (2015), 90-95.  doi: 10.1109/TCE.2015.7064115.
    [14] M. LinK. Tang and X. Yao, Dynamic sampling approach to training neural networks for multiclass imbalance classification, IEEE Transactions on Neural Networks and Learning Systems, 24 (2013), 647-660. 
    [15] W.-Z. Lu and D. Wang, Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme, Science of the Total Environment, 395 (2008), 109-116.  doi: 10.1016/j.scitotenv.2008.01.035.
    [16] W.-C. LinC.-F. TsaiY.-H. Hu and J.-S. Jhang, Clustering-based undersampling in class-imbalanced data, Information Sciences, 409/410 (2017), 17-26.  doi: 10.1016/j.ins.2017.05.008.
    [17] G. RekhaA. K. Tyagi and V. Krishna Reddy, A wide scale classification of class imbalance problem and its solutions: A systematic literature review,, Journal of Computer Science, 15 (2019), 886-929.  doi: 10.3844/jcssp.2019.886.929.
    [18] G. RekhaA. K. Tyagi and V. Krishna Reddy, Solving class imbalance problem using bagging, boosting techniques, with and without using noise filtering method, International Journal of Hybrid Intelligent Systems, 15 (2019), 67-76.  doi: 10.3233/HIS-190261.
    [19] F. Rayhan, S. Ahmed, A. Mahbub, M. Jani, S. Shatabda and D. M. Farid, et al., Cusboost: Cluster-based under-sampling with boosting for imbalanced classification, 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), (2017), arXiv1712.04356. doi: 10.1109/CSITSS.2017.8447534.
    [20] S. Ruggieri, Efficient c4. 5 [classification algorithm], IEEE Transactions on Knowledge and Data Engineering, 14 (2002), 438-444. 
    [21] Y. SunA. K. Wong and M. S. Kamel, Classification of imbalanced data: A review,, International Journal of Pattern Recognition and Artificial Intelligence, 23 (2009), 687-719.  doi: 10.1142/S0218001409007326.
    [22] C. SeiffertT. M. KhoshgoftaarJ. Van Hulse and A. Napolitano, Rusboost: A hybrid approach to alleviating class imbalance, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 40 (2010), 185-197.  doi: 10.1109/TSMCA.2009.2029559.
    [23] R. C. Team, R: A language and environment for statistical computing [internet], vienna (austria): R foundation for statistical computing.[cited 2015 mar 23] (2012).
    [24] S. Wang and X. Yao, Diversity analysis on imbalanced data sets by using ensemble models, in: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining, IEEE, 2009,324–331. doi: 10.1109/CIDM.2009.4938667.
  • 加载中

Figures(1)

Tables(3)

SHARE

Article Metrics

HTML views(1247) PDF downloads(544) Cited by(0)

Access History

Other Articles By Authors

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return