Advanced Search
Article Contents
Article Contents

Privacy preserving feature selection and Multiclass Classification for horizontally distributed data

  • * Corresponding author: Meng Han

    * Corresponding author: Meng Han 
Abstract Full Text(HTML) Figure(9) / Table(4) Related Papers Cited by
  • In the last two decades, a lot of scientific fields have experienced a huge growth in data volume and data complexity, which brings data miners lots of opportunities, as well as many challenges. With the advent of the era of big data, applying data mining techniques on assembling data from multiple parties (or sources) has become a leading trend. However, those data mining tasks may divulge individuals' privacy, which leads to the increased concerns in privacy preserving. In this work, a Privacy Preserving feature selection method (PPFS-IFW) and Multiclass Classification method (PPM2C) are proposed. Experiments had been conducted to validate the performance of the proposed approaches. Both PPFS-IFW and PPM2C were tested on six benchmark datasets. The testing results demonstrate PPFS-IFW's capability in enhancing the classification performance at the level of accuracy by selection informative features. PPFS-IFW can not only preserve private information but also outperform some other state-of-the-art feature selection approaches. Experimental results also show that the proposed PPM2C method is workable and stable. Particularly, It reduces the risk of over-fitting when compared with the regular Support Vector Machine. In the meantime, by employing the Secure Sum Protocol to encrypt data at the bottom layer, users' privacy is preserved.

    Mathematics Subject Classification: Primary: 58F15, 58F17; Secondary: 53C35.


    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  Workflow of PPM2C

    Figure 2.  Classification accuracy improved by PPFS-IFW under CV1 scenario

    Figure 3.  Classification accuracy improved by PPFS-IFW under CV2 scenario

    Figure 4.  Classification Accuracy comparison before and after feature selection (PPFS-IFW)

    Figure 5.  Comparison of classification accuracy for PPM2C when using PAN-SVM and LIBSVM

    Figure 6.  Classification accuracy of PrivacySVM under CV1 and CV2

    Figure 7.  Classification accuracy of LIBSVM under CV1 and CV2

    Figure 8.  Classification accuracy of PrivacySVM under CV1

    Figure 9.  Classification accuracy of PrivacySVM under CV2

    Table 1.  Details of Datasets used in Evaluation of PPFS-IFW

    Datasetnum. samples num. features C $\gamma$
    Diabetes(DIA) 768 8 512.0 0.0078125
    Ionosphere 351 34 8.0 0.5
    Colon 62 2000 32.0 0.0078125
    Leukemia 72 7129 128.0 0.0001221
    Lymhoma(DLBCL) 47 4026 2.0 0.0078125
    Breast Cancer (WBC) 569 30 128.0 8.0
     | Show Table
    DownLoad: CSV

    Table 2.  Accuracy improved under CV1 and CV2

    Dataset CV2 CV1 CV1 num. of Feature CV2 num. of Feature
    DIA $3.39\%$ $2.10\%$ $4$ $4$
    Ionosphere $0.35\%$ $3.42\%$ $2$ $8$
    Colon $3.08\%$ $8.00\%$ $34$ $157$
    WBC $2.47\%$ $1.12\%$ $10$ $4$
    DLBCL $5.57\%$ $10.95\%$ $394$ $444$
    Leukemia $8.57\%$ $3.45\%$ $537$ $631$
    Sum $23.43\%$ $29.04\%$ $981$ $1248$
     | Show Table
    DownLoad: CSV

    Table 3.  Accuracy comparison with other methods

    Dataset Fisher SVM FSV RFE SVM KP SVM Ours(CV2) Ours (CV1)
    DIA $76.42$ $76.58$ $76.56$ $76.74$ $79.87$ $78.86$
    WBC $94.7$ $95.23$ $95.25$ $97.55$ $99.11$ $97.81$
    Colon $87.46$ $92.03$ $92.52$ $96.57$ $85.00$ $90.00$
     | Show Table
    DownLoad: CSV

    Table 4.  Details of Datasets

    Dataset num. of samples num. of features num. of class
    $Leukemia_3c$ 72 7129 3
    $Leukemia_4a$ 72 7129 4
    DNA 2000 180 3
    Vowel 528 10 11
    Lung 32 56 3
    Letter 15000 16 26
     | Show Table
    DownLoad: CSV
  • [1] C. Ambroise and G. J. McLachlan, Selection bias in gene extraction on the basis of microarray gene-expression data, Proceedings of the National Academy of Sciences, 99 (2002), 6562-6566.  doi: 10.1073/pnas.102102699.
    [2] V. G. Ashok, K. Navuluri, A. Alhafdhi and R. Mukkamala, Dataless data mining: Association rules-based distributed privacy-preserving data mining, in Information Technology-New Generations (ITNG), 2015 12th International Conference on, IEEE, 2015, 615-620. doi: 10.1109/ITNG.2015.102.
    [3] K. Bache and M. Lichman, Uci machine learning repository, http://archive.ics.uci.edu/ml, 2013.
    [4] K. Bache and M. Lichman, Uci machine learning repository, http://archive.ics.uci.edu/ml, 2013.
    [5] S. D. Bay, Combining nearest neighbor classifiers through multiple feature subsets. in ICML, 98 (1998), 37-45.
    [6] M. Bendechache and M.-T. Kechadi, Distributed clustering algorithm for spatial data mining, in Spatial Data Mining and Geographical Knowledge Services (ICSDM), 2015 2nd IEEE International Conference on. IEEE, 2015, 60-65.
    [7] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun, U. A. Muller, E. Sackinger, P. Simard et al., Comparison of classifier methods: a case study in handwritten digit recognition, in Pattern Recognition, 1994. Vol. 2-Conference B: Computer Vision & Image Processing., Proceedings of the 12th IAPR International. Conference on, vol. 2. IEEE, 1994, 77-82.
    [8] Z. Cai, R. Goebel, M. R. Salavatipour, Y. Shi, L. Xu and G. Lin, Selecting genes with dissimilar discrimination strength for sample class prediction, in Proceedings Of The 5th Asia-Pacific Bioinformatics Conference, World Scientific, 2007, 81-90. doi: 10.1142/9781860947995_0011.
    [9] P. S. Bradley and O. L. Mangasarian, Feature selection via concave minimization and support vector machines, in ICML, 98 (1998), 82-90.
    [10] C. J. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 2 (1998), 121-167. 
    [11] Z. CaiT. Zhang and X.-F Wan, A computational framework for influenza antigenic cartography, PLoS Computational Biology, 6 (2010), e1000949.  doi: 10.1371/journal.pcbi.1000949.
    [12] C. CliftonM. KantarciogluJ. VaidyaX. Lin and M. Y. Zhu, Tools for privacy preserving distributed data mining, ACM Sigkdd Explorations Newsletter, 4 (2002), 273-297.  doi: 10.1145/772862.772867.
    [13] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, 20 (1995), 273-297.  doi: 10.1007/BF00994018.
    [14] P. Drineas and M. W. Mahoney, On the nyström method for approximating a gram matrix for improved kernel-based learning, journal of Machine Learning Research, 6 (2015), 2153-2175. 
    [15] S. DudoitY. H. YangM. J. Callow and T. P. Speed, Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments, Statistica Sinica, 12 (2002), 111-139. 
    [16] R. A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics, 7 (1936), 179-188.  doi: 10.1111/j.1469-1809.1936.tb02137.x.
    [17] Z. Cai and X. Zheng, A private and efficient mechanism for data uploading in smart cyber-physical systems, IEEE Transactions on Network Science and Engineering, (2018), 1-1.  doi: 10.1109/TNSE.2018.2830307.
    [18] V. Franc and S. Sonnenburg, et al., Optimized cutting plane algorithm for large-scale risk minimization, Journal of Machine Learning Research, 10 (2009), 2157-2192. 
    [19] J. Friedman, Another Approach to Polychotomous Classification, Technical report, Department of Statistics, Stanford University, Tech. Rep., 1996.
    [20] C. FurlanelloM. SerafiniS. Merler and G. Jurman, et al., Entropy-based gene ranking without selection bias for the predictive classification of microarray data, BMC Bioinformatics, 4 (2003), 54. 
    [21] M. Han, J. Li, Ji and Z. Cai, Q. Han, Privacy reserved influence maximization in gps-enabled cyber-physical and online social networks, in 2016 IEEE International Conferences on Social Computing and Networking (SocialCom), 2016, 284-292.
    [22] H. Albinali, M. Han, J. Wang, H. Gao, Y. Li, The roles of social network mavens, in 2016 12th International Conference on Mobile Ad-Hoc and Sensor Networks (MSN), 2016, 1-8.
    [23] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri et al., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, science, 286 (1999), 531-537. doi: 10.1126/science.286.5439.531.
    [24] I. Guyon and A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research, 3 (2003), 1157-1182. 
    [25] I. GuyonJ. WestonS. Barnhill and V. Vapnik, Gene selection for cancer classification using support vector machines, Machine Learning, 46 (2002), 389-422. 
    [26] I. Kholod, M. Kuprianov and I. Petukhov, Distributed data mining based on actors for internet of things, in Embedded Computing (MECO), 2016 5th Mediterranean Conference on, IEEE, 2016, 480-484. doi: 10.1109/MECO.2016.7525698.
    [27] S. Knerr, L. Personnaz and G. Dreyfus, Single-layer learning revisited: A stepwise procedure for building and training a neural network, in Neurocomputing, Springer, 68 (1990), 41-50. doi: 10.1016/j.jcss.2003.06.002.
    [28] L. Liu, M. Han, Y. Zhou, Y. Wang, LSTM Recurrent Neural Networks for Influenza Trends Prediction, in International Symposium on Bioinformatics Research and Applications, 2018, 259-264.
    [29] Y. Lu, M. Yan, M. Han, Q. Yang, Y. Zhang, Privacy Preserving Multiclass Classification for Horizontally Distributed Data, in Proceedings of the 19th Annual SIG Conference on Information Technology Education, 2018, 165-165.
    [30] Y. Lindell and B. Pinkas, Privacy preserving data mining, Journal of Cryptology, 15 (2002), 177-206.  doi: 10.1007/s00145-001-0019-2.
    [31] M. Han, J. Wang, M. Yan, C. Ai, Z. Duan, Z. Hong, Near-complete privacy protection: cognitive optimal strategy in location-based services, in Procedia Computer Science, 129 (2018), 298-304.
    [32] A. Joshi, M. Han, Y. Wang, A survey on security and privacy issues of blockchain technology, in Mathematical Foundations of Computing, 1 (2018), 121-147.
    [33] Y. Lu, P. Phoungphol and Y. Zhang, Privacy aware non-linear support vector machine for multi-source big data, in Trust, Security and Privacy in Computing and Communications (TrustCom), 2014 IEEE 13th International Conference on, IEEE, 2014, 783-789. doi: 10.1109/TrustCom.2014.103.
    [34] S. MaldonadoR. Weber and J. Basak, Simultaneous feature selection and classification using kernel-penalized support vector machines, Information Sciences, 181 (2011), 115-128.  doi: 10.1016/j.ins.2010.08.047.
    [35] J. Miao and L. Niu, A survey on feature selection, Procedia Computer Science, 91 (2016), 919-926.  doi: 10.1016/j.procs.2016.07.111.
    [36] J. Miranda, R. Montoya and R. Weber, Linear penalization support vector machines for feature selection, in International Conference on Pattern Recognition and Machine Intelligence. Springer, 2005, 188-192.
    [37] K. Parmar, D. Vaghela and P. Sharma, Performance prediction of students using distributed data mining, in Innovations in Information, Embedded and Communication Systems (ICIIECS), 2015 International Conference on, IEEE, 2015, 1-5.
    [38] I. Rish, An empirical study of the naive bayes classifier, in IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 3 (2001), 41-46.
    [39] S. L. Salzberg, C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers, inc., 1993, Machine Learning, 16 (1994), 235-240. 
    [40] A. SharmaS. Imoto and S. Miyano, A top-r feature selection algorithm for microarray gene expression data, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 9 (2012), 754-764. 
    [41] Y. Shen, H. Shao and Y. Li, Research on the personalized privacy preserving distributed data mining, in Future Information Technology and Management Engineering, 2009. FITME'09. Second International Conference on. IEEE, 2009, 436-439. doi: 10.1109/FITME.2009.115.
    [42] C.-A. Tsai, C.-H. Huang, C.-W. Chang and C.-H. Chen, Recursive feature selection with significant variables of support vectors, Computational and Mathematical Methods in Medicine, 2012 (2012), Art. ID 712542, 12 pp. doi: 10.1155/2012/712542.
    [43] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio and V. Vapnik, Feature selection for svms, in Advances in Neural Information Processing Systems, 2001, 668-674.
    [44] Z. Xu and X. Yi, Classification of privacy-preserving distributed data mining protocols, in Digital Information Management (ICDIM), 2011 Sixth International Conference on. IEEE, 2011, 337-342. doi: 10.1109/ICDIM.2011.6093356.
    [45] K. Yang, Z. Cai, J. Li and G. Lin, A stable gene selection in microarray data analysis, BMC bioinformatics, 7 (2006), p228.
    [46] J. Ye and T. Xiong, Computational and theoretical analysis of null space and orthogonal linear discriminant analysis, Journal of Machine Learning Research, 7 (2006), 1183-1204. 
    [47] L. Ying-hua, Y. Bing-ru, C. Dan-yang and M. Nan, State-of-the-art in distributed privacy preserving data mining, in Communication Software and Networks (ICCSN), 2011 IEEE 3rd International Conference on. IEEE, 2011, 545-549. doi: 10.1109/ICCSN.2011.6014329.
    [48] K. Zhang, L. Lan, Z. Wang and F. Moerchen, Scaling up kernel svm on limited resources: A low-rank linearization approach, in Artificial Intelligence and Statistics, 2012, 1425-1434.
    [49] X. ZhangX. LuQ. ShiX.-q. XuE. L. Hon-chiuN. HarrisJ. D. IglehartA. MironJ. S. Liu and W. H. Wong, Recursive svm feature selection and sample classification for mass-spectrometry and microarray data, BMC Bioinformatics, 7 (2006), p197. 
    [50] F. Zhang, C. Rong, G. Zhao, J. Wu and X. Wu, Privacy-preserving two-party distributed association rules mining on horizontally partitioned data, in Cloud Computing and Big Data (CloudCom-Asia), 2013 International Conference on. IEEE, 2013, 633-640. doi: 10.1109/CLOUDCOM-ASIA.2013.87.
    [51] K. Zhang, I. W. Tsang and J. T. Kwok, Improved nyström low-rank approximation and error analysis, in Proceedings of the 25th International Conference on Machine Learning, ACM, 2008, 1232-1239. doi: 10.1145/1390156.1390311.
    [52] X. ZhengZ. Cai and Y. Li, Data linkage in smart internet of things systems: A consideration from a privacy perspective, IEEE Communications Magazine, 56 (2018), 55-61.  doi: 10.1109/MCOM.2018.1701245.
    [53] Z. ZhuY.-S. Ong and M. Dash, Markov blanket-embedded genetic algorithm for gene selection, Pattern Recognition, 40 (2007), 3236-3248.  doi: 10.1016/j.patcog.2007.02.007.
    [54] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, 2013.
  • 加载中




Article Metrics

HTML views(2061) PDF downloads(471) Cited by(0)

Access History



    DownLoad:  Full-Size Img  PowerPoint