Advanced Search
Article Contents
Article Contents

Using distribution analysis for parameter selection in repstream

  • * Corresponding author: Ross Callister

    * Corresponding author: Ross Callister 
Abstract Full Text(HTML) Figure(28) / Table(3) Related Papers Cited by
  • One of the most significant challenges in data clustering is the evolution of the data distributions over time. Many clustering algorithms have been introduced to deal specifically with streaming data, but common amongst them is that they require users to set input parameters. These inform the algorithm about the criteria under which data points may be clustered together. Setting the initial parameters for a clustering algorithm is itself a non-trivial task, but the evolution of the data distribution over time could mean even optimally set parameters could become non-optimal as the stream evolves. In this paper we extend the RepStream algorithm, a combination graph and density-based clustering algorithm, in a way which allows the primary input parameter, the $ K $ value, to be automatically adjusted over time. We introduce a feature called the edge distribution score which we compute for data in memory, as well as introducing an incremental method for adjusting the $ K $ parameter over time based on this score. We evaluate our methods against RepStream itself, and other contemporary stream clustering algorithms, and show how our method of automatically adjusting the $ K $ value over time leads to higher quality clustering output even when the initial parameters are set poorly.

    Mathematics Subject Classification: Primary: 62H30.


    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 1.  An illustration of the representative and point-level sparse graphs in the RepStream algorithm

    Figure 2.  The representative vertex $ R_1 $ and $ R_2 $ share a reciprocal connection at the representative level, and are also density related. The density radius of the vertices is shown as $ DR_1 $ and $ DR_2 $ respectively, which is the average distance to neighbours at the point level, multiplied by the alpha scaling factor

    Figure 3.  Intra and Inter-class edges. The Edge $ E_1 $ is considered an inter-class edge as it connects two vertices $ R_1 $ and $ R_2 $ that belong to different ground-truth classes. Edge $ E_2 $ connects two vertices belonging to the same class and thus is considered an intra-class edge

    Figure 4.  An illustration of relative edge lengths of nearest neighbours in the middle of a cluster versus near the edge of a cluster

    Figure 5.  Different cases for the distribution of edge lengths

    Figure 6.  Visualisation of the DS1 dataset

    Figure 7.  Visualisation of the DS2 dataset

    Figure 8.  Two dimensional representations of the 5 classes in the SynTest dataset. The main class is always present and steadily changes shape, the smaller classes appear at various points through the dataset, as shown in Figure 9

    Figure 9.  The class presence of the classes in the SynTest dataset. A marker indicates that the class is present in the dataset during the given time window

    Figure 10.  The evolution of the Closer dataset, showing slices of its 3 sections

    Figure 11.  Comparative purity for Tree Cover dataset

    Figure 12.  Comparative purity for Tree Cover dataset

    Figure 13.  Comparative purity for KDD 99' Cup dataset

    Figure 14.  Comparative purity for KDD 99' Cup dataset

    Figure 15.  Comparative purity for DS1 dataset

    Figure 16.  The $ K $ value selected by our dynamic $ K $ method on the DS1 dataset

    Figure 17.  Comparative purity for DS2 dataset

    Figure 18.  Comparative purity for SynTest dataset

    Figure 19.  Comparative purity for Closer dataset

    Figure 20.  Comparative purity for Tree Cover dataset

    Figure 21.  Comparative purity for KDD 99' Cup dataset

    Figure 22.  The $ K $ value selected by our dynamic $ K $ method on the KDD Dataset

    Figure 23.  F-Measure comparison vs RepStream using optimal parameters on DS1 dataset

    Figure 24.  F-Measure comparison vs RepStream using optimal parameters on DS2 dataset

    Figure 25.  F-Measure comparison vs RepStream using optimal parameters on SynTest dataset

    Figure 26.  F-Measure comparison vs RepStream using optimal parameters on Closer dataset

    Figure 27.  F-Measure comparison vs RepStream using optimal parameters on Tree Cover dataset

    Figure 28.  F-Measure comparison vs RepStream using optimal parameters on KDD 99' Cup dataset

    Table 1.  Table showing the input parameters used for each algorithm

    Algorithm name Density Parameters Grid Granularity Reclustering Method Decay Parameter Distance Parameter Other Parameters
    CluStream $ k $-means $ \delta $ $ \alpha $, $ l $ Snapshot parameters
    SWClustering $ \beta $ $ k $-means $ \epsilon $, $ N $ Window parameters
    DenStream $ \mu $, $ \beta $ $ \epsilon $
    HPStream $ \lambda $ $ r $ $ l $ Projected Dimensions
    D-Stream $ C_m $, $ C_l $, $ \beta $ $ len $ $ \lambda $
    ExCC Grid Granularity in all dimensions
    MR-Stream $C_H$, $C_L$, $\beta$, $\mu$ Hierarchical $\lambda$ $\epsilon$ $H$ Heirarchical limit
    FlockStream $\beta$, $\mu$ $\lambda$ $\epsilon$
    DeBaRa $dPts$ $f$
    DBStream $\alpha$, $w\_min$ $\lambda$, $t\_gap$ $r$
    SNCStream $\psi$, $\beta$ $\lambda$ $\epsilon$ $K$ Graph connectivity
    PASCAL Evolutionary Evolutionary EDA hyper-parameters
    HASTREAM $minPts$ Hierarchical $\lambda$
    BEStream $\Delta$, $\tau$ $\lambda$ $\xi$ $\theta$ Direction parameter
    ADStream $\xi$, $\epsilon$ $\lambda$
    PatchWork $ratio$, $minPoints$, $minCells$ $\epsilon$
    RepStream $\lambda$ $\alpha$ $K$ Graph connectivity
     | Show Table
    DownLoad: CSV

    Table 2.  Best and Worst $ K $ values for RepStream, according to $ F $ score

    Dataset Best K Best $ F $ score Worst K Worst $ F $ score
    DS1 7 0.7208 18 0.2767
    DS2 7 0.6371 21 0.2594
    SynTest 9 0.8614 5 0.5435
    Closer 9 0.7989 5 0.4345
    TreeCov 29 0.6108 5 0.2978
    KDD99 30 0.7898 5 0.2636
     | Show Table
    DownLoad: CSV

    Table 3.  Comparison of our dynamic $ K $ method versus RepStream with optimal and worst $ K $ values

    Dataset Best F-Measure Worst F-Measure Dynamic $ K $
    DS1 0.7208 0.2767 0.5153
    DS2 0.6371 0.2594 0.4137
    SynTest 0.7989 0.5435 0.7091
    Closer 0.8614 0.4345 0.8410
    TreeCov 0.6108 0.2978 0.5920
    KDD99 0.7898 0.2636 0.7882
     | Show Table
    DownLoad: CSV
  • [1] M. R. Ackermann, M. Märtens, C. Raupach, K. Swierkot, C. Lammersen and C. Sohler, Streamkm++: A clustering algorithm for data streams, Journal of Experimental Algorithmics, 17 (2012), Article 2.4, 30 pp. doi: 10.1145/2133803.2184450.
    [2] C. Aggarwal, J. Han, J. Wang And P. Yu, A framework for projected clustering of high dimensional data streams, in Proceedings 2004 VLDB Conference, Elsevier, 2004, 852–863. doi: 10.1016/B978-012088469-8.50075-9.
    [3] C. C. Aggarwal, P. S. Yu, J. Han and J. Wang, A framework for clustering evolving data streams, in Proceedings 2003 VLDB Conference, Elsevier, 2003, 81–92. doi: 10.1016/B978-012722442-8/50016-1.
    [4] J. P. Barddal, H. M. Gomes and F. Enembreck, SNCStream: A social network-based data stream clustering algorithm, in Proceedings of the 30th Annual ACM Symposium on Applied Computing - SAC'15, SAC'15, ACM Press, New York, New York, USA, 2015, 935–940. doi: 10.1145/2695664.2695674.
    [5] V. BhatnagarS. Kaur and S. Chakravarthy, Clustering data streams using grid-based synopsis, Knowledge and Information Systems, 41 (2014), 127-152.  doi: 10.1007/s10115-013-0659-1.
    [6] J. A. Blackard and D. J. Dean, Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables, Computers and Electronics in Agriculture, 24 (1999), 131–151. doi: 10.1016/S0168-1699(99)00046-0.
    [7] F. Cao, M. Estert, W. Qian and A. Zhou, Density-based clustering over an evolving data stream with noise, in Proceedings of the 2006 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, Philadelphia, PA, 2006, 328–339. doi: 10.1137/1.9781611972764.29.
    [8] Y. Chen and L. Tu, Density-based clustering for real-time stream data, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD'07, ACM, ACM Press, New York, New York, USA, 2007, 133–142. doi: 10.1145/1281192.1281210.
    [9] A. ForestieroC. Pizzuti and G. Spezzano, A single pass algorithm for clustering evolving data streams based on swarm intelligence, Data Mining and Knowledge Discovery, 26 (2013), 1-26.  doi: 10.1007/s10618-011-0242-x.
    [10] J. Forrest, Stream: A framework for data stream modeling in r, Bachelor Thesis, Department of Computer Science and Engineering, SMU.
    [11] M. Hahsler and M. Bolaos, Clustering data streams based on shared density between micro-clusters, IEEE Transactions on Knowledge and Data Engineering, 28 (2016), 1449–1461, http://ieeexplore.ieee.org/document/7393836/. doi: 10.1109/TKDE.2016.2522412.
    [12] S. Hettich and S. D. Bay, The UCI KDD Archive, http://kdd.ics.uci.edu, 1999.
    [13] A. K. JainM. N. Murty and P. J. Flynn, Data clustering: A review, ACM computing surveys (CSUR), 31 (1999), 264-323.  doi: 10.1145/331499.331504.
    [14] S. Kaur, V. Bhatnagar and S. Chakravarthy, Stream clustering algorithms: A primer, in Big Data in Complex Systems, Springer, 9 (2015), 105–145, http://link.springer.com/10.1007/978-3-319-11056-1. doi: 10.1007/978-3-319-11056-1_4.
    [15] H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes and B. Pfahringer, An effective evaluation measure for clustering on evolving data streams, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2011, 868–876. doi: 10.1145/2020408.2020555.
    [16] S. Lühr and M. Lazarescu, Incremental clustering of dynamic data streams using connectivity based representative points, Data & Knowledge Engineering, 68 (2009), 1–27, http://linkinghub.elsevier.com/retrieve/pii/S0169023X08001092.
    [17] J. Schneider and M. Vlachos, Fast parameterless density-based clustering via random projections, in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management - CIKM'13, ACM, ACM Press, New York, New York, USA, 2013, 861–866. doi: 10.1145/2505515.2505590.
    [18] N. Wattanakitrungroj, S. Maneeroj and C. Lursinsap, Bestream: Batch capturing with elliptic function for one-pass data stream clustering, Data & Knowledge Engineering, 117 (2018), 53–70, http://www.sciencedirect.com/science/article/pii/S0169023X17301027. doi: 10.1016/j.datak.2018.07.002.
    [19] G. Widmer and M. Kubat, Learning in the presence of concept drift and hidden contexts, Machine Learning, 23 (1996), 69-101.  doi: 10.1007/BF00116900.
    [20] X. Zhang, C. Furtlehner and M. Sebag, Data streaming with affinity propagation, in Proc. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2008, 628–643. doi: 10.1007/978-3-540-87481-2_41.
    [21] A. ZhouF. CaoW. Qian and C. Jin, Tracking clusters in evolving data streams over sliding windows, Knowledge and Information Systems, 15 (2008), 181-214.  doi: 10.1007/s10115-007-0070-x.
  • 加载中




Article Metrics

HTML views(931) PDF downloads(229) Cited by(0)

Access History

Other Articles By Authors



    DownLoad:  Full-Size Img  PowerPoint