# American Institute of Mathematical Sciences

January  2022, 18(1): 411-426. doi: 10.3934/jimo.2020160

## The approximation algorithm based on seeding method for functional $k$-means problem†

 1 School of Mathematics and Statistics, Shandong Normal University, Jinan 250014, China 2 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen 518055, China 3 Department of Operations Research and Information Engineering, Beijing University of Technology, Beijing 100124, China 4 School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250101, China

* Corresponding author: Dongmei Zhang

A preliminary version of this article appeared in Proceedings of COCOON 2019, pp. 387-396.

Received  March 2020 Revised  July 2020 Published  January 2022 Early access  November 2020

Different from the classical $k$-means problem, the functional $k$-means problem involves a kind of dynamic data, which is generated by continuous processes. In this paper, we mainly design an $O(\ln\; k)$-approximation algorithm based on the seeding method for functional $k$-means problem. Moreover, the numerical experiment presented shows that this algorithm is more efficient than the functional $k$-means clustering algorithm.

Citation: Min Li, Yishui Wang, Dachuan Xu, Dongmei Zhang. The approximation algorithm based on seeding method for functional $k$-means problem. Journal of Industrial and Management Optimization, 2022, 18 (1) : 411-426. doi: 10.3934/jimo.2020160
##### References:
 [1] C. Abraham, P. A. Cornillon, E. Matzner-Løber and N. Molinari, Unsupervised curve clustering using B-splines, Scandinavian Journal of Statistics, 30 (2003), 581-595.  doi: 10.1111/1467-9469.00350. [2] S. Ahmadian, A. Norouzi-Fard, O. Svensson and J. Ward, Better guarantees for $k$-means and Euclidean $k$-median by primal-dual algorithms, SIAM Journal on Computing, (2019), FOCS17-97–FOCS17-156. doi: 10.1137/18M1171321. [3] D. Aloise, A. Deshpande, P. Hansen and P. Popat, NP-hardness of Euclidean sum-of-squares clustering, Machine Learning, 75 (2009), 245-248.  doi: 10.1007/s10994-009-5103-0. [4] D. Arthur and S. Vassilvitskii, $K$-means++: The advantages of careful seeding, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, (2007), 1027–1035. [5] M. Boullé, Functional data clustering via piecewise constant nonparametric density estimation, Pattern Recognition, 45 (2012), 4389-4401. [6] C. Bouveyron and C. Brunet-Saumard, Model-based clustering of high-dimensional data: A review, Computational Statistics & Data Analysis, 71 (2014), 52-78.  doi: 10.1016/j.csda.2012.12.008. [7] R. Gamasaee and M. Zarandi, A new dirichlet process for mining dynamic patterns in functional data, Information Sciences, 405 (2017), 55-80.  doi: 10.1016/j.ins.2017.04.008. [8] S. Har-Peled and B. Sadri, How fast is the $k$-means method?, Algorithmica, 41 (2005), 185-202.  doi: 10.1007/s00453-004-1127-9. [9] J. Jacques and C. Preda, Functional data clustering: A survey, Advances in Data Analysis and Classification, 8 (2014), 231-255.  doi: 10.1007/s11634-013-0158-y. [10] S. Ji, D. Xu, L. Guo, M. Li and D. Zhang, The seeding algorithm for spherical $k$-means clustering with penalties, Journal of Combinatorial Optimization, 2020. doi: 10.1007/s10878-020-00569-1. [11] M. Kayano, K. Dozono and S. Konishi, Functional cluster analysis via orthonormalized Gaussian basis expansions and its application, Journal of Classification, 27 (2010), 211-230.  doi: 10.1007/s00357-010-9054-8. [12] M. Li, The bi-criteria seeding algorithms for two variants of $k$-means problem, Journal of Combinatorial Optimization, 2020. doi: 10.1007/s10878-020-00537-9. [13] M. Li, D. Xu, J. Yue, D. Zhang and P. Zhang, The seeding algorithm for $k$-means problem with penalties, Journal of Combinatorial Optimization, 39 (2020), 15-32.  doi: 10.1007/s10878-019-00450-w. [14] S. Lloyd, Least squares quantization in PCM, IEEE Transactions on Information Theory, 28 (1982), 129-137.  doi: 10.1109/TIT.1982.1056489. [15] Y. Meng, J. Liang, F. Cao and Y. He, A new distance with derivative information for functional $k$-means clustering algorithm, Information Sciences, 463/464 (2018), 166-185.  doi: 10.1016/j.ins.2018.06.035. [16] R. Ostrovsky, Y. Rabani, L. Schulman and C. Swamy, The effectiveness of Lloyd-type methods for the $k$-means problem, Journal of the ACM, 59 (2012), 28: 1–28: 22. doi: 10.1145/2395116.2395117. [17] G. Ozturk and M. Ciftci, Clustering based polyhedral conic functions algorithm in classification, Journal of Industrial & Management Optimization, 11 (2015), 921-932.  doi: 10.3934/jimo.2015.11.921. [18] J. Park and J. Ahn, Clustering multivariate functional data with phase variation, Biometrics, 73 (2017), 324-333.  doi: 10.1111/biom.12546. [19] J. Peng and H. G. Müller, Distance-based clustering of sparsely observed stochastic processes, with applications to online auctions, The Annals of Applied Statistics, 2 (2008), 1056-1077.  doi: 10.1214/08-AOAS172. [20] C. Preda, G. Saporta and C. Lévéder, PLS classification of functional data, Computational Statistics, 22 (2007), 223-235.  doi: 10.1007/s00180-007-0041-4. [21] T. Tarpey and K. K. Kinateder, Clustering functional data, Journal of Classification, 20 (2003), 93-114.  doi: 10.1007/s00357-003-0007-3. [22] D. Wei, A constant-factor bi-criteria approximation guarantee for $k$-means++, Proceedings of the Thirtieth International Conference on Neural Information Processing Systems, (2016), 604–612. [23] X. Wu, V. Kumar, J. Quinlan, J. Ross Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P.S. Yu, Z. H. Zhou, M. Steinbach, D. J. Hand and D. Steinberg, Top 10 algorithms in data mining, Knowledge and Information Systems, 14 (2008), 1-37.  doi: 10.1007/s10115-007-0114-2.

show all references

##### References:
 [1] C. Abraham, P. A. Cornillon, E. Matzner-Løber and N. Molinari, Unsupervised curve clustering using B-splines, Scandinavian Journal of Statistics, 30 (2003), 581-595.  doi: 10.1111/1467-9469.00350. [2] S. Ahmadian, A. Norouzi-Fard, O. Svensson and J. Ward, Better guarantees for $k$-means and Euclidean $k$-median by primal-dual algorithms, SIAM Journal on Computing, (2019), FOCS17-97–FOCS17-156. doi: 10.1137/18M1171321. [3] D. Aloise, A. Deshpande, P. Hansen and P. Popat, NP-hardness of Euclidean sum-of-squares clustering, Machine Learning, 75 (2009), 245-248.  doi: 10.1007/s10994-009-5103-0. [4] D. Arthur and S. Vassilvitskii, $K$-means++: The advantages of careful seeding, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, (2007), 1027–1035. [5] M. Boullé, Functional data clustering via piecewise constant nonparametric density estimation, Pattern Recognition, 45 (2012), 4389-4401. [6] C. Bouveyron and C. Brunet-Saumard, Model-based clustering of high-dimensional data: A review, Computational Statistics & Data Analysis, 71 (2014), 52-78.  doi: 10.1016/j.csda.2012.12.008. [7] R. Gamasaee and M. Zarandi, A new dirichlet process for mining dynamic patterns in functional data, Information Sciences, 405 (2017), 55-80.  doi: 10.1016/j.ins.2017.04.008. [8] S. Har-Peled and B. Sadri, How fast is the $k$-means method?, Algorithmica, 41 (2005), 185-202.  doi: 10.1007/s00453-004-1127-9. [9] J. Jacques and C. Preda, Functional data clustering: A survey, Advances in Data Analysis and Classification, 8 (2014), 231-255.  doi: 10.1007/s11634-013-0158-y. [10] S. Ji, D. Xu, L. Guo, M. Li and D. Zhang, The seeding algorithm for spherical $k$-means clustering with penalties, Journal of Combinatorial Optimization, 2020. doi: 10.1007/s10878-020-00569-1. [11] M. Kayano, K. Dozono and S. Konishi, Functional cluster analysis via orthonormalized Gaussian basis expansions and its application, Journal of Classification, 27 (2010), 211-230.  doi: 10.1007/s00357-010-9054-8. [12] M. Li, The bi-criteria seeding algorithms for two variants of $k$-means problem, Journal of Combinatorial Optimization, 2020. doi: 10.1007/s10878-020-00537-9. [13] M. Li, D. Xu, J. Yue, D. Zhang and P. Zhang, The seeding algorithm for $k$-means problem with penalties, Journal of Combinatorial Optimization, 39 (2020), 15-32.  doi: 10.1007/s10878-019-00450-w. [14] S. Lloyd, Least squares quantization in PCM, IEEE Transactions on Information Theory, 28 (1982), 129-137.  doi: 10.1109/TIT.1982.1056489. [15] Y. Meng, J. Liang, F. Cao and Y. He, A new distance with derivative information for functional $k$-means clustering algorithm, Information Sciences, 463/464 (2018), 166-185.  doi: 10.1016/j.ins.2018.06.035. [16] R. Ostrovsky, Y. Rabani, L. Schulman and C. Swamy, The effectiveness of Lloyd-type methods for the $k$-means problem, Journal of the ACM, 59 (2012), 28: 1–28: 22. doi: 10.1145/2395116.2395117. [17] G. Ozturk and M. Ciftci, Clustering based polyhedral conic functions algorithm in classification, Journal of Industrial & Management Optimization, 11 (2015), 921-932.  doi: 10.3934/jimo.2015.11.921. [18] J. Park and J. Ahn, Clustering multivariate functional data with phase variation, Biometrics, 73 (2017), 324-333.  doi: 10.1111/biom.12546. [19] J. Peng and H. G. Müller, Distance-based clustering of sparsely observed stochastic processes, with applications to online auctions, The Annals of Applied Statistics, 2 (2008), 1056-1077.  doi: 10.1214/08-AOAS172. [20] C. Preda, G. Saporta and C. Lévéder, PLS classification of functional data, Computational Statistics, 22 (2007), 223-235.  doi: 10.1007/s00180-007-0041-4. [21] T. Tarpey and K. K. Kinateder, Clustering functional data, Journal of Classification, 20 (2003), 93-114.  doi: 10.1007/s00357-003-0007-3. [22] D. Wei, A constant-factor bi-criteria approximation guarantee for $k$-means++, Proceedings of the Thirtieth International Conference on Neural Information Processing Systems, (2016), 604–612. [23] X. Wu, V. Kumar, J. Quinlan, J. Ross Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P.S. Yu, Z. H. Zhou, M. Steinbach, D. J. Hand and D. Steinberg, Top 10 algorithms in data mining, Knowledge and Information Systems, 14 (2008), 1-37.  doi: 10.1007/s10115-007-0114-2.
Notation Index
 Notations Meaning of symbols $x_i(t)$ or $x_i^j(t)$ functional curve $X(t)$ or $X^i(t)$ or $C^i(t)$ $d$-dimensional functional sample with functional curves as its components $\mathfrak{F}^d(t)$ set of $d$-dimensional functional samples $\Gamma(t)$ or $\Delta(t)$ or $C(t)$ subset of $\mathfrak{F}^d(t)$ $\mu(\Gamma(t))$ center of mass of functional samples of $\Gamma(t)$ $d(X^i(t), X^j(t))$ distance between the functional samples $X^i(t)$ and $X^j(t)$ $d(X^i(t), \Gamma(t))$ distance from the functional sample $X^i(t)$ to the subset of functional samples $\Gamma(t)$ $X(t)_{\Gamma(t)}$ the closest functional sample in $\Gamma(t)$ to the functional sample $X(t)$ $\Phi(\Gamma(t), C(t))$ potential function of $\Gamma(t)$ over $C(t)$ $\Gamma_{C(t)}^i(t)$ the $i$-th cluster of $\Gamma(t)$ with respect to $C(t)$
 Notations Meaning of symbols $x_i(t)$ or $x_i^j(t)$ functional curve $X(t)$ or $X^i(t)$ or $C^i(t)$ $d$-dimensional functional sample with functional curves as its components $\mathfrak{F}^d(t)$ set of $d$-dimensional functional samples $\Gamma(t)$ or $\Delta(t)$ or $C(t)$ subset of $\mathfrak{F}^d(t)$ $\mu(\Gamma(t))$ center of mass of functional samples of $\Gamma(t)$ $d(X^i(t), X^j(t))$ distance between the functional samples $X^i(t)$ and $X^j(t)$ $d(X^i(t), \Gamma(t))$ distance from the functional sample $X^i(t)$ to the subset of functional samples $\Gamma(t)$ $X(t)_{\Gamma(t)}$ the closest functional sample in $\Gamma(t)$ to the functional sample $X(t)$ $\Phi(\Gamma(t), C(t))$ potential function of $\Gamma(t)$ over $C(t)$ $\Gamma_{C(t)}^i(t)$ the $i$-th cluster of $\Gamma(t)$ with respect to $C(t)$
Comparison of Algorithm 1 and the functional $k$-means algorithm in [15]
 Data Set Method ARI DBI Initial Cost Returned Cost Time (s) Simudata SeedAlg 0.7919 0.8652 3338 1689 58 FuncAlg 0.7975 0.8656 5465 1689 61 Sdata SeedAlg 0.6651 1.1385 1667 712 192 FuncAlg 0.6561 1.1384 2485 720 217
 Data Set Method ARI DBI Initial Cost Returned Cost Time (s) Simudata SeedAlg 0.7919 0.8652 3338 1689 58 FuncAlg 0.7975 0.8656 5465 1689 61 Sdata SeedAlg 0.6651 1.1385 1667 712 192 FuncAlg 0.6561 1.1384 2485 720 217
 [1] Chenchen Wu, Wei Lv, Yujie Wang, Dachuan Xu. Approximation algorithm for spherical $k$-means problem with penalty. Journal of Industrial and Management Optimization, 2021  doi: 10.3934/jimo.2021067 [2] Baolan Yuan, Wanjun Zhang, Yubo Yuan. A Max-Min clustering method for $k$-means algorithm of data clustering. Journal of Industrial and Management Optimization, 2012, 8 (3) : 565-575. doi: 10.3934/jimo.2012.8.565 [3] Fan Yuan, Dachuan Xu, Donglei Du, Min Li. An exact algorithm for stable instances of the $k$-means problem with penalties in fixed-dimensional Euclidean space. Journal of Industrial and Management Optimization, 2021  doi: 10.3934/jimo.2021122 [4] Sung Ha Kang, Berta Sandberg, Andy M. Yip. A regularized k-means and multiphase scale segmentation. Inverse Problems and Imaging, 2011, 5 (2) : 407-429. doi: 10.3934/ipi.2011.5.407 [5] Gaidi Li, Zhen Wang, Dachuan Xu. An approximation algorithm for the $k$-level facility location problem with submodular penalties. Journal of Industrial and Management Optimization, 2012, 8 (3) : 521-529. doi: 10.3934/jimo.2012.8.521 [6] Justyna Jarczyk, Witold Jarczyk. Gaussian iterative algorithm and integrated automorphism equation for random means. Discrete and Continuous Dynamical Systems, 2020, 40 (12) : 6837-6844. doi: 10.3934/dcds.2020135 [7] Baoli Shi, Zhi-Feng Pang, Jing Xu. Image segmentation based on the hybrid total variation model and the K-means clustering strategy. Inverse Problems and Imaging, 2016, 10 (3) : 807-828. doi: 10.3934/ipi.2016022 [8] Xavier Gràcia, Xavier Rivas, Narciso Román-Roy. Constraint algorithm for singular field theories in the k-cosymplectic framework. Journal of Geometric Mechanics, 2020, 12 (1) : 1-23. doi: 10.3934/jgm.2020002 [9] Ruiqi Yang, Dachuan Xu, Yicheng Xu, Dongmei Zhang. An adaptive probabilistic algorithm for online k-center clustering. Journal of Industrial and Management Optimization, 2019, 15 (2) : 565-576. doi: 10.3934/jimo.2018057 [10] Yishui Wang, Dongmei Zhang, Peng Zhang, Yong Zhang. Local search algorithm for the squared metric $k$-facility location problem with linear penalties. Journal of Industrial and Management Optimization, 2021, 17 (4) : 2013-2030. doi: 10.3934/jimo.2020056 [11] Xavier Gràcia, Xavier Rivas, Narciso Román-Roy. Erratum: Constraint algorithm for singular field theories in the $k$-cosymplectic framework. Journal of Geometric Mechanics, 2021, 13 (2) : 273-275. doi: 10.3934/jgm.2021007 [12] Manuel Friedrich, Martin Kružík, Jan Valdman. Numerical approximation of von Kármán viscoelastic plates. Discrete and Continuous Dynamical Systems - S, 2021, 14 (1) : 299-319. doi: 10.3934/dcdss.2020322 [13] Weihua Liu, Andrew Klapper. AFSRs synthesis with the extended Euclidean rational approximation algorithm. Advances in Mathematics of Communications, 2017, 11 (1) : 139-150. doi: 10.3934/amc.2017008 [14] Qiyu Jin, Ion Grama, Quansheng Liu. Convergence theorems for the Non-Local Means filter. Inverse Problems and Imaging, 2018, 12 (4) : 853-881. doi: 10.3934/ipi.2018036 [15] Marius Tucsnak. Control of plate vibrations by means of piezoelectric actuators. Discrete and Continuous Dynamical Systems, 1996, 2 (2) : 281-293. doi: 10.3934/dcds.1996.2.281 [16] Gilles Carbou, Stéphane Labbé, Emmanuel Trélat. Smooth control of nanowires by means of a magnetic field. Communications on Pure and Applied Analysis, 2009, 8 (3) : 871-879. doi: 10.3934/cpaa.2009.8.871 [17] Guojun Gan, Qiujun Lan, Shiyang Sima. Scalable clustering by truncated fuzzy $c$-means. Big Data & Information Analytics, 2016, 1 (2&3) : 247-259. doi: 10.3934/bdia.2016007 [18] Liran Rotem. Banach limit in convexity and geometric means for convex bodies. Electronic Research Announcements, 2016, 23: 41-51. doi: 10.3934/era.2016.23.005 [19] Xiaoxi Li, Marc Quincampoix, Jérôme Renault. Limit value for optimal control with general means. Discrete and Continuous Dynamical Systems, 2016, 36 (4) : 2113-2132. doi: 10.3934/dcds.2016.36.2113 [20] Haixia Liu, Jian-Feng Cai, Yang Wang. Subspace clustering by (k,k)-sparse matrix factorization. Inverse Problems and Imaging, 2017, 11 (3) : 539-551. doi: 10.3934/ipi.2017025

2020 Impact Factor: 1.801

## Tools

Article outline

Figures and Tables