Article Contents
Article Contents

Support vector machines and Radon's theorem

• *Corresponding author: Brittany Story

• The foundational work for this paper was completed while Elin Farnell was a research scientist in the Department of Mathematics at Colorado State University.
° The foundational work for this paper was completed while Brittany Story was a graduate student in the Department of Mathematics at Colorado State University.

• A support vector machine (SVM) is an algorithm that finds a hyperplane which optimally separates labeled data points in $\mathbb{R}^n$ into positive and negative classes. The data points on the margin of this separating hyperplane are called support vectors. We connect the possible configurations of support vectors to Radon's theorem, which provides guarantees for when a set of points can be divided into two classes (positive and negative) whose convex hulls intersect. If the convex hulls of the positive and negative support vectors are projected onto a separating hyperplane, then the projections intersect if and only if the hyperplane is optimal. Further, with a particular type of general position, we show that (a) the projected convex hulls of the support vectors intersect in exactly one point, (b) the support vectors are stable under perturbation, (c) there are at most $n+1$ support vectors, and (d) every number of support vectors from 2 up to $n+1$ is possible. Finally, we perform computer simulations studying the expected number of support vectors, and their configurations, for randomly generated data. We observe that as the distance between classes of points increases for this type of randomly generated data, configurations with fewer support vectors become more likely.

Mathematics Subject Classification: Primary: 52C35, 62-07; Secondary: 62R40.

 Citation:

• Figure 1.  Linearly separable two-class data, along with a linear classifier (a separating hyperplane) with the maximal margin of separation

Figure 2.  Decision boundary hyperplane $\mathcal{P}$ along with a vector $\mathbf{x}_i$ and its decomposition with respect to $\mathcal{P}$, where $\mathbf{w}$ and $\mathbf{x}_i$ are drawn with tails at the same base point in the hyperplane for context

Figure 3.  Examples of two sets in $\mathbb{R}^2$ of size 4 that cannot be shattered by affine separators. Consider the labeled classes drawn as red squares and blue circles. In each example, the convex hulls of the two classes intersect at the origin. This implies that no affine separator can correctly classify according to these labels

Figure 4.  Two Radon configurations in $\mathbb{R}^2$

Figure 5.  Four possible support vector configurations in $\mathbb{R}^3$ (the figures with planes), each drawn above its corresponding projected Radon configuration in the separating hyperplane (the figures with just lines and points). As we will see in Section 5, when points are in strong general position, there can be at most $n+1$ support vectors. Thus in $\mathbb{R}^3$, we can have configurations with 2, 3, or 4 support vectors. See Table 2 in Section 6 for relative frequencies of these configurations

Figure 6.  The same dataset with two different separating hyperplanes. Note that (A) does not contain a Radon point in the projection of the convex hulls whereas (B) does. The margin of separation is larger in (B) than in (A)

Figure 7.  Two examples of point sets $X\subseteq \mathbb{R}^n$ that are not in strong general position. (The point sets are unlabeled, and the colors are for illustration.) On the left, in $\mathbb{R}^2$, the vector $\overrightarrow{\mathbf{p}_0\mathbf{p}_1}$ is parallel to $\overrightarrow{\mathbf{n}_0\mathbf{n}_1}$, violating condition $(ii)$ of strong general position. On the right, we provide an example in $\mathbb{R}^3$. If $\mathbf{w}$ is the shortest vector between $\mathbf{p}_0$ and the line between $\mathbf{n}_0$ and $\mathbf{n}_1$, then the hyperplane perpendicular to $\mathbf{w}$ that contains $\mathbf{n}_0, \mathbf{n}_1$ also contains $\mathbf{n}_2$, violating condition $(iii)$. In other words, the orthogonal projection of $\mathbf{p}_0$ onto the affine span of $\mathbf{n}_0, \mathbf{n}_1, \mathbf{n}_2$ lies on the line segment between $\mathbf{n}_0$ and $\mathbf{n}_1$, which differs from the top left in Figure 5 where instead the projection of $\mathbf{p}_0$ is in the interior of the convex hull of $\mathbf{n}_0$, $\mathbf{n}_1$, and $\mathbf{n}_2$

Figure 8.  Example images of randomly generated linearly seperable groups of points with (A) two and (B) three support vectors, both drawn from trials where $a = 10$

Figure 9.  A 2-simplex embedded in the 2-dimensional subspace $x_1+x_2+x_3 = 1$ of $\mathbb{R}^3$ with support vectors assigned to the vertices

Table 1.  The number of support vectors in $\mathbb{R}^2$ from 1, 000 trials, as a function of a parameter $a$ controlling the expected distance between the center of the distributions

 $a =$ 5 10 20 # configurations with 2 support vectors 417 632 809 # configurations with 3 support vectors 583 368 191

Table 2.  The number of support vectors in $\mathbb{R}^3$ from 1, 000 trials, as a function of a parameter $a$ controlling the expected distance between the center of the distributions

 $a =$ 5 10 20 # configurations with 2 support vectors 279 554 758 # configurations with 3 support vectors 458 367 221 # configurations with 4 support vectors (2–2 split) 145 47 9 # configurations with 4 support vectors (3–1 split) 118 32 12
•  [1] E. G. Bajmóczy and I. Bárány, On a common generalization of Borsuk's and Radon's theorem, Acta Mathematica Academiae Scientiarum Hungarica, 34 (1979), 347-350.  doi: 10.1007/BF01896131. [2] I. Bárány and P. Soberón, Tverberg's theorem is 50 years old: A survey, Bulletin of the American Mathematical Society, 55 (2018), 459-492.  doi: 10.1090/bull/1634. [3] L. Blum, F. Cucker, M. Shub and S. Smale, Complexity and Real Computation, Springer Science & Business Media, 1998. doi: 10.1007/978-1-4612-0701-6. [4] J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua and A. Lopez, A comprehensive survey on support vector machine classification: Applications, challenges and trends, Neurocomputing, 408 (2020), 189-215.  doi: 10.1016/j.neucom.2019.10.118. [5] K. L. Clarkson, D. Eppstein, G. L. Miller, C. Sturtivant and S.-H. Teng, Approximating center points with iterative Radon points, International Journal of Computational Geometry & Applications, 6 (1996), 357-377.  doi: 10.1142/S021819599600023X. [6] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000. doi: 10.1017/CBO9780511801389. [7] J. A. De Loera and T. Hogan, Stochastic Tverberg theorems with applications in multiclass logistic regression, separability, and centerpoints of data, SIAM Journal on Mathematics of Data Science, 2 (2020), 1151-1166.  doi: 10.1137/19M1277102. [8] K.-B. Duan and S. S. Keerthi, Which is the best multiclass SVM method? An empirical study, In International Workshop on Multiple Classifier Systems, Springer, 2005,278-285. doi: 10.1007/11494683_28. [9] D. S. Dummit and R. M. Foote, Abstract Algebra, Wiley, 2004. [10] M. Fink, J. Hershberger, N. Kumar and S. Suri, Separability and convexity of probabilistic point sets. [11] W. J. Firey, An integral-geometric meaning for lower order area functions of convex bodies, Mathematika, 19 (1972), 205-212.  doi: 10.1112/S0025579300005660. [12] W. J. Firey, Kinematic measures for sets of support figures, Mathematika, 21 (1974), 270-281.  doi: 10.1112/S0025579300008664. [13] T. Hofmann, B. Schölkopf and A. J. Smola, Kernel methods in machine learning, The Annals of Statistics, 2008, 1171-1220. doi: 10.1214/009053607000000677. [14] C.-W. Hsu and C.-J. Lin, A comparison of methods for multiclass support vector machines, IEEE Transactions on Neural Networks, 13 (2002), 415-425.  doi: 10.1109/72.991427. [15] V. Kecman, Support vector machines–An introduction, In Support Vector Machines: Theory and Applications, Springer, 2005, 1-47. doi: 10.1007/10984697_1. [16] Y. Ma and G. Guo, Support Vector Machines Applications, Springer Publishing Company, Incorporated, 2014. doi: 10.1007/978-3-319-02300-7. [17] J. Matoušek, Lectures on Discrete Geometry, Springer-Verlag, Berlin, Heidelberg, 2002. doi: 10.1007/978-1-4613-0039-7. [18] P. McMullen, A dice probability problem, Mathematika, 21 (1974), 193-198.  doi: 10.1112/S0025579300008573. [19] K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf, An introduction to kernel-based learning algorithms, IEEE Transactions on Neural Networks, 12 (2001), 181-201.  doi: 10.1109/72.914517. [20] B. B. Peterson, The geometry of Radon's theorem, American Mathematical Monthly, 79 (1972), 949-963.  doi: 10.1080/00029890.1972.11993165. [21] J. Platt, Sequential minimal optimization: A fast algorithm for training support vector machines, Technical Report, Microsoft, April 1998. [22] J. Radon, Mengen konvexer körper, die einen gemeinsamen punkt enthalten, Mathematische Annalen, 83 (1921), 113-115.  doi: 10.1007/BF01464231. [23] G. E. Sakr and I. H. Elhajj, VC-based confidence and credibility for support vector machines, Soft Computing, 20 (2016), 133-147.  doi: 10.1007/s00500-014-1485-4. [24] R. Schneider and W. Weil, Stochastic and Integral Geometry, Springer, 2008. doi: 10.1007/978-3-540-78859-1. [25] A. Smola and S. Vishwanathan, Introduction to machine learning, Cambridge University, UK, 32 (2008), 2008. [26] A. J. Smola and B. Schölkopf, A tutorial on support vector regression, Statistics and Computing, 14 (2004), 199-222.  doi: 10.1023/B:STCO.0000035301.49549.88. [27] H. Tverberg, A generalization of Radon's theorem, Journal of the London Mathematical Society, 1 (1966), 123-128.  doi: 10.1112/jlms/s1-41.1.123. [28] V. Vapnik and A. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, Theory of Probability & Its Applications, 16 (1971), 264-280. [29] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer Science & Business Media, 2013. doi: 10.1007/978-1-4757-2440-0. [30] C.-C. Yao, Utilizing ellipsoid on support vector machines, In 2008 International Conference on Machine Learning and Cybernetics, IEEE, 6 (2008), 3373-3378.

Figures(9)

Tables(2)