# American Institute of Mathematical Sciences

eISSN:
2639-8001

All Issues

## Foundations of Data Science

March 2021 , Volume 3 , Issue 1

Select all articles

Export/Reference:

2021, 3(1): 1-26 doi: 10.3934/fods.2021002 +[Abstract](1474) +[HTML](612) +[PDF](3603.92KB)
Abstract:

In prior work [4], Anderson et al. introduced a new problem, the rankability problem, which refers to a dataset's inherent ability to produce a meaningful ranking of its items. Ranking is a fundamental data science task with numerous applications that include web search, data mining, cybersecurity, machine learning, and statistical learning theory. Yet little attention has been paid to the question of whether a dataset is suitable for ranking. As a result, when a ranking method is applied to a dataset with low rankability, the resulting ranking may not be reliable.

Rankability paper [4] and its methods studied unweighted data for which the dominance relations are binary, i.e., an item either dominates or is dominated by another item. In this paper, we extend rankability methods to weighted data for which an item may dominate another by any finite amount. We present combinatorial approaches to a weighted rankability measure and apply our new measure to several weighted datasets.

2021, 3(1): 27-47 doi: 10.3934/fods.2021004 +[Abstract](1329) +[HTML](436) +[PDF](656.57KB)
Abstract:

This paper considers a new approach to using Markov chain Monte Carlo (MCMC) in contexts where one may adopt multilevel (ML) Monte Carlo. The underlying problem is to approximate expectations w.r.t. an underlying probability measure that is associated to a continuum problem, such as a continuous-time stochastic process. It is then assumed that the associated probability measure can only be used (e.g. sampled) under a discretized approximation. In such scenarios, it is known that to achieve a target error, the computational effort can be reduced when using MLMC relative to i.i.d. sampling from the most accurate discretized probability. The ideas rely upon introducing hierarchies of the discretizations where less accurate approximations cost less to compute, and using an appropriate collapsing sum expression for the target expectation. If a suitable coupling of the probability measures in the hierarchy is achieved, then a reduction in cost is possible. This article focused on the case where exact sampling from such coupling is not possible. We show that one can construct suitably coupled MCMC kernels when given only access to MCMC kernels which are invariant with respect to each discretized probability measure. We prove, under verifiable assumptions, that this coupled MCMC approach in a ML context can reduce the cost to achieve a given error, relative to exact sampling. Our approach is illustrated on a numerical example.

2021, 3(1): 49-66 doi: 10.3934/fods.2021005 +[Abstract](1144) +[HTML](436) +[PDF](1102.91KB)
Abstract:

We propose two related unsupervised clustering algorithms which, for input, take data assumed to be sampled from a uniform distribution supported on a metric space \begin{document}$X$\end{document}, and output a clustering of the data based on the selection of a topological model for the connected components of \begin{document}$X$\end{document}. Both algorithms work by selecting a graph on the samples from a natural one-parameter family of graphs, using a geometric criterion in the first case and an information theoretic criterion in the second. The estimated connected components of \begin{document}$X$\end{document} are identified with the kernel of the associated graph Laplacian, which allows the algorithm to work without requiring the number of expected clusters or other auxiliary data as input.

2021, 3(1): 67-97 doi: 10.3934/fods.2021006 +[Abstract](1360) +[HTML](441) +[PDF](4748.55KB)
Abstract:

Persistent homology (PH) is one of the most popular tools in topological data analysis (TDA), while graph theory has had a significant impact on data science. Our earlier work introduced the persistent spectral graph (PSG) theory as a unified multiscale paradigm to encompass TDA and geometric analysis. In PSG theory, families of persistent Laplacian matrices (PLMs) corresponding to various topological dimensions are constructed via a filtration to sample a given dataset at multiple scales. The harmonic spectra from the null spaces of PLMs offer the same topological invariants, namely persistent Betti numbers, at various dimensions as those provided by PH, while the non-harmonic spectra of PLMs give rise to additional geometric analysis of the shape of the data. In this work, we develop an open-source software package, called highly efficient robust multidimensional evolutionary spectra (HERMES), to enable broad applications of PSGs in science, engineering, and technology. To ensure the reliability and robustness of HERMES, we have validated the software with simple geometric shapes and complex datasets from three-dimensional (3D) protein structures. We found that the smallest non-zero eigenvalues are very sensitive to data abnormality.