# American Institute of Mathematical Sciences

eISSN:
2639-8001

All Issues

## Foundations of Data Science

June 2022 , Volume 4 , Issue 2

Select all articles

Export/Reference:

2022, 4(2): 165-216 doi: 10.3934/fods.2022002 +[Abstract](679) +[HTML](254) +[PDF](3785.42KB)
Abstract:

We establish a new theory which unifies various aspects of topological approaches for data science, by being applicable both to point cloud data and to graph data, including networks beyond pairwise interactions. We generalize simplicial complexes and hypergraphs to super-hypergraphs and establish super-hypergraph homology as an extension of simplicial homology. Driven by applications, we also introduce super-persistent homology.

2022, 4(2): 217-242 doi: 10.3934/fods.2022004 +[Abstract](270) +[HTML](143) +[PDF](6626.11KB)
Abstract:

We study the deformation of the input space by a trained autoencoder via the Jacobians of the trained weight matrices. In doing so, we prove bounds for the mean squared errors for points in the input space, under assumptions regarding the orthogonality of the eigenvectors. We also show that the trace and the product of the eigenvalues of the Jacobian matrices is a good predictor of the mean squared errors on test points. This is a dataset independent means of testing an autoencoder's ability to generalize on new input. Namely, no knowledge of the dataset on which the network was trained is needed, only the parameters of the trained model.

2022, 4(2): 243-269 doi: 10.3934/fods.2022005 +[Abstract](320) +[HTML](199) +[PDF](2521.29KB)
Abstract:

We introduce a novel method for Additive Noise Analysis for Persistence Thresholding (ANAPT) which separates significant features in the sublevel set persistence diagram of a time series based on a statistics analysis of the persistence of a noise distribution. Specifically, we consider an additive noise model and leverage the statistical analysis to provide a noise cutoff or confidence interval in the persistence diagram for the observed time series. This analysis is done for several common noise models including Gaussian, uniform, exponential, and Rayleigh distributions. ANAPT is computationally efficient, does not require any signal pre-filtering, is widely applicable, and has open-source software available. We demonstrate the functionality of ANAPT with both numerically simulated examples and an experimental data set. Additionally, we provide an efficient \begin{document}$\Theta(n\log(n))$\end{document} algorithm for calculating the zero-dimensional sublevel set persistence homology.

2022, 4(2): 271-298 doi: 10.3934/fods.2022007 +[Abstract](360) +[HTML](126) +[PDF](1797.17KB)
Abstract:

Nowadays, neural networks are widely used in many applications as artificial intelligence models for learning tasks. Since typically neural networks process a very large amount of data, it is convenient to formulate them within the mean-field and kinetic theory. In this work we focus on a particular class of neural networks, i.e. the residual neural networks, assuming that each layer is characterized by the same number of neurons \begin{document}$N$\end{document}, which is fixed by the dimension of the data. This assumption allows to interpret the residual neural network as a time-discretized ordinary differential equation, in analogy with neural differential equations. The mean-field description is then obtained in the limit of infinitely many input data. This leads to a Vlasov-type partial differential equation which describes the evolution of the distribution of the input data. We analyze steady states and sensitivity with respect to the parameters of the network, namely the weights and the bias. In the simple setting of a linear activation function and one-dimensional input data, the study of the moments provides insights on the choice of the parameters of the network. Furthermore, a modification of the microscopic dynamics, inspired by stochastic residual neural networks, leads to a Fokker-Planck formulation of the network, in which the concept of network training is replaced by the task of fitting distributions. The performed analysis is validated by artificial numerical simulations. In particular, results on classification and regression problems are presented.

2022, 4(2): 299-322 doi: 10.3934/fods.2022008 +[Abstract](212) +[HTML](98) +[PDF](478.2KB)
Abstract:

We consider the problem of static Bayesian inference for partially observed Lévy-process models. We develop a methodology which allows one to infer static parameters and some states of the process, without a bias from the time-discretization of the afore-mentioned Lévy process. The unbiased method is exceptionally amenable to parallel implementation and can be computationally efficient relative to competing approaches. We implement the method on S & P 500 log-return daily data and compare it to some Markov chain Monte Carlo (MCMC) algorithm.