Agenda

23 Nov 2022 14:30

Antonietta Mira - Bayesian identifications of data intrinsic dimensions

Meeting Room 1, Campus Economico San Giobbe

Antonietta Mira (professoressa Università della Svizzera Italiana) - Bayesian identifications of data intrinsic dimensions

Abstract

With the advent of Big Data, it is increasingly common to deal with cases where data are defined in a high-dimensional space and little is known a priori about their distribution. Quite often, however, the data distribution has support on a subspace (manifold) whose dimension, called the intrinsic dimension (ID) of the data, is much lower than that of the embedding space. Under very weak assuptions on the data distribution, the k-nearest-neighbor distances in the data follow distributions which depend parmetrically on the ID. This fact was leveraged by Facco et al., Scientific Reports 2017 to provide an ID estimate (TWO-NN) based on the ratio between the 1st and 2nd neighbor of each point in the data.  We extended TWO-NN to the case where the ID is not constant within the data, i.e., the distribution has support on the
union of several manifolds with different ID. This situation may trivially occur if data sets with different ID are merged, but, as we reveal, occurs quite naturally in several dataset from diverse disciplines. In this case, the nearest-neighbor-distances follow a simple mixture distribution, and within a Bayesian framework we can robustly estimate the IDs of the manifolds, and assign each data point to one of the manifolds. In many real-world data sets we find widely heterogeneous dimensions, corresponding to variation in core properties: folded vs unfolded configurations in a protein molecular dynamics trajectory, active vs non-active regions in brain imaging data, and firms with different financial risk in company balance sheets.
 

Language

The event will be held in English

Organized by

Dipartimento di Economia

Search in the agenda