Speaker
Description
In any large data base, most of the features defining a data point are redundant, irrelevant, or affected by large noise, and have to be discarded. To do this, one needs to answer: What is the best dimensionality of a reduced feature space in order to retain maximum information? How can one correct for different units of measure? What is the optimal scaling of importance between features? We use a statistical method, Information Imbalance, to select the most informative feature sets among many possible ones. In an example from the Amazon rainforest, we find sets of biotic and abiotic features to predict tree biodiversity and species richness, and compare common biodiversity estimators for their information content. The differentiable version of this statistic can automatically weight features relative to each other, accounting for units of measure and importance. Other use cases include variable selection in molecular dynamics simulations, clinical data sets and for neural network potentials.