Agenda

27 Apr 2026 12:15

Data Thinning and Beyond

Aula EPSILON 1 - Edificio EPSILON | Campus Scientifico

Speaker:
Daniela M. Witten, University of Washington

Abstract:
Contemporary data analysis pipelines often involve the use and reuse of data. For instance, a scientist may explore a dataset to select an interesting hypothesis, and then wish to test this hypothesis with the same data. From a statistical perspective, this double use of data is highly problematic: it induces dependence between the hypothesis generation and testing stages, which complicates inference. Failure to account for this dependence renders classical inference techniques invalid. I will present "data thinning", a set of strategies for obtaining independent training and test sets so that the former can be used to select a hypothesis, and the latter to test it. Data thinning enables valid selective inference in settings for which no solutions were previously available. However, it is also restrictive, in the sense that it requires strong distributional assumptions. Therefore, I will also present two strategies inspired by data thinning that enable valid post-selection inference without such assumptions. One strategy considers thinning summary statistics of the data, rather than the data itself, in order to take advantage of asymptotic properties of the summary statistics. The second strategy involves generating training and test sets that are not independent, and then orthogonalizing the latter with respect to the former in order to conduct valid inference.

Bio sketch:
Daniela Witten is a professor of Statistics and Biostatistics at University of Washington, and the Dorothy Gilford Endowed Chair in Mathematical Statistics. She develops statistical machine learning methods for high-dimensional data, with a focus on unsupervised learning. Daniela is the recipient of an NIH Director's Early Independence Award, a Sloan Research Fellowship, an NSF CAREER Award, and a Simons Investigator Award in Mathematical Modeling of Living Systems. She received the Presidents’ Award from the Committee of Presidents of Statistical Societies (COPSS), awarded annually to a statistician under age 41 in recognition of outstanding contributions to the field of statistics. She also received the Spiegelman Award from the American Public Health Association for a statistician under age 40 who has made outstanding contributions to statistics for public health, and the Leo Breiman Award for contributions to the field of statistical machine learning. She is a Fellow of the American Statistical Association and the Institute for Mathematical Statistics, and an Elected Member of the International Statistical Institute. Daniela is a co-author (with Gareth James, Trevor Hastie, and Rob Tibshirani) of the very popular textbook "Introduction to Statistical Learning". She has served as an Associate Editor for Biometrika, Journal of Computational and Graphical Statistics, and Journal of the American Statistical Association, and as an Action Editor for Journal of Machine Learning Research. Since 2023, she serves as Joint Editor of Journal of the Royal Statistical Society, Series B. Daniela completed a BS in Math and Biology with Honors and Distinction at Stanford University in 2005, and a PhD in Statistics at Stanford University in 2010.

Lingua

L'evento si terrà in inglese

Organizzatore

Gruppo Statistica

Allegati

Flyer 1902 KB

Cerca in agenda