Metodologia e applicazioni della statistica (STREAM) 
STatistics REsearch: Applications and Methods

Il gruppo di ricerca si occupa dello sviluppo di metodi statistici e della loro applicazione in una varietà di discipline.
La ricerca metodologica si concentra sull’inferenza bayesiana e frequentista, in ambito parametrico e non-parametrico. Specifici contributi metodologici hanno riguardato i dati categoriali, le carte di controllo, la meta analisi, i valori estremi, le serie temporali, i dati spaziali, l'analisi di sensibilità.
Il gruppo si occupa anche di statistica computazionale e dello sviluppo di software statistico in particolare con il linguaggio R.
Il gruppo è particolarmente attivo nell'applicazione dei metodi statistici in collaborazioni multidisciplinare con ricercatori e professori dell’Università Ca’ Foscari e di altri istituti di ricerca. Le Applicazioni recenti hanno riguardato, per esempio, la climatologia, l'economia, le scienze ambientali, l'epidemiologia, l'idrologia, la medicina, la sociologia e lo sport.

Seminari

Tutti i seminari si possono seguire tramite la piattaforma Zoom:
https://unive.zoom.us/j/85153268624?pwd=MzBhdlA2M1B2dThJQ2Y5T0EwUE5PZz09
Meeting ID: 851 5326 8624
Passcode: SanMarco2

4/11/2021 ore 12:15
Philippe Naveau (Laboratoire des Sciences du Climat et de l’Environnement)
Titolo: Evaluation of binary classifiers for environmental extremes

Machine learning classification methods usually assume that all possible classes are sufficiently present within the training set. Due to their inherent rarities, extreme events are always under-represented and classifiers tailored for predicting extremes need to be carefully designed to handle this under-representation. In this talk, we address the question of how to assess and compare classifiers with respect to their capacity to capture extreme occurrences.This is also related to the topic of scoring rules used in forecasting literature. In this context, we propose and study different risk functions adapted to extremal classifiers. The inferential properties of our empirical risk estimator are derived under the framework of multivariate regular variation and hidden regular variation. As an example, we study in detail the special class of linear classifiers and show that the optimisation of our risk function leads to a consistent solution. A simulation study compares different classifiers and indicates their performance with respect to our risk functions. To conclude, we apply our framework to the analysis of extreme river discharges in the Danube river basin. The application compares different predictive algorithms and test their capacity at forecasting river discharges from other river stations. As a by-product, we identify the explanatory variables that contribute the most to extremal behaviour. If time allows, we will also discuss other climate datasets.

Joint work with Juliette Legrand (LSCE, Rennes University) and Marco Oesting (Siegen University)


18/11/2021 ore 15:00
Philippe Naveau (Laboratoire des Sciences du Climat et de l’Environnement)
Titolo: Combining ensemble climate simulations to take into account multi-model error

Global climate model outputs like any numerical model simulation ensembles correspond to an approximation of the true system under study, here the climate system. Different sources of variability, uncertainties and model errors have to be taken into account to provide reliable estimates of climate changes. In this talk, we will study basic atmospheric variables (temperatures and precipitation) simulated from more than 10 different climate models of the Coupled Model Intercomparison Project (called CMIP, version 5 and 6). As each climate model provides an ensemble of similar but still different climate trajectories, we propose and study a statistical model to efficiently combine these runs. As climate database can be large, the proposed statistical procedure needs to be fast and easy to implement.
Under these constraints, we propose a simple statistic that, under precise defined conditions, has the advantage of averaging imperfect models outputs without the need of model bias corrections. We also check the validity of our conditions by contrasting recorded measurements and simulations.


Seminari passati

23/03/2021 ore 14:00
Giada Adelfio
(Università degli Studi di Palermo)
Titolo: Some properties of local weighted second-order statistics for spatio-temporal point processes

Spatial, temporal, and spatio-temporal point processes, and in particular Poisson processes, are stochastic processes that are largely used to describe and model the distribution of a wealth of real phenomena.
When a model is fitted to a set of random points, observed in a given multidimensional space, diagnostic measures are necessary to assess the goodness-of-fit and to evaluate the ability of that model to describe the random point pattern behaviour. The main problem when dealing with residual analysis for point processes is to find a correct definition of residuals. Diagnostics of goodness-of-fit in the theory of point processes are often considered through the transformation of data into residuals as a result of a thinning or a rescaling procedure. We alternatively consider here second-order statistics coming from weighted measures. Motivated by Adelfio and Schoenberg (2010) for the spatial case, we consider here an extension to the spatio-temporal context in addition to focussing on local characteristics.
Then, rather than using global characteristics, we introduce local tools, considering individual contributions of a global estimator as a measure of clustering. Generally, the individual contributions to a global statistic can be used to identify outlying components measuring the influence of each contribution to the global statistic.
In particular, our proposed method assesses goodness-of-fit of spatio-temporal models by using local weighted second-order statistics, computed after weighting the contribution of each observed point by the inverse of the conditional intensity function that identifies the process.
Weighted second-order statistics directly apply to data without assuming homogeneity nor transforming the data into residuals, eliminating thus the sampling variability due to the use of a transforming procedure. We provide some characterisations and show a number of simulation studies.


7/04/2021 ore 15:00
Stefano Castruccio
(University of Notre Dame)
Titolo: Model- and Data-Driven Approximation of Space-Time Systems. A Tale of Two Approaches

In this talk I will discuss two different approaches to approximate space-time systems. This first one is model-driven and loosely inspired by physics, assumes that the system is locally diffusive through a stochastic partial differential equation, and can be efficiently approximated with a Gaussian Markov random field. This approximation will be used to produce a stochastic generator of simulated multi-decadal global temperature, thereby offering a fast alternative to the generation of large climate model ensembles.
The second approach is instead data-driven, and relies on (deep) neural networks in time. Instead of traditional machine learning methods aimed at inferring an extremely large parameter space, we instead rely on an alternative fast, sparse and computationally efficient echo state network dynamics on an appropriately dimensionally reduced spatial field. The additional computational time is then used to produce an ensemble and probabilistically calibrate the forecast. The approach will be used to produce air pollution forecasts from a citizen science network in San Francisco and forecasting wind energy in Saudi Arabia.


13/04/2021 ore 14:00
Alan Agresti
(University of Florida)
Titolo: Simple Ways to Interpret Effects in Modeling Binary and Ordinal Data

Probability-based effect measures for models for binary and ordinal response variables can be simpler to interpret than logistic (and probit) regression model parameters and their corresponding effect measures, such as odds ratios.
For describing the effect of an explanatory variable while adjusting for others in modeling a binary response, it is sometimes possible to employ the identity and log link functions to generate simple effect measures.
When such link functions are inappropriate, one can still construct analogous effect measures from a logistic regression model fit, based on average differences or ratios of the probability modeled or on average instantaneous rates of change for the probability.
Simple measures are also proposed for interpreting effects in models for ordinal responses based on applying a link function to cumulative probabilities.
The measures are also sometimes applicable with nonlinear predictors, such as in generalized additive models.
The methods are illustrated with examples and implemented with R software.

Parts of this work are joint with Maria Kateri, Claudia Tarantola, and Roberta Varriale.


20/04/2021 ore 14:00
David Firth (University of Warwick)
Titolo: Schedule-adjusted league tables during the football season

In this talk I will show how to construct a better football league table than the official ranking based on accumulated points to date.  The aim of this work is (only) to produce a more informative representation of how teams currently stand, based on their match results to date in the current season; it is emphatically not about prediction.  A more informative league table is one that takes proper account of "schedule strength" differences, i.e., differing numbers of matches played by each team (home and away), and differing average standings of the opponents that each team has faced.

This work extends previous "retrodictive" use of Bradley-Terry models and their generalizations, specifically to handle 3 points for a win, and also to incorporate home/away effects coherently without assuming homogeneity across teams.  Playing records that are 100% or 0%, which can be problematic in standard Bradley-Terry approaches, are incorporated in a simple way without the need for a regularizing penalty on the likelihood. A maximum-entropy argument shows how the method developed here is the mathematically "best" way to account for schedule strength in a football league table.

Illustrations will be from the English Premier League, and the Italian Serie A.


4/05/2021 ore 14:00
Manuele Leonelli (IE University Madrid)
Titolo: Untangling complex dependencies in categorical data using staged trees

The dependence structure of a categorical random vector is often studied by means of a probabilistic graphical model. The most commonly used model is the so-called Bayesian network which provides an intuitive and efficient framework to assess (causal) dependencies. One of the major drawbacks of these models is that they can only explicitly represent symmetric dependencies, which, in practice, may not give a complete description of the data dependence structure. Staged trees are a flexible class of graphical models which can explicitly represent and model a wide array of non-symmetric dependence
In this talk, I will provide an overview of this model class and their application to a wide array of datasets. I will also discuss a number of ongoing developments for staged trees, including efficient structural learning, causal discovery, manipulations of the graphs and the new stagedtrees R package.
The talk is based on joint work with Gherardo Varando (University of Valencia), Federico Carli and Eva Riccomagno (University of Genova).


18/05/2021 ore 14:00
Laura Sangalli (Politecnico di Milano)
Titolo: Functional and complex data - new methods merging statistics, scientific computing and engineering

Recent years have seen an explosive growth in the recording of increasingly complex and high-dimensional data. Classical statistical methods are often unfit to handle such data, whose analysis calls for the definition of new methods merging ideas and approaches from statistics, applied mathematics and engineering. This seminar in particular focuses on functional and spatial data defined over complex multidimensional domains, including curved bi-dimensional domains and non-convex three-dimensional domains. I will present an innovative class of methods, based on regularizing terms involving partial differential equations. The proposed methods make use of advanced numerical techniques, such as finite element analysis and isogeometric analysis. An illustration to the analysis of neuroimaging data is provided. In this applicative domain, the proposed methods offer important advantages with respect to the best state of art techniques, allowing to correctly take into account to complex anatomy of the brain.


25/05/2021 ore 14:00
John Aston
(University of Cambridge)
Titolo: Functional Data on Constrained Spaces

Functional Data Analysis is concerned with the statistical analysis of data which are curves or surfaces. There has been considerable progress made in this area over the last 20-30 years, but most of this work has been focused on 1-dimensional curves living in a standard space such as the space of square integrable functions. However, many real data applications, such as those from linguistics and neuroimaging, imply considerable constraints on data which is not a simple curve. In this talk, we will look at several different types of constrained functional data. We will examine the role of positive definiteness in linguistics and show that this can be used to study ancient languages. We will also look at 2-d manifolds embedded in 3-dimensions, such as the cortical surface of the brain. We’ll see that some current applications, such as functional connectivity, require both properties simultaneously and we’ll suggest methods for understanding the data in such cases.


1/06/2021 ore 14:00
Michael Fop (University College Dublin)
Titolo: A composite likelihood approach for model-based clustering of high-dimensional data

The use of finite Gaussian mixture models (GMMs) is a well established approach to performing model-based clustering. Despite the popularity of GMMs, their widespread use is hindered by their inability to transfer to high-dimensional data settings. Difficulties related to dealing with high-dimensional covariance matrices and highly correlated data often makes the use of GMMs impractical. The composite likelihood (CL) approach uses smaller dimensional marginal and/or conditional pseudo-likelihoods to estimate the parameters of a model, avoiding the need to fully specify the underlying joint distribution. Such an approximation is very helpful when the full model is difficult to specify or manipulate, overcoming the computational problems often arising when dealing with a multi-dimensional joint distribution. In addition, the specification of appropriate conditional likelihoods allows the modelling of the dependence structure by means of lower dimensional terms.
This talk presents a framework that exploits the idea of embedding CL in the area of GMMs for clustering high-dimensional data. The framework explores the use of approximations to the likelihood of a GMM by means of block-pairwise and block-conditional composite likelihoods, which allow the decomposition of the potentially high-dimensional density into terms of smaller dimensions. Estimation is based on a computationally efficient expectation-maximization algorithm, enabling the use of GMMs for clustering high-dimensional data. The approach is demonstrated through simulated and real data examples.
Talk based on joint work with Claire Gormley (University College Dublin), Adrian O'Hagan (University College Dublin), Ioannis Kosmidis (University of Warwick), Dimitris Karlis (Athens University of Economics and Business), and Caitriona Ryan (Maynooth University).


8/06/2021 ore 14:00
Riccardo Rastelli
(University College Dublin)
Titolo: A time-continuous extension of the latent position network model for instantaneous interactions

We create a framework to analyse the timing and frequency of instantaneous interactions between pairs of entities. This type of interaction data is especially common nowadays, and easy to collect. Examples include email networks, phone call networks, proximity networks. The framework relies on a latent position network model: the entities are embedded in a latent Euclidean space, and they move along individual trajectories that are continuous over time. These trajectories are used to characterise the timing and frequency of the pairwise interactions. We discuss an inferential framework where we estimate the trajectories from the observed interaction data, and propose applications on artificial and real data.


Last update: 01/12/2021