Statistics research: application and methods (STREAM)

The research group contributes to the development and application of statistical methods for the solution of real-world problems.
Current methodological research themes include Bayesian methods, Bayesian nonparametrics, categorical data analysis, computational methods, extreme values, likelihood theory, multivariate methods, nonparametric methods, quality control, sensitivity analysis, spatial processes, time series modeling.
The group also has an interest in computational statistics and in the development of statistical software, with particular focus on the R statistical language.
The group members promote the application of advanced statistical methods through interdisciplinary research, jointly with scientists at Ca’ Foscari University of Venice and other research institutions. Current applied research fields include, among others, climatology, economy, environmental sciences, epidemiology, hydrology, medicine, sociology, sport

Collaborations

Publications

  • Kosmidis, Ioannis, Lunardon, Nicola "Empirical bias-reducing adjustments to estimating functions", Journal of the Royal Statistical Society Series B: Statistical Methodology (2023)
  • Caserotti, Marta, Paolo Girardi, Enrico Rubaltelli, Alessandra Tasso, Lorella Lotto, and Teresa Gavaruzzi. "Associations of COVID-19 risk perception with vaccine hesitancy over time for Italian residents." Social science & medicine 272 (2021): 113688
  • Bacro, Jean-Noël, Carlo Gaetan, Thomas Opitz, and Gwladys Toulemonde. "Hierarchical space-time modeling of asymptotically independent exceedances with an application to precipitation data." Journal of the American Statistical Association 115, no. 530 (2020): 555-569
  • Giummolè, Federica, Valentina Mameli, Erlis Ruli, and Laura Ventura. "Objective Bayesian inference with proper scoring rules." Test 28, no. 3 (2019): 728-755
  • Prosdocimi, Ilaria, Emiko Dupont, Nicole H. Augustin, Thomas R. Kjeldsen, Dan P. Simpson, and Theresa R. Smith. "Areal models for spatially coherent trend detection: the case of British peak river flows." Geophysical Research Letters 46, no. 22 (2019): 13054-13061
  • Marcon, Giulia, Simone A. Padoan, and Isadora Antoniano-Villalobos. "Bayesian inference for the extremal dependence." Electronic Journal of Statistics 10, no. 2 (2016): 3310-3337
  • Varin, Cristiano, Manuela Cattelan, and David Firth. "Statistical modelling of citation exchange between statistics journals." Journal of the Royal Statistical Society: Series A (Statistics in Society) 179, no. 1 (2016): 1-63

Research projects

Seminars

7/05/2024 at 12:00
Paolo Maranzano (Università Milano-Bicocca - Fondazione Eni Enrico Mattei)
Title: Retrieving and managing air quality data at the European level: the EEAaq package and the need to manage missing data

In this talk we discuss the EEAaq software, an R package developed to download, manage and analyze air quality data at the European level from the European Environment Agency (EEA) dataflows. The software (release 0.0.3) is freely available on the R CRAN since August 2023. EEAaq addresses several issues: (1) the EEA air quality download system and the metadata retrieving lacks in practicality and flexibility for non-professionals users; (2) direct collection of data from the agency’s portal requires heavy data manipulation; (3) air quality conditions in Europe are continuously raising considerable interest from researchers and technicians involved in policy evaluation. The EEAaq package provides the users with a set of functions, which can be re-grouped into three categories according to their goal: 1) download, 2) summarize and aggregate data, and 3) build static and dynamic maps. The download functions allow the users to specify either LAU or NUTS-level zone information, a specific shapefile, or a list of coordinates representing the area for which to retrieve the respective air quality data. The summary functions allow for the computation of descriptive statistics, data information, and time aggregation. The mapping functions aim to represent the monitoring stations and to build spatial interpolation maps.

Data provided by the EEA suffer from poor comparability due to the heterogeneity of national and regional agencies. In fact, depending on the countries, pollutants may be measured at different frequencies or even not measured at all. Another serious problem is the high presence of missing values and holes in the collected time series. To address this issue, a variety of algorithms are being developed in the EEAaq package that provide estimates and imputations of missing values by exploiting the multiple seasonality properties (intraday, weekly, and annual) of the pollutants. In particular, variants of the Site-Dependent Effect method (Plaia & Bondì, 2006), which takes into account only the temporal dynamics of the data, are proposed in which they 1) explicitly model the spatial correlation between pollutant stations; 2) model the potential spatial heterogeneity between time series; and 3) model the positive asymmetry that typically characterizes pollutant concentrations. The imputation algorithms are evaluated through a simulation study based on the actual atmospheric monitoring network installed in Europe in 2023.

Joint work with: Riccardo Borgoni, Agostino Tassan Mazzocco (University of Milano-Bicocca)


23/04/2024 at 12:15
Lorenzo Schiavon (Università Ca' Foscari Venezia - DEC)
Title: Structured infinite factorization

A challenging task in learning low-dimensional representations of high-dimensional objects is the lack of knowledge about the dimension of the latent space and the relative impact of its components. To address this, overfitted factor models with increasing shrinkage priors have been proposed, enabling the adaptive removal of unnecessary components by truncating latent matrices. In this talk, we discuss the limitations of current specifications and gradually compose a framework incorporating novel methods to address these deficiencies. We introduce a general class of infinite factor models with sparse patterns, which can be structured based on exogenous information, thus filling gaps in existing infinite factor models by accommodating grouped variables and non-exchangeable structures. This framework is further extended to general matrix decomposition models, presenting a more scalable computational strategy inspired by gradient boosting. Practical benefits of sparse modeling in infinite factor models are demonstrated through diverse applications to football tracking data.


16/04/2024 at 12:15
Francesca Romana Crucinio (King's College London)
Title: A connection between Tempering and Entropic Mirror Descent

This talk explores the connections between tempering (for Sequential Monte Carlo; SMC) and entropic mirror descent to sample from a target probability distribution whose unnormalized density is known. We establish that tempering SMC corresponds to entropic mirror descent applied to the reverse Kullback-Leibler (KL) divergence and obtain convergence rates for the tempering iterates. Our result motivates the tempering iterates from an optimization point of view, showing that tempering can be seen as a descent scheme of the KL divergence with respect to the Fisher-Rao geometry, in contrast to Langevin dynamics that perform descent of the KL with respect to the Wasserstein-2 geometry. We exploit the connection between tempering and mirror descent iterates to justify common practices in SMC and derive adaptive tempering rules that improve over other alternative benchmarks in the literature.


9/04/2024 at 12:15
Laura D'Angelo (Università degli Studi di Milano-Bicocca)
Title: Analyzing the activation patterns and heterogeneity of the neuronal response via Bayesian mixture models

The modern technique of calcium imaging is revolutionizing the understanding of the nervous system, thanks to its ability to image the activity of individual neurons over time in freely moving animals. This technology has led to remarkable insights into how neurons process and encode information, both individually and collectively. In this talk, we examine how Bayesian mixture models can help uncover the complex functioning of neurons in different experimental studies. First, we discuss a nested mixture model to analyze how the activity of an individual cell is affected by visual stimuli that vary over time. Then, we move to a multivariate model to identify groups of co-activating neurons, where the information on the anatomical proximity is introduced to inform the clustering procedure. At the basis of both models is a simple but effective state-space formulation that describes the calcium dynamic and relates it to the latent firing events. Through these examples, we show how the Bayesian nonparametric framework is ideal for flexibly modeling the unobserved neuronal activity, identifying patterns of activity, borrowing information across experimental conditions, and including additional knowledge.


26/03/2024 at 12:15
Philipp Sterzinger (University of Warwick)
Title: Diaconis-Ylviskaer prior penalized likelihood in high-dimensional logistic regression

In recent years, there has been a surge of interest in estimators and inferential procedures that exhibit optimal asymptotic properties in high-dimensional logistic regression when the number of covariates grows proportionally as a fraction ($\kappa \in (0,1)$) of the number of observations. In this seminar, we focus on the behaviour of a class of maximum penalized likelihood estimators, employing the Diaconis-Ylvisaker prior as the penalty.

Building on advancements in approximate message passing, we analyze the aggregate asymptotic behaviour of these estimators when covariates are normal random variables with arbitrary covariance. This analysis enables us to eliminate the persistent asymptotic bias of the estimators through straightforward rescaling for any value of the prior hypertuning parameter. Moreover, we derive asymptotic pivots for constructing inferences, including adjusted Z-statistics and penalized likelihood ratio statistics.

Unlike the maximum likelihood estimate, which only asymptotically exists in a limited region on the plane of $\kappa$ versus signal strength, the maximum penalized likelihood estimate always exists and is directly computable via maximum likelihood routines. As a result, our asymptotic results remain valid even in regions where existing maximum likelihood results are not obtainable, with no overhead in implementation or computation.

The dependency of the estimators on the prior hyper-parameter facilitates the derivation of estimators with zero asymptotic bias and minimal mean squared error. We will explore these estimators' shrinkage properties, substantiate our theoretical findings with simulations and applications.


19/03/2024 at 12:15
Andrea Cappozzo (Università degli Studi di Milano)
Title: Penalized mixed-effects multitask learning: a framework for regularizing multilevel models with applications on DNA methylation biomarkers creation

Linear mixed modeling is a well-established technique widely employed when observations possess a grouping structure. Nevertheless, this standard methodology is no longer applicable when the learning framework encompasses a multivariate response and high-dimensional predictors. To overcome these issues, a penalized estimation scheme based on an expectation-maximization (EM) algorithm is proposed, in which any penalty criteria previously devised for fixed-effects models can be conveniently incorporated into the fitting process. We employ the novel methodology for creating surrogate biomarkers of cardiovascular risk factors, such as lipids and blood pressure, from whole-genome DNA methylation data in a multi-center study. The proposed approach shows promising results in both predictive accuracy and bio-molecular interpretation compared to state-of-the-art alternatives.
Based on joint work with Francesca Ieva and Giovanni Fiorito.


12/03/2024 at 12:15
Alessandra Menafoglio (Politecnico di Milano)
Title: The Bayes space approach to functional data analysis for probability density functions

In the presence of increasingly massive and heterogeneous data, the statistical modeling of distributional observations plays a key role. Choosing the ‘right’ embedding space for these data is of paramount importance for their statistical processing, to account for their nature and inherent constraints. The Bayes space theory is a natural embedding space for (spatial) distributional data, and was successfully applied in varied settings. In this presentation, I will discuss the state-of-the-art methods for the modelling, analysis, and prediction of distributional data, with a particular attention to cases when their spatial dependence cannot be neglected. I will embrace the viewpoint of object-oriented spatial statistics (O2S2), a system of ideas for the analysis of complex data with spatial dependence. All the theoretical developments will be illustrated through their application on real data, highlighting the intrinsic challenges of a statistical analysis which follows the Bayes spaces approach.


27/02/2024 at 12:15
Federico Ferraccioli [ITA] (Università degli Studi di Padova)
Title: Detecting stable modes using stochastic gradients

In the framework of density based clustering, modes represent a crucial concept as they identify high-density regions that can be associated to different groups of observations. The possibility to conduct inference on the resulting clusters remains nonetheless an intriguing yet quite intricate problem. To tackle this challenge, we propose a procedure that leverages on the connection between mean-shift clustering and stochastic gradient methods. This enables the definition of a sampling procedure, which can be used to construct confidence regions for the modes of a density. We investigate the asymptotic properties of the proposed method and evaluate its performances across different scenarios.

Past seminars

12/12/2023 at 10:30
Dáire Healy (Maynooth University)
Title: Extreme spatio-temporal events in a changing climate; an analysis of extreme temperatures in Ireland

Characterisation of extreme temperature events is crucial for societal development. We are thus motivated to investigate the changing nature of the frequency, magnitude and spatial extent of extreme temperatures in Ireland. We develop an extreme value model that captures spatial and temporal non-stationarity in extreme daily maximum temperature data. We model the tails of the marginal variables using the generalised Pareto distribution and the spatial dependence of extreme events using an r-Pareto process, with parameters of each model allowed to change over time. We use weather station observations for modelling extreme events since data from climate models have trends determined by the specific climate model configuration. However, climate models do provide valuable information about the detailed physiography over Ireland and the associated climate response. We propose novel methods which exploit the climate model data to overcome issues linked to the sparse and biased sampling of the observations. In this talk we will give details of our data, modelling procedure and illustrate how our analysis identifies a temporal change in the behaviour of extreme temperature events over Ireland.


13/06/2023 at 14:00
Cecilia Viscardi [ITA] (Dipartimento di Statistica, Informatica, Applicazioni 'G. Parenti' (DiSIA), Università degli Studi di Firenze)
Title: Likelihood-free Transport Monte Carlo -- Joint work with Dr Dennis Prangle (University of Bristol)

Approximate Bayesian computation (ABC) is a class of methods for drawing inferences when the likelihood function is unavailable or computationally demanding to evaluate. Importance sampling and other algorithms using sequential importance sampling steps are state-of-art methods in ABC. Most of them get samples from tempered approximate posterior distributions defined by considering a decreasing sequence of ABC tolerance thresholds. Their efficiency is sensitive to the choice of an adequate proposal distribution and/or forward kernel function. We present a novel ABC method addressing this problem by combining importance sampling steps and optimization procedures. We resort to Normalising Flows (NFs) to optimize proposal distributions over a family of densities to transport particles drawn at each step towards the next tempered target. Therefore, the combination of sampling and optimization steps allows tempered distributions to get efficiently closer to the target posterior.

Results presented during this talk are from ongoing research that builds the paper
Dennis Prangle & Cecilia Viscardi (2023) Distilling Importance Sampling for Likelihood Free Inference, Journal of Computational and Graphical Statistics, DOI: https://doi.org/10.1080/10618600.2023.2175688


30/05/2023 at 14:00
Ioannis Kosmidis (University of Warwick)
Title: Maximum softly-penalized likelihood for mixed effects logistic regression

Maximum likelihood estimation in logistic regression with mixed effects is known to often result in estimates on the boundary of the parameter space. Such estimates, which include infinite values for fixed effects and singular or infinite variance components, can cause havoc to numerical estimation procedures and inference. We introduce an appropriately scaled additive penalty to the log-likelihood function, or an approximation thereof, which penalizes the fixed effects by the Jeffreys' invariant prior for the model with no random effects and the variance components by a composition of negative Huber loss functions. The resulting maximum penalized likelihood estimates are shown to lie in the interior of the parameter space. Appropriate scaling of the penalty guarantees that the penalization is soft enough to preserve the optimal asymptotic properties expected by the maximum likelihood estimator, namely consistency, asymptotic normality, and Cramér-Rao efficiency. Our choice of penalties and scaling factor preserves equivariance of the fixed effects estimates under linear transformation of the model parameters, such as contrasts. Maximum softly-penalized likelihood is compared to competing approaches on real-data examples, and through comprehensive simulation studies that illustrate its superior finite sample performance.
Joint work with: Philipp Sterzinger, University of Warwick
Relevant paper: https://doi.org/10.1007/s11222-023-10217-3


16/05/2023 at 14:00
Omiros Papaspiliopoulos (Università Bocconi Milano)
Title: Accurate and scalable large-scale inference for mixed models

Generalized linear mixed models are the workhorse of applied Statistics. In modern applications, from political science to electronic marketing, it is common to have categorical factors with large number of levels. This arises naturally when considering interaction terms in survey-type data, or in recommender-system type of applications. In such contexts it is important to have a scalable computational framework, that is one whose complexity scales linearly with the number of observations $n$ and parameters $p$ in the model. Popular implementations, such as those in lmer, although highly optimized they involve costs that scale polynomially with $n$ and $p$. We adopt a Bayesian approach (although the essence of our arguments applies more generally) for inference in such contexts and design families of variational approximations for approximate Bayesian inference with provable scalability. We also provide guarrantees for the resultant approximation error and in fact link that to the rate of convergence of the numerical schemes used to obtain the variational approximation.
This is joint work with Giacomo Zanella (Bocconi) and Max Goplerud (Pittsburgh)


9/05/2023 at 14:00
Roberta Pappadà (University of Trieste)
Title: Time Series Clustering based on Multivariate Comonotonicity via Copulas

In recent years, copula-based measures of association have been exploited to develop clustering methods that can be used to identify the co-movements of random variables representing, e.g., a set of physical variables describing the phenomenon of interest (such as flood peak and volume). When the phenomenon under consideration is described by multiple time series collected at some given geographical sites, such clustering methods may allow the identification of sub-regions characterized by a similar stochastic behavior. While many studies have focused on a single variable of interest, the copula approach represents a natural way to develop a multivariate framework, which can consider the role of compound events for extremes. Hence, the study of compound events and the associated risk may benefit from a copula-based spatial clustering of time series. In this regard, we propose a dissimilarity-based clustering procedure to identify spatial clusters of gauge stations, each characterized by multiple time series. In particular, the procedure tends to cluster sites that exhibit a weak form of comonotonic behavior, which is more apt for some applications, thus allowing for a much more flexible notion of comonotonicity. Different dissimilarity indices are proposed, which only depend on the copula of the involved random variables, and compared in a simulation study. The proposed method is illustrated via an application to the analysis of flood risks.
The talk will present some results from ongoing research based on the collaboration with Fabrizio Durante (Università del Salento) and Sebastian Fuchs (University of Salzburg).


2/05/2023 at 14:00
Fabrizio Laurini (Università di Parma)
Title: Extremal features of GARCH models and their numerical evaluation

Generalized autoregressive conditionally heteroskedastic (GARCH) processes, which are widely used for risk management when modelling the conditional variance of financial returns, have peculiar extremal properties, as extreme values tend to cluster according to a non trivial scheme. Marginal and dependence features of GARCH processes are determined by a multivariate regular variation property and tail processes. For high-order processes new results are presented and a set of new algorithms is analysed. These algorithms exploit a mixture of new limit theory and particle filtering results for fixed point distributions, so that a novel method is now available. Special cases including ARCH and IGARCH processes are investigated even when the innovation term has Skew-t distribution. In some of these special cases the marginal variance does not even exist. With our results it is possible to evaluate the marginal tail index and other measure of temporal extremal dependence, like the extremogram and the extremal index.

The presentation is based on the paper: Laurini, F., Fearnhead, P. & Tawn, J. "Limit theory and robust evaluation methods for the extremal properties of GARCH(p, q) processes". Stat Comput 32, 104 (2022). https://doi.org/10.1007/s11222-022-10164-5 (available open access).


18/04/2023 at 14:30
Michele Peruzzi (Duke University)
Title: Bayesian multi-species N-mixture models for large scale spatial data in community ecology

Community ecologists seek to model the local abundance of multiple animal species while taking into account that observed counts only represent a portion of the underlying population size. Analogously, modeling spatial correlations in species' latent abundances is important when attempting to explain how species compete for scarce resources. We develop a Bayesian multi-species N-mixture model with spatial latent effects to address both issues. On one hand, our model accounts for imperfect detection by modeling local abundance via a Poisson log-linear model. Conditional on the local abundance, the observed counts have a binomial distribution. On the other hand, we let a directed acyclic graph restrict spatial dependence in order to speed up computations and use recently developed gradient-based Markov-chain Monte Carlo methods to sample a posteriori in the multivariate non-Gaussian data scenarios in which we are interested.


21/03/2023 at 14:00
Tommaso Rigon (Università degli Studi di Milano-Bicocca)
Title: Bayesian nonparametric prediction of the taxonomic affiliation of DNA sequences

Predicting the taxonomic affiliation of DNA sequences collected from biological samples is a fundamental step in biodiversity assessment. This task is performed by leveraging existing databases containing reference DNA sequences endowed with a taxonomic identification. However, environmental sequences can be from organisms that are either unknown to science or for which there are no reference sequences available. Thus, the taxonomic novelty of a sequence needs to be accounted for when doing classification. We propose Bayesian nonparametric taxonomic classifiers, BayesANT, which use species sampling model priors to allow unobserved taxa to be discovered at each taxonomic rank. Using a simple product multinomial likelihood with conjugate Dirichlet priors at the lowest rank, a highly flexible supervised algorithm is developed to provide a probabilistic prediction of the taxa placement of each sequence at each rank. We run our algorithm on a carefully annotated library of Finnish arthropods (FinBOL). To assess the ability of BayesANT to recognize novelty and to predict known taxonomic affiliations correctly, we test it on two training-test splitting scenarios, each with a different proportion of taxa unobserved in training. Our algorithm attains accurate predictions and reliably quantifies classification uncertainty, especially when many sequences in the test set are affiliated with taxa unknown in training.


28/02/2023 at 14:00
Nicola Lunardon (Università degli Studi di Milano-Bicocca)
Title: On bias prevention and incidental parameters

Firth (1993) introduced a method for reducing the bias of the maximum likelihood estimator. The approach enjoys some welcome side effects, like retrieving finite estimates when maximum likelihood ones are not. An additional property regard the effectiveness of the method in reducing the sensitivity of likelihood-based inferential procedures to incidental parameters. The usefulness of the above properties is demonstrated through simulations in the analysis of binary matched data.

30/06/2022 at 10:00
Thomas Kjeldsen (University of Bath)
Title: Modelling extreme rainfall and flood events; statistical challenges for a hydrological engineer

Statistical models of extreme rainfall and flood events have been used for more than a century to aid design and operation of water infrastructure.  In particular, extreme value models are routinely used to establish a relationship between event-magnitude and exceedance probability, often expressed as return period.  This talk will present examples of recent progress in the application of extreme value models to address practical problems and opportunities facing hydrological engineers.  First the impact of land-use and climate change on future flood risk will be considered, focussing in particular on the use of change-permitting extreme value models.  Secondly, the talk will discuss the use mixture models by considering different event-types, and the new opportunities emerging through the use of large-scale open-access meteorological data.


14/06/2022 at 14:30
Emanuele Aliverti (Università Ca' Foscari Venezia - DEC)
Title: Stratified stochastic variational inference for networks

There has been considerable interest in Bayesian modeling high dimensional networks via latent space approaches. These methods allow to characterize the dependence structure of a network with simplicity, often demonstrating remarkable empirical performances. Unfortunately, as the number of nodes increases, estimation based on Markov chain Monte Carlo becomes very slow and can lead to inadequate mixing.
In this talk, I will illustrate scalable algorithms to conduct approximate posterior inference for latent factor models, relying on a novel stratified variational inference approach. I will illlustrate the benefit of the proposed methods with different examples, focusing on high-resolution brain imaging and investigating the relationships among contrade in Venice during the 18th century.


13/01/2022 at 10:00
Thomas Kjeldsen (University of Bath)
Title: Reconstructing the peak flow of historical flood events: the city of Bath, United Kingdom

This talk will discuss practical and statistical challenges in the use of reconstructed historical flood events in contemporary flood frequency analysis.  Focussing on the River Avon which runs through the City of Bath, work was undertaken to reconstruct the magnitude of large historical events based on documentary evidence (e.g. flood marks, photographs, newspaper articles) combined with a hydraulic river model.  The reconstructed events cover the period 1866-1960 and include 16 distinct events. The reconstructed events were used to augment contemporary instrumental flow measurements, creating a censored annual maximum series of peak flow covering the period 1866-2016.  Finally, the augmented annual maximum series was modelled using a censored generalised logistic distribution to enable quantile based estimates of design floods. The result show that including the historical events increased the 1 in 100 year design flood by 20% when compared to using the modern instrumental record only.

18/11/2021 at 15:00
Philippe Naveau (Laboratoire des Sciences du Climat et de l’Environnement)
Title: Combining ensemble climate simulations to take into account multi-model error

Global climate model outputs like any numerical model simulation ensembles correspond to an approximation of the true system under study, here the climate system. Different sources of variability, uncertainties and model errors have to be taken into account to provide reliable estimates of climate changes. In this talk, we will study basic atmospheric variables (temperatures and precipitation) simulated from more than 10 different climate models of the Coupled Model Intercomparison Project (called CMIP, version 5 and 6). As each climate model provides an ensemble of similar but still different climate trajectories, we propose and study a statistical model to efficiently combine these runs. As climate database can be large, the proposed statistical procedure needs to be fast and easy to implement.
Under these constraints, we propose a simple statistic that, under precise defined conditions, has the advantage of averaging imperfect models outputs without the need of model bias corrections. We also check the validity of our conditions by contrasting recorded measurements and simulations.


4/11/2021 at 12:15
Philippe Naveau (Laboratoire des Sciences du Climat et de l’Environnement)
Title: Evaluation of binary classifiers for environmental extremes

Machine learning classification methods usually assume that all possible classes are sufficiently present within the training set. Due to their inherent rarities, extreme events are always under-represented and classifiers tailored for predicting extremes need to be carefully designed to handle this under-representation. In this talk, we address the question of how to assess and compare classifiers with respect to their capacity to capture extreme occurrences.This is also related to the topic of scoring rules used in forecasting literature. In this context, we propose and study different risk functions adapted to extremal classifiers. The inferential properties of our empirical risk estimator are derived under the framework of multivariate regular variation and hidden regular variation. As an example, we study in detail the special class of linear classifiers and show that the optimisation of our risk function leads to a consistent solution. A simulation study compares different classifiers and indicates their performance with respect to our risk functions. To conclude, we apply our framework to the analysis of extreme river discharges in the Danube river basin. The application compares different predictive algorithms and test their capacity at forecasting river discharges from other river stations. As a by-product, we identify the explanatory variables that contribute the most to extremal behaviour. If time allows, we will also discuss other climate datasets.

Joint work with Juliette Legrand (LSCE, Rennes University) and Marco Oesting (Siegen University)


8/06/2021 at 14:00
Riccardo Rastelli
(University College Dublin)
Title: A time-continuous extension of the latent position network model for instantaneous interactions

We create a framework to analyse the timing and frequency of instantaneous interactions between pairs of entities. This type of interaction data is especially common nowadays, and easy to collect. Examples include email networks, phone call networks, proximity networks. The framework relies on a latent position network model: the entities are embedded in a latent Euclidean space, and they move along individual trajectories that are continuous over time. These trajectories are used to characterise the timing and frequency of the pairwise interactions. We discuss an inferential framework where we estimate the trajectories from the observed interaction data, and propose applications on artificial and real data.


1/06/2021 at 14:00
Michael Fop (University College Dublin)
Title: A composite likelihood approach for model-based clustering of high-dimensional data

The use of finite Gaussian mixture models (GMMs) is a well established approach to performing model-based clustering. Despite the popularity of GMMs, their widespread use is hindered by their inability to transfer to high-dimensional data settings. Difficulties related to dealing with high-dimensional covariance matrices and highly correlated data often makes the use of GMMs impractical. The composite likelihood (CL) approach uses smaller dimensional marginal and/or conditional pseudo-likelihoods to estimate the parameters of a model, avoiding the need to fully specify the underlying joint distribution. Such an approximation is very helpful when the full model is difficult to specify or manipulate, overcoming the computational problems often arising when dealing with a multi-dimensional joint distribution. In addition, the specification of appropriate conditional likelihoods allows the modelling of the dependence structure by means of lower dimensional terms.
This talk presents a framework that exploits the idea of embedding CL in the area of GMMs for clustering high-dimensional data. The framework explores the use of approximations to the likelihood of a GMM by means of block-pairwise and block-conditional composite likelihoods, which allow the decomposition of the potentially high-dimensional density into terms of smaller dimensions. Estimation is based on a computationally efficient expectation-maximization algorithm, enabling the use of GMMs for clustering high-dimensional data. The approach is demonstrated through simulated and real data examples.
Talk based on joint work with Claire Gormley (University College Dublin), Adrian O'Hagan (University College Dublin), Ioannis Kosmidis (University of Warwick), Dimitris Karlis (Athens University of Economics and Business), and Caitriona Ryan (Maynooth University).


25/05/2021 at 14:00
John Aston
(University of Cambridge)
Title: Functional Data on Constrained Spaces

Functional Data Analysis is concerned with the statistical analysis of data which are curves or surfaces. There has been considerable progress made in this area over the last 20-30 years, but most of this work has been focused on 1-dimensional curves living in a standard space such as the space of square integrable functions. However, many real data applications, such as those from linguistics and neuroimaging, imply considerable constraints on data which is not a simple curve. In this talk, we will look at several different types of constrained functional data. We will examine the role of positive definiteness in linguistics and show that this can be used to study ancient languages. We will also look at 2-d manifolds embedded in 3-dimensions, such as the cortical surface of the brain. We’ll see that some current applications, such as functional connectivity, require both properties simultaneously and we’ll suggest methods for understanding the data in such cases.


18/05/2021 at 14:00
Laura Sangalli (Politecnico di Milano)
Title: Functional and complex data - new methods merging statistics, scientific computing and engineering

Recent years have seen an explosive growth in the recording of increasingly complex and high-dimensional data. Classical statistical methods are often unfit to handle such data, whose analysis calls for the definition of new methods merging ideas and approaches from statistics, applied mathematics and engineering. This seminar in particular focuses on functional and spatial data defined over complex multidimensional domains, including curved bi-dimensional domains and non-convex three-dimensional domains. I will present an innovative class of methods, based on regularizing terms involving partial differential equations. The proposed methods make use of advanced numerical techniques, such as finite element analysis and isogeometric analysis. An illustration to the analysis of neuroimaging data is provided. In this applicative domain, the proposed methods offer important advantages with respect to the best state of art techniques, allowing to correctly take into account to complex anatomy of the brain.


4/05/2021 at 14:00
Manuele Leonelli (IE University Madrid)
Title: Untangling complex dependencies in categorical data using staged trees

The dependence structure of a categorical random vector is often studied by means of a probabilistic graphical model. The most commonly used model is the so-called Bayesian network which provides an intuitive and efficient framework to assess (causal) dependencies. One of the major drawbacks of these models is that they can only explicitly represent symmetric dependencies, which, in practice, may not give a complete description of the data dependence structure. Staged trees are a flexible class of graphical models which can explicitly represent and model a wide array of non-symmetric dependence
In this talk, I will provide an overview of this model class and their application to a wide array of datasets. I will also discuss a number of ongoing developments for staged trees, including efficient structural learning, causal discovery, manipulations of the graphs and the new stagedtrees R package.
The talk is based on joint work with Gherardo Varando (University of Valencia), Federico Carli and Eva Riccomagno (University of Genova).


20/04/2021 at 14:00
David Firth (University of Warwick)
Title: Schedule-adjusted league tables during the football season

In this talk I will show how to construct a better football league table than the official ranking based on accumulated points to date.  The aim of this work is (only) to produce a more informative representation of how teams currently stand, based on their match results to date in the current season; it is emphatically not about prediction.  A more informative league table is one that takes proper account of "schedule strength" differences, i.e., differing numbers of matches played by each team (home and away), and differing average standings of the opponents that each team has faced.

This work extends previous "retrodictive" use of Bradley-Terry models and their generalizations, specifically to handle 3 points for a win, and also to incorporate home/away effects coherently without assuming homogeneity across teams.  Playing records that are 100% or 0%, which can be problematic in standard Bradley-Terry approaches, are incorporated in a simple way without the need for a regularizing penalty on the likelihood. A maximum-entropy argument shows how the method developed here is the mathematically "best" way to account for schedule strength in a football league table.

Illustrations will be from the English Premier League, and the Italian Serie A.


13/04/2021at 14:00
Alan Agresti
(University of Florida)
Title: Simple Ways to Interpret Effects in Modeling Binary and Ordinal Data

Probability-based effect measures for models for binary and ordinal response variables can be simpler to interpret than logistic (and probit) regression model parameters and their corresponding effect measures, such as odds ratios.
For describing the effect of an explanatory variable while adjusting for others in modeling a binary response, it is sometimes possible to employ the identity and log link functions to generate simple effect measures.
When such link functions are inappropriate, one can still construct analogous effect measures from a logistic regression model fit, based on average differences or ratios of the probability modeled or on average instantaneous rates of change for the probability.
Simple measures are also proposed for interpreting effects in models for ordinal responses based on applying a link function to cumulative probabilities.
The measures are also sometimes applicable with nonlinear predictors, such as in generalized additive models.
The methods are illustrated with examples and implemented with R software.

Parts of this work are joint with Maria Kateri, Claudia Tarantola, and Roberta Varriale.


7/04/2021at 15:00
Stefano Castruccio
(University of Notre Dame)
Title: Model- and Data-Driven Approximation of Space-Time Systems. A Tale of Two Approaches

In this talk I will discuss two different approaches to approximate space-time systems. This first one is model-driven and loosely inspired by physics, assumes that the system is locally diffusive through a stochastic partial differential equation, and can be efficiently approximated with a Gaussian Markov random field. This approximation will be used to produce a stochastic generator of simulated multi-decadal global temperature, thereby offering a fast alternative to the generation of large climate model ensembles.
The second approach is instead data-driven, and relies on (deep) neural networks in time. Instead of traditional machine learning methods aimed at inferring an extremely large parameter space, we instead rely on an alternative fast, sparse and computationally efficient echo state network dynamics on an appropriately dimensionally reduced spatial field. The additional computational time is then used to produce an ensemble and probabilistically calibrate the forecast. The approach will be used to produce air pollution forecasts from a citizen science network in San Francisco and forecasting wind energy in Saudi Arabia.


23/03/2021 at 14:00
Giada Adelfio
(Università degli Studi di Palermo)
Title: Some properties of local weighted second-order statistics for spatio-temporal point processes

Spatial, temporal, and spatio-temporal point processes, and in particular Poisson processes, are stochastic processes that are largely used to describe and model the distribution of a wealth of real phenomena.
When a model is fitted to a set of random points, observed in a given multidimensional space, diagnostic measures are necessary to assess the goodness-of-fit and to evaluate the ability of that model to describe the random point pattern behaviour. The main problem when dealing with residual analysis for point processes is to find a correct definition of residuals. Diagnostics of goodness-of-fit in the theory of point processes are often considered through the transformation of data into residuals as a result of a thinning or a rescaling procedure. We alternatively consider here second-order statistics coming from weighted measures. Motivated by Adelfio and Schoenberg (2010) for the spatial case, we consider here an extension to the spatio-temporal context in addition to focussing on local characteristics.
Then, rather than using global characteristics, we introduce local tools, considering individual contributions of a global estimator as a measure of clustering. Generally, the individual contributions to a global statistic can be used to identify outlying components measuring the influence of each contribution to the global statistic.
In particular, our proposed method assesses goodness-of-fit of spatio-temporal models by using local weighted second-order statistics, computed after weighting the contribution of each observed point by the inverse of the conditional intensity function that identifies the process.
Weighted second-order statistics directly apply to data without assuming homogeneity nor transforming the data into residuals, eliminating thus the sampling variability due to the use of a transforming procedure. We provide some characterisations and show a number of simulation studies.

Last update: 17/04/2024