Past Talks
Gaël varoquaux

Embeddings to learn on messy relational data
07 Mar 2023 At Inria  Salle Gilles Kahn
07 Mar 2023 At Inria  Salle Gilles Kahn
Statistical learning builds upon the regularities of the data at hand, leveraging similarities across observations or smoothness in the underlying process. But these regularities, similarities, or smoothness are difficult to capture in relational data. Observations come with attributes of different nature: age, size, address. The observations themselves may be of different nature, reflecting different granularity of information, for instance studying...
Statistical learning builds upon the regularities of the data at hand, leveraging similarities across observations or smoothness in the underlying process. But these regularities, similarities, or smoothness are difficult to capture in relational data. Observations come with attributes of different nature: age, size, address. The observations themselves may be of different nature, reflecting different granularity of information, for instance studying the housing market housing market may require assembling information across sales, properties, buyers, and various administrative divisions of cities and states.
Given such complex relational data, the standard practice is to manual transform it to a vector space, with much manual labor to make the data as regular as possible: SQL joins and aggregates across tables, entity normalization (fixing typos), imputation of the missing values. I will present progress on rethinking the datascience process to avoid these manual operations. Using flexible learners rather than parametric models removes the need for fancy imputation [1]. Machinelearning at the character level removes the need for normalizing entities, though the analytical question must be reformulated with a nonparametric model [2,3]. And finally, the information of a full database, with objects of different nature and varying attributes, can cast in a vector space that captures this information by expressing the relational model as a graph and adapting knowledgegraph embedding techniques [4]. As a result we provide vectors summarizing the full numerical and relational information in wikipedia for millions of entities: cities, people, compagnies, books: https://sodainria.github.io/ken_embeddings/
[1] Marine Le Morvan, Julie Josse, Erwan Scornet, & Gaël Varoquaux, (2021). What’s a good imputation to predict with missing values?. Advances in Neural Information Processing Systems, 34, 1153011540.
[2] Patricio Cerda, and Gaël Varoquaux. Encoding highcardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering (2020).
[3] Alexis CvetkovIliev, Alexandre Allauzen, and Gaël Varoquaux. Analytics on NonNormalized Data Sources: more Learning, rather than more Cleaning. IEEE Access 10 (2022): 4242042431.
[4] Alexis CvetkovIliev, Alexandre Allauzen, and Gaël Varoquaux. Relational data embeddings for feature enrichment with background information. Machine Learning (2023): 134.
Given such complex relational data, the standard practice is to manual transform it to a vector space, with much manual labor to make the data as regular as possible: SQL joins and aggregates across tables, entity normalization (fixing typos), imputation of the missing values. I will present progress on rethinking the datascience process to avoid these manual operations. Using flexible learners rather than parametric models removes the need for fancy imputation [1]. Machinelearning at the character level removes the need for normalizing entities, though the analytical question must be reformulated with a nonparametric model [2,3]. And finally, the information of a full database, with objects of different nature and varying attributes, can cast in a vector space that captures this information by expressing the relational model as a graph and adapting knowledgegraph embedding techniques [4]. As a result we provide vectors summarizing the full numerical and relational information in wikipedia for millions of entities: cities, people, compagnies, books: https://sodainria.github.io/ken_embeddings/
[1] Marine Le Morvan, Julie Josse, Erwan Scornet, & Gaël Varoquaux, (2021). What’s a good imputation to predict with missing values?. Advances in Neural Information Processing Systems, 34, 1153011540.
[2] Patricio Cerda, and Gaël Varoquaux. Encoding highcardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering (2020).
[3] Alexis CvetkovIliev, Alexandre Allauzen, and Gaël Varoquaux. Analytics on NonNormalized Data Sources: more Learning, rather than more Cleaning. IEEE Access 10 (2022): 4242042431.
[4] Alexis CvetkovIliev, Alexandre Allauzen, and Gaël Varoquaux. Relational data embeddings for feature enrichment with background information. Machine Learning (2023): 134.
Marylou Gabrié

Opportunities and Challenges in Enhancing Sampling with Learning
07 Mar 2023 At Inria  Salle Gilles Kahn
07 Mar 2023 At Inria  Salle Gilles Kahn
Deep generative models parametrize very flexible families of distributions able to fit complicated datasets of images or text. These models provide independent samples from complex highdistributions at negligible costs. On the other hand, sampling exactly a target distribution, such a Bayesian posterior, is typically challenging: either because of dimensionality, multimodality, illconditioning or a combination of the previous. In this talk, I will review recent...
Deep generative models parametrize very flexible families of distributions able to fit complicated datasets of images or text. These models provide independent samples from complex highdistributions at negligible costs. On the other hand, sampling exactly a target distribution, such a Bayesian posterior, is typically challenging: either because of dimensionality, multimodality, illconditioning or a combination of the previous. In this talk, I will review recent works trying to enhance traditional inference and sampling algorithms with learning. I will present in particular flowMC, an adaptive MCMC with Normalizing Flows along with first applications and remaining challenges.
Nabil Mustapha

Hitting Convex Sets: Combinatorics, Geometry and Sampling
07 Feb 2023 At ENSAE  Salle 1002
07 Feb 2023 At ENSAE  Salle 1002
Let P be a finite set of points in R^d, and epsilon > 0 a given parameter. Then our goal is to compute a set Q in R^d, of small size, such that *any* convex set containing an epsilonth fraction of the points of P contains at least one point of Q. This problem, proposed more than 40 years ago, remains wide open, with a large gap between the bestknown lower and upper bounds on the size of Q. In this talk, I will explain some of the approaches towards resolving this question, which ...
Let P be a finite set of points in R^d, and epsilon > 0 a given parameter. Then our goal is to compute a set Q in R^d, of small size, such that *any* convex set containing an epsilonth fraction of the points of P contains at least one point of Q.
This problem, proposed more than 40 years ago, remains wide open, with a large gap between the bestknown lower and upper bounds on the size of Q. In this talk, I will explain some of the approaches towards resolving this question, which use a variety of techniques from Ramsey theory, combinatorial geometry, theory of convex polytopes and datadepth. Joint work with Saurabh Ray and Imre Barany.
This problem, proposed more than 40 years ago, remains wide open, with a large gap between the bestknown lower and upper bounds on the size of Q. In this talk, I will explain some of the approaches towards resolving this question, which use a variety of techniques from Ramsey theory, combinatorial geometry, theory of convex polytopes and datadepth. Joint work with Saurabh Ray and Imre Barany.
Johannes Lutzeyer

Advances in Graph Representation Learning: The Graph Ordering Attention Networks
07 Feb 2023 At ENSAE  Salle 1002
07 Feb 2023 At ENSAE  Salle 1002
The research and development of Graph Neural Networks (GNNs) is currently one of the fastestgrowing topics in the area of Artificial Intelligence. GNNs have celebrated many academic and industrial successes in the past years; providing a rich ground for theoretical analysis and achieving stateoftheart results in several learning tasks. In this talk, I will give an accessible introduction to the area of Graph Representation Learning with a focus on GNNs. I will then discuss some...
The research and development of Graph Neural Networks (GNNs) is currently one of the fastestgrowing topics in the area of Artificial Intelligence. GNNs have celebrated many academic and industrial successes in the past years; providing a rich ground for theoretical analysis and achieving stateoftheart results in several learning tasks. In this talk, I will give an accessible introduction to the area of Graph Representation Learning with a focus on GNNs. I will then discuss some of our recent work in the area, in particular our recent contribution the Graph Ordering Attention (GOAT) Networks. Most standard GNN layers are unable to capture statistical interaction effects of nodes in a neighbourhood. Our GOAT networks address this shortcoming and capture all possible interactions in a given neighbourhood. This allows us to demonstrably improve GNN performance in both synthetic tasks and on realworld data. I will end the talk by providing an overview of our other recent contributions to the Graph Representation Learning literature.
Marco Cuturi

OT at scale: Careful Initialization, Lowrank considerations and Neural Networks
06 Dec 2022 At Inria  Amphi Sophie Germain
06 Dec 2022 At Inria  Amphi Sophie Germain
I will present in this talk a series of efforts targeted at increasing the scalability and applicability of OT computations. I will present efforts on two fronts: In the first part, I will discuss speeding up the discrete resolution of the Kantorovich problem, using either the Sinkhorn approach, and, in that case, focusing on efficient heuristics to initialize Sinkhorn potentials, or, alternatively, by parameterizing OT couplings as a product of lowrank nonnegative matrices. In the second...
I will present in this talk a series of efforts targeted at increasing the scalability and applicability of OT computations. I will present efforts on two fronts: In the first part, I will discuss speeding up the discrete resolution of the Kantorovich problem, using either the Sinkhorn approach, and, in that case, focusing on efficient heuristics to initialize Sinkhorn potentials, or, alternatively, by parameterizing OT couplings as a product of lowrank nonnegative matrices. In the second part I will explain how a parameterization, in the 2Wasserstein setting, of dual potentials as inputconvex neural networks has opened several research avenues, and demonstrate this by illustrating a recent application to a multitask, joint estimation of several Monge maps linked by a common set of parameters.
Alexandre Perez

Beyond calibration: estimating the grouping loss of modern neural networks
06 Dec 2022 At Inria  Amphi Sophie Germain
06 Dec 2022 At Inria  Amphi Sophie Germain
Good decision making requires machinelearning models to provide trustworthy confidence scores. To this end, recent work has focused on miscalibration, i.e, the over or under confidence of model scores. Yet, contrary to widespread belief, calibration is not enough: even a classifier with the best possible accuracy and perfect calibration can have confidence scores far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores ..
Good decision making requires machinelearning models to provide trustworthy confidence scores. To this end, recent work has focused on miscalibration, i.e, the over or under confidence of model scores. Yet, contrary to widespread belief, calibration is not enough: even a classifier with the best possible accuracy and perfect calibration can have confidence scores far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We use it to study modern neural network architectures in vision and NLP. We find that the grouping loss varies markedly across architectures, and that it is a key modelcomparison factor across the most accurate, calibrated, models. We also show that distribution shifts lead to high grouping loss.
Lukas Steinberger

Statistical Efficiency in Local Differential Privacy
08 Nov 2022 At ENSAE  Salle 1004
08 Nov 2022 At ENSAE  Salle 1004
In light of big data and powerful machine learning algorithms that can quickly search and analyze large amounts of data, issues of data privacy protection are becoming more pressing than ever before. An increasingly popular approach towards data privacy protection, which is robust against any kind of adversary, is the notion of `differential privacy’. Although nowadays differential privacy receives a lot of attention also from within the statistics community, still even some of the most basic...
In light of big data and powerful machine learning algorithms that can quickly search and analyze large amounts of data, issues of data privacy protection are becoming more pressing than ever before. An increasingly popular approach towards data privacy protection, which is robust against any kind of adversary, is the notion of `differential privacy’. Although nowadays differential privacy receives a lot of attention also from within the statistics community, still even some of the most basic questions appear to be open.
In this talk we begin with a very general introduction to privacy aware data analysis via differential privacy. We then turn to the local paradigm of differential privacy and focus on the classical statistical problem of efficient estimation in regular parametric models. We highlight some of the intricacies that appear when the classical theory is adapted to procedures that guarantee local differential privacy and propose a general solution for the single parameter case. At the core of the new theory lies a challenging optimization problem for which efficient algorithms are sorely needed. Most of this talk is work in progress.
In this talk we begin with a very general introduction to privacy aware data analysis via differential privacy. We then turn to the local paradigm of differential privacy and focus on the classical statistical problem of efficient estimation in regular parametric models. We highlight some of the intricacies that appear when the classical theory is adapted to procedures that guarantee local differential privacy and propose a general solution for the single parameter case. At the core of the new theory lies a challenging optimization problem for which efficient algorithms are sorely needed. Most of this talk is work in progress.
When a decision, such as the approval or denial of bank loan, is delegated to a computer, an explanation of that decision ought to be given with it. This ethical need to explain the decisions leads to the search for a formal definition of the notion of explanation. This question meets older questions in logic regarding the explanatory nature of proof.
When a decision, such as the approval or denial of bank loan, is delegated to a computer, an explanation of that decision ought to be given with it. This ethical need to explain the decisions leads to the search for a formal definition of the notion of explanation. This question meets older questions in logic regarding the explanatory nature of proof.
Valentin de Bortoli

An introduction to ScoreBased Generative Modeling and Schrodinger Bridges
05 Jul 2022 At ENSAE  Salle 1004
05 Jul 2022 At ENSAE  Salle 1004
ScoreBased Generative Modeling (SGM) is a recently developed approach to probabilistic generative modeling that exhibits stateoftheart performance on several audio and image synthesis tasks. In this talk, we will introduce the basics of SGM and present some connections with stochastic control through the concept of Schrodinger bridges.
ScoreBased Generative Modeling (SGM) is a recently developed approach to probabilistic generative modeling that exhibits stateoftheart performance on several audio and image synthesis tasks. Existing SGMs generally consist of two parts. Firstly, noise is incrementally added to the data in order to obtain a perturbed data distribution approximating an easytosample prior distribution e.g. Gaussian. Secondly, a neural network is used to learn the reversetime denoising dynamics, which when initialized at this prior distribution, defines a generative model. Song et al. (2021) have shown that one could fruitfully view the noising process as a Stochastic Differential Equation (SDE) that progressively perturbs the initial data distribution into an approximately Gaussian one. In this talk, we will introduce the basics of SGM and present some connections with stochastic control through the concept of Schrodinger bridges.
YANN ISSARTEL

Seriation and 1Dlocalization in latent space models.
05 Jul 2022 At ENSAE  Salle 1004
05 Jul 2022 At ENSAE  Salle 1004
Motivated by applications in archeology for relative dating of objects, or in 2Dtomography for angular synchronization, we consider the problem of statistical seriation where one seeks to reorder a noisy disordered matrix of pairwise affinities. This reformulation naturally leads to the problem of estimating the positions in the latent space. Under nonparametric assumptions, we introduce a procedure that provably localizes the latent positions with minimax optimal rate.
Motivated by applications in archeology for relative dating of objects, or in 2Dtomography for angular synchronization, we consider the problem of statistical seriation where one seeks to reorder a noisy disordered matrix of pairwise affinities. This problem can be recast in the powerful latent space terminology, where the affinity between a pair of items is modeled as a noisy observation of a function f(x_i,x_j) of the latent positions x_i, x_j of the two items on a onedimensional space. This reformulation naturally leads to the problem of estimating the positions in the latent space. Under nonparametric assumptions on the affinity function f, we introduce a procedure that provably localizes all the latent positions with a maximum error of the order of the square root of log(n)/n. This rate is proven to be minimax optimal. Computationally efficient procedures are also analyzed, under some more restrictive assumptions. Our general results can be instantiated to the original problem of statistical seriation, leading to new bounds for the maximum error in the ordering.
Matthieu Jonckheere

New exploration principles for sparse and rare rewards
07 Jun 2022 At CentralSupelec  Batiment Eiffel  Amphi IV
07 Jun 2022 At CentralSupelec  Batiment Eiffel  Amphi IV
We consider reinforcement learning control problems under the average reward criterion in which nonzero rewards are both sparse and rare, that is, they occur in very few states and the steadystate probability of observing them is extremely small. For such problems, we propose a new approach to exploit prior knowledge on the sparse structure of the environment. The core idea is to use Renewal Theory and FlemingViot particle systems to construct estimators of quantities relevant to ...
We consider reinforcement learning control problems under the average reward criterion in which nonzero rewards are both sparse and rare, that is, they occur in very few states and the steadystate probability of observing them is extremely small. For such problems, we propose a new approach to exploit prior knowledge on the sparse structure of the environment. The core idea is to use Renewal Theory and FlemingViot particle systems to construct estimators of quantities relevant to reinforcement learning, for which we provide some theoretical guarantees for the estimation error.
Joint work with U. Ayesta, S. Majewski, D. Mastropietro,
Joint work with U. Ayesta, S. Majewski, D. Mastropietro,
Ismail Ben Ayed

Learning with limited supervision
07 Jun 2022 At CentralSupelec  Batiment Eiffel  Amphi IV
07 Jun 2022 At CentralSupelec  Batiment Eiffel  Amphi IV
Despite their unprecedented performances when trained on largescale labeled data, deep learning models are seriously challenged when dealing with novel (unseen) classes and limited labeled instances. This generalization challenge occurs in a breadth of real scenarios and applications. In contrast, humans can learn new tasks easily from a handful of examples, by leveraging prior experience and context. Fewshot learning attempts to bridge this gap, and has recently triggered substantial efforts.
Despite their unprecedented performances when trained on largescale labeled data, deep learning models are seriously challenged when dealing with novel (unseen) classes and limited labeled instances. This generalization challenge occurs in a breadth of real scenarios and applications. In contrast, humans can learn new tasks easily from a handful of examples, by leveraging prior experience and context. Fewshot learning attempts to bridge this gap, and has recently triggered substantial research efforts. This talk discusses recent developments in the general, wideinterest subject of learning with very limited supervision. Specifically, I will discuss stateoftheart models, which leverage unlabeled data with priors, and point to current challenges and interesting future research avenues.
I consider the deconvolution problem in the case where no information is known about the noise distribution. More precisely, no assumption is made on the noise distribution and no samples are available to estimate it: the deconvolution problem is solved based only on observations of the corrupted signal. I will present an identifiability result, and its use for statistical inference in various applications.
I consider the deconvolution problem in the case where no information is known about the noise distribution. More precisely, no assumption is made on the noise distribution and no samples are available to estimate it: the deconvolution problem is solved based only on observations of the corrupted signal. I will present an identifiability result, and its use for statistical inference in various applications.
The mathematical forces at work behind Generative Adversarial Networks raise challenging theoretical issues. Motivated by the important question of characterizing the geometrical properties of the generated distributions, we provide a thorough analysis of Wasserstein GANs (WGANs) in both the finite sample and asymptotic regimes.
(joint work with B. Cadre, N. Klutchnikoff, A. Stéphanovitch, and U. Tanielian)
(joint work with B. Cadre, N. Klutchnikoff, A. Stéphanovitch, and U. Tanielian)
The mathematical forces at work behind Generative Adversarial Networks raise challenging theoretical issues. Motivated by the important question of characterizing the geometrical properties of the generated distributions, we provide a thorough analysis of Wasserstein GANs (WGANs) in both the finite sample and asymptotic regimes. We study the specific case where the latent space is univariate and derive results valid regardless of the dimension of the output space. We show in particular that for a fixed sample size, the optimal WGANs are closely linked with connected paths minimizing the sum of the squared Euclidean distances between the sample points. We also highlight the fact that WGANs are able to approach (for the 1Wasserstein distance) the target distribution as the sample size tends to infinity, at a given convergence rate and provided the family of generative Lipschitz functions grows appropriately. We derive in passing new results on optimal transport theory in the semidiscrete setting.
(joint work with B. Cadre, N. Klutchnikoff, A. Stéphanovitch, and U. Tanielian)
(joint work with B. Cadre, N. Klutchnikoff, A. Stéphanovitch, and U. Tanielian)
Arnak Dalalyan

Robust Estimation of the Gaussian Mean by iterative reweighting
01 Feb 2022 At Inria Saclay  Amphi Sophie Germain
01 Feb 2022 At Inria Saclay  Amphi Sophie Germain
The goal of this talk is to show that a single robust estimator of the mean of a multivariate Gaussian distribution can enjoy five desirable properties.
The goal of this talk is to show that a single robust estimator of the mean of a multivariate Gaussian distribution can enjoy five desirable properties. First, it is computationally tractable in the sense that it can be computed in a time which is at most polynomial in dimension, sample size and the logarithm of the inverse of the contamination rate. Second, it is equivariant by scaling, translations and orthogonal transformations. Third, it has a nearlyminimaxratebreakdown point approximately equal to 0.28. Fourth, it is minimax rate optimal when data consist of independent observations corrupted by adversarially chosen outliers. Fifth, it is asymptotically optimal when the rate of contamination tends to zero. The estimator is obtained by an iterative reweighting approach. Each sample point is assigned a weight that is iteratively updated using a convex optimization problem. We also establish a dimensionfree nonasymptotic risk bound for the expected error of the proposed estimator. It is the first of this kind results in the literature and involves only the effective rank of the covariance matrix.
(Joint work with Arshak Minasyan)
(Joint work with Arshak Minasyan)
Frederic Pascal

A new robust clustering algorithm  Application to images segmentation.
01 Feb 2022 At Inria Saclay  Amphi Sophie Germain
01 Feb 2022 At Inria Saclay  Amphi Sophie Germain
We introduce a new robust clustering algorithm that can efficiently deal with noise and outliers in diverse data sets. As an EMlike algorithm, it is based on both estimations of clusters centers and covariances.
Though very popular, it is well known that the EM algorithm suffers from nonGaussian distribution shapes, outliers and highdimensionality. In this talk, we introduce a new robust clustering algorithm that can efficiently deal with noise and outliers in diverse data sets. As an EMlike algorithm, it is based on both estimations of clusters centers and covariances. In addition, using a semiparametric paradigm, the method estimates an unknown scale parameter per datapoint. This allows the algorithm to accommodate for heavier tails distributions and outliers without significantly loosing efficiency in various classical scenarios. Then, we show that the proposed algorithm outperforms other classical unsupervised methods of the literature such as kmeans, the EM algorithm and its recent modications or spectral clustering when applied to real data sets as MNIST, NORB and 20newsgroups. An application to Radar (SAR) image segmentation is proposed, highlighting the interest of the proposed approach.
Frederic Barbaresco

Statistique et apprentissage sur les groupes de Lie : géométrie de l’Information de JeanLouis Koszul et modèle symplectique de la physique statistique de JeanMarie Souriau
06 Jul 2021 At BlueJean
06 Jul 2021 At BlueJean
Les modèles géométriques de l’information, de la physique statistique et de l’inférence en apprentissage machine partagent des structures communes. La géométrie de l’information utilise des gradients naturels d’apprentissage qui ont la propriété remarquable d’être invariants aux systèmes de coordonnées codant l’information par l’intermédiaire de la métrique dite de Fisher. Ces schémas intrinsèques préservent les symétries et peuvent s’étendre à des espaces plus abstraits.
Les modèles géométriques de l’information, de la physique statistique et de l’inférence en apprentissage machine partagent des structures communes comme cela a été illustré à l’Ecole de Physique des Houches de Juillet 2020 (https://franknielsen.github.io/SPIGLesHouches2020/) et dans l’ouvrage SPRINGER récent sur le sujet [1]. La géométrie de l’information [2] utilise des gradients naturels d’apprentissage qui ont la propriété remarquable d’être invariants aux systèmes de coordonnées codant l’information par l’intermédiaire de la métrique dite de Fisher. Ces schémas intrinsèques préservent les symétries et peuvent s’étendre à des espaces plus abstraits lorsque les données sont codées sur des variétés différentielles ou des variétés symplectiques (cas du codage de l’information par des groupes de Lie comme en robotique ou Guidance Navigation & Control). Dans ces espaces de codage plus structurés, la notion de métrique de Fisher s’étend à celle de métrique de KoszulSouriau, la définition de l’Entropie à celle de fonction de Casimir invariante en représentation coadjointe, en établissant les liens avec les systèmes intégrables au sens de Liouville [3,4,5,6].
Ces thématiques seront développées à la 5ème édition de la conférence SEE GSI’21 (Geometric Science of Information) en Juillet 2021 (www.gsi2021.org) à Paris Sorbonne Université, coorganisée avec SCAI Sorbonne (https://scai.sorbonneuniversite.fr/) et ELLIS Paris Unit (https://ellisparis.github.io/), et qui aura pour thème « Learning Geometric Structures ».
Références bibliographiques :
[1] Frédéric Barbaresco & Frank Nielsen, Geometric Structures of statistical Physics, Information Geometry and Learning, Actes de la Summer Week de l’Ecole de Physique des Houches SPIGL’20, SPRINGER Proceedings in Mathematics & Statistics, 2021; https://www.springer.com/jp/book/9783030779566
[2] Frédéric Barbaresco & Frank Nielsen, La Géométrie de l’Information: horssérie n°73 du magazine TANGENTE ; https://www.decitre.fr/revues/tangentehorsserien73mathematiquesetemploi9782848842387.html
[3] CharlesMichel Marle, On Gibbs states of mechanical systems with symmetries, arXiv:2012.00582v1 [math.DG], https://arxiv.org/abs/2012.00582
[4] Frédéric Barbaresco, Lie Group Statistics and Lie Group Machine Learning Based on Souriau Lie Groups Thermodynamics & KoszulSouriauFisher Metric: New Entropy Definition as Generalized Casimir Invariant Function in Coadjoint Representation. Entropy 2020, 22, 642.
[5] Frédéric Barbaresco, François GayBalmaz, Lie Group Cohomology and (Multi)Symplectic Integrators: New Geometric Tools for Lie Group Machine Learning Based on Souriau Geometric Statistical Mechanics. Entropy 2020, 22, 498.
[6] Frédéric Barbaresco, Koszul lecture related to geometric and analytic mechanics, Souriau’s Lie group thermodynamics and information geometry. SPRINGER Information Geometry, 2021
Ces thématiques seront développées à la 5ème édition de la conférence SEE GSI’21 (Geometric Science of Information) en Juillet 2021 (www.gsi2021.org) à Paris Sorbonne Université, coorganisée avec SCAI Sorbonne (https://scai.sorbonneuniversite.fr/) et ELLIS Paris Unit (https://ellisparis.github.io/), et qui aura pour thème « Learning Geometric Structures ».
Références bibliographiques :
[1] Frédéric Barbaresco & Frank Nielsen, Geometric Structures of statistical Physics, Information Geometry and Learning, Actes de la Summer Week de l’Ecole de Physique des Houches SPIGL’20, SPRINGER Proceedings in Mathematics & Statistics, 2021; https://www.springer.com/jp/book/9783030779566
[2] Frédéric Barbaresco & Frank Nielsen, La Géométrie de l’Information: horssérie n°73 du magazine TANGENTE ; https://www.decitre.fr/revues/tangentehorsserien73mathematiquesetemploi9782848842387.html
[3] CharlesMichel Marle, On Gibbs states of mechanical systems with symmetries, arXiv:2012.00582v1 [math.DG], https://arxiv.org/abs/2012.00582
[4] Frédéric Barbaresco, Lie Group Statistics and Lie Group Machine Learning Based on Souriau Lie Groups Thermodynamics & KoszulSouriauFisher Metric: New Entropy Definition as Generalized Casimir Invariant Function in Coadjoint Representation. Entropy 2020, 22, 642.
[5] Frédéric Barbaresco, François GayBalmaz, Lie Group Cohomology and (Multi)Symplectic Integrators: New Geometric Tools for Lie Group Machine Learning Based on Souriau Geometric Statistical Mechanics. Entropy 2020, 22, 498.
[6] Frédéric Barbaresco, Koszul lecture related to geometric and analytic mechanics, Souriau’s Lie group thermodynamics and information geometry. SPRINGER Information Geometry, 2021
Lénaïc Chizat

Analysis of Gradient Descent on Wide TwoLayer ReLU Neural Networks
06 Jul 2021 At BlueJean
06 Jul 2021 At BlueJean
In this talk, we propose an analysis of gradient descent on wide twolayer ReLU neural networks that leads to sharp characterizations of the learned predictor. The main idea is to study the dynamics when the width of the hidden layer goes to infinity, which is a Wasserstein gradient flow.
In this talk, we propose an analysis of gradient descent on wide twolayer ReLU neural networks that leads to sharp characterizations of the learned predictor. The main idea is to study the dynamics when the width of the hidden layer goes to infinity, which is a Wasserstein gradient flow. While this dynamics evolves on a nonconvex landscape, we show that for appropriate initializations, its limit, when it exists, is a global minimizer. We also study the implicit regularization of this algorithm when the objective is the unregularized logistic loss, which leads to a maxmargin classifier in a certain functional space. We finally discuss what these results tell us about the generalization performance, and in particular how these models compare to kernel methods. This is based on joint work with Francis Bach.
Benjamain Riu

Muddling Label Regularization: Deep Learning for Tabular Datasets
01 Jun 2021 At BlueJean
01 Jun 2021 At BlueJean
Deep Learning is considered the stateoftheart in computer vision, speech recognition and natural language processing. Until recently, it was also widely accepted that DL is irrelevant for learning tasks on tabular data, especially in the small sample regime where ensemble methods are acknowledged as the gold standard. We present a new endtoend differentiable method to train a standard FFNN.
Deep Learning is considered the stateoftheart in computer vision, speech recognition and natural language processing. Until recently, it was also widely accepted that DL is irrelevant for learning tasks on tabular data, especially in the small sample regime where ensemble methods are acknowledged as the gold standard.
We present a new endtoend differentiable method to train a standard FFNN. Our method, Muddling labels for Regularization (MLR), penalizes memorization through the generation of uninformative labels and the application of a differentiable closeform regularization scheme on the last hidden layer during training. MLR outperforms classical FFNN and the gold standard GBDT, RF for regression and classification tasks on several datasets from the UCI database and Kaggle covering a large range of sample sizes and feature to sample ratios. Researchers and practitioners can use MLR on its own as an offtheshelf DL solution or integrate it into the most advanced ML pipelines.
We present a new endtoend differentiable method to train a standard FFNN. Our method, Muddling labels for Regularization (MLR), penalizes memorization through the generation of uninformative labels and the application of a differentiable closeform regularization scheme on the last hidden layer during training. MLR outperforms classical FFNN and the gold standard GBDT, RF for regression and classification tasks on several datasets from the UCI database and Kaggle covering a large range of sample sizes and feature to sample ratios. Researchers and practitioners can use MLR on its own as an offtheshelf DL solution or integrate it into the most advanced ML pipelines.
Stefano Fortunati

Robust Semiparametric Efficient Estimators in Complex Elliptically Symmetric Distributions
01 Jun 2021 At BlueJean
01 Jun 2021 At BlueJean
In this seminar, we firstly recall the building blocks underlying the derivation of robust and efficient finitedimensional parameter vector realvalued Restimator, then its extension to complexvalued data is proposed. Moreover, through numerical simulations, its estimation performance and robustness to outliers are discussed in a finitesample regime.
Covariance matrices play a major role in statistics, signal processing and machine
learning. This seminar focuses on the semiparametric covariance/scatter matrix estimation problem
in elliptical distributions. The class of elliptical distributions can be seen as a semiparametric model
where the finitedimensional vector of interest is given by the location vector and by the
(vectorized) covariance/scatter matrix, while the density generator represents an infinite
dimensional nuisance function. The main aim of the statistical inference in elliptically distributed
data is then to provide possible estimators of the finitedimensional parameter vector able to
reconcile the two dichotomic concepts of robustness and (semiparametric) efficiency. An R
estimator satisfying these requirements has been recently proposed by Hallin, Oja and Paindaveine
for realvalued elliptical data by exploiting the Le Cam's theory of onestep efficient estimators and the rankbased statistics. In this seminar, we firstly recall the building blocks underlying the
derivation of such realvalued Restimator, then its extension to complexvalued data is proposed.
Moreover, through numerical simulations, its estimation performance and robustness to outliers are
discussed in a finitesample regime.
learning. This seminar focuses on the semiparametric covariance/scatter matrix estimation problem
in elliptical distributions. The class of elliptical distributions can be seen as a semiparametric model
where the finitedimensional vector of interest is given by the location vector and by the
(vectorized) covariance/scatter matrix, while the density generator represents an infinite
dimensional nuisance function. The main aim of the statistical inference in elliptically distributed
data is then to provide possible estimators of the finitedimensional parameter vector able to
reconcile the two dichotomic concepts of robustness and (semiparametric) efficiency. An R
estimator satisfying these requirements has been recently proposed by Hallin, Oja and Paindaveine
for realvalued elliptical data by exploiting the Le Cam's theory of onestep efficient estimators and the rankbased statistics. In this seminar, we firstly recall the building blocks underlying the
derivation of such realvalued Restimator, then its extension to complexvalued data is proposed.
Moreover, through numerical simulations, its estimation performance and robustness to outliers are
discussed in a finitesample regime.
Timothée Matthieu

Robust Machine Learnings with Median of Means and Mestimators
04 May 2021 At BlueJean
04 May 2021 At BlueJean
Most machine learning algorithms are not robust as their predictive power can be highly disturbed by a small amount of abnormal data. In this talk I will show how to use robust estimators of the mean to construct a robust risk minimization framework for machine learning tasks, in particular for classification and regression.
Most machine learning algorithms are not robust as their predictive power can be highly disturbed by a small amount of abnormal data. In this talk I will show how to use robust estimators of the mean to construct a robust risk minimization framework for machine learning tasks, in particular for classification and regression. I will first present some finite sample results (i.e. concentration inequalities) on Mestimators and Median of Means when estimating the mean, then I will use these estimators to construct robust Machine Learning algorithms and finally I will show illustrations of these algorithms in several practical applications.
Solenne Gaucher

Continuumarmed bandits : from the classical setting to the finite setting
06 Apr 2021 At BlueJean
06 Apr 2021 At BlueJean
In this talk, we start by introducing the continuumarmed bandit, and present classical algorithms and results. In a second time, we focus on the finite continuumarmed bandit setting, where the set of actions is finite, and each action can only be taken once.
Bandits are used to model the following sequential decision problem : at each time step, an agent takes an action and receives a reward drawn i.i.d. from a distribution depending on the action she has selected. Her aim is to maximize her cumulative reward. In this talk, we start by introducing the continuumarmed bandit, and present classical algorithms and results. In this setting, actions are indexed by covariates in a continuous set. The expected reward for taking an action is modeled as an (unknown) function of the covariate describing this action. In a second time, we focus on the finite continuumarmed bandit setting, where the set of actions is finite, and each action can only be taken once.
Behavioral neurosciences are about to undergo a major shift due to a) the dissemination of wearable and ambient sensors in routine clinical assessment or in complex HumanMachine interaction, b) the progressive abandonment of animal experimentation. This raises several issues in terms of observational data, but also questions how Machine Learning techniques may contribute to the datadriven quantification and prediction of Human behavior. In this talk, I will introduce the topic of Machine Learn
Behavioral neurosciences are about to undergo a major shift due to a) the dissemination of wearable and ambient sensors in routine clinical assessment or in complex HumanMachine interaction, b) the progressive abandonment of animal experimentation. This raises several issues in terms of observational data, but also questions how Machine Learning techniques may contribute to the datadriven quantification and prediction of Human behavior. In this talk, I will introduce the topic of Machine Learning applications to behavioral neurosciences. I will also give an overview of recent works and questions for the future of Machine Learning research in this field.
Florent Bouchard

Riemannian geometry for data analysis: illustration on blind source separation and lowrank structured covariance matrices
02 Mar 2021 At BlueJean
02 Mar 2021 At BlueJean
In this presentation, Riemannian geometry for data analysis is introduced. In particular, it is applied on two specific statistical signal processing problems: blind source separation and lowrank structured covariance matrices.
In this presentation, Riemannian geometry for data analysis is introduced. In particular, it is applied on two specific statistical signal processing problems: blind source separation and lowrank structured covariance matrices. Blind source separation can be solved by jointly diagonalizing some covariance matrices. We show here how geometry can be exploited in order to provide original joint diagonalization criteria and ways to compare them theoretically. These results are illustrated with numerical experiments both on simulated and real data. Concerning lowrank structured covariance matrices, an intrinsic CramérRao bound of the corresponding estimation problem is presented, illustrating the interest of geometry for performance analysis. These results are validated with simulations.
We consider the problem of randomdesign linear regression, in a distributionfree setting where no assumption is made on the distribution of the predictive/input variables. After surveying existing approaches and indicating some improvements, we explain why they fall short in our setting. We then identify the minimal assumption on the target/output under which guarantees are possible, and describe a nonlinear prediction procedure achieving the optimal error bound with high probability.
We consider the problem of randomdesign linear regression, in a distributionfree setting where no assumption is made on the distribution of the predictive/input variables. After surveying existing approaches and indicating some improvements, we explain why they fall short in our setting. We then identify the minimal assumption on the target/output under which guarantees are possible, and describe a nonlinear prediction procedure achieving the optimal error bound with high probability.
Emilie chouzenoux

Proximal Gradient Algorithm in the Presence of Adjoint Mismatch. Application to Computed Tomography.
02 Feb 2021 At BlueJean
02 Feb 2021 At BlueJean
In this talk, we focus on the proximal gradient algorithm stability properties when such an adjoint mismatch arises. By making use of tools from convex analysis and fixed point theory, we establish conditions under which the algorithm can still converge to a fixed point. We provide bounds on the error between this point and the solution to the minimization problem. We illustrate the applicability of our theoretical results through numerical examples in the context of computed tomography.
The proximal gradient algorithm is a popular iterative algorithm to deal with penalized leastsquares minimization problems. Its simplicity and versatility allow one to embed nonsmooth penalties efficiently. In the context of inverse problems arising in signal and image processing, a major concern lies in the computational burden when implementing minimization algorithms. For instance, in tomographic image reconstruction, a bottleneck is the cost for applying the forward linear operator and its adjoint. Consequently, it often happens that these operators are approximated numerically, so that the adjoint property is no longer fulfilled.
In this talk, we focus on the proximal gradient algorithm stability properties when such an adjoint mismatch arises. By making use of tools from convex analysis and fixed point theory, we establish conditions under which the algorithm can still converge to a fixed point. We provide bounds on the error between this point and the solution to the minimization problem. We illustrate the applicability of our theoretical results through numerical examples in the context of computed tomography.
This is joint work with M. Savanier, J.C. Pesquet, C. Riddell and Y. Trousset
References :
[1] E. Chouzenoux, J.C. Pesquet, C. Riddell, M. Savanier and Y. Trousset. Convergence of Proximal Gradient Algorithm in the Presence of Adjoint Mismatch. To appear in Inverse Problems, 2020. http://www.optimizationonline.org/DB_HTML/2020/10/8055.html
[2] M. Savanier, E. Chouzenoux, J.C. Pesquet, C. Riddell and Y. Trousset. Proximal Gradient Algorithm in the Presence of Adjoint Mismatch. In Proceedings of the 28th European Signal Processing Conference (EUSIPCO 2020), January 1822 2021
In this talk, we focus on the proximal gradient algorithm stability properties when such an adjoint mismatch arises. By making use of tools from convex analysis and fixed point theory, we establish conditions under which the algorithm can still converge to a fixed point. We provide bounds on the error between this point and the solution to the minimization problem. We illustrate the applicability of our theoretical results through numerical examples in the context of computed tomography.
This is joint work with M. Savanier, J.C. Pesquet, C. Riddell and Y. Trousset
References :
[1] E. Chouzenoux, J.C. Pesquet, C. Riddell, M. Savanier and Y. Trousset. Convergence of Proximal Gradient Algorithm in the Presence of Adjoint Mismatch. To appear in Inverse Problems, 2020. http://www.optimizationonline.org/DB_HTML/2020/10/8055.html
[2] M. Savanier, E. Chouzenoux, J.C. Pesquet, C. Riddell and Y. Trousset. Proximal Gradient Algorithm in the Presence of Adjoint Mismatch. In Proceedings of the 28th European Signal Processing Conference (EUSIPCO 2020), January 1822 2021
Zaccharie Ramzi

XPDNet for MRI Reconstruction: an Application to the fastMRI 2020 Brain Challenge
02 Feb 2021 At BlueJean
02 Feb 2021 At BlueJean
In this talk we will see how Deep Learning enables us to: (i) learn an optimal optimization scheme, (ii) learn a prior from the data and (iii) learn how to refine our knowledge of the measurements operator. We show the results of this approach on the fastMRI 2020 brain reconstruction challenge where we secured the 2nd spot in both the 4x and 8x acceleration tracks.
In classical Magnetic Resonance Imaging reconstruction, slow iterative nonlinear algorithms using manually crafted priors are applied to obtain the anatomical image from undersampled Fourier measurements. In addition they have to deal with an incomplete knowledge of the exact measurement operator.
Deep Learning methods, and in particular, unrolled networks, have allowed to alleviate those issues. In this talk we will see how Deep Learning enables us to:
Deep Learning methods, and in particular, unrolled networks, have allowed to alleviate those issues. In this talk we will see how Deep Learning enables us to:
 Learn an optimal optimization scheme,
 Learn a prior from the data
 Learn how to refine our knowledge of the measurements operator.
Vincent Divol

Reconstructing measures on manifolds: an optimal transport approach
05 Jan 2021 At BlueJean
05 Jan 2021 At BlueJean
The problem of density estimation is known to become untractable in high dimension D >> 1. We propose to overcome this issue by assuming that the distribution mu is actually supported around a low dimensional unknown shape M, of dimension d << D. After showing that this problem is degenerate for a large class of standard losses, we focus on the Wasserstein loss.
Density estimation is one of the most classical problem in nonparametric statistics: given i.i.d. samples from a distribution mu on R^D, the goal is to reconstruct the underlying density (say for instance for the L_p norm). This problem is known to become untractable in high dimension D >> 1. We propose to overcome this issue by assuming that the distribution mu is actually supported around a low dimensional unknown shape M, of dimension d << D. After showing that this problem is degenerate for a large class of standard losses (L_p, total variation, etc.), we focus on the Wasserstein loss, for which we build a minimax estimator, based on kernel density estimation, whose rate of convergence depends on d, and on the regularity of the underlying density, but not on the ambient dimension D.
Alain Durmus

Nonreversible MCMC: a complete recipe with convergence guarantees
05 Jan 2021 At BlueJean
05 Jan 2021 At BlueJean
In this talk, we will introduce develop general tools to ensure that a class of nonreversible Markov kernels, possibly relying on complex transforms, has the desired invariance property and lead to convergent algorithms. This leads to a set of simple and practically verifiable conditions.
Markov Chain Monte Carlo (MCMC) is a class of algorithms to sample complex and highdimensional probability distributions. The MetropolisHastings (MH) algorithm, the workhorse of MCMC, provides a simple recipe to construct reversible Markov kernels. Reversibility is a tractable property which implies a less tractable but essential property here, invariance. Reversibility is however not necessarily desirable when considering performance. This has prompted recent interest in designing kernels breaking this property. At the same time, an active stream of research has focused on the design of novel versions of the MH kernel, some nonreversible, relying on the use of complex invertible deterministic transforms. While standard implementations of the MH kernel are well understood, aforementioned developments have not received the same systematic treatment to ensure their validity. In this talk, we will introduce develop general tools to ensure that a class of nonreversible Markov kernels, possibly relying on complex transforms, has the desired invariance property and lead to convergent algorithms. This leads to a set of simple and practically verifiable conditions.
In this talk, we show how the BuresWasserstein distance can be used in machine learning applications, by presenting scalable algorithms for computing and differentiating the Bures metric. Then, we show that the BuresWasserstein geometry can seamlessly incorporate other methods for approximating OT.
Optimal transport (OT) has recently gained popularity in the machine learning community, both as a way to measure the discrepancy between two probability measures and as a principled method to transform a distribution into another. Yet, in most cases OT does not admit a closedform expression and either has to be evaluated through a costly optimization problem, or approximated by regularizing this problem.
Alternatively, a powerful approach consists in keeping the problem as such, and regularizing the data instead to fall back to cases that can be solved efficiently. In particular, representing the data using elliptical distributions, which are fully described by their mean vector and covariance matrix, leads to one of the very few cases of closedform expressions for OT. Indeed, for such distributions, the Wasserstein distance can be decomposed as the sum of the Euclidean distance between means and the Bures distance between covariance matrices, which defines a Riemannian metric on the set of positive semidefinite matrices.
In this talk, we show how the BuresWasserstein distance can be used in machine learning applications, by presenting scalable algorithms for computing and differentiating the Bures metric. In particular, we show that a suitable reparameterization allows to emulate Riemannian gradient descent in a projectionfree Euclidean setting. Finally, we show that the BuresWasserstein geometry can seamlessly incorporate other methods for approximating OT, such as lowdimensional projections or entropic regularization, and propose applications to probabilistic word embeddings.
Alternatively, a powerful approach consists in keeping the problem as such, and regularizing the data instead to fall back to cases that can be solved efficiently. In particular, representing the data using elliptical distributions, which are fully described by their mean vector and covariance matrix, leads to one of the very few cases of closedform expressions for OT. Indeed, for such distributions, the Wasserstein distance can be decomposed as the sum of the Euclidean distance between means and the Bures distance between covariance matrices, which defines a Riemannian metric on the set of positive semidefinite matrices.
In this talk, we show how the BuresWasserstein distance can be used in machine learning applications, by presenting scalable algorithms for computing and differentiating the Bures metric. In particular, we show that a suitable reparameterization allows to emulate Riemannian gradient descent in a projectionfree Euclidean setting. Finally, we show that the BuresWasserstein geometry can seamlessly incorporate other methods for approximating OT, such as lowdimensional projections or entropic regularization, and propose applications to probabilistic word embeddings.
Sophie Donnet

Block models for multipartite networks.Applications in ecology and ethnobiology.
01 Dec 2020 At BlueJeans
01 Dec 2020 At BlueJeans
In this contribution, we propose a stochastic block model able to handle multipartite networks, thus supplying a clustering of the individuals based on their connection behavior in more than one network. Our model is an extension of the latent block models (LBM) and stochastic block model (SBM).
Modeling relations between individuals is a classical question in social sciences, ecology, etc. In order to uncover a latent structure in the data, a popular approach consists in clustering individuals according to the observed patterns of interactions. To do so, Stochastic block models (SBM) and Latent Block models (LBM) are standard tools for clustering the individuals with respect to their comportment in a unique network. However, when adopting an integrative point of view, individuals are not involved in a unique network but are part of several networks, resulting into a potentially complex multipartite network. In this contribution, we propose a stochastic block model able to handle multipartite networks, thus supplying a clustering of the individuals based on their connection behavior in more than one network. Our model is an extension of the latent block models (LBM) and stochastic block model (SBM). The parameters  such as the marginal probabilities of assignment to blocks and the matrix of probabilities of connections between blocks  are estimated through a variational ExpectationMaximization procedure. The numbers of blocks are chosen with the Integrated Completed Likelihood criterion, a penalized likelihood criterion. The pertinence of our methodology is illustrated on two datasets issued from ecology and ethnobiology.
In this talk I will cover some recent works that aim to learn nonlinear representations of neural time series in order to estimate good predictive models.
Good representations are the building blocks of signal processing. While the Fourier representation reveal sinusoids buried in noise, wavelets better capture timelocalized transient phenomena. Processing neuroimaging data is also based on representations. Morlet wavelets or short time Fourier transform (STFT) are common for electrophysiology (M/EEG) and spherical harmonics are used for diffusion MRI signals. In this talk I will cover some recent works that aim to learn nonlinear representations of neural time series in order to estimate good predictive models. Works that I will present will cover recent works such as [1, 2, 3].
[1] Learning the Morphology of Brain Signals Using AlphaStable Convolutional Sparse Coding
Jas, M., Dupré la Tour, T., Simsekli, U. and Gramfort, A. (2017)
Advances in Neural Information Processing Systems (NIPS) 30
[2] Multivariate Convolutional Sparse Coding for Electromagnetic Brain Signals
Dupré la Tour, T., Moreau, T., Jas, M. and Gramfort, A. (2018)
Advances in Neural Information Processing Systems (NeurIPS)
[3] Selfsupervised representation learning from electroencephalography signals
Banville, H., Albuquerque, I., Moffat, G., Engemann, D. and Gramfort, A. (2019)
Proc. Machine Learning for Signal Processing (MLSP).
[1] Learning the Morphology of Brain Signals Using AlphaStable Convolutional Sparse Coding
Jas, M., Dupré la Tour, T., Simsekli, U. and Gramfort, A. (2017)
Advances in Neural Information Processing Systems (NIPS) 30
[2] Multivariate Convolutional Sparse Coding for Electromagnetic Brain Signals
Dupré la Tour, T., Moreau, T., Jas, M. and Gramfort, A. (2018)
Advances in Neural Information Processing Systems (NeurIPS)
[3] Selfsupervised representation learning from electroencephalography signals
Banville, H., Albuquerque, I., Moffat, G., Engemann, D. and Gramfort, A. (2019)
Proc. Machine Learning for Signal Processing (MLSP).
Marine Le Morvan

Neumann networks: differential programming for supervised learning with missing values
03 Nov 2020 At BlueJeans
03 Nov 2020 At BlueJeans
In this work, we derive the analytical form of the optimal predictor under a linearity assumption and various missing data mechanisms. Based on a Neumann series approximation of the optimal predictor, we propose a new principled architecture, named Neumann networks. Their originality and strength comes from the use of a new type of nonlinearity: the multiplication by the missingness indicator.
The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing patterns, which can be exponential in the number of dimensions. In this work, we derive the analytical form of the optimal predictor under a linearity assumption and various missing data mechanisms including Missing at Random (MAR) and selfmasking (Missing Not At Random). Based on a Neumann series approximation of the optimal predictor, we propose a new principled architecture, named Neumann networks. Their originality and strength comes from the use of a new type of nonlinearity: the multiplication by the missingness indicator. We provide an upper bound on the Bayes risk of Neumann networks, and show that they have good predictive accuracy with both a number of parameters and a computational complexity independent of the number of missing data patterns. As a result they scale well to problems with many features, and remain statistically efficient for mediumsized samples. Moreover, we show that, contrary to procedures using EM or imputation, they are robust to the missing data mechanism, including difficult MNAR settings such as selfmasking.
We introduce and analyse a new family of algorithms which generalizes and unifies both the mirror descent and the dual averaging algorithms.
We introduce and analyse a new family of algorithms which generalizes and unifies both the mirror descent and the dual averaging algorithms. In the framework of this family, we define a new algorithm for constrained optimization with the aim of combining the advantages of mirror descent and dual averaging. In practice, this new algorithm converges as fast as mirror descent and dual averaging, and in some situations greatly outperforms them. Besides, we demonstrate how our algorithms can also be applied to solving variational inequalities.
This is joint work with Anatoli Juditsky and Eric Moulines.
This is joint work with Anatoli Juditsky and Eric Moulines.
The goal of this talk is to introduce the audience to the problem of algorithmic fairness. I will provide a general overview on the topic, describe various available notions of fairness in classification and regression, and present main approaches to tackle this problem. I will also present some recent theoretical results both in classification and regression. This talk is based on joint works with C. Denis, M. Hebiri, L. Oneto, and M. Pontil.
The goal of this talk is to introduce the audience to the problem of algorithmic fairness. I will provide a general overview on the topic, describe various available notions of fairness in classification and regression, and present main approaches to tackle this problem. I will also present some recent theoretical results both in classification and regression. This talk is based on joint works with C. Denis, M. Hebiri, L. Oneto, and M. Pontil.
Alexei Grinbaum

Chance as a value for artificial intelligence
03 Mar 2020 At INRIA Saclay  Batiment Alan Turing  Amphi sophie Germain
03 Mar 2020 At INRIA Saclay  Batiment Alan Turing  Amphi sophie Germain
Deep learning techniques lead to fundamentally noninterpretable decisions made by the machine. Although such choices do not have an explanation, they impact the users in significant ways. If the ultimate innovator is a machine, what is the meaning of responsible conduct?
Deep learning techniques lead to fundamentally noninterpretable decisions made by the machine. Although such choices do not have an explanation, they impact the users in significant ways. If the ultimate innovator is a machine, what is the meaning of responsible conduct? I argue in a recent book that the capacity to extract an AI system from human judgment, by reducing transparency in favor of opacity, is an essential value in machine ethics. This can be achieved through the use of randomness, illustrated on several examples including the trolley dilemma. Methodologically, a comparison of common motives between technological setups and mythological narratives is used to achieve ethical insights.
Yannig Goude

Machine Learning Methods for Electricity Load Forecasting: contributions and perspectives.
03 Mar 2020 At INRIA Saclay  Batiment Alan Turing  Amphi sophie Germain
03 Mar 2020 At INRIA Saclay  Batiment Alan Turing  Amphi sophie Germain
To maintain the electricity quality, energy stakeholders are developing smart grids, the next generation power grid including advance communication networks and associated optimisation and forecasting tools. We will present recent development in the field of online learning and probabilistic forecasting done at EDF in this context.
Energy systems are facing a revolution and many challenges. On the one hand, electricity production is moving to more intermittency and complexity with the increase of renewable energy and the development of small distributed production units such as photovoltaic panels or wind farms. On the other hand, consumption is also changing with e.g. plugin (hybrid) electric vehicles, heat pumps, the development of new technologies such as smart phones, computers, storage devices. To maintain the electricity quality, energy stakeholders are developing smart grids, the next generation power grid including advance communication networks and associated optimisation and forecasting tools. Exploiting the smart grid efficiently requires advanced data analytics to improve forecasting at different geographical scale. We will present recent development in the field of online learning and probabilistic forecasting done at EDF in this context.
Rainer Dyckerhoff

Convergence of depths and central regions
11 Feb 2020 At Telecom Paris  Amphi 2
11 Feb 2020 At Telecom Paris  Amphi 2
After a short introduction in the general concept of depth we first discuss the continuity properties of depths and central regions. The main part of the talk is concerned with the connections between different types of convergence. We give conditions under which the pointwise (resp. uniform) convergence of the data depth implies the pointwise (resp. uniform) convergence of the central regions in the Hausdorff metric as well as conditions under which the reverse implications hold.
Depth is a concept that measures the centrality of a point in a given data cloud x_1, x_2,..., x_n in ℝ^d or in a given probability distribution on ℝ^d. Every depth defines a family of socalled central regions. The αcentral region is the set of all points that have a depth of at least α. For statistical applications it is desirable that with increasing sample size the empirical depth as well as the empirical central regions converge almost surely to their population counterparts.
After a short introduction in the general concept of depth we first discuss the continuity properties of depths and central regions. The main part of the talk is concerned with the connections between different types of convergence. We give conditions under which the pointwise (resp. uniform) convergence of the data depth implies the pointwise (resp. uniform) convergence of the central regions in the Hausdorff metric as well as conditions under which the reverse implications hold. Further, we demonstrate that under relative weak conditions the pointwise convergence of the data depth (resp. central regions) is equivalent to the uniform convergence of the data depth (resp. central regions). Finally, we illustrate these results by applying them to special notions of data depth that have been proposed in the literature.
After a short introduction in the general concept of depth we first discuss the continuity properties of depths and central regions. The main part of the talk is concerned with the connections between different types of convergence. We give conditions under which the pointwise (resp. uniform) convergence of the data depth implies the pointwise (resp. uniform) convergence of the central regions in the Hausdorff metric as well as conditions under which the reverse implications hold. Further, we demonstrate that under relative weak conditions the pointwise convergence of the data depth (resp. central regions) is equivalent to the uniform convergence of the data depth (resp. central regions). Finally, we illustrate these results by applying them to special notions of data depth that have been proposed in the literature.
Hicham Janati

Spatiotemporal Optimal transport metric for brain imaging data
11 Feb 2020 At Telecom Paris  Amphi 2
11 Feb 2020 At Telecom Paris  Amphi 2
Evaluating how two different MEG time series are close to each other is far from trivial. In this paper, we propose SpatioTemporal Alignments (STA), a new differentiable formulation of DTW, in which spatial differences between time samples are accounted for using regularized optimal transport.
Comparing data defined over space and time is notoriously hard, because it involves quantifying both spatial and temporal variability, while at the same time taking into account the chronological structure of data. For instance, MEG and EEG source estimates provide brain activity maps in a millimetre / millisecond resolution with significant spatiotemporal differences. Evaluating how two different MEG time series are close to each other is thus far from trivial. This work is a first step towards that goal. Dynamic Time Warping (DTW) computes an optimal alignment between time series in agreement with the chronological order, but is inherently blind to spatial shifts. In this paper, we propose SpatioTemporal Alignments (STA), a new differentiable formulation of DTW, in which spatial differences between time samples are accounted for using regularized optimal transport (OT). Our temporal alignments are handled through a smooth variant of DTW called softDTW, for which we prove a new property: softDTW increases quadratically with time shifts. The cost matrix within softDTW that we use is computed using unbalanced Optimal transport. Experiments on brain imaging data confirm our theoretical findings and illustrate the effectiveness of STA as a dissimilarity for spatiotemporal data.
Vianney Perchet

Optimization vs Privacy in Machine Learning
07 Jan 2020 At CentralSupelec  Batiment Eiffel  Amphi I
07 Jan 2020 At CentralSupelec  Batiment Eiffel  Amphi I
After, a quick review of some concepts in privacy for machine learning, we will formalize the tradeoff utility vs privacy as an optimization problem where the objective mapping is regularized by the amount of information revealed to the adversary.
Information is valuable either by remaining private (for instance if it is sensitive) or, on the other hand, by being used publicly to optimize some target loss functions. These two objectives are antagonistic and leaking this information might be more rewarding than concealing it. Unlike classical solutions that focus on the first point, we consider instead agents that maximize a natural tradeoff between both objectives.
We will in a first step quickly review some concepts of privacy in machine learning, before formalizing the tradeoff utility vs privacy as an optimization problem where the objective mapping is regularized by the amount of information revealed to the adversary (measured as a divergence between the prior and posterior on the private knowledge). Quite surprisingly, when combined with the entropic regularization, the Sinkhorn loss naturally emerges in the optimization objective, making it efficiently solvable. We apply these techniques to preserve some privacy in online repeated auctions.
We will in a first step quickly review some concepts of privacy in machine learning, before formalizing the tradeoff utility vs privacy as an optimization problem where the objective mapping is regularized by the amount of information revealed to the adversary (measured as a divergence between the prior and posterior on the private knowledge). Quite surprisingly, when combined with the entropic regularization, the Sinkhorn loss naturally emerges in the optimization objective, making it efficiently solvable. We apply these techniques to preserve some privacy in online repeated auctions.
Imke Mayer

Treatment effect estimation with missing attributes
07 Jan 2020 At CentralSupelec  Batiment Eiffel  Amphi I
07 Jan 2020 At CentralSupelec  Batiment Eiffel  Amphi I
In this talk, I will first provide a classification of existing methods for causal inference with missing values. Then, I will present two recent contributions on this topic.
Inferring causal effects of a treatment or policy from observational data is central to many applications. However, stateoftheart methods for causal inference seldom consider the possibility that covariates have missing values, which is ubiquitous in many realworld cases.
This work is motivated by assessing several medical questions about different treatments based on a large prospective database counting over 20,000 major trauma patients. This database is complex in the sense that it presents a multilevel and heterogeneous structure and precisely contains large fractions of missing values.
Missing data greatly complicate causal analyses as they either require strong assumptions about the missing data generating mechanism or an adapted unconfoundedness hypothesis. In this talk, I will first provide a classification of existing methods according to the main underlying assumptions, which are based either on variants of the classical unconfoundedness assumption or relying on assumptions about the mechanism that generates the missing values. Then, I will present two recent contributions on this topic: (1) an extension of doubly robust estimators that allows handling of missing attributes, and (2) an approach to causal inference based on variational autoencoders adapted to incomplete data.
This work is motivated by assessing several medical questions about different treatments based on a large prospective database counting over 20,000 major trauma patients. This database is complex in the sense that it presents a multilevel and heterogeneous structure and precisely contains large fractions of missing values.
Missing data greatly complicate causal analyses as they either require strong assumptions about the missing data generating mechanism or an adapted unconfoundedness hypothesis. In this talk, I will first provide a classification of existing methods according to the main underlying assumptions, which are based either on variants of the classical unconfoundedness assumption or relying on assumptions about the mechanism that generates the missing values. Then, I will present two recent contributions on this topic: (1) an extension of doubly robust estimators that allows handling of missing attributes, and (2) an approach to causal inference based on variational autoencoders adapted to incomplete data.
In this talk, we study the approximation error (the bias) of some purely random forest models in a regression framework, focusing in particular on the influence of the size of each tree and of the number of trees in the forest.
Random forests (Breiman, 2001) are a very effective and commonly used statistical method, but their full theoretical analysis is still an open problem. As a first step, simplified models such as purely random forests have been introduced, in order to shed light on the good performance of Breiman's random forests.
In the regression framework, the quadratic risk of a purely random forest can be written as the sum of two terms, which can be understood as an approximation error and an estimation error. Robin Genuer (2010) studied how the estimation error decreases when the number of trees increases for some specific model. In this talk, we study the approximation error (the bias) of some purely random forest models in a regression framework, focusing in particular on the influence of the size of each tree and of the number of trees in the forest.
Under some regularity assumptions on the regression function, we show that the bias of an infinite forest decreases at a faster rate (with respect to the size of each tree) than a single tree. As a consequence, infinite forests attain a strictly better risk rate (with respect to the sample size) than single trees.
This talk is based on joint works with Robin Genuer.
arxiv.org/abs/1407.3939
arxiv.org/abs/1604.01515
In the regression framework, the quadratic risk of a purely random forest can be written as the sum of two terms, which can be understood as an approximation error and an estimation error. Robin Genuer (2010) studied how the estimation error decreases when the number of trees increases for some specific model. In this talk, we study the approximation error (the bias) of some purely random forest models in a regression framework, focusing in particular on the influence of the size of each tree and of the number of trees in the forest.
Under some regularity assumptions on the regression function, we show that the bias of an infinite forest decreases at a faster rate (with respect to the size of each tree) than a single tree. As a consequence, infinite forests attain a strictly better risk rate (with respect to the sample size) than single trees.
This talk is based on joint works with Robin Genuer.
arxiv.org/abs/1407.3939
arxiv.org/abs/1604.01515
Pierre Laforgue

On the Dualization of OperatorValued Kernel Machines
02 Dec 2019 At ENSAE  Amphi 200
02 Dec 2019 At ENSAE  Amphi 200
In this talk, we study a dualization approach to OperatorValued Kernel problems in infinite dimensional output spaces. Under mild assumptions, the dual problem is shown to be computable, allowing for the resolution of a wide range of OVK problems.
OperatorValued Kernels (OVKs) provide an elegant way to extend scalar kernel methods when the output space is a Hilbert space. If the output space if finite dimensional, this framework naturally allows to tackle multiclass classification or multitask regression problems. But its ability to deal with infinite dimensional output spaces opens the door to many more applications, such as structured output prediction, structured representation learning, or functional regression. This work investigates how to use the duality principle to handle different families of loss functions, yet unexplored within OVK machines. The difficulty of having infinite dimensional dual variables is overcome by means of a Double Representer Theorem, that will be explicited. This allows for instance to handle Ɛinsensitive and Huber losses, which are of particular interest in the context of surrogate approaches.
This is a joint work with Alex Lambert, Luc BrogatMotte and Florence d'AlchéBuc from Télécom Paris. The preprint is available at https://arxiv.org/abs/1910.04621.
This is a joint work with Alex Lambert, Luc BrogatMotte and Florence d'AlchéBuc from Télécom Paris. The preprint is available at https://arxiv.org/abs/1910.04621.
FrançoisPierre Paty

Regularizing Optimal Transport Using Regularity Theory
05 Nov 2019 At INRIA Saclay  Batiment Alan Turing  Amphi sophie Germain
05 Nov 2019 At INRIA Saclay  Batiment Alan Turing  Amphi sophie Germain
In this talk, I will present a new regularization of OT leveraging regularity of the Brenier map. Instead of considering regularity as a a property that can be proved under suitable assumptions, we consider regularity as a condition that must be enforced when estimating OT.
Optimal transport (OT) suffers from the curse of dimensionality. In this talk, I will present a new regularization of OT leveraging regularity of the Brenier map. Instead of considering regularity as a a property that can be proved under suitable assumptions, we consider regularity as a condition that must be enforced when estimating OT. From an algorithmic point of view, this leads to an infinitedimensional optimization problem, which, when dealing with discrete measures, can be rewritten as a finitedimensional separately convex problem. From a statistical point of view, this defines new estimators of the OT map and 2Wasserstein distance between arbitrary measures, for which I will show numerical evidence of their performance. (Based on a joint work with Alexandre d'Aspremont and Marco Cuturi)
Pierre Ablin

Deep learning for inverse problems: solving the Lasso with neural networks
05 Nov 2019 At INRIA Saclay  Batiment Alan Turing  Amphi sophie Germain
05 Nov 2019 At INRIA Saclay  Batiment Alan Turing  Amphi sophie Germain
Deep learning architectures are becoming ubiquitous for inverse problem resolution. We start by reviewing one of the earliest examples, called LISTA. In a second step, we study the weights that are learned by such networks.
Deep learning architectures are becoming ubiquitous for inverse problem resolution. We start by reviewing one of the earliest examples. It consists in using recurrent neural networks to solve the Lasso problem: one of the most popular algorithm to solve the Lasso, the Iterative ShrinkageThresholding Algorithm (ISTA), iterates matrix multiplications and nonlinearities until convergence. Therefore, T iterations of ISTA are equivalent to a T layers neural network. The weights of the neural network can then be learned to accelerate resolution of the Lasso.
In a second step, we study the weights that are learned by such networks. In particular, we show that the last layers of such deep networks can only learn a better stepsize for the ISTA iterations.
In a second step, we study the weights that are learned by such networks. In particular, we show that the last layers of such deep networks can only learn a better stepsize for the ISTA iterations.