Past Talks

Arshak Minasyan - Statistically optimal robust mean and covariance estimation for anisotropic gaussians
02 Apr 2024 At ENSAE - Salle 1001
We propose two estimators for estimating the unknown mean and covariance matrix of anisotropic Gaussians that achieve the minimax optimal rate in the presence of outliers. The contamination model considered is the adversarial/strong contamination model, which contains all well-known contamination models, e.g., Huber's contamination model. Despite the recent significant interest in robust statistics, achieving both dimension-free bounds in the canonical Gaussian case remained open. In fact, ...
We propose two estimators for estimating the unknown mean and covariance matrix of anisotropic Gaussians that achieve the minimax optimal rate in the presence of outliers. The contamination model considered is the adversarial/strong contamination model, which contains all well-known contamination models, e.g., Huber's contamination model. Despite the recent significant interest in robust statistics, achieving both dimension-free bounds in the canonical Gaussian case remained open. In fact, several previously known results were either dimension-dependent and required the covariance matrix to be close to identity or had a sub-optimal dependence on the contamination level.

As a part of the analysis, we derive sharp concentration inequalities for central order statistics of Gaussian, folded normal, and chi-squared distributions. This is a joint work with N. Zhivotovskiy.
Ulugbek Kamilov - Restoration Deep Network as Implicit Priors for Imaging Inverse Problems
02 Apr 2024 At ENSAE - Salle 1001
Many interesting computational imaging problems can be formulated as imaging inverse problems. Since these problems are often ill-posed, one needs to integrate all the available prior knowledge for obtaining high-quality solutions. This talk focuses on the class of methods based on using “image restoration” deep neural network as data-driven implicit priors on images. The roots of the methods discussed in this talk can be traced to the popular Plug-and-Play Priors (PnP) family of methods for...
Many interesting computational imaging problems can be formulated as imaging inverse problems.
Since these problems are often ill-posed, one needs to integrate all the available prior knowledge for
obtaining high-quality solutions. This talk focuses on the class of methods based on using “image
restoration” deep neural network as data-driven implicit priors on images. The roots of the methods
discussed in this talk can be traced to the popular Plug-and-Play Priors (PnP) family of methods for
solving inverse problems. The talk will present applications of learned implicit priors for image
generation in limited angle computed tomography, recovery of continuously represented microscopy
images, and solving blind inverse problems in magnetic resonance imaging.
Badr-Eddine Cherief Abdellatif - A PAC-Bayes perspective on learning and generalization
05 Mar 2024 At Inria - Salle Gilles Kahn
Born in the late 20th century, PAC-Bayes has recently re-emerged as a powerful framework for learning with guarantees. Its bounds offer a principled way to understand the generalization ability of randomized learning algorithms, even guiding the design of new ones. This introduction dives into the foundations of PAC-Bayes, explores its recent advancements, and tries to offer some insights into promising future research directions.
Born in the late 20th century, PAC-Bayes has recently re-emerged as a powerful framework for learning with guarantees. Its bounds offer a principled way to understand the generalization ability of randomized learning algorithms, even guiding the design of new ones. This introduction dives into the foundations of PAC-Bayes, explores its recent advancements, and tries to offer some insights into promising future research directions.
Austin Stromme - Minimum intrinsic dimension scaling for entropic optimal transport
05 Mar 2024 At Inria - Salle Gilles Kahn
Entropic optimal transport (entropic OT) is a regularized variant of the optimal transport problem, widely used in practice for its computational benefits. A key statistical question for both entropic and un-regularized OT is the extent to which low-dimensional structure, of the type conjectured by the well-known manifold hypothesis, affects the statistical rates of convergence. In this talk, we will present statistical results for entropic OT which clarify the statistical role of...
Entropic optimal transport (entropic OT) is a regularized variant of the optimal transport problem, widely used in practice for its computational benefits. A key statistical question for both entropic and un-regularized OT is the extent to which low-dimensional structure, of the type conjectured by the well-known manifold hypothesis, affects the statistical rates of convergence. In this talk, we will present statistical results for entropic OT which clarify the statistical role of intrinsic dimension, and indeed entropic regularization itself, in the form of a novel type of intrinsic dimension dependence we term Minimum Intrinsic Dimension Scaling (MID scaling).
Quentin Bouniot - Understanding Deep Neural Networks Through the Lens of their Non-Linearity
06 Feb 2024 At ENSAE - Salle 1001
The remarkable success of deep neural networks (DNN) is often attributed to their high expressive power and their ability to approximate functions of arbitrary complexity. Indeed, DNNs are highly non-linear models, and activation functions introduced into them are largely responsible for this. While many works studied the expressive power of DNNs through the lens of their approximation capabilities, quantifying the non-linearity of DNNs or of individual activation functions remains an ...
The remarkable success of deep neural networks (DNN) is often attributed to their high expressive power and their ability to approximate functions of arbitrary complexity. Indeed, DNNs are highly non-linear models, and activation functions introduced into them are largely responsible for this. While many works studied the expressive power of DNNs through the lens of their approximation capabilities, quantifying the non-linearity of DNNs or of individual activation functions remains an open problem. In this work, we propose the first theoretically sound solution to track non-linearity propagation in deep neural networks with a specific focus on computer vision applications. Our proposed affinity score allows us to gain insights into the inner workings of a wide range of different architectures and learning paradigms. We provide extensive experimental results that highlight the practical utility of the proposed affinity score and its potential for long-reaching applications. ( https://arxiv.org/abs/2310.11439 )
Anna Korba - Sampling through optimization of discrepancies
06 Feb 2024 At ENSAE - Salle 1001
Sampling from a target measure when only partial information is available (e.g. unnormalized density as in Bayesian inference, or true samples as in generative modeling) is a fundamental problem in computational statistics and machine learning. The sampling problem can be formulated as an optimization over the space of probability distributions of a well-chosen discrepancy (e.g. a divergence or distance). In this talk, we'll discuss several properties of sampling algorithms for some...
Sampling from a target measure when only partial information is available (e.g. unnormalized density as in Bayesian inference, or true samples as in generative modeling) is a fundamental problem in computational statistics and machine learning. The sampling problem can be formulated as an optimization over the space of probability distributions of a well-chosen discrepancy (e.g. a divergence or distance).
In this talk, we'll discuss several properties of sampling algorithms for some choices of discrepancies (well-known ones, or novel proxies), both regarding their optimization and quantization aspects.
Pierre Ablin - Adaptive Training Distributions with Scalable Online Bilevel Optimization
09 Jan 2024 At Inria - Amphi Sophie Germain
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm, the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This work considers modifying the pretraining distribution in the case where one has a small sample of data reflecting the targeted test conditions. We propose an algorithm motivated by a recent formulation of this setting as an online, bilevel optimization problem. With...
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this paradigm, the distribution of the large, heterogeneous pretraining data rarely matches that of the application domain. This work considers modifying the pretraining distribution in the case where one has a small sample of data reflecting the targeted test conditions. We propose an algorithm motivated by a recent formulation of this setting as an online, bilevel optimization problem. With scalability in mind, our algorithm prioritizes computing gradients at training points which are likely to most improve the loss on the targeted distribution. Empirically, we show that in some cases this approach is beneficial over existing strategies from the domain adaptation literature but may not succeed in other cases. We propose a simple test to evaluate when our approach can be expected to work well and point towards further research to address current limitations.

This talk is based on the following paper: "https://arxiv.org/abs/2311.11973"> https://arxiv.org/abs/2311.11973
Aymeric Dieuleveut - Provable non-accelerations of the heavy-ball method
09 Jan 2024 At Inria - Amphi Sophie Germain
We show that the heavy-ball (HB) method provably does not reach an accelerated convergence rate on smooth strongly convex problems. More specifically, we show that for any condition number and any choice of algorithmic parameters, either the worst-case convergence rate of HB on the class of L-smooth and μ-strongly convex quadratic functions is not accelerated (that is, slower than 1 − O(κ)), or there exists an L-smooth μ-strongly convex function and an initialization such that the method...
We show that the heavy-ball (HB) method provably does not reach an accelerated convergence rate on smooth strongly convex problems. More specifically, we show that for any condition number and any choice of algorithmic parameters, either the worst-case convergence rate of HB on the class of L-smooth and μ-strongly convex quadratic functions is not accelerated (that is, slower than 1 − O(κ)), or there exists an L-smooth μ-strongly convex function and an initialization such that the method does not converge.

To the best of our knowledge, this result closes a simple yet open question on one of the most used and iconic first-order optimization techniques. Our approach builds on finding functions for which HB fails to converge and instead cycles over finitely many iterates. We analytically describe all parametrizations of HB that exhibit this cycling behavior on a particular cycle shape, whose choice is supported by a systematic and constructive approach to the study of cycling behaviors of first-order methods. We show the robustness of our results to perturbations of the cycle, and extend them to class of functions that also satisfy higher-order regularity conditions.
Jordan Serres - Concentration of empirical barycenters in Non Positively Curved metric spaces
03 Oct 2023 At Inria - Amphi Sophie Germain
Barycenters in non-Euclidean geometries are the most natural extension of linear averaging. They are widely used in shape statistics, optimal transport and matrix analysis. In this talk, I will present their asymptotic properties, i.e. law of large numbers. I will also present some more recent results about their non-asymptotic concentration rate.
Barycenters in non-Euclidean geometries are the most natural extension of linear averaging. They are widely used in shape statistics, optimal transport and matrix analysis. In this talk, I will present their asymptotic properties, i.e. law of large numbers. I will also present some more recent results about their non-asymptotic concentration rate.
David Degras - Sparse estimation and model segmentation via the Sparse Group Fused Lasso
03 Oct 2023 At Inria - Amphi Sophie Germain
This talk introduces the sparse group fused lasso (SGFL), a statistical framework for segmenting sparse regression models of multivariate time series. To compute solutions of the SGFL, a nonsmooth and nonseparable convex program, we develop a hybrid optimization method that is fast, requires no internal tuning, and is guaranteed to converge to a global minimizer. In numerical experiments, the proposed algorithm compares favorably to the state of the art with respect to computation ...
This talk introduces the sparse group fused lasso (SGFL), a statistical framework for segmenting sparse regression models of multivariate time series. To compute solutions of the SGFL, a nonsmooth and nonseparable convex program, we develop a hybrid optimization method that is fast, requires no internal tuning, and is guaranteed to converge to a global minimizer. In numerical experiments, the proposed algorithm compares favorably to the state of the art with respect to computation time and numerical accuracy; benefits are particularly substantial in high dimension. Statistical performance is satisfactory in recovering nonzero regression coefficients and excellent in change point detection. Applications to air quality and neuroimaging data will be presented. The hybrid algorithm is implemented in an R package available at https://github.com/ddegras/sparseGFL
Alexei Grinbaum - Generative AI: from statistical physics to ethics
05 Sep 2023 At ENSAE - Salle 1001
This talk intertwines a quick reminder of the most salient open problems in the transformer architecture with a questioning of ethical and societal impacts of large language models. We’ll discuss emergent behaviours in LLMs and their scientific understanding (or lack thereof) as critical phenomena. We’ll dwell on the example of ‘digital whales’ to get a feeling of how much room is available down there in the LLM vector space for non-human use of language. And we’ll conclude by ...
This talk intertwines a quick reminder of the most salient open problems in the transformer architecture with a questioning of ethical and societal impacts of large language models. We’ll discuss emergent behaviours in LLMs and their scientific understanding (or lack thereof) as critical phenomena. We’ll dwell on the example of ‘digital whales’ to get a feeling of how much room is available down there in the LLM vector space for non-human use of language. And we’ll conclude by comparing machines that speak our language with non-human entities endowed with the same capacity in myth. What lessons can we draw for generative AI from angels, demons, gods, and oracles?
Francis Bach - Chain of Log-Concave Markov Chains
05 Sep 2023 At ENSAE - Salle 1001
Markov chain Monte Carlo (MCMC) is a class of general-purpose algorithms for sampling from unnormalized densities. There are two well-known problems facing MCMC in high dimensions: (i) The distributions of interest are concentrated in pockets separated by large regions with small probability mass, and (ii) The log-concave pockets themselves are typically ill-conditioned. We introduce a framework to tackle these problems using isotropic Gaussian smoothing. We prove one can always...
Markov chain Monte Carlo (MCMC) is a class of general-purpose algorithms for sampling from unnormalized densities. There are two well-known problems facing MCMC in high dimensions: (i) The distributions of interest are concentrated in pockets separated by large regions with small probability mass, and (ii) The log-concave pockets themselves are typically ill-conditioned. We introduce a framework to tackle these problems using isotropic Gaussian smoothing. We prove one can always decompose sampling from a density (minimal assumptions made on the density) into a sequence of sampling from log-concave conditional densities via accumulation of noisy measurements with equal noise levels. This construction keeps track of a history of samples, making it non-Markovian as a whole, but the history only shows up in the form of an empirical mean, making the memory footprint minimal. We study our sampling algorithm quantitatively using the 2-Wasserstein metric and compare it with various Langevin MCMC algorithms. Joint work with Saeed Saremi and Ji Won Park (https://arxiv.org/abs/2305.19473).
Xavier d’Haultfoeuille - Partially Linear Models under Data Combination
06 Jun 2023 At Inria - Amphi Sophie Germain
We study partially linear models when the outcome of interest and some of the covariates are observed in two different datasets that cannot be linked. This type of data combination problem arises very frequently in empirical microeconomics. Using recent tools from optimal transport theory, we derive a constructive characterization of the sharp identified set. We then build on this result and develop a novel inference method that exploits the specific geometric properties of...
We study partially linear models when the outcome of interest and some of the covariates are observed in two different datasets that cannot be linked. This type of data combination problem arises very frequently in empirical microeconomics. Using recent tools from optimal transport theory, we derive a constructive characterization of the sharp identified set. We then build on this result and develop a novel inference method that exploits the specific geometric properties of the identified set. Our method exhibits good performances in finite samples, while remaining very tractable. We apply our approach to study intergenerational income mobility over the period 1850-1930 in the United States. Our method allows us to relax the exclusion restrictions used in earlier work, while delivering confidence regions that are informative.
Alejandro de la Concha - Collaborative likelihood-ratio estimation over graphs
06 Jun 2023 At Inria - Amphi Sophie Germain
Assuming we have i.i.d observations from two unknown probability density functions (pdfs), p and p′, the likelihood-ratio estimation (LRE) is an approach to compare two pdfs without knowing their functional form explicitly. In this talk, we will introduce a graph-based extension of the problem. We will suppose each node v of a fixed graph has access to observations coming from two unknown node-specific pdfs, p_v and p′_v; the goal is then to compare the respective p_v and p′_v of each node by...
Assuming we have i.i.d observations from two unknown probability density functions (pdfs), p and p′, the likelihood-ratio estimation (LRE) is an approach to compare two pdfs without knowing their functional form explicitly. In this talk, we will introduce a graph-based extension of the problem. We will suppose each node v of a fixed graph has access to observations coming from two unknown node-specific pdfs, p_v and p′_v; the goal is then to compare the respective p_v and p′_v of each node by integrating information provided by the graph structure. This setting is interesting when the graph conveys some sort of 'similarity' between the node-wise estimation tasks, which suggests that the nodes can collaborate to solve more efficiently their individual tasks. Our main contribution is a distributed non-parametric framework for graph-based LRE, called GRULSIF, that incorporates in a novel way elements from f-divergence functionals, Kernel methods, and Multitask Learning. Among the applications of LRE, we choose the two-sample hypothesis testing to develop a proof of concept for our graph-based learning framework. Our experiments compare favorably the performance of our approach against state-of-the-art non-parametric statistical tests that apply at each node independently, and thus disregard the graph structure.
François Lanusse - Score-Based Generative Modeling on SO(3) and Application to Emulating Cosmological Simulations
02 May 2023 At ENSAE - Salle 1002
Modeling probabilities over 3D orientations is an important area of directional statistics which finds a wide range of applications from computer vision and robotics to physics. In this talk I will present a new approach for modeling conditional distributions on SO(3) (the Lie group of 3D rotations) based on extending Euclidean Score-Based Generative models (e.g. Song et al. 2021) to the SO(3) manifold. From a methodological point of view, ...
Modeling probabilities over 3D orientations is an important area of directional statistics which finds a wide range of applications from computer vision and robotics to physics. In this talk I will present a new approach for modeling conditional distributions on SO(3) (the Lie group of 3D rotations) based on extending Euclidean Score-Based Generative models (e.g. Song et al. 2021) to the SO(3) manifold. From a methodological point of view, our implementation relies on specific properties of heat diffusion on SO(3) that allow for closed-form solutions of the diffusion Stochastic Differential Equation (SDE) on that manifold, as well as making use of geometric Ordinary Differential Equation (ODE) solvers to efficiently sample and evaluate the log probability of the model.
I will then illustrate a concrete application of these conditional generative models for emulating some of the complex physics that are present in cosmological hydrodynamical simulations but too expensive to simulate over large simulation volumes. Specifically, I will show how our SO(3) model can correctly capture and reproduce the complex dependence of 3D galaxy orientations with their environment, an effect that is very important to model correctly to analyze cosmological observations in practice.
Rémi Flamary - Optimal transport for graph modeling in Graph Neural Networks
02 May 2023 At ENSAE - Salle 1002
In recent years the Optimal Transport (OT) based Gromov-Wasserstein (GW) divergence has been investigated as a similarity measure between structured data expressed as distributions typically lying in different metric spaces, such as graphs with arbitrary sizes. In this talk, we will address the optimization problem inherent in the computation of GW and some of its recent extensions, namely the Entropic and the Fused GW divergences. Next we will illustrate how these OT problems can ...
In recent years the Optimal Transport (OT) based Gromov-Wasserstein (GW) divergence has been investigated as a similarity measure between structured data expressed as distributions typically lying in different metric spaces, such as graphs with arbitrary sizes. In this talk, we will address the optimization problem inherent in the computation of GW and some of its recent extensions, namely the Entropic and the Fused GW divergences. Next we will illustrate how these OT problems can be used to model graph data in learning scenarios such as graph compression, clustering and classification. Finally we will present a trecent application of GW distance as a novel pooling layer in Graph Neural Networks.
Matthieu Terris - Plug-and-play algorithms through the lens of monotone operators
04 Apr 2023 At ENSAE - Salle 1001
The lack of convergence guarantees for PnP algorithms can be problematic in some imaging applications. In this talk, we show that monotone operator theory provides a convenient theoretical framework for studying PnP algorithms. After a gentle introduction on PnP methods for inverse problems, we shot this theory allows us to derive theoretical Lipschitz constraints on the denoiser to ensure convergence and characterization of the limit point. These constraints can be relaxed and...
The lack of convergence guarantees for PnP algorithms can be problematic in some imaging applications. In this talk, we show that monotone operator theory provides a convenient theoretical framework for studying PnP algorithms. After a gentle introduction on PnP methods for inverse problems, we shot this theory allows us to derive theoretical Lipschitz constraints on the denoiser to ensure convergence and characterization of the limit point. These constraints can be relaxed and incorporated as a differentiable Jacobian (l2) norm regularisation term in the training loss function. This strategy can be applied to any feedforward architecture and is effective to enforce convergence despite being approximate. We show an application of the proposed method to (radio) astronomical imaging, where the accuracy of the reconstruction is key.
Azadeh Khaleghi - Oblivious Data for Fairness with Kernels
04 Apr 2023 At ENSAE - Salle 1001
We investigate the problem of algorithmic fairness in the case where sensitive and non-sensitive features are available and one aims to generate new, ‘oblivious’, features that closely approximate the non-sensitive features, and are only minimally dependent on the sensitive ones. We study this question in the context of kernel methods. We analyze a relaxed version of the Maximum Mean Discrepancy criterion which does not guarantee full independence but makes ...
We investigate the problem of algorithmic fairness in the case where sensitive and non-sensitive features are available and one aims to generate new, ‘oblivious’, features that closely approximate the non-sensitive features, and are only minimally dependent on the sensitive ones. We study this question in the context of kernel methods. We analyze a relaxed version of the Maximum Mean Discrepancy criterion which does not guarantee full independence but makes the optimization problem tractable. We derive a closed-form solution for this relaxed optimization problem and complement the result with a study of the dependencies between the newly generated features and the sensitive ones. Our key ingredient for generating such oblivious features is a Hilbert-space-valued conditional expectation, which needs to be estimated from data. We propose a plug-in approach and demonstrate how the estimation errors can be controlled. While our techniques help reduce the bias, we would like to point out that no post-processing of any dataset could possibly serve as an alternative to well-designed experiments.


Reference: S. Grünewälder, A. Khaleghi, Oblivious Data for Fairness with Kernels, Journal of Machine Learning Research, (208): 1-36, 2021.
Gaël varoquaux - Embeddings to learn on messy relational data
07 Mar 2023 At Inria - Salle Gilles Kahn
Statistical learning builds upon the regularities of the data at hand, leveraging similarities across observations or smoothness in the underlying process. But these regularities, similarities, or smoothness are difficult to capture in relational data. Observations come with attributes of different nature: age, size, address. The observations themselves may be of different nature, reflecting different granularity of information, for instance studying...
Statistical learning builds upon the regularities of the data at hand, leveraging similarities across observations or smoothness in the underlying process. But these regularities, similarities, or smoothness are difficult to capture in relational data. Observations come with attributes of different nature: age, size, address. The observations themselves may be of different nature, reflecting different granularity of information, for instance studying the housing market housing market may require assembling information across sales, properties, buyers, and various administrative divisions of cities and states.

Given such complex relational data, the standard practice is to manual transform it to a vector space, with much manual labor to make the data as regular as possible: SQL joins and aggregates across tables, entity normalization (fixing typos), imputation of the missing values. I will present progress on rethinking the data-science process to avoid these manual operations. Using flexible learners rather than parametric models removes the need for fancy imputation [1]. Machine-learning at the character level removes the need for normalizing entities, though the analytical question must be reformulated with a non-parametric model [2,3]. And finally, the information of a full database, with objects of different nature and varying attributes, can cast in a vector space that captures this information by expressing the relational model as a graph and adapting knowledge-graph embedding techniques [4]. As a result we provide vectors summarizing the full numerical and relational information in wikipedia for millions of entities: cities, people, compagnies, books: https://soda-inria.github.io/ken_embeddings/

[1] Marine Le Morvan, Julie Josse, Erwan Scornet, & Gaël Varoquaux, (2021). What’s a good imputation to predict with missing values?. Advances in Neural Information Processing Systems, 34, 11530-11540.

[2] Patricio Cerda, and Gaël Varoquaux. Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering (2020).

[3] Alexis Cvetkov-Iliev, Alexandre Allauzen, and Gaël Varoquaux. Analytics on Non-Normalized Data Sources: more Learning, rather than more Cleaning. IEEE Access 10 (2022): 42420-42431.

[4] Alexis Cvetkov-Iliev, Alexandre Allauzen, and Gaël Varoquaux. Relational data embeddings for feature enrichment with background information. Machine Learning (2023): 1-34.
Marylou Gabrié - Opportunities and Challenges in Enhancing Sampling with Learning
07 Mar 2023 At Inria - Salle Gilles Kahn
Deep generative models parametrize very flexible families of distributions able to fit complicated datasets of images or text. These models provide independent samples from complex high-distributions at negligible costs. On the other hand, sampling exactly a target distribution, such a Bayesian posterior, is typically challenging: either because of dimensionality, multi-modality, ill-conditioning or a combination of the previous. In this talk, I will review recent...
Deep generative models parametrize very flexible families of distributions able to fit complicated datasets of images or text. These models provide independent samples from complex high-distributions at negligible costs. On the other hand, sampling exactly a target distribution, such a Bayesian posterior, is typically challenging: either because of dimensionality, multi-modality, ill-conditioning or a combination of the previous. In this talk, I will review recent works trying to enhance traditional inference and sampling algorithms with learning. I will present in particular flowMC, an adaptive MCMC with Normalizing Flows along with first applications and remaining challenges.
Nabil Mustapha - Hitting Convex Sets: Combinatorics, Geometry and Sampling
07 Feb 2023 At ENSAE - Salle 1002
Let P be a finite set of points in R^d, and epsilon > 0 a given parameter. Then our goal is to compute a set Q in R^d, of small size, such that *any* convex set containing an epsilon-th fraction of the points of P contains at least one point of Q. This problem, proposed more than 40 years ago, remains wide open, with a large gap between the best-known lower and upper bounds on the size of Q. In this talk, I will explain some of the approaches towards resolving this question, which ...
Let P be a finite set of points in R^d, and epsilon > 0 a given parameter. Then our goal is to compute a set Q in R^d, of small size, such that *any* convex set containing an epsilon-th fraction of the points of P contains at least one point of Q.
This problem, proposed more than 40 years ago, remains wide open, with a large gap between the best-known lower and upper bounds on the size of Q. In this talk, I will explain some of the approaches towards resolving this question, which use a variety of techniques from Ramsey theory, combinatorial geometry, theory of convex polytopes and data-depth. Joint work with Saurabh Ray and Imre Barany.
Johannes Lutzeyer - Advances in Graph Representation Learning: The Graph Ordering Attention Networks
07 Feb 2023 At ENSAE - Salle 1002
The research and development of Graph Neural Networks (GNNs) is currently one of the fastest-growing topics in the area of Artificial Intelligence. GNNs have celebrated many academic and industrial successes in the past years; providing a rich ground for theoretical analysis and achieving state-of-the-art results in several learning tasks. In this talk, I will give an accessible introduction to the area of Graph Representation Learning with a focus on GNNs. I will then discuss some...
The research and development of Graph Neural Networks (GNNs) is currently one of the fastest-growing topics in the area of Artificial Intelligence. GNNs have celebrated many academic and industrial successes in the past years; providing a rich ground for theoretical analysis and achieving state-of-the-art results in several learning tasks. In this talk, I will give an accessible introduction to the area of Graph Representation Learning with a focus on GNNs. I will then discuss some of our recent work in the area, in particular our recent contribution the Graph Ordering Attention (GOAT) Networks. Most standard GNN layers are unable to capture statistical interaction effects of nodes in a neighbourhood. Our GOAT networks address this shortcoming and capture all possible interactions in a given neighbourhood. This allows us to demonstrably improve GNN performance in both synthetic tasks and on real-world data. I will end the talk by providing an overview of our other recent contributions to the Graph Representation Learning literature.
Alexandre Perez - Beyond calibration: estimating the grouping loss of modern neural networks
06 Dec 2022 At Inria - Amphi Sophie Germain
Good decision making requires machine-learning models to provide trustworthy confidence scores. To this end, recent work has focused on miscalibration, i.e, the over or under confidence of model scores. Yet, contrary to widespread belief, calibration is not enough: even a classifier with the best possible accuracy and perfect calibration can have confidence scores far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores ..
Good decision making requires machine-learning models to provide trustworthy confidence scores. To this end, recent work has focused on miscalibration, i.e, the over or under confidence of model scores. Yet, contrary to widespread belief, calibration is not enough: even a classifier with the best possible accuracy and perfect calibration can have confidence scores far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We use it to study modern neural network architectures in vision and NLP. We find that the grouping loss varies markedly across architectures, and that it is a key model-comparison factor across the most accurate, calibrated, models. We also show that distribution shifts lead to high grouping loss.
Marco Cuturi - OT at scale: Careful Initialization, Low-rank considerations and Neural Networks
06 Dec 2022 At Inria - Amphi Sophie Germain
I will present in this talk a series of efforts targeted at increasing the scalability and applicability of OT computations. I will present efforts on two fronts: In the first part, I will discuss speeding up the discrete resolution of the Kantorovich problem, using either the Sinkhorn approach, and, in that case, focusing on efficient heuristics to initialize Sinkhorn potentials, or, alternatively, by parameterizing OT couplings as a product of low-rank non-negative matrices. In the second...
I will present in this talk a series of efforts targeted at increasing the scalability and applicability of OT computations. I will present efforts on two fronts: In the first part, I will discuss speeding up the discrete resolution of the Kantorovich problem, using either the Sinkhorn approach, and, in that case, focusing on efficient heuristics to initialize Sinkhorn potentials, or, alternatively, by parameterizing OT couplings as a product of low-rank non-negative matrices. In the second part I will explain how a parameterization, in the 2-Wasserstein setting, of dual potentials as input-convex neural networks has opened several research avenues, and demonstrate this by illustrating a recent application to a multitask, joint estimation of several Monge maps linked by a common set of parameters.
Lukas Steinberger - Statistical Efficiency in Local Differential Privacy
08 Nov 2022 At ENSAE - Salle 1004
In light of big data and powerful machine learning algorithms that can quickly search and analyze large amounts of data, issues of data privacy protection are becoming more pressing than ever before. An increasingly popular approach towards data privacy protection, which is robust against any kind of adversary, is the notion of `differential privacy’. Although nowadays differential privacy receives a lot of attention also from within the statistics community, still even some of the most basic...
In light of big data and powerful machine learning algorithms that can quickly search and analyze large amounts of data, issues of data privacy protection are becoming more pressing than ever before. An increasingly popular approach towards data privacy protection, which is robust against any kind of adversary, is the notion of `differential privacy’. Although nowadays differential privacy receives a lot of attention also from within the statistics community, still even some of the most basic questions appear to be open.

In this talk we begin with a very general introduction to privacy aware data analysis via differential privacy. We then turn to the local paradigm of differential privacy and focus on the classical statistical problem of efficient estimation in regular parametric models. We highlight some of the intricacies that appear when the classical theory is adapted to procedures that guarantee local differential privacy and propose a general solution for the single parameter case. At the core of the new theory lies a challenging optimization problem for which efficient algorithms are sorely needed. Most of this talk is work in progress.
Gilles Dowek - Explanation: From ethics to logic
08 Nov 2022 At ENSAE - Salle 1004
When a decision, such as the approval or denial of bank loan, is delegated to a computer, an explanation of that decision ought to be given with it. This ethical need to explain the decisions leads to the search for a formal definition of the notion of explanation. This question meets older questions in logic regarding the explanatory nature of proof.
When a decision, such as the approval or denial of bank loan, is delegated to a computer, an explanation of that decision ought to be given with it. This ethical need to explain the decisions leads to the search for a formal definition of the notion of explanation. This question meets older questions in logic regarding the explanatory nature of proof.
Valentin de Bortoli - An introduction to Score-Based Generative Modeling and Schrodinger Bridges
05 Jul 2022 At ENSAE - Salle 1004
Score-Based Generative Modeling (SGM) is a recently developed approach to probabilistic generative modeling that exhibits state-of-the-art performance on several audio and image synthesis tasks. In this talk, we will introduce the basics of SGM and present some connections with stochastic control through the concept of Schrodinger bridges.
Score-Based Generative Modeling (SGM) is a recently developed approach to probabilistic generative modeling that exhibits state-of-the-art performance on several audio and image synthesis tasks. Existing SGMs generally consist of two parts. Firstly, noise is incrementally added to the data in order to obtain a perturbed data distribution approximating an easy-to-sample prior distribution e.g. Gaussian. Secondly, a neural network is used to learn the reverse-time denoising dynamics, which when initialized at this prior distribution, defines a generative model. Song et al. (2021) have shown that one could fruitfully view the noising process as a Stochastic Differential Equation (SDE) that progressively perturbs the initial data distribution into an approximately Gaussian one. In this talk, we will introduce the basics of SGM and present some connections with stochastic control through the concept of Schrodinger bridges.
YANN ISSARTEL - Seriation and 1D-localization in latent space models.
05 Jul 2022 At ENSAE - Salle 1004
Motivated by applications in archeology for relative dating of objects, or in 2D-tomography for angular synchronization, we consider the problem of statistical seriation where one seeks to reorder a noisy disordered matrix of pairwise affinities. This reformulation naturally leads to the problem of estimating the positions in the latent space. Under non-parametric assumptions, we introduce a procedure that provably localizes the latent positions with minimax optimal rate.
Motivated by applications in archeology for relative dating of objects, or in 2D-tomography for angular synchronization, we consider the problem of statistical seriation where one seeks to reorder a noisy disordered matrix of pairwise affinities. This problem can be recast in the powerful latent space terminology, where the affinity between a pair of items is modeled as a noisy observation of a function f(x_i,x_j) of the latent positions x_i, x_j of the two items on a one-dimensional space. This reformulation naturally leads to the problem of estimating the positions in the latent space. Under non-parametric assumptions on the affinity function f, we introduce a procedure that provably localizes all the latent positions with a maximum error of the order of the square root of log(n)/n. This rate is proven to be minimax optimal. Computationally efficient procedures are also analyzed, under some more restrictive assumptions. Our general results can be instantiated to the original problem of statistical seriation, leading to new bounds for the maximum error in the ordering.
Matthieu Jonckheere - New exploration principles for sparse and rare rewards
07 Jun 2022 At CentralSupelec - Batiment Eiffel - Amphi IV
We consider reinforcement learning control problems under the average reward criterion in which non-zero rewards are both sparse and rare, that is, they occur in very few states and the steady-state probability of observing them is extremely small. For such problems, we propose a new approach to exploit prior knowledge on the sparse structure of the environment. The core idea is to use Renewal Theory and Fleming-Viot particle systems to construct estimators of quantities relevant to ...
We consider reinforcement learning control problems under the average reward criterion in which non-zero rewards are both sparse and rare, that is, they occur in very few states and the steady-state probability of observing them is extremely small. For such problems, we propose a new approach to exploit prior knowledge on the sparse structure of the environment. The core idea is to use Renewal Theory and Fleming-Viot particle systems to construct estimators of quantities relevant to reinforcement learning, for which we provide some theoretical guarantees for the estimation error.

Joint work with U. Ayesta, S. Majewski, D. Mastropietro,
Ismail Ben Ayed - Learning with limited supervision slides
07 Jun 2022 At CentralSupelec - Batiment Eiffel - Amphi IV
Despite their unprecedented performances when trained on large-scale labeled data, deep learning models are seriously challenged when dealing with novel (unseen) classes and limited labeled instances. This generalization challenge occurs in a breadth of real scenarios and applications. In contrast, humans can learn new tasks easily from a handful of examples, by leveraging prior experience and context. Few-shot learning attempts to bridge this gap, and has recently triggered substantial efforts.
Despite their unprecedented performances when trained on large-scale labeled data, deep learning models are seriously challenged when dealing with novel (unseen) classes and limited labeled instances. This generalization challenge occurs in a breadth of real scenarios and applications. In contrast, humans can learn new tasks easily from a handful of examples, by leveraging prior experience and context. Few-shot learning attempts to bridge this gap, and has recently triggered substantial research efforts. This talk discusses recent developments in the general, wide-interest subject of learning with very limited supervision. Specifically, I will discuss state-of-the-art models, which leverage unlabeled data with priors, and point to current challenges and interesting future research avenues.
Elisabeth Gassiat - Deconvolution with unknown noise distribution
05 Apr 2022 At ENSAE - Salle 1001
I consider the deconvolution problem in the case where no information is known about the noise distribution. More precisely, no assumption is made on the noise distribution and no samples are available to estimate it: the deconvolution problem is solved based only on observations of the corrupted signal. I will present an identifiability result, and its use for statistical inference in various applications.
I consider the deconvolution problem in the case where no information is known about the noise distribution. More precisely, no assumption is made on the noise distribution and no samples are available to estimate it: the deconvolution problem is solved based only on observations of the corrupted signal. I will present an identifiability result, and its use for statistical inference in various applications.
Gérard Biau - A primer on Generative Adversarial Networks
05 Apr 2022 At ENSAE - Salle 1001
The mathematical forces at work behind Generative Adversarial Networks raise challenging theoretical issues. Motivated by the important question of characterizing the geometrical properties of the generated distributions, we provide a thorough analysis of Wasserstein GANs (WGANs) in both the finite sample and asymptotic regimes.

(joint work with B. Cadre, N. Klutchnikoff, A. Stéphanovitch, and U. Tanielian)
The mathematical forces at work behind Generative Adversarial Networks raise challenging theoretical issues. Motivated by the important question of characterizing the geometrical properties of the generated distributions, we provide a thorough analysis of Wasserstein GANs (WGANs) in both the finite sample and asymptotic regimes. We study the specific case where the latent space is univariate and derive results valid regardless of the dimension of the output space. We show in particular that for a fixed sample size, the optimal WGANs are closely linked with connected paths minimizing the sum of the squared Euclidean distances between the sample points. We also highlight the fact that WGANs are able to approach (for the 1-Wasserstein distance) the target distribution as the sample size tends to infinity, at a given convergence rate and provided the family of generative Lipschitz functions grows appropriately. We derive in passing new results on optimal transport theory in the semi-discrete setting.

(joint work with B. Cadre, N. Klutchnikoff, A. Stéphanovitch, and U. Tanielian)
Frederic Pascal - A new robust clustering algorithm - Application to images segmentation.
01 Feb 2022 At Inria Saclay - Amphi Sophie Germain
We introduce a new robust clustering algorithm that can efficiently deal with noise and outliers in diverse data sets. As an EM-like algorithm, it is based on both estimations of clusters centers and covariances.
Though very popular, it is well known that the EM algorithm suffers from non-Gaussian distribution shapes, outliers and high-dimensionality. In this talk, we introduce a new robust clustering algorithm that can efficiently deal with noise and outliers in diverse data sets. As an EM-like algorithm, it is based on both estimations of clusters centers and covariances. In addition, using a semi-parametric paradigm, the method estimates an unknown scale parameter per datapoint. This allows the algorithm to accommodate for heavier tails distributions and outliers without significantly loosing efficiency in various classical scenarios. Then, we show that the proposed algorithm outperforms other classical unsupervised methods of the literature such as k-means, the EM algorithm and its recent modications or spectral clustering when applied to real data sets as MNIST, NORB and 20newsgroups. An application to Radar (SAR) image segmentation is proposed, highlighting the interest of the proposed approach.
Arnak Dalalyan - Robust Estimation of the Gaussian Mean by iterative reweighting
01 Feb 2022 At Inria Saclay - Amphi Sophie Germain
The goal of this talk is to show that a single robust estimator of the mean of a multivariate Gaussian distribution can enjoy five desirable properties.
The goal of this talk is to show that a single robust estimator of the mean of a multivariate Gaussian distribution can enjoy five desirable properties. First, it is computationally tractable in the sense that it can be computed in a time which is at most polynomial in dimension, sample size and the logarithm of the inverse of the contamination rate. Second, it is equivariant by scaling, translations and orthogonal transformations. Third, it has a nearly-minimax-rate-breakdown point approximately equal to 0.28. Fourth, it is minimax rate optimal when data consist of independent observations corrupted by adversarially chosen outliers. Fifth, it is asymptotically optimal when the rate of contamination tends to zero. The estimator is obtained by an iterative reweighting approach. Each sample point is assigned a weight that is iteratively updated using a convex optimization problem. We also establish a dimension-free non-asymptotic risk bound for the expected error of the proposed estimator. It is the first of this kind results in the literature and involves only the effective rank of the covariance matrix.
(Joint work with Arshak Minasyan)
Frederic Barbaresco - Statistique et apprentissage sur les groupes de Lie : géométrie de l’Information de Jean-Louis Koszul et modèle symplectique de la physique statistique de Jean-Marie Souriau
06 Jul 2021 At BlueJean
Les modèles géométriques de l’information, de la physique statistique et de l’inférence en apprentissage machine partagent des structures communes. La géométrie de l’information utilise des gradients naturels d’apprentissage qui ont la propriété remarquable d’être invariants aux systèmes de coordonnées codant l’information par l’intermédiaire de la métrique dite de Fisher. Ces schémas intrinsèques préservent les symétries et peuvent s’étendre à des espaces plus abstraits.
Les modèles géométriques de l’information, de la physique statistique et de l’inférence en apprentissage machine partagent des structures communes comme cela a été illustré à l’Ecole de Physique des Houches de Juillet 2020 (https://franknielsen.github.io/SPIG-LesHouches2020/) et dans l’ouvrage SPRINGER récent sur le sujet [1]. La géométrie de l’information [2] utilise des gradients naturels d’apprentissage qui ont la propriété remarquable d’être invariants aux systèmes de coordonnées codant l’information par l’intermédiaire de la métrique dite de Fisher. Ces schémas intrinsèques préservent les symétries et peuvent s’étendre à des espaces plus abstraits lorsque les données sont codées sur des variétés différentielles ou des variétés symplectiques (cas du codage de l’information par des groupes de Lie comme en robotique ou Guidance Navigation & Control). Dans ces espaces de codage plus structurés, la notion de métrique de Fisher s’étend à celle de métrique de Koszul-Souriau, la définition de l’Entropie à celle de fonction de Casimir invariante en représentation coadjointe, en établissant les liens avec les systèmes intégrables au sens de Liouville [3,4,5,6].

Ces thématiques seront développées à la 5ème édition de la conférence SEE GSI’21 (Geometric Science of Information) en Juillet 2021 (www.gsi2021.org) à Paris Sorbonne Université, co-organisée avec SCAI Sorbonne (https://scai.sorbonne-universite.fr/) et ELLIS Paris Unit (https://ellis-paris.github.io/), et qui aura pour thème « Learning Geometric Structures ».

Références bibliographiques :

[1] Frédéric Barbaresco & Frank Nielsen, Geometric Structures of statistical Physics, Information Geometry and Learning, Actes de la Summer Week de l’Ecole de Physique des Houches SPIGL’20, SPRINGER Proceedings in Mathematics & Statistics, 2021; https://www.springer.com/jp/book/9783030779566
[2] Frédéric Barbaresco & Frank Nielsen, La Géométrie de l’Information: hors-série n°73 du magazine TANGENTE ; https://www.decitre.fr/revues/tangente-hors-serie-n-73-mathematiques-et-emploi-9782848842387.html
[3] Charles-Michel Marle, On Gibbs states of mechanical systems with symmetries, arXiv:2012.00582v1 [math.DG], https://arxiv.org/abs/2012.00582
[4] Frédéric Barbaresco, Lie Group Statistics and Lie Group Machine Learning Based on Souriau Lie Groups Thermodynamics & Koszul-Souriau-Fisher Metric: New Entropy Definition as Generalized Casimir Invariant Function in Coadjoint Representation. Entropy 2020, 22, 642.
[5] Frédéric Barbaresco, François Gay-Balmaz, Lie Group Cohomology and (Multi)Symplectic Integrators: New Geometric Tools for Lie Group Machine Learning Based on Souriau Geometric Statistical Mechanics. Entropy 2020, 22, 498.
[6] Frédéric Barbaresco, Koszul lecture related to geometric and analytic mechanics, Souriau’s Lie group thermodynamics and information geometry. SPRINGER Information Geometry, 2021
Lénaïc Chizat - Analysis of Gradient Descent on Wide Two-Layer ReLU Neural Networks
06 Jul 2021 At BlueJean
In this talk, we propose an analysis of gradient descent on wide two-layer ReLU neural networks that leads to sharp characterizations of the learned predictor. The main idea is to study the dynamics when the width of the hidden layer goes to infinity, which is a Wasserstein gradient flow.
In this talk, we propose an analysis of gradient descent on wide two-layer ReLU neural networks that leads to sharp characterizations of the learned predictor. The main idea is to study the dynamics when the width of the hidden layer goes to infinity, which is a Wasserstein gradient flow. While this dynamics evolves on a non-convex landscape, we show that for appropriate initializations, its limit, when it exists, is a global minimizer. We also study the implicit regularization of this algorithm when the objective is the unregularized logistic loss, which leads to a max-margin classifier in a certain functional space. We finally discuss what these results tell us about the generalization performance, and in particular how these models compare to kernel methods. This is based on joint work with Francis Bach.
Benjamain Riu - Muddling Label Regularization: Deep Learning for Tabular Datasets
01 Jun 2021 At BlueJean
Deep Learning is considered the state-of-the-art in computer vision, speech recognition and natural language processing. Until recently, it was also widely accepted that DL is irrelevant for learning tasks on tabular data, especially in the small sample regime where ensemble methods are acknowledged as the gold standard. We present a new end-to-end differentiable method to train a standard FFNN.
Deep Learning is considered the state-of-the-art in computer vision, speech recognition and natural language processing. Until recently, it was also widely accepted that DL is irrelevant for learning tasks on tabular data, especially in the small sample regime where ensemble methods are acknowledged as the gold standard.
We present a new end-to-end differentiable method to train a standard FFNN. Our method, Muddling labels for Regularization (MLR), penalizes memorization through the generation of uninformative labels and the application of a differentiable close-form regularization scheme on the last hidden layer during training. MLR outperforms classical FFNN and the gold standard GBDT, RF for regression and classification tasks on several datasets from the UCI database and Kaggle covering a large range of sample sizes and feature to sample ratios. Researchers and practitioners can use MLR on its own as an off-the-shelf DL solution or integrate it into the most advanced ML pipelines.
Stefano Fortunati - Robust Semiparametric Efficient Estimators in Complex Elliptically Symmetric Distributions
01 Jun 2021 At BlueJean
In this seminar, we firstly recall the building blocks underlying the derivation of robust and efficient finite-dimensional parameter vector real-valued R-estimator, then its extension to complex-valued data is proposed. Moreover, through numerical simulations, its estimation performance and robustness to outliers are discussed in a finite-sample regime.
Covariance matrices play a major role in statistics, signal processing and machine
learning. This seminar focuses on the semiparametric covariance/scatter matrix estimation problem
in elliptical distributions. The class of elliptical distributions can be seen as a semiparametric model
where the finite-dimensional vector of interest is given by the location vector and by the
(vectorized) covariance/scatter matrix, while the density generator represents an infinite-
dimensional nuisance function. The main aim of the statistical inference in elliptically distributed
data is then to provide possible estimators of the finite-dimensional parameter vector able to
reconcile the two dichotomic concepts of robustness and (semiparametric) efficiency. An R-
estimator satisfying these requirements has been recently proposed by Hallin, Oja and Paindaveine
for real-valued elliptical data by exploiting the Le Cam's theory of one-step efficient estimators and the rank-based statistics. In this seminar, we firstly recall the building blocks underlying the
derivation of such real-valued R-estimator, then its extension to complex-valued data is proposed.
Moreover, through numerical simulations, its estimation performance and robustness to outliers are
discussed in a finite-sample regime.
Timothée Matthieu - Robust Machine Learnings with Median of Means and M-estimators slides
04 May 2021 At BlueJean
Most machine learning algorithms are not robust as their predictive power can be highly disturbed by a small amount of abnormal data. In this talk I will show how to use robust estimators of the mean to construct a robust risk minimization framework for machine learning tasks, in particular for classification and regression.
Most machine learning algorithms are not robust as their predictive power can be highly disturbed by a small amount of abnormal data. In this talk I will show how to use robust estimators of the mean to construct a robust risk minimization framework for machine learning tasks, in particular for classification and regression. I will first present some finite sample results (i.e. concentration inequalities) on M-estimators and Median of Means when estimating the mean, then I will use these estimators to construct robust Machine Learning algorithms and finally I will show illustrations of these algorithms in several practical applications.
Nicolas Vayatis - Can Machine Learning predict Human Behavior?
06 Apr 2021 At BlueJean
Behavioral neurosciences are about to undergo a major shift due to a) the dissemination of wearable and ambient sensors in routine clinical assessment or in complex Human-Machine interaction, b) the progressive abandonment of animal experimentation. This raises several issues in terms of observational data, but also questions how Machine Learning techniques may contribute to the data-driven quantification and prediction of Human behavior. In this talk, I will introduce the topic of Machine Learn
Behavioral neurosciences are about to undergo a major shift due to a) the dissemination of wearable and ambient sensors in routine clinical assessment or in complex Human-Machine interaction, b) the progressive abandonment of animal experimentation. This raises several issues in terms of observational data, but also questions how Machine Learning techniques may contribute to the data-driven quantification and prediction of Human behavior. In this talk, I will introduce the topic of Machine Learning applications to behavioral neurosciences. I will also give an overview of recent works and questions for the future of Machine Learning research in this field.
Solenne Gaucher - Continuum-armed bandits : from the classical setting to the finite setting
06 Apr 2021 At BlueJean
In this talk, we start by introducing the continuum-armed bandit, and present classical algorithms and results. In a second time, we focus on the finite continuum-armed bandit setting, where the set of actions is finite, and each action can only be taken once.
Bandits are used to model the following sequential decision problem : at each time step, an agent takes an action and receives a reward drawn i.i.d. from a distribution depending on the action she has selected. Her aim is to maximize her cumulative reward. In this talk, we start by introducing the continuum-armed bandit, and present classical algorithms and results. In this setting, actions are indexed by covariates in a continuous set. The expected reward for taking an action is modeled as an (unknown) function of the covariate describing this action. In a second time, we focus on the finite continuum-armed bandit setting, where the set of actions is finite, and each action can only be taken once.
Jaouad Mourtada - Distribution-free robust linear regression slides
02 Mar 2021 At BlueJean
We consider the problem of random-design linear regression, in a distribution-free setting where no assumption is made on the distribution of the predictive/input variables. After surveying existing approaches and indicating some improvements, we explain why they fall short in our setting. We then identify the minimal assumption on the target/output under which guarantees are possible, and describe a nonlinear prediction procedure achieving the optimal error bound with high probability.
We consider the problem of random-design linear regression, in a distribution-free setting where no assumption is made on the distribution of the predictive/input variables. After surveying existing approaches and indicating some improvements, we explain why they fall short in our setting. We then identify the minimal assumption on the target/output under which guarantees are possible, and describe a nonlinear prediction procedure achieving the optimal error bound with high probability.
Florent Bouchard - Riemannian geometry for data analysis: illustration on blind source separation and low-rank structured covariance matrices slides
02 Mar 2021 At BlueJean
In this presentation, Riemannian geometry for data analysis is introduced. In particular, it is applied on two specific statistical signal processing problems: blind source separation and low-rank structured covariance matrices.
In this presentation, Riemannian geometry for data analysis is introduced. In particular, it is applied on two specific statistical signal processing problems: blind source separation and low-rank structured covariance matrices. Blind source separation can be solved by jointly diagonalizing some covariance matrices. We show here how geometry can be exploited in order to provide original joint diagonalization criteria and ways to compare them theoretically. These results are illustrated with numerical experiments both on simulated and real data. Concerning low-rank structured covariance matrices, an intrinsic Cramér-Rao bound of the corresponding estimation problem is presented, illustrating the interest of geometry for performance analysis. These results are validated with simulations.
Emilie chouzenoux - Proximal Gradient Algorithm in the Presence of Adjoint Mismatch. Application to Computed Tomography. slides
02 Feb 2021 At BlueJean
In this talk, we focus on the proximal gradient algorithm stability properties when such an adjoint mismatch arises. By making use of tools from convex analysis and fixed point theory, we establish conditions under which the algorithm can still converge to a fixed point. We provide bounds on the error between this point and the solution to the minimization problem. We illustrate the applicability of our theoretical results through numerical examples in the context of computed tomography.
The proximal gradient algorithm is a popular iterative algorithm to deal with penalized least-squares minimization problems. Its simplicity and versatility allow one to embed nonsmooth penalties efficiently. In the context of inverse problems arising in signal and image processing, a major concern lies in the computational burden when implementing minimization algorithms. For instance, in tomographic image reconstruction, a bottleneck is the cost for applying the forward linear operator and its adjoint. Consequently, it often happens that these operators are approximated numerically, so that the adjoint property is no longer fulfilled.

In this talk, we focus on the proximal gradient algorithm stability properties when such an adjoint mismatch arises. By making use of tools from convex analysis and fixed point theory, we establish conditions under which the algorithm can still converge to a fixed point. We provide bounds on the error between this point and the solution to the minimization problem. We illustrate the applicability of our theoretical results through numerical examples in the context of computed tomography.

This is joint work with M. Savanier, J.C. Pesquet, C. Riddell and Y. Trousset

References :
[1] E. Chouzenoux, J.C. Pesquet, C. Riddell, M. Savanier and Y. Trousset. Convergence of Proximal Gradient Algorithm in the Presence of Adjoint Mismatch. To appear in Inverse Problems, 2020. http://www.optimization-online.org/DB_HTML/2020/10/8055.html
[2] M. Savanier, E. Chouzenoux, J.C. Pesquet, C. Riddell and Y. Trousset. Proximal Gradient Algorithm in the Presence of Adjoint Mismatch. In Proceedings of the 28th European Signal Processing Conference (EUSIPCO 2020), January 18-22 2021
Zaccharie Ramzi - XPDNet for MRI Reconstruction: an Application to the fastMRI 2020 Brain Challenge slides
02 Feb 2021 At BlueJean
In this talk we will see how Deep Learning enables us to: (i) learn an optimal optimization scheme, (ii) learn a prior from the data and (iii) learn how to refine our knowledge of the measurements operator. We show the results of this approach on the fastMRI 2020 brain reconstruction challenge where we secured the 2nd spot in both the 4x and 8x acceleration tracks.
In classical Magnetic Resonance Imaging reconstruction, slow iterative non-linear algorithms using manually crafted priors are applied to obtain the anatomical image from under-sampled Fourier measurements. In addition they have to deal with an incomplete knowledge of the exact measurement operator.

Deep Learning methods, and in particular, unrolled networks, have allowed to alleviate those issues. In this talk we will see how Deep Learning enables us to:
  1. Learn an optimal optimization scheme,
  2. Learn a prior from the data
  3. Learn how to refine our knowledge of the measurements operator.
We show the results of this approach on the fastMRI 2020 brain reconstruction challenge where we secured the 2nd spot in both the 4x and 8x acceleration tracks.
Alain Durmus - Nonreversible MCMC: a complete recipe with convergence guarantees slides
05 Jan 2021 At BlueJean
In this talk, we will introduce develop general tools to ensure that a class of nonreversible Markov kernels, possibly relying on complex transforms, has the desired invariance property and lead to convergent algorithms. This leads to a set of simple and practically verifiable conditions.
Markov Chain Monte Carlo (MCMC) is a class of algorithms to sample complex and high-dimensional probability distributions. The Metropolis-Hastings (MH) algorithm, the workhorse of MCMC, provides a simple recipe to construct reversible Markov kernels. Reversibility is a tractable property which implies a less tractable but essential property here, invariance. Reversibility is however not necessarily desirable when considering performance. This has prompted recent interest in designing kernels breaking this property. At the same time, an active stream of research has focused on the design of novel versions of the MH kernel, some nonreversible, relying on the use of complex invertible deterministic transforms. While standard implementations of the MH kernel are well understood, aforementioned developments have not received the same systematic treatment to ensure their validity. In this talk, we will introduce develop general tools to ensure that a class of nonreversible Markov kernels, possibly relying on complex transforms, has the desired invariance property and lead to convergent algorithms. This leads to a set of simple and practically verifiable conditions.
Vincent Divol - Reconstructing measures on manifolds: an optimal transport approach slides
05 Jan 2021 At BlueJean
The problem of density estimation is known to become untractable in high dimension D >> 1. We propose to overcome this issue by assuming that the distribution mu is actually supported around a low dimensional unknown shape M, of dimension d << D. After showing that this problem is degenerate for a large class of standard losses, we focus on the Wasserstein loss.
Density estimation is one of the most classical problem in nonparametric statistics: given i.i.d. samples from a distribution mu on R^D, the goal is to reconstruct the underlying density (say for instance for the L_p norm). This problem is known to become untractable in high dimension D >> 1. We propose to overcome this issue by assuming that the distribution mu is actually supported around a low dimensional unknown shape M, of dimension d << D. After showing that this problem is degenerate for a large class of standard losses (L_p, total variation, etc.), we focus on the Wasserstein loss, for which we build a minimax estimator, based on kernel density estimation, whose rate of convergence depends on d, and on the regularity of the underlying density, but not on the ambient dimension D.
Sophie Donnet - Block models for multipartite networks.Applications in ecology and ethnobiology. slides
01 Dec 2020 At BlueJeans
In this contribution, we propose a stochastic block model able to handle multipartite networks, thus supplying a clustering of the individuals based on their connection behavior in more than one network. Our model is an extension of the latent block models (LBM) and stochastic block model (SBM).
Modeling relations between individuals is a classical question in social sciences, ecology, etc. In order to uncover a latent structure in the data, a popular approach consists in clustering individuals according to the observed patterns of interactions. To do so, Stochastic block models (SBM) and Latent Block models (LBM) are standard tools for clustering the individuals with respect to their comportment in a unique network. However, when adopting an integrative point of view, individuals are not involved in a unique network but are part of several networks, resulting into a potentially complex multipartite network. In this contribution, we propose a stochastic block model able to handle multipartite networks, thus supplying a clustering of the individuals based on their connection behavior in more than one network. Our model is an extension of the latent block models (LBM) and stochastic block model (SBM). The parameters -- such as the marginal probabilities of assignment to blocks and the matrix of probabilities of connections between blocks -- are estimated through a variational Expectation-Maximization procedure. The numbers of blocks are chosen with the Integrated Completed Likelihood criterion, a penalized likelihood criterion. The pertinence of our methodology is illustrated on two datasets issued from ecology and ethnobiology.
Boris Muzellec - The Bures-Wasserstein Geometry for Machine Learning slides
01 Dec 2020 At BlueJeans
In this talk, we show how the Bures-Wasserstein distance can be used in machine learning applications, by presenting scalable algorithms for computing and differentiating the Bures metric. Then, we show that the Bures-Wasserstein geometry can seamlessly incorporate other methods for approximating OT.
Optimal transport (OT) has recently gained popularity in the machine learning community, both as a way to measure the discrepancy between two probability measures and as a principled method to transform a distribution into another. Yet, in most cases OT does not admit a closed-form expression and either has to be evaluated through a costly optimization problem, or approximated by regularizing this problem.


Alternatively, a powerful approach consists in keeping the problem as such, and regularizing the data instead to fall back to cases that can be solved efficiently. In particular, representing the data using elliptical distributions, which are fully described by their mean vector and covariance matrix, leads to one of the very few cases of closed-form expressions for OT. Indeed, for such distributions, the Wasserstein distance can be decomposed as the sum of the Euclidean distance between means and the Bures distance between covariance matrices, which defines a Riemannian metric on the set of positive semi-definite matrices.


In this talk, we show how the Bures-Wasserstein distance can be used in machine learning applications, by presenting scalable algorithms for computing and differentiating the Bures metric. In particular, we show that a suitable reparameterization allows to emulate Riemannian gradient descent in a projection-free Euclidean setting. Finally, we show that the Bures-Wasserstein geometry can seamlessly incorporate other methods for approximating OT, such as low-dimensional projections or entropic regularization, and propose applications to probabilistic word embeddings.
Alexandre Gramfort - Learning representations from neural signals slides
03 Nov 2020 At BlueJeans
In this talk I will cover some recent works that aim to learn non-linear representations of neural time series in order to estimate good predictive models.
Good representations are the building blocks of signal processing. While the Fourier representation reveal sinusoids buried in noise, wavelets better capture time-localized transient phenomena. Processing neuroimaging data is also based on representations. Morlet wavelets or short time Fourier transform (STFT) are common for electrophysiology (M/EEG) and spherical harmonics are used for diffusion MRI signals. In this talk I will cover some recent works that aim to learn non-linear representations of neural time series in order to estimate good predictive models. Works that I will present will cover recent works such as [1, 2, 3].

[1] Learning the Morphology of Brain Signals Using Alpha-Stable Convolutional Sparse Coding
Jas, M., Dupré la Tour, T., Simsekli, U. and Gramfort, A. (2017)
Advances in Neural Information Processing Systems (NIPS) 30
[2] Multivariate Convolutional Sparse Coding for Electromagnetic Brain Signals
Dupré la Tour, T., Moreau, T., Jas, M. and Gramfort, A. (2018)
Advances in Neural Information Processing Systems (NeurIPS)
[3] Self-supervised representation learning from electroencephalography signals
Banville, H., Albuquerque, I., Moffat, G., Engemann, D. and Gramfort, A. (2019)
Proc. Machine Learning for Signal Processing (MLSP).
Marine Le Morvan - Neumann networks: differential programming for supervised learning with missing values slides
03 Nov 2020 At BlueJeans
In this work, we derive the analytical form of the optimal predictor under a linearity assumption and various missing data mechanisms. Based on a Neumann series approximation of the optimal predictor, we propose a new principled architecture, named Neumann networks. Their originality and strength comes from the use of a new type of non-linearity: the multiplication by the missingness indicator.
The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing patterns, which can be exponential in the number of dimensions. In this work, we derive the analytical form of the optimal predictor under a linearity assumption and various missing data mechanisms including Missing at Random (MAR) and self-masking (Missing Not At Random). Based on a Neumann series approximation of the optimal predictor, we propose a new principled architecture, named Neumann networks. Their originality and strength comes from the use of a new type of non-linearity: the multiplication by the missingness indicator. We provide an upper bound on the Bayes risk of Neumann networks, and show that they have good predictive accuracy with both a number of parameters and a computational complexity independent of the number of missing data patterns. As a result they scale well to problems with many features, and remain statistically efficient for medium-sized samples. Moreover, we show that, contrary to procedures using EM or imputation, they are robust to the missing data mechanism, including difficult MNAR settings such as self-masking.
Joon Kwon - Unifying mirror descent and dual averaging slides
06 Oct 2020 At BlueJeans
We introduce and analyse a new family of algorithms which generalizes and unifies both the mirror descent and the dual averaging algorithms.
We introduce and analyse a new family of algorithms which generalizes and unifies both the mirror descent and the dual averaging algorithms. In the framework of this family, we define a new algorithm for constrained optimization with the aim of combining the advantages of mirror descent and dual averaging. In practice, this new algorithm converges as fast as mirror descent and dual averaging, and in some situations greatly outperforms them. Besides, we demonstrate how our algorithms can also be applied to solving variational inequalities.
This is joint work with Anatoli Juditsky and Eric Moulines.
Evgenii Chzhen - Algorithmic Fairness in Classification and Regression slides
06 Oct 2020 At BlueJeans
The goal of this talk is to introduce the audience to the problem of algorithmic fairness. I will provide a general overview on the topic, describe various available notions of fairness in classification and regression, and present main approaches to tackle this problem. I will also present some recent theoretical results both in classification and regression. This talk is based on joint works with C. Denis, M. Hebiri, L. Oneto, and M. Pontil.
The goal of this talk is to introduce the audience to the problem of algorithmic fairness. I will provide a general overview on the topic, describe various available notions of fairness in classification and regression, and present main approaches to tackle this problem. I will also present some recent theoretical results both in classification and regression. This talk is based on joint works with C. Denis, M. Hebiri, L. Oneto, and M. Pontil.
Alexei Grinbaum - Chance as a value for artificial intelligence
03 Mar 2020 At INRIA Saclay - Batiment Alan Turing - Amphi sophie Germain
Deep learning techniques lead to fundamentally non-interpretable decisions made by the machine. Although such choices do not have an explanation, they impact the users in significant ways. If the ultimate innovator is a machine, what is the meaning of responsible conduct?
Deep learning techniques lead to fundamentally non-interpretable decisions made by the machine. Although such choices do not have an explanation, they impact the users in significant ways. If the ultimate innovator is a machine, what is the meaning of responsible conduct? I argue in a recent book that the capacity to extract an AI system from human judgment, by reducing transparency in favor of opacity, is an essential value in machine ethics. This can be achieved through the use of randomness, illustrated on several examples including the trolley dilemma. Methodologically, a comparison of common motives between technological setups and mythological narratives is used to achieve ethical insights.
Yannig Goude - Machine Learning Methods for Electricity Load Forecasting: contributions and perspectives.
03 Mar 2020 At INRIA Saclay - Batiment Alan Turing - Amphi sophie Germain
To maintain the electricity quality, energy stakeholders are developing smart grids, the next generation power grid including advance communication networks and associated optimisation and forecasting tools. We will present recent development in the field of online learning and probabilistic forecasting done at EDF in this context.
Energy systems are facing a revolution and many challenges. On the one hand, electricity production is moving to more intermittency and complexity with the increase of renewable energy and the development of small distributed production units such as photovoltaic panels or wind farms. On the other hand, consumption is also changing with e.g. plug-in (hybrid) electric vehicles, heat pumps, the development of new technologies such as smart phones, computers, storage devices. To maintain the electricity quality, energy stakeholders are developing smart grids, the next generation power grid including advance communication networks and associated optimisation and forecasting tools. Exploiting the smart grid efficiently requires advanced data analytics to improve forecasting at different geographical scale. We will present recent development in the field of online learning and probabilistic forecasting done at EDF in this context.
Rainer Dyckerhoff - Convergence of depths and central regions
11 Feb 2020 At Telecom Paris - Amphi 2
After a short introduction in the general concept of depth we first discuss the continuity properties of depths and central regions. The main part of the talk is concerned with the connections between different types of convergence. We give conditions under which the pointwise (resp. uniform) convergence of the data depth implies the pointwise (resp. uniform) convergence of the central regions in the Hausdorff metric as well as conditions under which the reverse implications hold.
Depth is a concept that measures the centrality of a point in a given data cloud x_1, x_2,..., x_n in ℝ^d or in a given probability distribution on ℝ^d. Every depth defines a family of so-called central regions. The α-central region is the set of all points that have a depth of at least α. For statistical applications it is desirable that with increasing sample size the empirical depth as well as the empirical central regions converge almost surely to their population counterparts.

After a short introduction in the general concept of depth we first discuss the continuity properties of depths and central regions. The main part of the talk is concerned with the connections between different types of convergence. We give conditions under which the pointwise (resp. uniform) convergence of the data depth implies the pointwise (resp. uniform) convergence of the central regions in the Hausdorff metric as well as conditions under which the reverse implications hold. Further, we demonstrate that under relative weak conditions the pointwise convergence of the data depth (resp. central regions) is equivalent to the uniform convergence of the data depth (resp. central regions). Finally, we illustrate these results by applying them to special notions of data depth that have been proposed in the literature.
Hicham Janati - Spatio-temporal Optimal transport metric for brain imaging data
11 Feb 2020 At Telecom Paris - Amphi 2
Evaluating how two different MEG time series are close to each other is far from trivial. In this paper, we propose Spatio-Temporal Alignments (STA), a new differentiable formulation of DTW, in which spatial differences between time samples are accounted for using regularized optimal transport.
Comparing data defined over space and time is notoriously hard, because it involves quantifying both spatial and temporal variability, while at the same time taking into account the chronological structure of data. For instance, MEG and EEG source estimates provide brain activity maps in a millimetre / millisecond resolution with significant spatio-temporal differences. Evaluating how two different MEG time series are close to each other is thus far from trivial. This work is a first step towards that goal. Dynamic Time Warping (DTW) computes an optimal alignment between time series in agreement with the chronological order, but is inherently blind to spatial shifts. In this paper, we propose Spatio-Temporal Alignments (STA), a new differentiable formulation of DTW, in which spatial differences between time samples are accounted for using regularized optimal transport (OT). Our temporal alignments are handled through a smooth variant of DTW called soft-DTW, for which we prove a new property: soft-DTW increases quadratically with time shifts. The cost matrix within soft-DTW that we use is computed using unbalanced Optimal transport. Experiments on brain imaging data confirm our theoretical findings and illustrate the effectiveness of STA as a dissimilarity for spatio-temporal data.
Vianney Perchet - Optimization vs Privacy in Machine Learning slides
07 Jan 2020 At Central-Supelec - Batiment Eiffel - Amphi I
After, a quick review of some concepts in privacy for machine learning, we will formalize the tradeoff utility vs privacy as an optimization problem where the objective mapping is regularized by the amount of information revealed to the adversary.
Information is valuable either by remaining private (for instance if it is sensitive) or, on the other hand, by being used publicly to optimize some target loss functions. These two objectives are antagonistic and leaking this information might be more rewarding than concealing it. Unlike classical solutions that focus on the first point, we consider instead agents that maximize a natural trade-off between both objectives.

We will in a first step quickly review some concepts of privacy in machine learning, before formalizing the tradeoff utility vs privacy as an optimization problem where the objective mapping is regularized by the amount of information revealed to the adversary (measured as a divergence between the prior and posterior on the private knowledge). Quite surprisingly, when combined with the entropic regularization, the Sinkhorn loss naturally emerges in the optimization objective, making it efficiently solvable. We apply these techniques to preserve some privacy in online repeated auctions.
Imke Mayer - Treatment effect estimation with missing attributes slides
07 Jan 2020 At Central-Supelec - Batiment Eiffel - Amphi I
In this talk, I will first provide a classification of existing methods for causal inference with missing values. Then, I will present two recent contributions on this topic.
Inferring causal effects of a treatment or policy from observational data is central to many applications. However, state-of-the-art methods for causal inference seldom consider the possibility that covariates have missing values, which is ubiquitous in many real-world cases.

This work is motivated by assessing several medical questions about different treatments based on a large prospective database counting over 20,000 major trauma patients. This database is complex in the sense that it presents a multi-level and heterogeneous structure and precisely contains large fractions of missing values.

Missing data greatly complicate causal analyses as they either require strong assumptions about the missing data generating mechanism or an adapted unconfoundedness hypothesis. In this talk, I will first provide a classification of existing methods according to the main underlying assumptions, which are based either on variants of the classical unconfoundedness assumption or relying on assumptions about the mechanism that generates the missing values. Then, I will present two recent contributions on this topic: (1) an extension of doubly robust estimators that allows handling of missing attributes, and (2) an approach to causal inference based on variational autoencoders adapted to incomplete data.
Sylvain Arlot - Analysis of some Purely Random Forests slides
02 Dec 2019 At ENSAE - Amphi 200
In this talk, we study the approximation error (the bias) of some purely random forest models in a regression framework, focusing in particular on the influence of the size of each tree and of the number of trees in the forest.
Random forests (Breiman, 2001) are a very effective and commonly used statistical method, but their full theoretical analysis is still an open problem. As a first step, simplified models such as purely random forests have been introduced, in order to shed light on the good performance of Breiman's random forests.

In the regression framework, the quadratic risk of a purely random forest can be written as the sum of two terms, which can be understood as an approximation error and an estimation error. Robin Genuer (2010) studied how the estimation error decreases when the number of trees increases for some specific model. In this talk, we study the approximation error (the bias) of some purely random forest models in a regression framework, focusing in particular on the influence of the size of each tree and of the number of trees in the forest.

Under some regularity assumptions on the regression function, we show that the bias of an infinite forest decreases at a faster rate (with respect to the size of each tree) than a single tree. As a consequence, infinite forests attain a strictly better risk rate (with respect to the sample size) than single trees.

This talk is based on joint works with Robin Genuer.
arxiv.org/abs/1407.3939
arxiv.org/abs/1604.01515
Pierre Laforgue - On the Dualization of Operator-Valued Kernel Machines slides
02 Dec 2019 At ENSAE - Amphi 200
In this talk, we study a dualization approach to Operator-Valued Kernel problems in infinite dimensional output spaces. Under mild assumptions, the dual problem is shown to be computable, allowing for the resolution of a wide range of OVK problems.
Operator-Valued Kernels (OVKs) provide an elegant way to extend scalar kernel methods when the output space is a Hilbert space. If the output space if finite dimensional, this framework naturally allows to tackle multi-class classification or multi-task regression problems. But its ability to deal with infinite dimensional output spaces opens the door to many more applications, such as structured output prediction, structured representation learning, or functional regression. This work investigates how to use the duality principle to handle different families of loss functions, yet unexplored within OVK machines. The difficulty of having infinite dimensional dual variables is overcome by means of a Double Representer Theorem, that will be explicited. This allows for instance to handle Ɛ-insensitive and Huber losses, which are of particular interest in the context of surrogate approaches.

This is a joint work with Alex Lambert, Luc Brogat-Motte and Florence d'Alché-Buc from Télécom Paris. The preprint is available at https://arxiv.org/abs/1910.04621.
François-Pierre Paty - Regularizing Optimal Transport Using Regularity Theory slides
05 Nov 2019 At INRIA Saclay - Batiment Alan Turing - Amphi sophie Germain
In this talk, I will present a new regularization of OT leveraging regularity of the Brenier map. Instead of considering regularity as a a property that can be proved under suitable assumptions, we consider regularity as a condition that must be enforced when estimating OT.
Optimal transport (OT) suffers from the curse of dimensionality. In this talk, I will present a new regularization of OT leveraging regularity of the Brenier map. Instead of considering regularity as a a property that can be proved under suitable assumptions, we consider regularity as a condition that must be enforced when estimating OT. From an algorithmic point of view, this leads to an infinite-dimensional optimization problem, which, when dealing with discrete measures, can be rewritten as a finite-dimensional separately convex problem. From a statistical point of view, this defines new estimators of the OT map and 2-Wasserstein distance between arbitrary measures, for which I will show numerical evidence of their performance. (Based on a joint work with Alexandre d'Aspremont and Marco Cuturi)
Pierre Ablin - Deep learning for inverse problems: solving the Lasso with neural networks slides
05 Nov 2019 At INRIA Saclay - Batiment Alan Turing - Amphi sophie Germain
Deep learning architectures are becoming ubiquitous for inverse problem resolution. We start by reviewing one of the earliest examples, called LISTA. In a second step, we study the weights that are learned by such networks.
Deep learning architectures are becoming ubiquitous for inverse problem resolution. We start by reviewing one of the earliest examples. It consists in using recurrent neural networks to solve the Lasso problem: one of the most popular algorithm to solve the Lasso, the Iterative Shrinkage-Thresholding Algorithm (ISTA), iterates matrix multiplications and non-linearities until convergence. Therefore, T iterations of ISTA are equivalent to a T layers neural network. The weights of the neural network can then be learned to accelerate resolution of the Lasso.
In a second step, we study the weights that are learned by such networks. In particular, we show that the last layers of such deep networks can only learn a better step-size for the ISTA iterations.