Marie Perrot-Dockès

Selected Publications

Motivated by an application in molecular biology, we propose a novel, efficient and fully data-driven approach for estimating large block structured sparse covariance matrices in the case where the number of variables is much larger than the number of samples without limiting ourselves to block diagonal matrices. Our approach consists in approximating such a covariance matrix by the sum of a low-rank sparse matrix and a diagonal matrix. Our methodology can also deal with matrices for which the block structure only appears if the columns and rows are permuted according to an unknown permutation. Our technique is implemented in the R package exttt{BlockCov} which is available from the Comprehensive R Archive Network and from GitHub. In order to illustrate the statistical and numerical performance of our package some numerical experiments are provided as well as a thorough comparison with alternative methods. Finally, our approach is applied to gene expression data in order to better understand the toxicity of acetaminophen on the liver of rats.

Omic data are characterized by the presence of strong dependence structures that result either from data acquisition or from some underlying biological processes. In metabolomics, for instance, data resulting from Liquid Chromatography-Mass Spectrometry (LC-MS) – a technique which gives access to a large coverage of metabolites – exhibit such patterns. These data sets are typically used to find the metabolites characterizing a phenotype of interest associated with the samples. However, applying some statistical procedures that do not adjust the variable selection step to the dependence pattern may result in a loss of power and the selection of spurious variables. The goal of this paper is to propose a variable selection procedure in the multivariate linear model that accounts for the dependence structure of the multiple outputs which may lead in the LC-MS framework to the selection of more relevant metabolites. We propose a novel Lasso-based approach in the multivariate framework of the general linear model taking into account the dependence structure by using various modelings of the covariance matrix of the residuals. Our numerical experiments show that including the estimation of the covariance matrix of the residuals in the Lasso criterion dramatically improves the variable selection performance. Our approach is also successfully applied to a LC-MS data set made of African copals samples for which it is able to provide a small list of metabolites without altering the phenotype discrimination. Our methodology is implemented in the R package MultiVarSel which is available from the CRAN (Comprehensive R Archive Network).

A novel variable selection approach in the framework of multivariate linear models taking into account the dependence that may exist between the responses is presented. It consists in estimating beforehand the covariance matrix Sigma of the responses and to plug this estimator in a Lasso criterion, in order to obtain a sparse estimator of the coefficient matrix. Its properties are investigated both from a theoretical and a numerical point of view. General conditions on the estimators of the covariance matrix and its inverse are given in order to recover the positions of the null and non null entries of the coefficient matrix when the size of Sigma is not fixed and can tend to infinity.
In JMVA, 2018

Recent Publications

. A multivariate variable selection approach for analyzing LC-MS metabolomics data. 2018.


. Variable selection in multivariate linear models with high-dimensional covariance matrix estimation. In JMVA, 2018.


Recent & Upcoming Talks


  • Tutorial of Inferential Statistic to first year student of AgroParisTech
  • Introduction to R langage to second year students of AgroParisTech