Statistics for Omics Data Analysis


Omics studies provide information on many cellular entities (e.g. DNA, mRNA, microRNA), whose interrelatedness is to be uncovered to enhance our understanding of the cell. The cohesion among the entities may be extracted from omics studies using graphical models. Graphical models combine model and graph. The model specifies the details of the relations among the cellular entities, possibly over time. The accompanying graph is a schemata/network of the relations among the entities as implicated by the model. The graph provides a first accessible handle on the model and smooths the communication of results to our medical collaborators.

Learning graphical models from the typically under-sampled (compared to the large number of entities) omics studies requires the development of penalized estimation techniques. Our research in this direction concentrates on ridge (shrinkage) type estimation methods of graphical models, from the various type of studies conducted in our hospital. We also develop methodology to analyze and exploit an estimated graphical model in order to enhance its practical value to our medical collaborators. We thus aim to translate graphical models into tangible, practical consequences in terms of the cellular entities and their relations.


Integrated analysis of multiple omics data sets

While graphical models can help understanding how genes in a chosen set (or pathway) are related, often there is also interest in a bigger picture of association between molecular types such as RNA expression, DNA copy number and methylation, giving an overview for a genomic region or even the entire genome. Such picture may suggest ways in which gene expression is regulated which, in the context of disease studies, may help us further understand pathogenesis. Building this bigger picture is challenging, however, as it involves tens of thousands of measurements per sample, for each molecular type.

We develop powerful methods to unravel associations between molecular profiles. These make use of many molecular features at the same time, to both help efficiently separate signal from noise, as well as decrease the number of statistical tests needed. These less structured models are efficiently implemented, making them ideal choices to apply to large sets of genes, including genome-wide studies.


Co-data: Enhancing estimation, prediction and selection

Genomic studies can often benefit from co-data, in addition to the genomic data collected. Examples of co-data are: genomic annotation, data from repositories like GEO and TCGA, p-values from similar studies, published gene signatures, correlations with other bio-molecules (eg DNA-mRNA), and response-independent summary statistics like a gene's total count.

We approach omics data here as 'double Big' data: 'Big' in terms of number of variables, and 'Big' in terms of available sources of auxiliary co-data. The high-dimensional character of omics data hampers accurate selection of markers or combinations thereof. Variable-specific weights can greatly alleviate this problem. However, how to estimate such weights unbiasedly? Much of our research is dedicated to this question. We employ and develop empirical Bayes techniques that implement such estimation by jointly using co-data and the main study data. We have applied these techniques to differential expression testing, network estimation, and clinical prediction. Results are very promising in terms of accuracy and reproducibility.

For example, for molecular classification based on self-sampled tissue of healthy controls and cases with pre-cancerous lesions in the cervix, this strategy boosted the predictive performance when using external molecular data from professionally obtained tissue. Another example is the network estimate for genes in the P53 pathway for pancreatic and lung tumors. Here, reproducibility of the estimated conditional independence network was greatly enhanced when incorporating the external network obtained from corresponding healthy tissues.