See also the ZONMW website

Summary Biomarker selection from omics studies is usually performed by applying a feature selection algorithm to a single data set of interest. Only after this initial selection other sources of information are used to reduce the selected set, e.g. by intersecting it with sets found in similar studies, or by searching the literature. This is useful, but the damage may have already been done: in particular for heterogeneous patient populations or data sets with a weak differential signal, important markers are missed and/or many false positives are reported, leading to the much-heard frustration on poor reproducibility of published molecular signatures.

A variety and large amount of data on the potential markers is (publicly) available. We propose to use Big Data when needed most: during the construction of predictive signatures thus ensuring robustness and reproducibility. We develop a novel framework that automatically uses such additional data, the co-data, during selection of potential biomarkers from molecular profiles. Here, co-data can be any qualitative or quantitative information on the markers, as long as it does not use the response labels of the primary data. Examples of co-data are: data reflecting therapy response markers in a model system such as cell lines; genomic annotation; p-values from a related study; and other molecular data types. Our approach consists of two major components. First, a methodology for optimal and automatic use of multiple sources of co-data to render accurate and reproducible predictors and clinically relevant biomarkers. Second, an inventory and evaluation of co-data for molecular cancer diagnostics and prognostics.

Our methodology is based on the principles of Weighted Learning and Transfer Learning, plus the combination thereof. First, Weighted Learning estimates marker-specific weights from the co-data (e.g. external p-values) and the primary data. It automatically tunes the weights to the information content of the co-data with respect to the primary data. Preliminary results show that accounting for these weights can reduce prediction error rates by tens of percentages. In addition, such weights can highly alleviate the 'find-the-needle-in-the-haystack' problem and robustify marker selection. Second, Transfer Learning jointly learns from labeled cell line (co-)data and unlabeled tumour data. Here, the setting is therapy response: for the cell lines, molecular data and therapy response is available, whereas the latter is absent for tumour data. Then, the challenge is to transfer the labels of the cell lines to (predicted) labels of the tumours while taking the molecular resemblance between the celllines and the tumours (same mutations, same gene expression patterns etc.) into account.

These two methodologies, weighted learning and transfer learning, will be combined to render a novel, powerful algorithm that makes optimal use of co-data. A broad oncological collaboration team [dr. Renske Steenbergen, prof.dr. Ruud Brakenhoff (VUmc); dr. Gabe Sonke, prof.dr. Gerrit Meijer (NKI)] will support us with data, biomedical know-how and means for validation of markers. In close collaboration with this team, we will apply our approach to several tumor types for the purpose of i) Cancer diagnostics based on body fluids and specimens, in particular for oral and cervical (pre-)cancer; ii) Prediction of therapy response, in particular response to chemotherapy in neo-adjuvant breast cancer and metastatic colorectal cancer.