Summary on the proposal "Co-data random forest learning for rare tumors" granted by the Hanarth Foundation for the call "AI for Oncology".

Rare cancers pose a major problem for machine learning algorithms: most genomics studies on rare cancers contain data on a relatively small number of patients and a large number of genomic features. Such a setting is challenging for machine learners, because these may easily overfit, or fail to find relevant signal when using naive penalization.

Our aim is to steer the machine learners in the right direction. For that, we make use of vast amounts of complementary data (co-data) on the features, as available in online repositories. We propose to build a well-interpretable tree-based learner that unites strong elements of machine learning and biology: it accounts for complex molecular interactions to improve predictions, while robustifying results by weighting features using biological information as co-data when learning. The penalty weights are learnt from the data by using empirical Bayes techniques. In terms of application, we focus on prognosis for three rare disease entities of lymphoma cancer using a variety of genomics data. The project is a collaboration between prof. Mark van de Wiel (PI), dr. Thomas Klausch (both dep. Epidemiology & Biostatistics) and prof. Daphne de Jong (dep. Pathology). It funds a PhD-student and a PostDoc.