New machine learning tools enable precise cancer subtype classification

A multi-institutional team has developed a comprehensive suite of machine learning models that can accurately classify cancer samples into molecular subtypes, bridging a critical gap between genomic research discoveries and clinical implementation.

 

Prostate cancer cells

Researchers have created a powerful new computational resource that enables rapid and accurate classification of tumour samples based on their molecular characteristics. The breakthrough solution, detailed in Cancer Cell, leverages machine learning to analyse complex genomic data and match patient samples to previously defined cancer subtypes identified through The Cancer Genome Atlas (TCGA) Network.

The team, led by researchers from Van Andel Institute, Broad Institute of MIT and Harvard, and Oregon Health & Science University, developed 737 ready-to-use prediction models covering 26 cancer types and 106 distinct molecular subtypes. These models can process data from various genomic platforms, including gene expression, DNA methylation, mutations, and other molecular features.

The researchers employed five different machine learning methods to develop their classification models, training them on data from 8,791 TCGA cancer samples. They found that for many cancer types, gene expression data alone was sufficient for accurate subtype classification.

“Since many TCGA molecular subtypes were generated using hundreds or thousands of features from multiple data types, scientists and physicians have asked us for help subtyping their samples,” said Dr Andrew D. Cherniack from the Broad Institute. “Our resource greatly simplifies this process.”

Validation with external datasets

The team validated their models using independent datasets, including the METABRIC breast cancer cohort and AURORA study samples. The models demonstrated robust performance even when analysing data generated using different technological platforms or from formalin-fixed paraffin-embedded tissue samples.

The authors note: “For most research applications, the classifier models can be applied directly to samples from other studies, even when only one data type is available, after appropriate data transformation to match the range and distribution in the TCGA cohort.”

Sample size requirements

One crucial finding was that approximately 150 samples are typically adequate for achieving maximum model performance in most cancer types. The researchers developed a mathematical framework to predict classification accuracy based on sample size, helping inform future study designs.

Clinical implications

The work represents a significant step toward translating complex genomic findings into practical clinical tools. The authors explain that “these tumour subtype classifiers will be useful in prospective clinical cancer research and practice, possibly helping to realize the as-yet unfulfilled promise of translating genomic findings to the clinic.”

“Since many TCGA molecular subtypes were generated using hundreds or thousands of features from multiple data types, scientists and physicians have asked us for help subtyping their samples,” Dr. Cherniack said. “Our resource greatly simplifies this process.”

Resource availability

All models and associated tools have been made freely available to the research community through an online repository. The resource includes containerized versions of the prediction models and detailed documentation for implementation.

“A major element of this effort was working to ensure that these models could be deployed by other groups onto new datasets,” said Kyle Ellrott, Ph.D., of the Knight Cancer Institute at Oregon Health & Science University said. “All too often this type of work is difficult to replicate or apply to new samples.”

The research team emphasises that while these tools provide a foundation for clinical assay development, further validation and optimization may be needed for specific clinical applications.

Reference:

Ellrott, K., Wong, C. K., Yau, C., et. al. (2 January 2025). Classification of non-TCGA cancer samples to TCGA molecular subtypes using compact feature sets. Cancer Cell, 43(1), 1-18. https://doi.org/10.1016/j.ccell.2024.12.002