PAMM: pathway-aware masked representation learning for interpretable multi-cancer prediction

Citation

Abstract

In this thesis, PAMM, a new paradigm of interpretable multi-cancer prediction based on Pathway-Aware Masked Representation Learning is introduced. To tackle the challenge of the ‘Small n, Large p’ of transcriptomics it is our holding that we apply the rigorous seven-stage pipeline of preprocessing (i.e. Log2 transform, ANOVA filter, Lasso regularization and Recursive Feature Elimination) to reduce the original high-noise 57,750 genes in Breast, Lung, GBM, and HC samples to a high-signal feature set. The basic architecture goes beyond the usual deep learning of black boxes by incorporating biologically relevant priors of KEGG 2021 Human library in a self-supervised masking scheme. In contrast to stochastic masking, the pretraining phase of PAMM uses a Pathway-Aware Masking logic where complete sets of functional genes are zeroed, requiring the model to recreate missing biological units and learn complicated inter-pathway relationships. The latent representations of the model are optimized with Optuna, and the statistical robustness is verified with twenty independent iterations, and the latent representation is further interpreted with Single-sample Gene Set Enrichment Analysis (ssGSEA). The resulting visualizations of mean pathway activity indicate that PAMM is able to capture different, clinically viable biological signatures of each cancer type. PAMM provides a clear and very precise diagnostics platform of precision oncology by filling the gap between high-dimensional self-supervised learning and functional biology. Along with closed-set multi-cancer, PAMM is also explicitly tailored to open-set recognition. Through a combined study of softmax confidence and latent space distances from class centroids, the framework can discard samples that do not adhere to any known cancer manifold. This allows the certainty of identifying unknown or non-cancerous gene expression patterns, which is very essential when it comes to a real-life clinical implementation in which unobservable conditions are the norm. This two-fold feature sets PAMM apart from the traditional classifiers and guarantees the accuracy of the diagnosis and its safety

Description

Cataloged from PDF version of thesis.
Includes bibliographical references (pages 66-69).
This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2026.

Publisher Link

Type

Thesis