BLUE RIBBON POSTER MULTI-DISCIPLINARY: Improving Survival Prediction of Head-and-Neck Cancer with Medical Image, Foundation Models and Multi-Modal Fusion
Abstract
Purpose
Accurate survival prediction in head and neck squamous cell carcinoma (HNSCC) is clinically important for risk stratification but remains challenging due to small cohort sizes and the difficulty of integrating high-dimensional PET/CT imaging with low-dimensional clinical covariates. We investigate whether pretrained, modality-specific medical image foundation models (FMs) can improve multimodal survival prediction performance and robustness under limited data.
Methods
We build on a strong PET/CT survival modeling baseline based on a DeepMTS-style multi-task architecture with a Dynamic Affine Feature Map Transform (DAFT) block for image–tabular fusion. To this framework, we incorporate pretrained CT and PET foundation model encoders, which are kept frozen and used to extract fixed image embeddings without task-specific fine-tuning. The resulting embeddings (512-D for CT and 768-D for PET) are optionally compressed using lightweight linear projection layers and then combined by late fusion via concatenation immediately before the survival prediction head. Experiments are conducted on the public HECKTOR 2022 training split (524 PET/CT cases from 7 centers) with recurrence-free survival labels and 9 clinical covariates, using 5-fold stratified cross-validation. Performance is evaluated by mean±SD concordance index (C-index) and 1-year time-dependent AUROC.
Results
Relative to the baseline DeepMTS+DAFT model (C-index 0.616±0.082; 1-year AUROC 0.671±0.158), adding frozen FM embeddings consistently improved outcomes and reduced fold-to-fold variance. CT-FM achieved C-index 0.715±0.075 and 1-year AUROC 0.710±0.083, while PET-FM (best with 128-D projection) achieved C-index 0.736±0.060 and 1-year AUROC 0.712±0.100. Late fusion of CT+PET FM embeddings (best with 256-D projection) yielded the best performance (C-index 0.743±0.067; 1-year AUROC 0.744±0.094).
Conclusion
Frozen, modality-specific PET and CT foundation model embeddings provide stable and complementary prognostic signals for HNSCC survival modeling. Lightweight late fusion with simple projection heads improves both predictive performance and robustness without increasing training complexity, supporting the potential utility of FM-augmented multimodal models for clinically relevant risk stratification in limited-data settings.