A Hybrid Vision Transformer-Based Framework for Multi-Endpoint Survival Prediction In Oropharyngeal Cancer
Abstract
Purpose
This study presents a hybrid framework combining Vision Transformer (ViT) and handcrafted radiomic features for oropharyngeal carcinoma (OPC), to enhance multi-endpoint survival prediction.
Methods
A cohort of 512 OPC cases from the Cancer Imaging Archive (TCIA) was used, including CT images, GTVp segmentations, and sixteen clinical features. Expert GTVp contours yielded 873 handcrafted radiomic features via 3D Slicer, while ViT exploits self-attention to capture global tumor context from volumetric CTs. Multi-endpoint survival models were developed and evaluated using these feature combinations: (i) clinical variables combined with handcrafted radiomics; (ii) clinical variables combined with ViT-based image embeddings; and (iii) clinical variables combined with both handcrafted radiomics and ViT embeddings. In the multimodal model, radiomic and ViT features were fused through feature-level concatenation into a unified representation. All feature sets were input into Multi-Task Logistic Regression models to predict endpoint-specific continuous risk scores for overall survival (OS), local failure-free survival (LFFS), regional failure-free survival (RFFS), and distant failure-free survival (DFFS). Performance was assessed using five-fold cross-validation, time-dependent AUC, and Kaplan–Meier based risk stratification.
Results
For overall survival, time-dependent ROC analysis was performed to compare the three feature combinations. The model using clinical variables combined with radiomic features achieved an AUC of 0.75, while the model using clinical variables combined with ViT-derived embeddings reached 0.82 (p = 0.003 versus radiomics). The integrated model, which combined clinical variables, radiomic features, and ViT-derived embeddings, further improved performance, achieving an AUC of 0.844 (p = 0.001 versus radiomics; p = 0.04 versus ViT alone). Kaplan–Meier risk stratification based on median scores demonstrated clear separation between high- and low-risk groups (log-rank p < 0.001).
Conclusion
ViT-derived global context complements handcrafted radiomic features. Their integrated model supports robust multi-endpoint survival prediction and improved risk stratification, offering a promising pathway for personalized OPC radiotherapy decision-making.