A Self-Supervised Transformer Framework for Interpretable Extranodal Extension Detection In Head and Neck Cancer
Abstract
Purpose
Extranodal extension (ECE) of metastatic lymph nodes is a major adverse prognostic factor in head and neck cancer and strongly influences treatment intensity. However, pre-operative identification of ECE on CT is challenging, subjective, and inconsistent across observers. This study aims to develop an interpretable deep-learning framework that can automatically detect ECE from routine CT scans and provide localized visual evidence to support clinical decision-making.
Methods
We curated a dataset of 3D head-and-neck CT scans from 150 patients with pathology-confirmed ECE status. Clinical radiotherapy planning contours (RTSTRUCT) were exported and converted into CT–mask volumes defining nodal regions of interest. We designed a 3D dual-output SwinUNETR architecture that jointly performs voxel-level lymph-node region-of-interest (ROI) segmentation to constrain and interpret case-level ECE classification. To address data scarcity and heterogeneity, we implemented masked-autoencoder (MAE) pretraining on 215 unlabeled CT volumes before fine-tuning on labeled ECE data. Additional design elements included anatomically informed soft-tissue mask dilation to reduce sensitivity to contour variability, class-imbalance–aware optimization, and mask-guided pooling that explicitly links classification decisions to predicted nodal regions. Performance was evaluated using Dice for segmentation and AUC, accuracy, sensitivity, specificity, and F1-score for classification.
Results
The dual-output SwinUNETR achieved a Dice score of 0.808 for nodal ROI segmentation and a classification AUC of 0.681. Incorporating MAE pretraining improved performance to Dice = 0.822 and AUC = 0.886, with accuracy of 0.782, sensitivity of 0.833, specificity of 0.727, and F1-score of 0.800, demonstrating gains in ECE classification while maintaining localization within nodal regions
Conclusion
A self-supervised dual-output SwinUNETR enables accurate and interpretable CT-based detection of ECE under limited labeled data. By providing nodal-region localization alongside case-level ECE risk estimates, this framework has the potential to improve pre-operative staging, reduce unnecessary neck dissections, and support imaging-driven treatment personalization in head-and-neck cancer.