Multi-Scale Deep Learning with Multimodality Data for Postoperative Recurrence Prediction for Cervical Cancer Using MRI: A Multicenter Study
Abstract
Purpose
Cervical cancer (CC) remains one of the most common malignancies in women worldwide, and postoperative recurrence continues to challenge long-term survival. Given that clinical decision-making relies on multimodal information, integrating imaging, textual, and clinical data has the potential to improve predictive performance. Thus, developing a multimodal, multi-scale deep learning(DL) model may enable more precise prognosis prediction in operable CC.
Methods
This multicenter retrospective study included 445 operable CC patients with preoperative MRI and corresponding radiology reports from three institutions. A multi-scale model (MSM) combining ConvNeXt and dual-path Vision Transformer (ViT) was constructed. Textual features were extracted from radiology reports using BERT and fused with imaging and clinical features to form a multimodal network (MSM-TC). The SHapley Additive exPlanations (SHAP) method and attention visualization were employed to enhance model interpretability.
Results
Among single-branch models, ConvNeXt and ViT achieved the best predictive performance with AUCs of 0.798 and 0.766 in the internal validation cohort, and 0.656 and 0.704 in the external validation cohort. The MSM integrating ConvNeXt and ViT improved recurrence prediction, achieving AUCs of 0.944, 0.837, and 0.681 in the training, internal validation, and external validation cohorts, respectively. Incorporating textual information into MSM (MSM-T) further enhanced model performance with AUCs of 0.902, 0.860, and 0.742, while the final multimodal model integrating imaging, textual, and clinical data (MSM-TC) achieved the highest performance with AUCs of 0.930, 0.860, and 0.798 across the three cohorts. Kaplan–Meier analysis confirmed that the model-derived risk score effectively stratified patients into high- and low-risk groups, demonstrating its strong prognostic value.
Conclusion
Our study demonstrates that a multimodal, multi-scale DL framework integrating imaging, textual, and clinical data can achieve robust prediction of recurrence and survival in operable CC patients, highlighting its potential for individualized prognostic assessment and future clinical translation.