A Hybrid Prompt–Guided Multimodal Framework for Automatic Clinical Target Volume Delineation In Esophageal Cancer: A Multicenter Study
Abstract
Purpose
Accurate clinical target volume (CTV) delineation is essential for radiotherapy in esophageal cancer but remains highly subjective and variable due to its reliance on physician experience. Most automated approaches focus on gross tumor volume (GTV) and do not adequately address CTV-related clinical complexity. This study aimed to develop and validate a hybrid prompt–guided multimodal framework for automatic CTV delineation in esophageal cancer.
Methods
This retrospective multicenter study included PET/CT-based esophageal cancer patients from internal center (n = 262) and an independent external center (n = 49). A hybrid prompt–guided ContextUNETR+ framework was proposed, integrating three complementary prompts: PET-derived image prompts, structural prompts from primary and nodal GTVs (GTVp + GTVn), and textual prompts from clinical baseline information. CT images, image prompts, and structural prompts were concatenated at the channel level and fed into the visual encoder, while textual features were extracted using a LLaMA3 large language model. Multimodal feature alignment was achieved via a Two-Way Transformer, followed by decoder-based CTV prediction. Model performance was evaluated using five-fold cross-validation in the internal cohort, and the best-performing model was further assessed in the external validation cohort. Dice similarity coefficient (DSC) and 95% Hausdorff distance (HD95) were used for evaluation.
Results
Across five-fold cross-validation in the internal cohort, the proposed method achieved mean DSC values ranging from 75.34% to 78.58% and HD95 values from 19.06 to 27.82 mm. External validation demonstrated robust generalization, with a DSC of 71.10% and an HD95 of 25.42 mm. The proposed framework consistently outperformed comparison models across all evaluations.
Conclusion
The hybrid prompt–guided ContextUNETR+ framework enables effective integration of multimodal imaging, structural priors, and clinical textual information for automated CTV delineation in esophageal cancer. This approach demonstrates stable multicenter performance and has the potential to reduce inter-observer variability and support standardized target delineation in clinical radiotherapy practice.