Clinical Context–Aware Segmentation of Tumor Targets and Organs at Risk Using Multimodal Foundation Models
Abstract
Purpose
Target delineation remains one of the most significant workflow bottlenecks in radiotherapy. Particularly, CTV segmentation remains challenging because it depends on clinical reasoning beyond visible imaging features. This study aims to develop and evaluate a multimodal AI framework that integrates clinical context with imaging data to improve automated or interactively semi-automated segmentation of tumor targets, especially CTV, while maintaining high accuracy for other radiotherapy structures.
Methods
We propose a multimodal segmentation framework combining a vision transformer–based foundation model with a large language model (LLM). The image encoder was fine-tuned using low-rank adaptation to handle CT-specific features, while the LLM encoded structured clinical information (e.g., TNM stage, histology, target type) into text prompt embeddings. These embeddings were fused with spatial prompts and image features through a transformer-based mask decoder with cross-attention and memory mechanisms for 3D consistency. The model was evaluated using the public NSCLC Radiomics dataset (300 cases), with 260 cases for training and 40 for testing. Segmentation accuracy was assessed using 3D and 2D Dice similarity coefficients.
Results
The proposed method achieved mean 3D Dice scores of 0.742 for GTV, 0.710 for CTV, 0.972 for lung OAR, and mean 2D Dice scores of 0.925 for GTV, 0.910 for CTV (For slices without visible tumors, correctly predicted empty masks were assigned a Dice score of 1.0, thus 2D Dice scores may appear relatively high). The proposed method outperformed conventional 3D U-Net and Medical SAM-based approaches. Incorporation of clinical text prompt embeddings consistently improved performance across all targets compared with image-only settings.
Conclusion
Integrating clinical context via LLM-driven prompt embeddings significantly enhances automated target delineation in radiotherapy. The proposed multimodal framework enables accurate, universal segmentation of both tumor targets and OARs from a single model, demonstrating strong potential to reduce contouring workload and improve radiotherapy workflow efficiency.