BEST IN PHYSICS (MULTI-DISCIPLINARY): Text-Guided Segmentation of Glioma Substructures on Multi-Parametric MRI
Abstract
Purpose
Accurate delineation of glioma substructures on multi-parametric MRI (mp-MRI) is critical for diagnosis, radiotherapy planning, and treatment response assessment. However, manual annotation of tumor subregions is labor-intensive and subject to inter-observer variability. This study aims to develop a text-guided multimodal segmentation framework that leverages a single foundation model with short semantic prompts (e.g., “edema,” “enhancing tumor”) to automatically segment glioma subregions from mp-MRI.
Methods
We proposed to adopt a multimodal text-conditioned segmentation model and apply it to mp-MRI volumes from the BraTS-GLI dataset, including T1-weighted with contrast (T1c), T1-weighted non-contrast (T1n), T2-weighted (T2w), and T2-FLAIR (T2f) sequences. The model jointly processes image volumes and short text prompts describing each target structure to produce voxel-wise segmentation masks. We evaluate segmentation performance for whole tumor (WT), necrotic core (NCR), edema (ED), and enhancing tumor (ET) using the Dice similarity coefficient (DSC) and normalized surface Dice (NSD) across 250 patients. Both zero-shot inference, using the pretrained text vocabulary without additional training, and task-specific fine-tuning on the 1001 patient training set are investigated.
Results
In the zero-shot setting, our model achieved DSC/NSD values of 0.849/0.694 (WT), 0.565/0.552 (NCR), 0.684/0.652 (ED), and 0.621/0.594 (ET). After fine-tuning, segmentation performance improved substantially to 0.930/0.887 (WT), 0.822/0.862 (NCR), 0.851/0.869 (ED), and 0.862/0.920 (ET). Qualitative visualization across representative cases demonstrates improved boundary adherence and subregion delineation following fine-tuning. Whole-tumor segmentation consistently achieved the highest accuracy, while enhancing and necrotic regions exhibited greater variability due to smaller volumes and heterogeneous contrast patterns.
Conclusion
The proposed text-guided multimodal framework enables accurate and flexible segmentation of glioma substructures from mp-MRI without task-specific architectural modifications. This approach offers a scalable solution for automated annotation and quantitative tumor analysis, supporting downstream radiology and radiation-oncology applications. Future work will address calibration, scanner-dependent variability, and improved modeling of small and morphologically complex tumor subregions.