Text-Guided Segmentation of Cardiac Substructures on Contrast-Enhanced CT
Abstract
Purpose
Dose sparing of cardiac substructures has the potential to improve thoracic radiotherapy outcomes beyond conventional whole-heart dose-volume metrics. However, manual delineation of numerous substructures is prohibitive, limiting the availability of high-quality annotations for training auto-contouring models. This study introduces a text-guided segmentation framework that supports both limited-vocabulary zero-shot inference and expanded-vocabulary fine-tuning for cardiac substructure segmentation on contrast-enhanced CT.
Methods
We proposed to adapt the Segment Anything with Text (SAT) foundation model for cardiac substructure segmentation. For zero-shot inference, cardiac anatomy was segmented by mapping clinically relevant substructure names to SAT’s pretrained prompt vocabulary on an institutional test cohort of 20 patients with thoracic CT scans. For expanded-vocabulary segmentation, SAT-Nano was fine-tuned for 50 epochs on an additional 49 patients from the same cohort and evaluated on the 20-patient test set. Performance was assessed using the Dice similarity coefficient (DSC) and normalized surface Dice (NSD) and compared with a 3D nnU-Net baseline trained on the same data split.
Results
In zero-shot evaluation across 11 in-vocabulary structures, SAT achieved mean DSCs of 0.70 (SAT-Pro) and 0.69 (SAT-Nano), with highest accuracy for the left atrium (0.861 ± 0.034), pulmonary artery (0.815 ± 0.055), and aorta (0.788 ± 0.051). After fine-tuning on a total of 22 structures, SAT demonstrated strong performance on large-volume (above the median volume) cardiac structures (macro DSC 0.863; NSD 0.628), including the aorta (0.927 ± 0.021), left ventricle (0.918 ± 0.010), and pulmonary artery (0.878 ± 0.036). Smaller structures remained challenging, particularly coronary arteries and conduction nodes (DSC 0.1657-0.466). Compared with nnU-Net, fine-tuned SAT achieved comparable DSC on large structures (macro DSC 0.881) and improved surface agreement (NSD 0.638 vs. 0.590).
Conclusion
Text-guided limited-vocabulary segmentation enables both zero-shot contouring and expanded-vocabulary fine-tuning for cardiac substructure segmentation on contrast-enhanced CT and may facilitate substructure-specific dose sparing in radiotherapy planning.