Zero-Shot Segment Anything Model 2 (SAM2)–Driven Interactive Delineation of Esophageal Clinical Target Volume: A Study of Prompting Strategy
Abstract
Purpose
Accurate delineation of the esophageal clinical target volume (CTV) is challenging due to extended longitudinal coverage and substantial slice-wise shape variation. Existing deep learning methods typically require extensive manual revision. Interactive segmentation provides an alternative approach. We aim to investigate the potential of Segment Anything Model 2 (SAM2) as an interactive model and to identify effective prompt strategies.
Methods
Zero-shot SAM2.1_hiera_l was evaluated on 83 postoperative esophageal cancer patients receiving radiotherapy. Ground truth (GT)-based bounding box and mask prompts were compared under a uniform placement strategy. The number of prompts ranged from 2 to 10, with mandatory inclusion of superior and inferior boundary slices. Based on the selected prompt type, an incremental slice selection strategy with exhaustive evaluation was applied to identify optimal prompt configurations. Two indices—normalized relative z-position and cross-sectional CTV area—were used to characterize selection patterns of prompt slices. Evaluation metrics included the Dice similarity coefficient (DSC) and 95th percentile Hausdorff distance (HD95).
Results
Segmentation accuracy increased with the number of prompts. Under uniform placement, mask prompts outperformed bounding box prompts in both evaluation metrics. With optimal slice selection, mask prompts achieved significantly higher DSC than uniform placement across prompt numbers (p<0.05). Performance gains diminished beyond five prompts (DSC: 0.85±0.05, HD95: 7.52±3.04 mm). Selected slices tended to exhibit larger target area and were more frequently located toward the upper portion of CTV.
Conclusion
Zero-shot SAM2 has demonstrated potential for esophageal CTV delineation, achieving excellent results with only five GT slice annotations. The selected slices are recommended to be in slices with larger target areas and upper portion of CTV. By incorporating clinical prior knowledge, interactive models effectively improve delineation accuracy in complex target volumes. This approach holds promise in establishing a new clinical workflow and offering a more efficient solution for delineation.