Blinded Multi-Rater Evaluation of Automated Brain Metastasis Segmentation Using a Hybrid Foundation-Model Framework
Abstract
Purpose
Accurate segmentation of brain metastases (BM) is essential for stereotactic radiosurgery planning and longitudinal assessment, yet manual contouring is time-consuming and subject to substantial inter-observer variability. We developed TUM-SAM, a hybrid foundation-model–based framework for fully automated BM segmentation, and evaluated its performance using a blinded, bias-controlled multi-rater analysis to assess agreement with and preference relative to expert physicians.
Methods
TUM-SAM combines nnU-Net–based lesion detection with a tumor-adapted Med-SAM segmentation model to enable prompt-free, fully automated BM segmentation. The model was trained on 301 patients (2,548 lesions) and evaluated on an independent external cohort of 105 patients (397 lesions). Segmentation performance was compared against DeepMedic and nnU-Net using Dice similarity coefficient and 95th-percentile Hausdorff distance (HD95). Two physicians independently contoured all external cases, and a third physician contoured a 20-patient subset. A blinded, tumor-level preference study was performed, and pairwise preferences were analyzed using a Bradley–Terry probabilistic model to account for rater-specific bias and case difficulty.
Results
In the external cohort, TUM-SAM achieved lesion-wise detection sensitivity of 0.94 and demonstrated improved segmentation accuracy across all tumor sizes, with a mean Dice of 0.84 and HD95 of 1.9 mm. In comparison, nnU-Net and DeepMedic achieved Dice values below 0.70 and HD95 values exceeding 3.3 mm. TUM-SAM’s voxel-wise performance fell within the range of inter-observer variability among physicians. In the blinded preference study, physicians favored TUM-SAM contours in 81–87% of raw comparisons. After bias correction using the Bradley–Terry model, TUM-SAM maintained a consistent preference advantage with estimated win probabilities of 55–56%.
Conclusion
Using a blinded, bias-controlled multi-rater evaluation framework, TUM-SAM demonstrated expert-level performance for automated BM segmentation and was consistently preferred by physicians. These results support the clinical feasibility of foundation-model–based segmentation as a physician-in-the-loop tool for BM management and highlight limitations of single-reference voxel-wise evaluation metrics.