Poster Poster Program Therapy Physics

Blinded Multi-Rater Evaluation of Automated Brain Metastasis Segmentation Using a Hybrid Foundation-Model Framework

Abstract
Purpose

Accurate segmentation of brain metastases (BM) is essential for stereotactic radiosurgery planning and longitudinal assessment, yet manual contouring is time-consuming and subject to substantial inter-observer variability. We developed TUM-SAM, a hybrid foundation-model–based framework for fully automated BM segmentation, and evaluated its performance using a blinded, bias-controlled multi-rater analysis to assess agreement with and preference relative to expert physicians.

Methods

TUM-SAM combines nnU-Net–based lesion detection with a tumor-adapted Med-SAM segmentation model to enable prompt-free, fully automated BM segmentation. The model was trained on 301 patients (2,548 lesions) and evaluated on an independent external cohort of 105 patients (397 lesions). Segmentation performance was compared against DeepMedic and nnU-Net using Dice similarity coefficient and 95th-percentile Hausdorff distance (HD95). Two physicians independently contoured all external cases, and a third physician contoured a 20-patient subset. A blinded, tumor-level preference study was performed, and pairwise preferences were analyzed using a Bradley–Terry probabilistic model to account for rater-specific bias and case difficulty.

Results

In the external cohort, TUM-SAM achieved lesion-wise detection sensitivity of 0.94 and demonstrated improved segmentation accuracy across all tumor sizes, with a mean Dice of 0.84 and HD95 of 1.9 mm. In comparison, nnU-Net and DeepMedic achieved Dice values below 0.70 and HD95 values exceeding 3.3 mm. TUM-SAM’s voxel-wise performance fell within the range of inter-observer variability among physicians. In the blinded preference study, physicians favored TUM-SAM contours in 81–87% of raw comparisons. After bias correction using the Bradley–Terry model, TUM-SAM maintained a consistent preference advantage with estimated win probabilities of 55–56%.

Conclusion

Using a blinded, bias-controlled multi-rater evaluation framework, TUM-SAM demonstrated expert-level performance for automated BM segmentation and was consistently preferred by physicians. These results support the clinical feasibility of foundation-model–based segmentation as a physician-in-the-loop tool for BM management and highlight limitations of single-reference voxel-wise evaluation metrics.

People

Related

Similar sessions

Poster Poster Program
Jul 19 · 07:00
Python-Based Automation Framework for Annual Machine QA Data Archiving In Qatrack+

Annual water-tank measurements help ensure beam characteristics remain consistent with commissioning baselines. However, the lack of a standardized processing workflow and decentralized data storage makes it difficult to analyze...

Syed Bilal Ahmad, PhD
Therapy Physics 0 people interested
Poster Poster Program
Jul 19 · 07:00
User Expectations and Current Availability of HDR Brachytherapy Audits In Europe

The aim of this work was to evaluate the need to implement more dosimetric audits in high‐dose‐rate brachytherapy (HDR-BT) in Europe and to identify which characteristics such audits should meet according to users.

Javier Vijande, PhD Laura Oliver Cañamás
Therapy Physics 0 people interested