Benchmarking Auto-Contouring AI for Radiotherapy: Robustness, Calibration, and Failure Modes
Abstract
Purpose
To establish a large-scale, independent benchmark for evaluating auto-contouring AI with an emphasis on radiotherapy-relevant requirements: robustness to domain shift, calibration of confidence, and clinically meaningful failure modes beyond average Dice.
Methods
We developed the Touchstone Benchmark to address common evaluation gaps in segmentation studies, including small test sets, inconsistent training protocols, and limited out-of-distribution (OOD) testing. Forty-four teams from 12 countries participated and submitted their original models. All methods were trained under a common protocol using an open abdominal CT training set of 5,195 scans from 76 hospitals across 8 countries with voxel-wise labels for nine structures. Independent evaluation was performed on a sequestered external test set of 13,420 CT volumes from 12 hospitals not seen during training, representing variability in scanner vendors, acquisition settings, patient demographics, and clinical conditions. Models spanning CNNs, transformers, diffusion-based approaches, and vision–language hybrids were assessed by the benchmark organizers using DSC, NSD, runtime, and subgroup analyses. We additionally evaluated reliability through calibration-style analyses linking model confidence to error rates and by characterizing systematic failure patterns across anatomy and patient subgroups.
Results
Among the 19 evaluated models to date, overall mean performance ranged from ~82–92% DSC, with organ-specific DSC spanning ~70–96%, indicating substantial heterogeneity across structures and methods. Inference speed varied from 0.12–0.61 s per CT slice, revealing large trade-offs between accuracy and throughput. The large OOD test set enabled detection of statistically significant but clinically meaningful performance gaps that are typically hidden in small benchmarks. Stratified analyses showed measurable performance differences across demographic and clinical subgroups, underscoring the need for diverse evaluation when commissioning auto-contouring tools.
Conclusion
Touchstone provides a scalable, independent platform for benchmarking auto-contouring AI under realistic OOD conditions, supporting medical physics needs for robust validation, reliability assessment, and transparent identification of failure modes prior to clinical deployment. *]:pointer-events-auto scroll-mt-[calc(var(--header-height)+min(200px,max(70px,20svh)))]" dir="auto" data-turn-id="428bc859-adf1-45f3-b107-4570ecac84b6" data-testid="conversation-turn-84" data-scroll-anchor="true" data-turn="assistant" tabindex="-1">