Paper Proffered Program Therapy Physics

Benchmarking Auto-Contouring AI for Radiotherapy: Robustness, Calibration, and Failure Modes

Abstract
Purpose

To establish a large-scale, independent benchmark for evaluating auto-contouring AI with an emphasis on radiotherapy-relevant requirements: robustness to domain shift, calibration of confidence, and clinically meaningful failure modes beyond average Dice.

Methods

We developed the Touchstone Benchmark to address common evaluation gaps in segmentation studies, including small test sets, inconsistent training protocols, and limited out-of-distribution (OOD) testing. Forty-four teams from 12 countries participated and submitted their original models. All methods were trained under a common protocol using an open abdominal CT training set of 5,195 scans from 76 hospitals across 8 countries with voxel-wise labels for nine structures. Independent evaluation was performed on a sequestered external test set of 13,420 CT volumes from 12 hospitals not seen during training, representing variability in scanner vendors, acquisition settings, patient demographics, and clinical conditions. Models spanning CNNs, transformers, diffusion-based approaches, and vision–language hybrids were assessed by the benchmark organizers using DSC, NSD, runtime, and subgroup analyses. We additionally evaluated reliability through calibration-style analyses linking model confidence to error rates and by characterizing systematic failure patterns across anatomy and patient subgroups.

Results

Among the 19 evaluated models to date, overall mean performance ranged from ~82–92% DSC, with organ-specific DSC spanning ~70–96%, indicating substantial heterogeneity across structures and methods. Inference speed varied from 0.12–0.61 s per CT slice, revealing large trade-offs between accuracy and throughput. The large OOD test set enabled detection of statistically significant but clinically meaningful performance gaps that are typically hidden in small benchmarks. Stratified analyses showed measurable performance differences across demographic and clinical subgroups, underscoring the need for diverse evaluation when commissioning auto-contouring tools.

Conclusion

Touchstone provides a scalable, independent platform for benchmarking auto-contouring AI under realistic OOD conditions, supporting medical physics needs for robust validation, reliability assessment, and transparent identification of failure modes prior to clinical deployment. *]:pointer-events-auto scroll-mt-[calc(var(--header-height)+min(200px,max(70px,20svh)))]" dir="auto" data-turn-id="428bc859-adf1-45f3-b107-4570ecac84b6" data-testid="conversation-turn-84" data-scroll-anchor="true" data-turn="assistant" tabindex="-1">

People

Related

Similar sessions

Poster Poster Program
Jul 19 · 07:00
Python-Based Automation Framework for Annual Machine QA Data Archiving In Qatrack+

Annual water-tank measurements help ensure beam characteristics remain consistent with commissioning baselines. However, the lack of a standardized processing workflow and decentralized data storage makes it difficult to analyze...

Syed Bilal Ahmad, PhD
Therapy Physics 0 people interested
Poster Poster Program
Jul 19 · 07:00
User Expectations and Current Availability of HDR Brachytherapy Audits In Europe

The aim of this work was to evaluate the need to implement more dosimetric audits in high‐dose‐rate brachytherapy (HDR-BT) in Europe and to identify which characteristics such audits should meet according to users.

Javier Vijande, PhD Laura Oliver Cañamás
Therapy Physics 0 people interested