Poster Poster Program Therapy Physics

Expert Agreement on Deep-Learning Auto-Segmentation for Head and Neck Organs at Risk

Abstract
Purpose

Deep-learning (DL) auto-segmentation may improve efficiency in head-and-neck (HN) planning, but clinical use depends on how consistently experts judge contour acceptability and how performance varies by organ. We evaluated a DL-based auto-segmentation system for HN organs-at-risk (OARs), emphasizing expert agreement and complementary quantitative metrics.

Methods

The DL system generated 34 OARs per case for fourteen patients. Seven HN-experienced reviewers scored each OAR on a 1–4 scale based on the estimated fraction of axial slices requiring correction: 4 (40%). Inter-reviewer agreement was assessed using Fleiss’ kappa (overall and score-specific). Quantitative agreement with clinical contours was characterized using Dice similarity coefficient (DSC) and sensitivity. The relationship between volume difference (Δvol) and DSC was examined to interpret cases where overlap metrics may be misleading.

Results

Across 3332 total scores (14×34×7), overall agreement was low but non-random (Fleiss’ kappa = 0.065; p<0.0001). Agreement was highest for Score 4 (kappa = 0.091) and lower for Scores 1–3 (kappa = 0.027–0.043). Reviewer scoring tendencies differed (Score-4 counts of 436 versus 234 across reviewers). Geometric performance spanned a wide range (mean DSC 0.22±0.15 to 0.98±0.01). Sensitivity varied substantially (brain 0.99±0.01 versus lips 0.45±0.19; optic chiasm 0.62±0.19). Δvol and DSC were not monotonically related; for example, brain showed Δvol -33.9±13.7 cc with DSC 0.98±0.01, whereas optic chiasm showed Δvol -0.2±0.4 cc with DSC 0.51±0.16.

Conclusion

DL auto-segmentation can provide useful starting contours for HN OARs, but expert agreement on acceptability is limited and organ-dependent. Reliability measures, sensitivity, and Δvol–DSC behavior add clinically relevant context beyond DSC alone. The observed reviewer-to-reviewer variability supports standardized scoring guidance, and DL-generated OARs should be reviewed before treatment plan optimization.

People

Related

Similar sessions

Poster Poster Program
Jul 19 · 07:00
Python-Based Automation Framework for Annual Machine QA Data Archiving In Qatrack+

Annual water-tank measurements help ensure beam characteristics remain consistent with commissioning baselines. However, the lack of a standardized processing workflow and decentralized data storage makes it difficult to analyze...

Syed Bilal Ahmad, PhD
Therapy Physics 0 people interested
Poster Poster Program
Jul 19 · 07:00
User Expectations and Current Availability of HDR Brachytherapy Audits In Europe

The aim of this work was to evaluate the need to implement more dosimetric audits in high‐dose‐rate brachytherapy (HDR-BT) in Europe and to identify which characteristics such audits should meet according to users.

Javier Vijande, PhD Laura Oliver Cañamás
Therapy Physics 0 people interested