Poster Poster Program Therapy Physics

Expert Agreement on Deep-Learning Auto-Segmentation for Head and Neck Organs at Risk

Abstract

Purpose

Deep-learning (DL) auto-segmentation may improve efficiency in head-and-neck (HN) planning, but clinical use depends on how consistently experts judge contour acceptability and how performance varies by organ. We evaluated a DL-based auto-segmentation system for HN organs-at-risk (OARs), emphasizing expert agreement and complementary quantitative metrics.

Methods

The DL system generated 34 OARs per case for fourteen patients. Seven HN-experienced reviewers scored each OAR on a 1–4 scale based on the estimated fraction of axial slices requiring correction: 4 (40%). Inter-reviewer agreement was assessed using Fleiss’ kappa (overall and score-specific). Quantitative agreement with clinical contours was characterized using Dice similarity coefficient (DSC) and sensitivity. The relationship between volume difference (Δvol) and DSC was examined to interpret cases where overlap metrics may be misleading.

Results

Across 3332 total scores (14×34×7), overall agreement was low but non-random (Fleiss’ kappa = 0.065; p<0.0001). Agreement was highest for Score 4 (kappa = 0.091) and lower for Scores 1–3 (kappa = 0.027–0.043). Reviewer scoring tendencies differed (Score-4 counts of 436 versus 234 across reviewers). Geometric performance spanned a wide range (mean DSC 0.22±0.15 to 0.98±0.01). Sensitivity varied substantially (brain 0.99±0.01 versus lips 0.45±0.19; optic chiasm 0.62±0.19). Δvol and DSC were not monotonically related; for example, brain showed Δvol -33.9±13.7 cc with DSC 0.98±0.01, whereas optic chiasm showed Δvol -0.2±0.4 cc with DSC 0.51±0.16.

Conclusion

DL auto-segmentation can provide useful starting contours for HN OARs, but expert agreement on acceptability is limited and organ-dependent. Reliability measures, sensitivity, and Δvol–DSC behavior add clinically relevant context beyond DSC alone. The observed reviewer-to-reviewer variability supports standardized scoring guidance, and DL-generated OARs should be reviewed before treatment plan optimization.

People

Noufal Manthala Padannayil, PhDAuthors · Lynn Cancer Institute, Boca Raton Regional Hospital, Baptist Health South Florida Suresh Rana, PhD, MSCorrespondings · Lynn Cancer Institute, Boca Raton Regional Hospital, Baptist Health South Florida Shyam Pokharel, PhDAuthors · Lynn Cancer Institute, Boca Raton Regional Hospital, Baptist Health South Florida Zachariah KhanAuthors · Lynn Cancer Institute, Boca Raton Regional Hospital, Baptist Health South Florida Nishan Shrestha, PhDPresenting Author · Lynn Cancer Institute, Boca Raton Regional Hospital, Baptist Health South Florida Rachel JacquesAuthors · Lynn Cancer Institute, Boca Raton Regional Hospital, Baptist Health South Florida Nebi DemezAuthors · Lynn Cancer Institute, Boca Raton Regional Hospital, Baptist Health South Florida Ranjith Cholakkara Poyil, PhDAuthors · Lynn Cancer Institute, Boca Raton Regional Hospital, Baptist Health South Florida