Expert Agreement on Deep-Learning Auto-Segmentation for Head and Neck Organs at Risk
Abstract
Purpose
Deep-learning (DL) auto-segmentation may improve efficiency in head-and-neck (HN) planning, but clinical use depends on how consistently experts judge contour acceptability and how performance varies by organ. We evaluated a DL-based auto-segmentation system for HN organs-at-risk (OARs), emphasizing expert agreement and complementary quantitative metrics.
Methods
The DL system generated 34 OARs per case for fourteen patients. Seven HN-experienced reviewers scored each OAR on a 1–4 scale based on the estimated fraction of axial slices requiring correction: 4 (40%). Inter-reviewer agreement was assessed using Fleiss’ kappa (overall and score-specific). Quantitative agreement with clinical contours was characterized using Dice similarity coefficient (DSC) and sensitivity. The relationship between volume difference (Δvol) and DSC was examined to interpret cases where overlap metrics may be misleading.
Results
Across 3332 total scores (14×34×7), overall agreement was low but non-random (Fleiss’ kappa = 0.065; p<0.0001). Agreement was highest for Score 4 (kappa = 0.091) and lower for Scores 1–3 (kappa = 0.027–0.043). Reviewer scoring tendencies differed (Score-4 counts of 436 versus 234 across reviewers). Geometric performance spanned a wide range (mean DSC 0.22±0.15 to 0.98±0.01). Sensitivity varied substantially (brain 0.99±0.01 versus lips 0.45±0.19; optic chiasm 0.62±0.19). Δvol and DSC were not monotonically related; for example, brain showed Δvol -33.9±13.7 cc with DSC 0.98±0.01, whereas optic chiasm showed Δvol -0.2±0.4 cc with DSC 0.51±0.16.
Conclusion
DL auto-segmentation can provide useful starting contours for HN OARs, but expert agreement on acceptability is limited and organ-dependent. Reliability measures, sensitivity, and Δvol–DSC behavior add clinically relevant context beyond DSC alone. The observed reviewer-to-reviewer variability supports standardized scoring guidance, and DL-generated OARs should be reviewed before treatment plan optimization.