Bridging Geometric Accuracy and Clinical Utility In Deep Learning Auto‑Segmentation: The Role of Organ Volume
Abstract
Purpose
To integrate qualitative assessments from experienced treatment planners with quantitative geometric comparisons to evaluate the clinical utility of a deep learning (DL)-based auto-segmentation tool and establish performance criteria grounded in clinical usability.
Methods
RayStation 2024A deep learning auto-segmentation was applied to 50 anonymized planning CTs across five body sites, generating 6–20 organ-at-risk (OAR) contours per case. Ten experienced planners (five dosimetrist–physicist pairs) independently reviewed all contours for their assigned site and provided a clinical utility (CU) score of 0 (unusable), 1 (potentially useful with major edits), or 2 (useful with minor edits). Across all cases, 349 DL-generated contours from 35 organs with corresponding expert contours were quantitatively evaluated using Dice coefficients, Hausdorff distance, and organ volumes. Linear regression models assessed associations between CU score and Dice, volume, and their interaction, with performance evaluated using R², RMSE, and coefficient p-values.
Results
Of 49 organs evaluated, 33 were deemed clinically useful, 12 potentially useful, and 4 unusable, with mean CU scores of 1.85 ± 0.18, 1.59 ± 0.13, and 0.16 ± 0.26, respectively. A Dice-only model showed a statistically significant but weak association with CU (p = 0.021, R² = 0.015). Volume alone was not predictive; however, a Dice*volume interaction markedly improved performance (p = 4.2×10⁻¹⁸, R² = 0.216), reducing RMSE from 0.644 to 0.576. The interaction coefficient (β₃ = 0.98) indicated that Dice became increasingly predictive with larger organ volume. Conversely, Hausdorff distance demonstrated strong predictive power for small volumes (<20 cc; p = 1.7×10⁻¹³, R² = 0.38).
Conclusion
Common geometric metrics alone have limited ability to predict subjective clinical utility. However, incorporating organ volume reveals how data-driven metric combinations may align with planner perception of efficiency. Ongoing work incorporating curvature metrics and comparative DL-versus-clinical dosimetry aims to further refine performance assessment.