Beyond Global Metrics: Unmasking Clinically Meaningful Deviations Using a Novel Local Metric In Regression Testing of Deployed Auto-Segmentation Models
Abstract
Purpose
Deep-learning (DL) auto-segmentation is routinely used clinically, yet domain shift and vendor model upgrades can change behavior in non-intuitive ways. We present a regression-testing framework to commission upgraded commercial auto-contouring models and to reveal spatially localized contour differences that may be obscured by global metrics.
Methods
We compared an initial and an upgraded commercial DL model across Pelvis and Head-and-Neck sites(n=36, OARs=21) against reference contours. Performance was first quantified using standard global metrics: Dice Similarity Coefficient (DSC), Hausdorff Distance (HD95), and Mean Surface Distance (MSD). To unmask localized discrepancies hidden by these aggregates, we applied Deformable Point Cloud Registration–Based Bidirectional Local Distance (DPCR-BLD). Unlike global summaries, this method visualizes spatial relationships to explicitly identify region-specific contouring shifts
Results
Global metrics showed no statistically significant differences for most ROIs. For example, the upgraded model improved mean DSC from 0.77 to 0.80 (p=0.06) and reduced HD95 from 10.43 mm to 7.84 mm for prostate. Global metrics values did not indicate where discrepancies occurred. DPCR-BLD localized prostate differences: the initial model over-contoured posterior prostate adjacent to the rectal wall, whereas the upgraded model under-contoured peripheral regions in contact with the rectum. Despite unchanged brainstem DSC (0.89), DPCR-BLD indicated slight superior-region over-contouring in the upgraded model. For the rectum, global metrics suggested no change, but the initial model failed on an outlier case with a rectal balloon; DPCR-BLD clearly highlighted the localized miss.
Conclusion
Combining conventional global metrics with DPCR-BLD provides actionable, spatially resolved regression testing for upgraded DL auto-segmentation models, helping prevent inappropriate transfer of expectations between versions and supporting continuous clinical monitoring and QA.