Inter-Observer Variability Informed Weighted Anisotropic Surface DSC for Segmentation Evaluation
Abstract
Purpose
Common contour evaluation metrics(e.g., DSC or HD) provide global summaries that can miss localized, anisotropic disagreements that drive clinically meaningful edits. Although surface-DSC(SDSC) better reflects boundary discrepancies, it typically uses a subjective isotropic tolerance (e.g., 5 mm) applied uniformly. We assume contour acceptability can be judged by alignment with inter-observer variability(IOV) and introduce an IOV-informed, weighted anisotropic surface DSC(WeightedAniSDSC).
Methods
Using the CURVAS dataset(pancreas, kidneys, liver; three independent annotations per case), cases were split into training(n=20), validation (n=5), and testing(n=65). In training, local IOV was quantified across annotators using DPCR-BLD. Assuming Gaussian distribution, a 2σ surface margin(µ±2σ) was computed at each surface location and mapped to new cases using deformable point-cloud registration (DPCR). AniSDSC replaces the uniform SDSC tolerance with this organ- and region-specific margin; WeightedAniSDSC further applies an outlier-distance penalty to reduce sensitivity to stray voxels and small islands. On validation cases, 60 pseudo-error contours(5 cases × 12 error types) were generated. ROC analysis were conducted with DSC, SDSC5mm, SDSC8mm, SDSC10mm, and WeightedAniSDSC to select an operating threshold. The validation-derived threshold was then applied to the testing subset using all pairwise annotator comparisons.
Results
WeightedAniSDSC improved discrimination of introduced errors (AUC 0.89–0.98) versus DSC (AUC 0.86–0.91) and consistently outperformed SDSC across fixed tolerances. The optimal threshold was less organ-dependent for WeightedAniSDSC (0.85–0.89) than for DSC (0.86–0.91). In testing, WeightedAniSDSC uniquely flagged 20 pancreas, 6 kidney, and 10 liver contours missed by DSC and SDSC5mm. Review identified stray-voxel artifacts and localized boundary deviations beyond expected IOV, and highlighted surface regions exceeding the mapped tolerance.
Conclusion
In this pilot evaluation, incorporating spatially varying, IOV-derived tolerances and outlier weighting into a surface-based agreement metric improved detection of realistic local contour discrepancies compared with conventional global metrics, suggesting WeightedAniSDSC may be a useful complement for segmentation evaluation and warrants broader validation.