Robustness Evaluation of Det2Prompt Framework Under Various Partial Label Scenarios for Abdominal Organ Segmentation
Abstract
Purpose
To evaluate the robustness and generalizability of Det2Prompt, a foundation-model driven framework for partially-label medical image segmentation, under various annotation conditions.
Methods
Det2Prompt utilizes light-weight detector to generate bounding-box prompts for unlabeled organs, which are input into a frozen foundation segmentation model to produce pseudo-labels. These pseudo-labels are then fused with human annotations to train a segmentation model. Experiments were conducted on the multi-modality Abdominal Multi-Organ Segmentation (AMOS) 2022 dataset, where partial-label scenarios were simulated by removing organ annotations, including 33% and 67% organ annotations coverage. We compared Det2Prompt with three training settings: (1) full supervision (full labels): full supervision with complete organ annotations, providing an upper-bound benchmark, (2) full supervision (partial labels): full supervision with partial labels where unlabeled organs were treated as background, and (3) partial supervision (partial labels): partial supervision where the loss was computed only on labeled organs. Model performance was evaluated using Dice similarity coefficient (DSC) and 95% Hausdorff distance (HD95).
Results
Det2Prompt achieved DSC of 0.8815 and HD95 of 11.09 mm under the 67% annotation scenario, and 0.8230 DSC and 19.22 mm HD95 under the 33% annotation scenario. In comparison, full supervision (full labels) achieved 0.8921 DSC and 9.95 mm HD95. Under full supervision (partial labels), performance degraded substantially, with DSC of 0.8581 and 0.5934 and HD95 of 88.17 mm and 57.96 mm for 67% and 33% annotation, respectively. partial supervision (partial labels) improved performance to 0.8639 and 0.7139 DSC, but showed large HD95 values (36.70 mm and 295.89 mm). Overall, Det2Prompt substantially outperformed conventional partial-label training strategies and approached fully labeled performance, particularly under severe annotation scarcity.
Conclusion
Det2Prompt demonstrates strong robustness under different annotation scenarios. By leveraging detection-guided prompting and foundation model pseudo-labeling, Det2Prompt significantly reduces the performance gap between partially and fully annotations, showing strong potential for clinical application.