Multimodal LLM-Based Quality Assurance for Auto-Contouring: Utility As a Primary Screening Tool
Abstract
Purpose
Manual verification of AI-based auto-contouring is labor-intensive and prone to fatigue-related errors. This study evaluated the feasibility of automated visual quality assessment using a multimodal large language model (LLM), Gemini 2.5 Flash, to assess its utility as a clinical primary screening tool to streamline the quality assurance (QA) workflow.
Methods
Twenty male pelvic CT scans from an open dataset were utilized. Two distinct auto-contouring software packages (RatoGuide prototype and clinical syngo.via) were evaluated. auto-contouring results for each slice were exported as PDF images with overlaid contours and input into Gemini 2.5 Flash. The LLM was instructed to rate the contour quality on a 5-point clinical scale (5: Accept; 4: Minor revision; 3: Major revision; 2: Redraw from scratch; 1: Organ not detected). Using evaluations by two board-certified radiation oncologists as ground truth, we calculated Spearman's rank correlation coefficients (ρ) and weighted kappa coefficients (κ). Additionally, scores ≤2 were defined as "Fail" and >2 as "Pass" to calculate sensitivity and specificity for each model to assess screening performance.
Results
Visual assessment by Gemini showed substantial agreement with expert judgment. For RatoGuide and syngo.via, the ρ values were 0.744 and 0.765, respectively, and the κ values were 0.669 and 0.746, respectively, indicating substantial agreement for both software. In terms of screening performance (defining "Fail" as the positive class), RatoGuide and syngo.via achieved sensitivities of 84.2% and 80.0%, and specificities of 93.6% and 95.4%, respectively. Although the LLM showed a tendency to overestimate quality, it exhibited high specificity in identifying clinically acceptable cases.
Conclusion
Gemini 2.5 Flash demonstrated substantial agreement with expert evaluations in image-based auto-contouring quality assessment. While potential overestimation bias (risk of missing "Fail" cases) warrants caution, the observed sensitivity suggests its feasibility as a primary screening QA tool to efficiently filter acceptable contours, thereby reducing the clinical workload.