Poster Poster Program Therapy Physics

Multimodal LLM-Based Quality Assurance for Auto-Contouring: Utility As a Primary Screening Tool

Abstract

Purpose

Manual verification of AI-based auto-contouring is labor-intensive and prone to fatigue-related errors. This study evaluated the feasibility of automated visual quality assessment using a multimodal large language model (LLM), Gemini 2.5 Flash, to assess its utility as a clinical primary screening tool to streamline the quality assurance (QA) workflow.

Methods

Twenty male pelvic CT scans from an open dataset were utilized. Two distinct auto-contouring software packages (RatoGuide prototype and clinical syngo.via) were evaluated. auto-contouring results for each slice were exported as PDF images with overlaid contours and input into Gemini 2.5 Flash. The LLM was instructed to rate the contour quality on a 5-point clinical scale (5: Accept; 4: Minor revision; 3: Major revision; 2: Redraw from scratch; 1: Organ not detected). Using evaluations by two board-certified radiation oncologists as ground truth, we calculated Spearman's rank correlation coefficients (ρ) and weighted kappa coefficients (κ). Additionally, scores ≤2 were defined as "Fail" and >2 as "Pass" to calculate sensitivity and specificity for each model to assess screening performance.

Results

Visual assessment by Gemini showed substantial agreement with expert judgment. For RatoGuide and syngo.via, the ρ values were 0.744 and 0.765, respectively, and the κ values were 0.669 and 0.746, respectively, indicating substantial agreement for both software. In terms of screening performance (defining "Fail" as the positive class), RatoGuide and syngo.via achieved sensitivities of 84.2% and 80.0%, and specificities of 93.6% and 95.4%, respectively. Although the LLM showed a tendency to overestimate quality, it exhibited high specificity in identifying clinically acceptable cases.

Conclusion

Gemini 2.5 Flash demonstrated substantial agreement with expert evaluations in image-based auto-contouring quality assessment. While potential overestimation bias (risk of missing "Fail" cases) warrants caution, the observed sensitivity suggests its feasibility as a primary screening QA tool to efficiently filter acceptable contours, thereby reducing the clinical workload.

People

Hikaru TannoAuthors · Department of Radiation Oncology, Tohoku University Graduate School of Medicine Noriyuki KadoyaAuthors · Department of Radiation Oncology, Tohoku University Graduate School of Medicine Keiichi JinguAuthors · Department of Radiation Oncology, Tohoku University Graduate School of Medicine Ryota TozukaPresenting Author · Department of Therapeutic Radiology, University of Yamanashi Tomoko AkitaAuthors · Department of Therapeutic Radiology, University of Yamanashi Masahide SaitoAuthors · University of Yamanashi Hikaru NemotoAuthors · Department of Therapeutic Radiology, University of Yamanashi Masaki MatsudaAuthors · Department of Therapeutic Radiology, University of Yamanashi Hiroshi OnishiAuthors · Department of Therapeutic Radiology, University of Yamanashi