Poster Poster Program Therapy Physics

Multimodal LLM-Based Quality Assurance for Auto-Contouring: Utility As a Primary Screening Tool

Abstract
Purpose

Manual verification of AI-based auto-contouring is labor-intensive and prone to fatigue-related errors. This study evaluated the feasibility of automated visual quality assessment using a multimodal large language model (LLM), Gemini 2.5 Flash, to assess its utility as a clinical primary screening tool to streamline the quality assurance (QA) workflow.

Methods

Twenty male pelvic CT scans from an open dataset were utilized. Two distinct auto-contouring software packages (RatoGuide prototype and clinical syngo.via) were evaluated. auto-contouring results for each slice were exported as PDF images with overlaid contours and input into Gemini 2.5 Flash. The LLM was instructed to rate the contour quality on a 5-point clinical scale (5: Accept; 4: Minor revision; 3: Major revision; 2: Redraw from scratch; 1: Organ not detected). Using evaluations by two board-certified radiation oncologists as ground truth, we calculated Spearman's rank correlation coefficients (ρ) and weighted kappa coefficients (κ). Additionally, scores ≤2 were defined as "Fail" and >2 as "Pass" to calculate sensitivity and specificity for each model to assess screening performance.

Results

Visual assessment by Gemini showed substantial agreement with expert judgment. For RatoGuide and syngo.via, the ρ values were 0.744 and 0.765, respectively, and the κ values were 0.669 and 0.746, respectively, indicating substantial agreement for both software. In terms of screening performance (defining "Fail" as the positive class), RatoGuide and syngo.via achieved sensitivities of 84.2% and 80.0%, and specificities of 93.6% and 95.4%, respectively. Although the LLM showed a tendency to overestimate quality, it exhibited high specificity in identifying clinically acceptable cases.

Conclusion

Gemini 2.5 Flash demonstrated substantial agreement with expert evaluations in image-based auto-contouring quality assessment. While potential overestimation bias (risk of missing "Fail" cases) warrants caution, the observed sensitivity suggests its feasibility as a primary screening QA tool to efficiently filter acceptable contours, thereby reducing the clinical workload.

People

Related

Similar sessions

Poster Poster Program
Jul 19 · 07:00
Python-Based Automation Framework for Annual Machine QA Data Archiving In Qatrack+

Annual water-tank measurements help ensure beam characteristics remain consistent with commissioning baselines. However, the lack of a standardized processing workflow and decentralized data storage makes it difficult to analyze...

Syed Bilal Ahmad, PhD
Therapy Physics 0 people interested
Poster Poster Program
Jul 19 · 07:00
User Expectations and Current Availability of HDR Brachytherapy Audits In Europe

The aim of this work was to evaluate the need to implement more dosimetric audits in high‐dose‐rate brachytherapy (HDR-BT) in Europe and to identify which characteristics such audits should meet according to users.

Javier Vijande, PhD Laura Oliver Cañamás
Therapy Physics 0 people interested