Large Language Model-Guided Triage of Incident Learning System Forms In Radiation Oncology
Abstract
Purpose
To evaluate agreement between LLMs and expert reviewers in triaging radiation oncology incident learning system (ILS) forms with regard to three clinically relevant dimensions (workflow process step, severity, and dosimetric impact), with the goal of improving efficiency and effectiveness of routine ILS review/follow-up for larger institutions.
Methods
Thirty-nine free text ILS forms from a single institution (2024-2025) were independently reviewed by three expert observers and three open-source LLMs (GPT-oss120b, Llama3.1:70b, Qwen2.5:32b). Each reviewer classified forms by: (1) process step (nine categories: Simulation, Contouring, Treatment Planning, Patient Setup, Imaging, Treatment Delivery, Physics Check, Physician Review, Therapy Check), (2) severity (Low/Medium/High), and (3) dosimetric impact (Low/Medium/High). Inter-rater reliability was assessed using Fleiss' Kappa for nominal data and weighted Fleiss' Kappa with quadratic weights for ordinal data. LLM-produced classifications were evaluated against each expert via Cohen's Kappa, comparing the mean LLM-expert agreement to mean expert-expert agreement, and overall accuracy against majority expert consensus.
Results
Expert reviewer reliability was as follows: process step Fleiss' κ = 0.28 (fair), severity weighted κ = 0.34 (fair), dosimetric impact weighted κ = 0.51 (moderate). Agreement of LLM-produced classifications and inter-expert classifications was comparable: process step mean LLM-expert κ = 0.40 vs. mean expert-expert κ = 0.30; severity mean LLM-expert κ = 0.33 vs. mean expert-expert κ = 0.37; dosimetric impact mean LLM-expert κ = 0.34 vs. mean expert-expert κ = 0.52. GPT-oss120b was the best-performing model, based on weighted kappa for expert majority vote across process step, severity, and dosimetric impact (κ=0.403,0.448,0.435), respectively.
Conclusion
LLMs demonstrated moderate agreement with the expert reviewers across multiple clinically relevant dimensions, with performance within the range of inter-expert variability. These findings suggest that LLM-assisted triage is a viable approach for scaling ILS forms review and providing consistent initial screening in radiation oncology quality improvement programs, pending validation on larger datasets.