Large Language Model Agreement for Classifying Adaptive Radiotherapy Rationales
Abstract
Purpose
Adaptive radiotherapy (ART) decisions made during quality assurance CT (QACT) review are commonly documented in free‑text clinical notes. Automated classification of these rationales could enable large‑scale workflow and quality‑improvement analysis. However, the robustness of artificial intelligence (AI) approaches across real‑world documentation remains uncertain. This study compared large language models (LLMs) with a deterministic rule‑based natural language processing (NLP) method for classifying ART rationales in datasets derived from actual clinical proton QACT notes with varying levels of structure.
Methods
Three independent datasets (n=500 each) were curated from de‑identified proton ART documentation collected over a 3‑year period. Notes were abstracted in a HIPAA‑compliant environment and categorized into: (1) noisy free‑text reflecting real clinical variability; (2) standardized notes with consistent phrasing; and (3) highly structured, template‑like inputs. Ground‑truth labels were determined from the clinical adaptation decision and documented rationale. Each case was assigned one of eight primary rationale categories and a derived binary avoidability label. Classification was performed using three LLMs (ChatGPT, Copilot, Grok) and a rule‑based NLP system. Performance was evaluated using accuracy and Cohen’s κ.
Results
Performance varied strongly with documentation structure. For noisy free‑text, primary‑category agreement was moderate–substantial for Copilot (κ=0.80; 84% accuracy), ChatGPT (κ=0.72; 77%), and rule‑based NLP (κ=0.69; 75%), while Grok performed poorly (κ=0.32; 46%). For standardized notes, agreement improved markedly, led by Grok (κ=0.94; 95%) and Copilot (κ=0.88; 90%). For structured inputs, all LLMs achieved perfect agreement (κ=1.00; 100%), whereas rule‑based NLP remained lower (κ=0.74; 80%). Across all datasets, avoidability classification outperformed the eight‑class primary task.
Conclusion
LLM performance for classifying ART rationales is highly sensitive to documentation quality, achieving perfect agreement under structured inputs but reduced robustness with noisy free‑text. Rule‑based NLP provides stable but lower‑ceiling performance. These findings underscore the importance of documentation standardization and model validation for reliable AI‑enabled ART workflow analytics.