Poster Poster Program Therapy Physics

Large Language Model Agreement for Classifying Adaptive Radiotherapy Rationales

Abstract

Purpose

Adaptive radiotherapy (ART) decisions made during quality assurance CT (QACT) review are commonly documented in free‑text clinical notes. Automated classification of these rationales could enable large‑scale workflow and quality‑improvement analysis. However, the robustness of artificial intelligence (AI) approaches across real‑world documentation remains uncertain. This study compared large language models (LLMs) with a deterministic rule‑based natural language processing (NLP) method for classifying ART rationales in datasets derived from actual clinical proton QACT notes with varying levels of structure.

Methods

Three independent datasets (n=500 each) were curated from de‑identified proton ART documentation collected over a 3‑year period. Notes were abstracted in a HIPAA‑compliant environment and categorized into: (1) noisy free‑text reflecting real clinical variability; (2) standardized notes with consistent phrasing; and (3) highly structured, template‑like inputs. Ground‑truth labels were determined from the clinical adaptation decision and documented rationale. Each case was assigned one of eight primary rationale categories and a derived binary avoidability label. Classification was performed using three LLMs (ChatGPT, Copilot, Grok) and a rule‑based NLP system. Performance was evaluated using accuracy and Cohen’s κ.

Results

Performance varied strongly with documentation structure. For noisy free‑text, primary‑category agreement was moderate–substantial for Copilot (κ=0.80; 84% accuracy), ChatGPT (κ=0.72; 77%), and rule‑based NLP (κ=0.69; 75%), while Grok performed poorly (κ=0.32; 46%). For standardized notes, agreement improved markedly, led by Grok (κ=0.94; 95%) and Copilot (κ=0.88; 90%). For structured inputs, all LLMs achieved perfect agreement (κ=1.00; 100%), whereas rule‑based NLP remained lower (κ=0.74; 80%). Across all datasets, avoidability classification outperformed the eight‑class primary task.

Conclusion

LLM performance for classifying ART rationales is highly sensitive to documentation quality, achieving perfect agreement under structured inputs but reduced robustness with noisy free‑text. Rule‑based NLP provides stable but lower‑ceiling performance. These findings underscore the importance of documentation standardization and model validation for reliable AI‑enabled ART workflow analytics.

People

Aidan SiddiqiPresenting Author · University of Tennessee Chester R. Ramsey, PhDAuthors · Covenant Health Proton Center Samantha G. Hedrick, PhDAuthors · Covenant Health Proton Center

Similar sessions

Poster Poster Program

Jul 19 · 07:00

Python-Based Automation Framework for Annual Machine QA Data Archiving In Qatrack+

Annual water-tank measurements help ensure beam characteristics remain consistent with commissioning baselines. However, the lack of a standardized processing workflow and decentralized data storage makes it difficult to analyze...

Syed Bilal Ahmad, PhD