Automated Prescription and Dose Goal Generation In a Clinical Treatment Planning System from Free-Text Physician Orders Using Frontier Large Language Models
Abstract
Purpose
To develop and evaluate an automated approach to intelligently interpret free-text, or natural language (NL), physician treatment directives and generate patient-specific prescriptions and clinical goals within a commercial treatment planning system (TPS) using large language models (LLMs), and compare the accuracy, time, and cost between major frontier LLMs, including ChatGPT and Claude.
Methods
We developed a fully-automated multi-agent LLM framework that accepted NL directives as input, extracted prescription and clinical goals through multiple steps of interpretation, and automatically entered them into our TPS (RayStation 2023B). The approach was evaluated in 5 GU and CNS patients who had previously received intensity-modulated proton therapy, each with unique archived NL physician directives. An experienced planner manually re-generated clinical goal lists for each directive, serving as ground truth. The LLM approach was compared with the clinical, human-generated goal list in terms of agreement with this ground truth. The accuracy, time, and inference cost of multiple frontier LLMs were compared, including ChatGPT-4o-mini, ChatGPT-4o, and Claude Opus 4, and accuracy was compared to the clinical plans using paired t-tests.
Results
The mean±standard deviation (SD) percent of correct goals for clinical plans was 82.2±14.4%, attributed to a combination of human error and intentional modification of goals during clinical planning. The mean±SD (p-value) percent of correct goals for 4o-mini, 4o, and Claude was 47.6±26.5% (0.039), 87.5±8.9% (0.236), and 96.0±4.4 (0.043), respectively. Mean±SD inference time per patient was 23.1±12.1 s, 24.2±12.1 s, and 224.7±148.5 s, and mean±SD inference cost per patient was $0.0042±0.0021, $0.059±0.031, and $0.658±0.415, respectively.
Conclusion
An LLM approach for interpreting NL physician treatment directives was developed and demonstrated better agreement with initial directives than final clinical plans. Among LLMs, the most costly reasoning model achieved the highest accuracy, and could enable intuitive and efficient auto-planning for complex and unique patients.