Development and Validation of a Gpt-5.2-Based Multi-Agent Radiotherapy Plan Evaluation System for Stereotactic Body Radiotherapy Plans
Abstract
Purpose
To explore the feasibility of applying large language models (LLMs) for radiotherapy plan(RTP) evaluation to assist radiotherapy plan quality control and clinical decision making.
Methods
This study developed a multi-agent RTP evaluation system (RTPEV) based on GPT-5.2, supporting multimodal inputs including clinical prescriptions, plan PDFs, and plan-related DICOM data. The system incorporates a Retrieval-Augmented Generation (RAG) retriever agent and a dedicated medical knowledge base tailored for RTP, combining clinical guidelines and literature to support evaluation process. A reviewer agent generates an initial evaluation report, a critic agent performs secondary verification and refinement, and an arbitrator agent outputs the final evaluation reports. Twenty stereotactic body radiation therapy (SBRT) plans from Sun Yat-sen University Cancer Center were retrospectively evaluated. System performance was evaluated using the number of issues detected per case, the agreement of system outputs and physicist review, and quantitative metrics including retrieval evidence relevance, hallucination rate, and review efficiency.
Results
For the 20 plans, the RTPEV system flagged an average of 6.1 issues per case requiring physicist confirmation. After manual review, 91.1% of the AI-flagged uncertain items were considered clinically acceptable. In addition, physicists identified further potential plan risks after reviewing the system-generated evaluation reports. Specifically, the system was able to identify noncompliant CT slice thickness, mismatched organ-at-risk (OAR) dose constraints, and etc,., and further stratified the identified issues by risk level with corresponding optimization recommendations, indicating that the system outputs facilitated the detection of plan-related concerns. Quantitative evaluation showed a hallucination rate of 0.106 and a retrieval evidence relevance score of 0.59, suggesting moderate evidence grounding and stable overall system performance.
Conclusion
The RTPEV multi-agent system demonstrates technical feasibility and clinical auxiliary value in SBRT RTP, which can stably identify potential problems with a low hallucination rate and support quality control and clinical decision-making for radiotherapy plans.