LLM-Based Clinical Note Summarization Feasibility Study
Abstract
Purpose
Cancer treatment planning requires clinicians to rapidly synthesize complex clinical information from detailed patient notes, a process that is time-consuming and cognitively demanding, particularly in multidisciplinary workflows involving non-physician clinicians. As clinical documentation grows in volume and complexity, this challenge is expected to intensify. Large language model (LLM)–based clinical text summarization offers a promising approach to streamline information extraction. However, multi-agent LLM performance depends strongly on system design choices, including reflective reasoning and prompt configuration, and hallucinations remain unacceptable in safety-critical clinical environments. This feasibility study evaluates whether multi-agent LLMs can practically generate accurate, evidence-grounded clinical summaries for cancer treatment planning, using adaptive radiotherapy (ART) clinical notes as a representative use case.
Methods
Twenty-one ART clinical notes authored by radiation oncologists were collected. Initial LLM-generated summaries were edited by a human reviewer to form the ground-truth dataset. Each summary included diagnosis, prior irradiation, treatment intent, and Eastern Cooperative Oncology Group (ECOG) performance status. A four-agent pipeline was implemented: Agent 1 generated an initial summary; Agent 2 evaluated the summary against the ground truth; Agent 3 revised the summary using reflective reasoning; and Agent 4 evaluated the revised output. Both 0-shot and 3-shot prompting were tested, yielding four configurations per clinical note. Evaluation agents classified each summary sentence into a confusion matrix based on agreement with the ground truth, from which precision, recall, and F1-score were computed.
Results
For 0-shot and 3-shot prompting, mean F1-scores before reflection were 85.3% and 86.5%, respectively. After reflection, mean F1-scores were 85.1% and 86.9%, respectively.
Conclusion
For structured clinical summaries, multi-agent LLM performance improves modestly with reflective reasoning and 3-shot prompting. As a feasibility study, these findings demonstrate the practical potential of LLM-generated summaries for supporting cancer treatment planning and motivate further investigation across broader clinical contexts.