Patient-Reported Outcome Analysis Via Zero-Shot, Few-Shot, Self-Correction, and Fine-Tuning Large Language Models
Abstract
Purpose
Monitoring patient-reported outcomes (PROs) for radiation-induced toxicities is critical for providing clinical feedback undergoing radiotherapy (RT). While RT is a highly effective cancer treatment, there are still a minority of patients reporting severe symptoms due to normal tissue damage. Traditional machine learning and deep learning approaches often face high retraining costs, performance deterioration, and lack of clinical interpretability. This study addresses these challenges by proposing a set of prompting and fine-tuning strategies for Large Language Models (LLMs) to robustly classify symptoms (e.g., bowel pain, depression, and sexual dysfunction) in prostate cancer patients as either mild or severe.
Methods
The proposed framework, AD-LLM, treats mild symptoms as normal cases and severe toxicities as anomalies. Using Meta’s Llama 3.1-8B-Instruct, we implemented two experimental prompt settings: (1) Mild, providing only descriptions of mild symptoms, and (2) Mild+Severe, describing both symptoms. We evaluated several prompt-based strategies, including zero-shot (incorporating task definitions, scoring rules, and Chain-of-Thought reasoning) and three example-injection strategies (Few-shot, Mild-shot, and Severe-shot). Additionally, we introduced a self-correction mechanism that allows LLM to refine its strategy based on previous misclassifications. Finally, we performed Low Rank Adaptation (LoRA) on both Llama 3.1 and Qwen 2.5-7B-Instruct to adapt the models for domain-specific PRO classification.
Results
Performance was highly dependent on prompt design; including both Mild and Severe definitions significantly improved the detection of severe cases, with Few-shot prompting consistently outperforming zero-shot. The self-correction mechanism improved recall for depression (0.7176) and sexual dysfunction (0.8471). Overall, LLMs with LoRA yielded the best overall performance. Specifically, Qwen 2.5-7B-LoRA achieved the highest AUC (0.9026/0.8299), AUCPR (0.3662/0.2517), and F1-scores for bowel pain (0.4131) and depression (0.3529), while also securing the top F1-score for sexual dysfunction (0.4668).
Conclusion
This study demonstrates that LLMs with self-correction mechanisms and LoRA-based fine-tuning could provide an interpretable and enhanced performance in classifying radiation-induced toxicities.