Paper Proffered Program Therapy Physics

Patient-Reported Outcome Analysis Via Zero-Shot, Few-Shot, Self-Correction, and Fine-Tuning Large Language Models

Abstract
Purpose

Monitoring patient-reported outcomes (PROs) for radiation-induced toxicities is critical for providing clinical feedback undergoing radiotherapy (RT). While RT is a highly effective cancer treatment, there are still a minority of patients reporting severe symptoms due to normal tissue damage. Traditional machine learning and deep learning approaches often face high retraining costs, performance deterioration, and lack of clinical interpretability. This study addresses these challenges by proposing a set of prompting and fine-tuning strategies for Large Language Models (LLMs) to robustly classify symptoms (e.g., bowel pain, depression, and sexual dysfunction) in prostate cancer patients as either mild or severe.

Methods

The proposed framework, AD-LLM, treats mild symptoms as normal cases and severe toxicities as anomalies. Using Meta’s Llama 3.1-8B-Instruct, we implemented two experimental prompt settings: (1) Mild, providing only descriptions of mild symptoms, and (2) Mild+Severe, describing both symptoms. We evaluated several prompt-based strategies, including zero-shot (incorporating task definitions, scoring rules, and Chain-of-Thought reasoning) and three example-injection strategies (Few-shot, Mild-shot, and Severe-shot). Additionally, we introduced a self-correction mechanism that allows LLM to refine its strategy based on previous misclassifications. Finally, we performed Low Rank Adaptation (LoRA) on both Llama 3.1 and Qwen 2.5-7B-Instruct to adapt the models for domain-specific PRO classification.

Results

Performance was highly dependent on prompt design; including both Mild and Severe definitions significantly improved the detection of severe cases, with Few-shot prompting consistently outperforming zero-shot. The self-correction mechanism improved recall for depression (0.7176) and sexual dysfunction (0.8471). Overall, LLMs with LoRA yielded the best overall performance. Specifically, Qwen 2.5-7B-LoRA achieved the highest AUC (0.9026/0.8299), AUCPR (0.3662/0.2517), and F1-scores for bowel pain (0.4131) and depression (0.3529), while also securing the top F1-score for sexual dysfunction (0.4668).

Conclusion

This study demonstrates that LLMs with self-correction mechanisms and LoRA-based fine-tuning could provide an interpretable and enhanced performance in classifying radiation-induced toxicities.

People

Related

Similar sessions

Poster Poster Program
Jul 19 · 07:00
Python-Based Automation Framework for Annual Machine QA Data Archiving In Qatrack+

Annual water-tank measurements help ensure beam characteristics remain consistent with commissioning baselines. However, the lack of a standardized processing workflow and decentralized data storage makes it difficult to analyze...

Syed Bilal Ahmad, PhD
Therapy Physics 0 people interested
Poster Poster Program
Jul 19 · 07:00
User Expectations and Current Availability of HDR Brachytherapy Audits In Europe

The aim of this work was to evaluate the need to implement more dosimetric audits in high‐dose‐rate brachytherapy (HDR-BT) in Europe and to identify which characteristics such audits should meet according to users.

Javier Vijande, PhD Laura Oliver Cañamás
Therapy Physics 0 people interested