Poster Poster Program Therapy Physics

Novel LLM-Based Cardiotoxicity Extraction from Electronic Health Records

Abstract

Purpose

Cardiotoxicity assessment following lung cancer radiotherapy is limited by the difficulty of extracting outcomes from unstructured Electronic Health Records (EHRs). Large Language Models (LLMs) offer a scalable solution, with privacy constraints necessitating locally hosted, open-source deployments. This work evaluated whether integrating structured screening with optimized prompting enables accurate cardiotoxicity surveillance using locally hosted open-source LLMs. We compared strategies prioritizing evidence-based interpretability against those optimized for operational robustness.

Methods

Epic EHR data from 92 lung cancer patients were analyzed, including structured problem lists and unstructured physician notes. Six open-source LLMs were deployed locally via Ollama. A structured problem list screen provided initial cardiotoxicity triage, remaining patients underwent specialty- and cardiac keyword–based note prefiltering to reduce input length. Two prompting strategies were compared. Algorithm A used explicit CoT prompting with structured evidence extraction, exposing step-by-step reasoning to support interpretability. Algorithm B applied context-specific prompts with implicit CoT reasoning, performing internal reasoning without intermediate outputs and generating binary classifications to enhance robustness. Performance was evaluated using F1 score and parsing success rate.

Results

Of 92 patients, 45 had physician-confirmed cardiotoxicity. Problem list screening identified 22 cases (23.9%), with remaining patients requiring note-based classification. Under Algorithm A, GPT-OSS achieved the highest performance (F1 = 0.916) with a parsing success rate of 97.8%, followed by DeepSeek-R1 (F1 = 0.828; parsing success = 96.7%). Across all six models, parsing reliability ranged from 93.5–97.8%. Algorithm B demonstrated slightly lower peak performance (GPT-OSS F1 = 0.889) but superior robustness, achieving 100% parsing success in five models. Larger general-purpose LLMs consistently outperformed domain-specific medical models (F1 < 0.75).

Conclusion

Explicit CoT prompting improves interpretability but reduces robustness, while implicit CoT prompting preserves performance and parsing reliability. Open-source LLMs enable reasonable cardiotoxicity extraction, with clinical deployability driven more by prompt architecture and output constraints than by model specialization alone.

People

Wenchao CaoPresenting Author · Thomas Jefferson University Wookjin Choi, PhDAuthors · Thomas Jefferson University Yevgeniy Vinogradskiy, PhDAuthors · Thomas Jefferson University

Similar sessions

Poster Poster Program

Jul 19 · 07:00

Python-Based Automation Framework for Annual Machine QA Data Archiving In Qatrack+

Annual water-tank measurements help ensure beam characteristics remain consistent with commissioning baselines. However, the lack of a standardized processing workflow and decentralized data storage makes it difficult to analyze...

Syed Bilal Ahmad, PhD