Novel LLM-Based Cardiotoxicity Extraction from Electronic Health Records
Abstract
Purpose
Cardiotoxicity assessment following lung cancer radiotherapy is limited by the difficulty of extracting outcomes from unstructured Electronic Health Records (EHRs). Large Language Models (LLMs) offer a scalable solution, with privacy constraints necessitating locally hosted, open-source deployments. This work evaluated whether integrating structured screening with optimized prompting enables accurate cardiotoxicity surveillance using locally hosted open-source LLMs. We compared strategies prioritizing evidence-based interpretability against those optimized for operational robustness.
Methods
Epic EHR data from 92 lung cancer patients were analyzed, including structured problem lists and unstructured physician notes. Six open-source LLMs were deployed locally via Ollama. A structured problem list screen provided initial cardiotoxicity triage, remaining patients underwent specialty- and cardiac keyword–based note prefiltering to reduce input length. Two prompting strategies were compared. Algorithm A used explicit CoT prompting with structured evidence extraction, exposing step-by-step reasoning to support interpretability. Algorithm B applied context-specific prompts with implicit CoT reasoning, performing internal reasoning without intermediate outputs and generating binary classifications to enhance robustness. Performance was evaluated using F1 score and parsing success rate.
Results
Of 92 patients, 45 had physician-confirmed cardiotoxicity. Problem list screening identified 22 cases (23.9%), with remaining patients requiring note-based classification. Under Algorithm A, GPT-OSS achieved the highest performance (F1 = 0.916) with a parsing success rate of 97.8%, followed by DeepSeek-R1 (F1 = 0.828; parsing success = 96.7%). Across all six models, parsing reliability ranged from 93.5–97.8%. Algorithm B demonstrated slightly lower peak performance (GPT-OSS F1 = 0.889) but superior robustness, achieving 100% parsing success in five models. Larger general-purpose LLMs consistently outperformed domain-specific medical models (F1 < 0.75).
Conclusion
Explicit CoT prompting improves interpretability but reduces robustness, while implicit CoT prompting preserves performance and parsing reliability. Open-source LLMs enable reasonable cardiotoxicity extraction, with clinical deployability driven more by prompt architecture and output constraints than by model specialization alone.