Fluoroscopy/IR Skin-Reaction Follow-up Handout: Constrained LLM Rewrites Improve Readability with Preserved Expert-Rated Fidelity and Actionability
Abstract
Purpose
Patient instructions for radiation-related skin effects after fluoroscopy-guided interventional procedures are integral to substantial-dose follow-up programs. Patient-facing materials often exceed health-literacy targets, risking reduced comprehension and adherence. We evaluated whether a constrained large language model (LLM) workflow could improve readability of a patient handout while preserving safety-critical content and actionability.
Methods
HF#6089 served as the source text. Three independent LLM rewrites (V1-V3) were generated under strict constraints (source-text-only; no new clinical claims; missing safety information flagged as [NEEDS CLINICIAN INPUT]) using readability-, actionability-, and fidelity-focused prompts. Readability was quantified using Flesch-Kincaid Grade Level (FKGL), Flesch Reading Ease (FRE), and Gunning Fog Index (GFI). A blinded multidisciplinary panel (two diagnostic medical physicists, one interventional radiologist, and one data scientist working in IR physics) rated each rewrite for technical fidelity/appropriateness (Fully/Somewhat/Not) and completed a predefined 10-item HF#6089-derived actionability checklist. Rewrites were blinded to raters. The corresponding author did not participate in scoring. Rater agreement was summarized using rating distributions and checklist score ranges.
Results
Baseline HF#6089 readability was FKGL 4.68, FRE 80.70, and GFI 7.06. Across V1-V3, readability improved with FKGL decreasing to 3.48-4.15, FRE increasing to 81.97-83.38, and GFI decreasing to 5.75-6.67. Relative to baseline, mean change (+/- SD) was Delta FKGL -0.76 +/- 0.38, Delta FRE +2.00 +/- 0.71, and Delta GFI -0.77 +/- 0.48. Appropriateness was Fully appropriate for V1 (4/4) and V3 (4/4); V2 was Fully appropriate for 3/4 and Somewhat appropriate for 1/4. Actionability checklist totals were 10/10 for V1 and V3 across all raters; V2 averaged 9.5/10 (range 8-10).
Conclusion
A constrained, source-text-only LLM workflow produced readability gains while preserving expert-rated appropriateness and high actionability. Prompt intent influenced performance, with readability- and fidelity-focused prompting yielding the most consistent balance. These findings support constrained LLM prompting to refine substantial-dose follow-up materials without adding new clinical claims.