Staged Alignment of Decoder Large Language Models for Oncology Tasks Via Radiology-Pathology Note Pairing
Abstract
Purpose
Decoder large language models (LLMs) like ChatGPT and Claude show excellent text understanding in general domains but remain under-aligned to oncology-specific clinical context. This study aims to design and validate an alignment pipeline for decoder-based LLMs, enabling decoder models to produce progressively refined oncology-relevant representations through staged training.
Methods
We designed a multi-stage alignment pipeline to repurpose Llama-3.1-8B into a clinical text embedder using LLM2Vec on 97,398 matched radiology-pathology reports covering nine brain tumor types. Stage 1 used masked next-token prediction to encourage bidirectional attention within the decoder. Stage 2 applied Simple Contrastive Learning of Sentence Embeddings (SimCSE), a self-supervised contrastive learning method, on unlabeled reports. Stage 3 performed supervised contrastive alignment on paired reports to learn oncology-relevant representations. We evaluated embeddings with 5-fold cross-validation on three prognostic tasks: tumor type classification (n=4,046), MGMT promoter methylation prediction (n=201), and one-year survival prediction (n=539), using zero-shot GPT-4o as a decoder LLM baseline.
Results
Across all endpoints, performance improved from Stage 1 through Stage 3, with all stages substantially outperforming zero-shot ChatGPT 4o. For tumor type classification, accuracy/F1-macro/AUROC rose from 61.5% / 49.0% / 81.4% at Stage 1 to 73.4% / 59.4% / 90.5% at Stage 2, and reached 87.2% / 86.1% / 96.2% at Stage 3, versus 85.5% / 84.0% / 35.1% for GPT-4o. For one-year survival, Stage 2 achieved 71.8% accuracy, 69.4% F1-macro, and 79.5% AUROC, compared to GPT-4o's 71.1% / 56.4% / 30.0%. For MGMT methylation, Stage 3 reached 77.1% / 60.0% / 76.1%, far exceeding GPT-4o (45.6% / 33.1% / 54.5%).
Conclusion
The fully aligned decoder achieved strong downstream performance, with consistent stage-wise gains demonstrating both the efficacy of the multi-stage pipeline design and the value of radiology-pathology matching as an alignment task for adapting generative LLMs to various oncology-specific predictive tasks.