Evaluating Vision Language Model for Autonomous Offline Adaptive Radiotherapy Decision Support In Head and Neck Cancers
Abstract
Purpose
Current automated offline triggers for adaptive radiotherapy often function as black boxes and fail to provide the reasoning behind a decision. Vision Language Models (VLM) offer a novel solution by providing a clear path toward explainability regarding the decision rationale. This study investigates an autonomous framework utilizing a VLM to flag candidates for offline adaptive radiotherapy in head and neck (HN) cancer.
Methods
An autonomous framework was designed to query the oncology information system for volumetric cone-beam computed tomography (CBCT) images and longitudinal patient specific clinical covariates including weekly weight and image reconstruction diameter. For every fraction the system automatically extracted three axial slices from the daily pretreatment CBCT and the baseline first CBCT. Slices were spatially isolated at the isocenter and at 18 mm intervals in the superior and the inferior longitudinal directions. These inputs were synthesized into a templated prompt instructing the Llama4-Maverick VLM model to quantify external anatomical changes and generate a binary decision for adaptation based on a 1 cm threshold utilizing a sampling temperature of 0.5. The mean adaptation trigger point determined by the model was compared to the historical clinical mean using a paired two-tailed t-test.
Results
This study consisted of 60 patients treated by a 35-fraction radiotherapy treatment course. Progressively larger external contour differences were observed by the system as the treatment course progressed. It flagged adaptation following 18.4 ± 1.3 fractions, which was significantly earlier than the historical clinical mean of 23.8 ± 0.4 fractions (p<0.001); seven patients were not flagged for adaptation by the VLM.
Conclusion
This framework establishes the feasibility of VLM based system as an explainable automated surveillance tool. While the model triggered adaptation significantly earlier than historical data, correlating these geometric flags with dosimetric impact remains essential to confirm specificity and guide prompt refinement for clinical implementation.