Can Vlms Read Glioma Mris? Building a Diagnostic Benchmark Using Public Multi-Parametric MRI and EHR Dataset
Abstract
Purpose
The rapid advancement of Vision-Language Models (VLMs) offers transformative potential for automated radiology; however, their ability to interpret 3D brain tumor imaging remains underexplored. This study introduces a rigorous benchmark to evaluate the diagnostic proficiency of both general-purpose and medically specialized VLMs in preoperative glioma assessment. By utilizing the high-fidelity UCSF-PDGM dataset, we aim to quantify the gap between current AI capabilities and the 3D spatial reasoning required for clinical assessment of glioma.
Methods
We utilized 480 multi-parametric MRI volumes and corresponding expert radiology reports as ground truth. An automated NLP pipeline decoupled these reports into modality-specific observations for T1-weighted Contrast-Enhanced (T1CE) and Fluid-Attenuated Inversion Recovery (FLAIR) sequences, generating 7,657 multiple-choice question (MCQ) pairs. To bridge the dimensionality gap, 3D volumes (1mm isotropic) were uniformly sampled into 2D axial grids (four 6x6 grids per volume). We evaluated GPT-5, Qwen3-VL-30B, and MedGemma (27B/4B) using Zero-Shot Chain-of-Thought (CoT) prompting. Each question was paired with its corresponding single-modality images to isolate perceptual accuracy for specific radiological features.
Results
GPT-5 achieved the highest overall accuracy (67.04%), followed by MedGemma-27B (61.28%), Qwen3-VL-30B (60.05%), and MedGemma-4B (52.35%). While these results indicate that VLMs can identify macroscopic features, performance varied significantly across question categories. Notably, Lesion Characteristics, the most clinically critical category, proved most challenging for all models. Despite uniform 1mm sampling, all models struggled to synthesize cross-slice features, failing to characterize internal tumor morphology and border irregularities readily apparent to radiologists.
Conclusion
This benchmark reveals a critical dimensionality gap: current VLMs lack both 3D spatial reasoning in brain MRI and sufficient brain tumor pre-training to synthesize complex glioma phenotypes. Expert-level glioma interpretation will require native 3D architectures and large-scale VLM pre-training with brain cancer imaging and reports.