Paper Proffered Program Diagnostic and Interventional Radiology Physics

Can Vlms Read Glioma Mris? Building a Diagnostic Benchmark Using Public Multi-Parametric MRI and EHR Dataset

Abstract

Purpose

The rapid advancement of Vision-Language Models (VLMs) offers transformative potential for automated radiology; however, their ability to interpret 3D brain tumor imaging remains underexplored. This study introduces a rigorous benchmark to evaluate the diagnostic proficiency of both general-purpose and medically specialized VLMs in preoperative glioma assessment. By utilizing the high-fidelity UCSF-PDGM dataset, we aim to quantify the gap between current AI capabilities and the 3D spatial reasoning required for clinical assessment of glioma.

Methods

We utilized 480 multi-parametric MRI volumes and corresponding expert radiology reports as ground truth. An automated NLP pipeline decoupled these reports into modality-specific observations for T1-weighted Contrast-Enhanced (T1CE) and Fluid-Attenuated Inversion Recovery (FLAIR) sequences, generating 7,657 multiple-choice question (MCQ) pairs. To bridge the dimensionality gap, 3D volumes (1mm isotropic) were uniformly sampled into 2D axial grids (four 6x6 grids per volume). We evaluated GPT-5, Qwen3-VL-30B, and MedGemma (27B/4B) using Zero-Shot Chain-of-Thought (CoT) prompting. Each question was paired with its corresponding single-modality images to isolate perceptual accuracy for specific radiological features.

Results

GPT-5 achieved the highest overall accuracy (67.04%), followed by MedGemma-27B (61.28%), Qwen3-VL-30B (60.05%), and MedGemma-4B (52.35%). While these results indicate that VLMs can identify macroscopic features, performance varied significantly across question categories. Notably, Lesion Characteristics, the most clinically critical category, proved most challenging for all models. Despite uniform 1mm sampling, all models struggled to synthesize cross-slice features, failing to characterize internal tumor morphology and border irregularities readily apparent to radiologists.

Conclusion

This benchmark reveals a critical dimensionality gap: current VLMs lack both 3D spatial reasoning in brain MRI and sufficient brain tumor pre-training to synthesize complex glioma phenotypes. Expert-level glioma interpretation will require native 3D architectures and large-scale VLM pre-training with brain cancer imaging and reports.

People

Hui Lin, PhDAuthors · Department of Radiation Oncology, University of California San Francisco Xiangru LiAuthors · Department of Radiology and Biomedical Imaging, University of California San Francisco Bo LiuPresenting Author · Graduate Program in Bioengineering, University of California San Francisco-UC Berkeley Janine Lupo, PhDAuthors · Department of Radiology and Biomedical Imaging, University of California San Francisco Steve Braunstein, MD, PhDAuthors · Radiation Oncology, University of California San Francisco Ke Sheng, PhDAuthors · Department of Radiation Oncology, University of California, San Francisco