Paper Proffered Program Therapy Physics

A Multimodal Foundation Model for Pediatric Multiparametric MRI Synthesis

Abstract

Purpose

Multiparametric brain MRI is essential for delineating pediatric brain tumor subregions; however, long acquisition times often preclude complete multi-contrast imaging in pediatric patients, and large public pediatric lesion datasets remain limited. We propose a multimodal foundation-model framework that synthesizes missing MRI sequences for pediatric patients from a single input sequence conditioned on acquisition-metadata text prompts and tumor segmentation maps.

Methods

A subset of the BraTS-PEDs dataset (112 subjects; 87/5/20 train/validation/test) containing T1-weighted (T1w), T2-weighted (T2w), FLAIR images, tumor masks, and acquisition metadata (demographics, scanner and field strength, voxel size, and sequence parameters including TR/TE/TI/FA) was used. Starting from a pretrained TUMSyn checkpoint, the model was fine-tuned for 100 epochs across all ordered modality pairs (six synthesis tasks) using balanced sampling, while keeping the pretrained text encoder frozen. Acquisition metadata were encoded as text prompts, and tumor segmentation masks were concatenated as an additional input channel. Performance was evaluated using whole-volume and tumor-region PSNR and SSIM and compared with ablation models (without metadata or segmentation conditioning) and a zero-shot TUMSyn baseline using paired t-tests with Holm correction.

Results

The proposed method achieved mean whole-volume PSNR/SSIM of 21.8dB/0.900 and mean tumor-region PSNR/SSIM of 14.2 dB/0.839 across six synthesis tasks. Removing acquisition-metadata conditioning reduced performance to 20.1 dB/0.885 (tumor region: 12.7 dB/0.821), while zero-shot TUMSyn further degraded performance to 19.1 dB/0.865 (tumor PSNR: 11.9 dB). For the representative T1w→FLAIR task, fine-tuning with metadata achieved PSNR/SSIM of 22.7 dB/0.914 compared with 21.0 dB/0.875 without metadata (p=0.007 / p<0.01).

Conclusion

This work presents a unified, multi-task multimodal foundation model for synthesizing missing pediatric brain tumor MRI sequences. Conditioning on acquisition and demographic metadata, together with segmentation context, improves tumor-region fidelity and may help preserve clinically relevant multiparametric information for tumor delineation and treatment planning when complete multi-sequence acquisitions are not feasible.

People