Foundation Model-Guided PET Imaging for Low-Dose Acquisition and Inter-Radiotracer Translation
Abstract
Purpose
Low-count PET acquisition and inter-radiotracer translation offer effective strategies to reduce radiation dose and mitigate tracer availability constraints. Recent self-supervised learning (SSL) foundation models (e.g., DINOv3) have demonstrated strong ability to extract high-level semantic representations in medical imaging when trained at scale; however, their application to PET imaging remains largely unexplored due to the limited availability of large PET datasets. This study proposes a DINOv3-guided PET imaging framework for low-dose reconstruction and inter-radiotracer translation.
Methods
Two paired PET datasets were used: (1) an institutional whole-body PET dataset comprising 35 full-count scans with one-eighth simulated low-count reconstructions, and (2) an inter-radiotracer brain PET imaging dataset from ADNI including 192 paired FDG and amyloid targeted AV45 scans. A dual-encoder U-Net architecture was developed, comprising a trainable ViT-B encoder and a frozen ViT-B encoder initialized with CT-pretrained DINOv3 weights. Multi-level features from both encoders were concatenated and forwarded to a transformer-based decoder for image reconstruction. Model training used a weighted combination of L1 reconstruction loss, perceptual LPIPS loss, and adversarial GAN loss. Ablation studies were performed by removing the DINOv3 encoder branch. Image quality was quantitatively evaluated using structural similarity index (SSIM), peak signal-to-noise ratio (PSNR) and normalized mean absolute error (NMAE).
Results
For low-count reconstruction, SSIM/PSNR/NMAE improved from 0.71/23.60 dB/0.04 to 0.77/24.84 dB/0.03 (p < 0.01, p < 0.05, and not statistically significant, respectively). For inter-radiotracer translation, SSIM/PSNR/NMAE improved from 0.63/15.97 dB/0.07 to 0.73/20.12 dB/0.04 (all p < 0.001). Meanwhile, the proposed method reduced noise in low-count reconstruction and preserved gray-to-white matter uptake patterns in inter-tracer translation.
Conclusion
This work presents the first foundation-model-guided framework for PET imaging across multiple tasks. Results demonstrate that transferable semantic representations derived from non-PET-pretrained encoders can significantly improve supervised PET image synthesis performance, providing a promising pathway toward general-purpose, multi-task PET foundation models.