Real-Time Diffusion-Quality Synthetic CT Generation Via Reliability-Aware Vision Transformers for Online Adaptive Radiotherapy
Abstract
Purpose
While Denoising Diffusion Probabilistic Models (DDPMs) have set new benchmarks for synthetic CT (sCT) image quality, their prohibitive inference times hinder integration into online adaptive radiation therapy (ART) workflows. This study introduces HQ-PatchNet, a reliability-driven deep learning framework designed to resolve the trade-off between image fidelity and computational efficiency. We aim to achieve diffusion-level HU accuracy and structural preservation at real-time speeds, enabling clinically feasible daily dose recalculation for head-and-neck cancer.
Methods
We propose a Vision Transformer (ViT)-based CycleGAN architecture incorporating a novel reliability-driven selective patch learning strategy. The framework dynamically modulates loss functions using Structural Similarity Index (SSIM)-based weights via a temperature-controlled mechanism, effectively suppressing gradients from artifact-dominated regions (e.g., scatter, streaks). The model was trained on a retrospective cohort of 70 patients and compared against state-of-the-art baselines, including U-Net, CycleGAN, and Denoising Diffusion Probabilistic Models (DDPM).
Results
HQ-PatchNet demonstrated superior performance across all quantitative metrics, improving Mean Absolute Error (MAE) from 45.02 HU (original CBCT) to 11.33 HU. Notably, it achieved an SSIM of 0.969 and PSNR of 40.37 dB, demonstrating image quality comparable to computationally intensive diffusion models (Baseline DDPM SSIM: 0.948). Crucially, while diffusion-based approaches required approximately 496 seconds per volume, HQ-PatchNet completed inference in just 6.40 seconds. Qualitative assessment confirmed that HQ-PatchNet successfully restored fine anatomical details and bone-soft tissue interfaces without the computational burden of iterative sampling.
Conclusion
HQ-PatchNet bridges the gap between high-fidelity imaging and clinical efficiency. By delivering diffusion-equivalent image quality in inference time, this framework provides a practical solution for CBCT-guided online ART, ensuring accurate electron density mapping and rapid decision-making in time-critical clinical environments.