BLUE RIBBON POSTER THERAPY: Advanced Conditioning Strategies for Diffusion Models for Radiation Therapy Dose Prediction
Abstract
Purpose
The purpose of this study is to improve the performance of diffusion-based neural networks by conditioning using representations from pre-trained models. Conditioning with a 3D encoder enforces volumetric consistency across slice-wise 2D predictions, while conditioning with a pre-trained segmentation model reduces the need for manual annotations of anatomical structures.
Methods
Our method predicts 3D dose volumes by concatenating 2D slices generated through a reverse diffusion process using a Denoising Diffusion Probabilistic Model (DDPM). The diffusion process is conditioned on multiple sources of contextual information. Slice-level context from CT and planning target volume (PTV) is extracted using a 2D structure encoder and concatenated with the network input at multiple stages. Anatomical context is extracted using a pre-trained MedSAM segmentation encoder and incorporated via cross-attention. To enforce global spatial consistency, volumetric context is introduced using a 3D Variational Autoencoder (VAE) pre-trained on 3D CT reconstruction, together with a learnable slice-position embedding applied through adaptive group normalization. The network is trained to predict the noise at each diffusion step.
Results
We evaluated the proposed model on an in-house lung cancer radiotherapy dataset consisting of 115 treatment plans. The model outperformed the comparison methods across multiple metrics, achieving the lowest Mean Absolute Error (3.374 ± 0.158 Gy) and Volume Score (4.330 ± 0.473 %). It preserved high-frequency details and accurately captured the directional characteristics of radiation beams. Ablation studies showed that the proposed encoders consistently improved the base diffusion model. While the segmentation encoder could compensate for the absence of expert-annotated organ-at-risk (OAR) masks, providing these masks further improved performance.
Conclusion
In conclusion, by integrating pre-trained 3D and segmentation encoders as conditioning in the diffusion model, our approach outperforms other methods in preserving high-frequency details and maintaining volumetric consistency. In addition, it reduces reliance on manual organ-at-risk segmentation without compromising prediction accuracy.