Structure-Guided Multi-Task CT-to-PET Synthesis with Cross-Attention and Frequency-Aware Dual Discrimination for Esophageal Cancer
Abstract
Purpose
To develop and evaluate a structure-guided multi-task deep learning framework for synthesizing positron emission tomography (PET) images from computed tomography (CT) in patients with esophageal cancer, aiming to improve both global image fidelity and tumor-specific metabolic consistency.
Methods
A multi-task generative framework was proposed to jointly perform CT-to-PET synthesis and high-uptake region prediction. The network adopts a shared encoder with two task-specific decoders, enabling the auxiliary metabolic-region branch to guide feature learning toward metabolically informative structures. Structural priors derived from esophageal distance fields were incorporated to stabilize generation for elongated anatomy. A Metabolic-Guided Cross-Attention (MGCA) module was introduced to facilitate cross-task feature interaction, and a Frequency-aware Dual Discriminator (FDD) was designed to balance spatial realism and spectral consistency. The model was trained and evaluated on 246 training and 62 internal test cases, with additional external validation on an independent cohort of 186 patients. Performance was assessed using image-level similarity metrics (PSNR, SSIM, MAE) and clinically relevant metabolic biomarkers, including metabolic tumor volume (MTV) and total lesion glycolysis (TLG).
Results
On the external validation set, the proposed method achieved a PSNR of 34.90, an SSIM of 0.920, and an MAE of 0.105, outperforming a conventional UNet-based baseline. For metabolic index consistency, strong correlations were observed between synthesized and ground-truth PET images, with Pearson correlation coefficients of 0.827 for MTV and 0.763 for TLG. Qualitative and quantitative analyses demonstrated that the proposed framework preserved high-frequency tumor details while maintaining stable low-frequency metabolic distributions, resulting in improved reliability of tumor-specific uptake characterization.
Conclusion
A structure-guided multi-task CT-to-PET synthesis framework with cross-attention and frequency-aware dual discrimination was developed for esophageal cancer imaging. By modeling metabolically sensitive regions and balancing spatial and spectral supervision, the proposed method improves both visual fidelity and quantitative metabolic consistency, demonstrating potential clinical value when PET acquisition is limited.