Benchmarking Transformer-Based Vs. Convolutional Neural Network-Based Registrationfor Rapid Response Assessment for Lung Ventilation Imaging In a Digitalanthropomorphic Phantom Study
Abstract
Purpose
CT-based lung ventilation imaging can provide a functional assessment of tumor and normal tissue response to lung stereotactic body radiation therapy (SBRT). The speed of ventilation map generation hinders the clinical adoption of LVI for online adaptive SBRT. Deep learning techniques such as Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) methods could enable rapid deformation vector field (DVF) and ventilation volume generation. Here, we benchmarked a transformer-based network against a CNN-based framework to determine the optimal architecture for rapidly generating ventilation maps based on registration accuracy and computational efficiency.
Methods
Ventilation volumes were derived for a CNN-based implementation of DEEDS deformable image registration method (DEEDS_DL) and a ViT-based method, PhysMorph using a 4D-XCAT digital phantom with ground-truth DVFs. DEEDS_DL (CNN) utilizes a U-Net architecture optimized for inference speed with DVFs generated using a VoxelMorph-based spatial transformation. PhysMorph integrates a U-Net with self-attention mechanisms to capture global lung motion. Both models were trained to predict DVFs between respiratory phases using supervised learning. Lung ventilation maps were derived using the Jacobian determinant of predicted deformation fields to quantify local tissue expansion. Performance was evaluated using Target Registration Error (TRE), DICE coefficient, and inference time on a standard CPU environment
Results
PhysMorph and DEEDS_DL achieved a whole lung DICE of 0.96 ± 0.01 and 0.95 ± 0.01, respectively, and TREs of 0.43 ± 0.06 mm and 0.38 ± 0.08 mm, respectively. The average inference time for DEEDS_DL and PhysMorph was 14 ms per volume and 271 ms respectively. This represents a ~19x speed increase for the CNN architecture while maintaining sub-voxel accuracy.
Conclusion
Here we demonstrated that the proposed CNN-based DEEDS_DL method produces DVFs more rapidly with similar deformation accuracy to vision transformer methods such as PhysMorph. Future work will involve evaluating the CNN-based DEEDS_DL framework for ventilation assessment for online adaptive lung SBRT.