Learning What Can Move: Patient-Specific Motion Subspaces from a Single Scan
Abstract
Purpose
Motion estimation naturally requires two time points, making it inherently a relational quantity. This constrains how motion can be learned during training, as the entanglement of paired data makes cross-modality and cross-dimensional scenarios particularly challenging. We approach this differently by modeling the motion subspace rather than motion itself. Unlike motion, the motion subspace is an intrinsic property potentially inferable from a single observation, since anatomical configuration inherently constrains the space of plausible deformations. We ask whether a learning system can acquire this ability: inferring patient-specific motion sub-space modes from a single static scan.
Methods
We parameterize deformation as K scalar spatial motion modes, capturing where motion (co)-occurs, multiplied by global vector coefficients, specifying 3D direction and magnitude. A neural network predicts patient-specific motion modes from the fixed image alone. A key challenge in learning such basis-coefficient combination is the inherent scale ambiguity, which causes unstable or failure training. We address this with a differentiable closed-form coefficient solver that, given predicted modes and any observation, uniquely determines the optimal coefficients via linearized least-squares.
Results
The learned motion modes effectively capture patient-specific motion subspace that consistently generalizes across all other phases. With only two scalar modes, the instantiated deformation yields PSNR of 33.1dB for image warping and Dice of 0.92 for segmentation warping. The predicted modes exhibit interpretable spatial patterns, with co-moving structures sharing similar activations.
Conclusion
This work demonstrates that patient-specific motion modes can be inferred from static anatomy alone. The learned subspace can be used directly or seamlessly integrated into conventional DIR frameworks as learned regularization. The asymmetric training, where only the fixed image is input to the network, offers potential for cross-modality applications. The compact parameterization makes the approach well-suited for scenarios with sparse observations.