Synergistic Prediction of Neoadjuvant Immunotherapy Response In Esophageal Cancer Via Multimodal Fusion of Orthogonal CT Latent Features and Vit-Based Histopathology
Abstract
Purpose
Predicting pathological complete response (pCR) following neoadjuvant immunotherapy (nICT) is critical for personalized management of esophageal cancer. This study develops an interpretable multimodal framework that integrates pre-treatment 3D CT latent features, extracted via a Forced-Orthogonal Autoencoder (FOAE), with whole-slide image (WSI) morphological features analyzed through Attention-based Multi-Instance Learning (AMIL).
Methods
We analyzed cohorts of 297 (pathology) and 285 (radiology) patients. For histopathology, WSIs were processed using a tile-based Vision Transformer (ViT-B/16) to extract high-level representations, which were aggregated by an AMIL model using dual-branch attention and dynamic Top-K sampling to generate pCR probabilities. For CT, a 3D convolutional FOAE was designed to learn compact latent representations of esophageal tumor volumes. To ensure feature disentanglement and minimize redundancy, an orthogonality constraint was enforced in the latent space using a cosine-sine reparameterization strategy. A hierarchical stretch operation isolated the 16 high-variance latent features most representative of tumor morphology. A final fusion model integrated these FOAE latent features with AMIL-derived scores, employing Layer-wise Relevance Propagation (LRP) to provide mechanistic interpretability by decomposing slide-level predictions into feature-specific relevance scores.
Results
The AMIL model achieved stable pCR prediction (Validation F1: 0.732), with attention maps highlighting nuclear heterogeneity as a key predictor. The FOAE demonstrated high reconstruction fidelity (Validation MAE: 0.0027; Dice: 0.888), indicating that the learned latent space effectively encoded complex 3D tumor structures. The multimodal fusion model significantly outperformed single-modality predictors, achieving a training AUC of 0.890 and a validation AUC of 0.744. LRP analysis successfully quantified the synergistic contribution of specific orthogonal CT dimensions and histopathological probabilities to the final pCR prediction.
Conclusion
Integrating orthogonal latent features from CT with AI-derived histopathological insights provides superior prognostic value for nICT response. This interpretable framework offers an objective, evidence-based tool for clinical decision-making, facilitating the identification of optimal candidates for surgery or organ-preservation strategies.