Multi-Modal Self-Supervised Contrastive Learning for a Foundational Model for Brachytherapy
Abstract
Purpose
This project uses a multi-modal method to train a foundational model for medical images of cervical cancer patients. Multiple imaging modalities are used over the course of treatment: multiparametric MRI, CT, cone-beam CT, and PET when indicated. This motivates a paired approach for deep learning to leverage the shared information across modalities and aim for image features relevant to downstream clinical tasks.
Methods
Contrastive learning (CL) is a method of self-supervised learning (requiring no labeled data) that pushes together ‘positive’ pairs and separates ‘negative’ pairs in the model’s deep features. This project utilizes multi-modal CL, using scans of different modalities for the same patient as positive pairs. CL typically treats all negative pairs equally, but different patients may have similar anatomy and disease progression. To address this, a soft contrastive loss was developed using clinical information to create a similarity metric for negative pairs. As a preliminary experiment, a vision transformer (ViT) pre-trained on 3D medical images was fine-tuned on an in-house MRI and CT dataset.
Results
The training progress of the ViT was evaluated using attention rollout to visualize the attention flow for a given input, highlighting regions most crucial for extracting deep features. Initial epochs show an emphasis on structures with high voxel intensity, while subsequent iterations resulted in a more balanced attention map incorporating the full pelvic region. At 50 epochs, the attention began leaking into the image background, which is indicative of overfitting.
Conclusion
This work shows potential in using multi-modal CL with clinical information to train foundational ViTs. The attention maps demonstrated that relevant anatomical regions were learned by the model in extracting deep features, but overfitting would eventually begin. The model will be evaluated on downstream tasks to examine if these learned features can result in performance improvements.