Automated Vision-Language Model -Derived Clinical Descriptors Enhances Radiomic Profiling for Robust Breast Malignancy Prediction
Abstract
Purpose
To enhance breast malignancy prediction, this study develops a multimodal framework that integrates automated, Vision-Language Model (VLM)-derived BI-RADS lexicons with quantitative radiomic features.
Methods
This multi-center study included 889 patients from two institutions, partitioned into training (80%) and independent testing (20%) cohorts. A VLM-driven workflow that utilizing Gemini 3 Pro to simulate expert-level observation was developed. Unlike traditional manual annotation, the VLM analyzed standard dual-view (CC/MLO) mammograms according to the BI-RADS 5th Edition guidelines. It automatically generated qualitative descriptors covering calcification morphology (e.g., fine pleomorphic, amorphous), distribution patterns (e.g., linear, segmental), and architectural distortion. These "digitized clinical observations" were integrated with quantitative radiomic features (shape, first-order, and texture matrices) through a multimodal early-fusion strategy. Following LASSO-based feature selection, an ensemble of ten machine learning classifiers (including Random Forest, XGBoost, and SVM) was trained. Performance was quantified via AUC and 95% confidence intervals (CI) with 1,000 bootstrap resampling iterations in the test cohorts.
Results
The VLM-augmented fusion framework demonstrated superior robustness and accuracy compared to unimodal baselines. The Random Forest classifier achieved the highest efficacy with an AUC of 0.865 (95% CI: 0.802–0.920), significantly outperforming both the radiomics-only model (AUC 0.847) and the lexicon-only model (AUC 0.758). This trend was consistent across other ensemble architectures like XGBoost and LightGBM (AUCs > 0.84). The integration of VLM-derived lexicons also elevated the lower bound of the 95% CI from 0.777 (radiomics-only) to 0.802 (fusion).
Conclusion
This study validates the novel application of VLMs as automated clinical observers in medical imaging. By effectively fusing VLM-derived semantic logic with micro-structural radiomics, the proposed pipeline offers accurate decision-support tool for multimodal breast cancer diagnosis.