Impact of Class-Imbalance Handling on Discrimination and Calibration In Multicenter Radiomic Survival Models
Abstract
Purpose
Radiomic survival models often have substantial class imbalance, which can distort patient ranking and reduce the reliability of predicted survival probabilities. This study evaluated commonly used imbalance-handling strategies and whether their effects change after ComBat harmonization.
Methods
A total of 1,648 pre-treatment CT radiomic features were extracted from 752 patients with head and neck squamous cell carcinoma from three institutions. The 2-year survival outcome was modeled using five feature-selection methods (Elastic Net, LASSO, Boruta, mRMR, and recursive feature elimination [RFE]) and logistic regression, random forests, or XGBoost. Four imbalance strategies—none, class weighting, SMOTE, and ADASYN—were evaluated using stratified cross-validation on two centers with external validation on the third. AUC assessed discrimination; calibration by Brier score and expected calibration error (ECE). A sensitivity analysis repeated the workflow, applying ComBat only to the training features.
Results
External performance across all pipelines (AUC 0.40–0.79; Brier 0.16–0.54; ECE 0.08–0.56). Adjusting for class imbalance changed results across several setups, but the direction of these shifts depended strongly on the specific model and feature selection. Once ComBat was applied, most pipelines built on Boruta, Elastic Net, LASSO, or mRMR showed a drop in AUC. In contrast, RFE combined with logistic regression was the only approach to show a clear and consistent benefit, with ΔAUC reaching +0.29 in the RF + class-weighting configuration. The strongest pipelines were all RFE-based, including LR with ComBat + SMOTE (AUC 0.793; Brier 0.185; ECE 0.174) and RF with ComBat + class-weighting (AUC 0.774; Brier 0.165; ECE 0.0105).
Conclusion
Imbalance-handling strategies should be evaluated within the full radiomics pipeline rather than applied by default. External reporting of calibration and discrimination is necessary to ensure that models provide transparent, clinically meaningful survival estimates.