Multi-Task Machine Learning with Interpretability for Predicting the Imaging and Radiation Oncology Core Head and Neck Phantom Outcomes
Abstract
Purpose
The Imaging and Radiation Oncology Core (IROC) IMRT head and neck (HN) phantom continues to show unacceptable delivery rates at ~10%. Increased understanding of what drives performance on this audit would assist in targeted improvement of radiotherapy quality. Machine learning is a promising tool for interpreting phantom results. A Multi-Task Learning (MTL) framework (versus traditional single-task models) can investigate intrinsic correlations, treating target, organ-at-risk (OAR), and 2D film dosimetry as interconnected tasks, to enhance the detection of unacceptable plans.
Methods
We analyzed 1,447 IROC HN phantom irradiations (101 failures) performed by over 1,000 institutions between 2012 and 2020. The feature set included plan complexity metrics, TPS parameters, dosiomics, and DVH metrics. We developed a chain-based multi-task architecture using XGBoost (MT-XGB) and Random Forest (MT-RF) to jointly learn regression tasks (predicting Average Gamma, primary/secondary PTV, and OAR TLD ratios) and binary classification (Pass/Fail). Models were evaluated using bootstrap-voting feature selection, random oversampling, and 20×5 repeated stratified k-fold cross-validation. SHAP partial dependence plots were employed to interpret non-linear relationships between features and outcomes.
Results
Overall, the MTL framework demonstrated superior or comparable performance to ST models. MT-XGB achieved the highest classification performance with an F1 Score of 0.85 and Accuracy of 0.75, surpassing ST-XGB (Accuracy: 0.73) and ST-RF (Accuracy: 0.66). Notably, MT-XGB yielded the highest Sensitivity (0.76), identifying significantly more failing plans than ST models. SHAP analysis revealed that aperture-based complexity metrics, specifically Plan Irregularity, mean MLC Speed, and mean MLC Gap, were the dominant drivers of failure, suggesting that modulation complexity may impact the interplay between target coverage and OAR sparing.
Conclusion
Plan complexity remains the primary determinant of IMRT HN delivery accuracy. By leveraging interconnected endpoints, multi-task learning provides a robust and effective tool for predicting clinical outcomes and improving credentialing workflows.