From Radiomics to TI-RADS: A Stability-Aware Clinically-Interpretable Dictionary for Explainable Thyroid Nodule Classification on Ultrasound
Abstract
Purpose
AI–based radiomics models for thyroid ultrasound often lack interpretability, limiting clinical trust. This study aims to develop and validate a fully interpretable radiomics framework for thyroid nodule classification by linking quantitative ultrasound features to the TI-RADS semantic lexicon through a clinically grounded radiomics dictionary.
Methods
A clinically informed radiomics dictionary was developed to map TI-RADS semantic categories—composition, echogenicity, shape, margin, and echogenic foci—to Image Biomarker Standardization Initiative (IBSI)–compliant radiomic features (RF) extracted from 2D ultrasound images. Dictionary relationships were defined by expert consensus (four physicians, three physicists, one biologist) and statistically validated using Shapley-Additive-Explanations (SHAP). Three multicenter datasets—TUCC (192 cases), TN5000 (4974 cases), and DDTI (376 cases)—were combined (5,542 nodules). In total, 107 RFs were extracted using PyRadiomics and normalized using min–max scaling. Twenty-seven feature-selection methods were paired with twenty-five classifiers and evaluated using stratified 5-fold cross-validation on 80% of the data, followed by independent testing on the remaining 20%. Robust model selection employed a stability-aware composite score in which the normalized mean performance and variability of balanced accuracy, recall, precision, F1-score, and ROC-AUC were jointly combined across cross-validation folds, penalizing unstable models and favoring reproducible classification.
Results
The proposed dictionary links RFs to TI-RADS semantic descriptors, enabling clinically interpretable machine-learning analysis. Benign nodules showed shape regularity and uniform echogenicity, whereas malignant nodules exhibited heterogeneity, hypoechogenic texture, infiltrative margins, taller-than-wide morphology, and microcalcification signatures. The Select_From_Model feature selection (logistic regression) + Extra-Trees classifier achieved strong performance (test ROC–AUC: 0.94±0.005). SHAP analysis identified texture heterogeneity as the dominant malignancy signal, with Gray-Level-Run-Length-Matrix-Gray-Level-Non-Uniformity, intensity dispersion, and kurtosis aligning model decisions with high-risk TI-RADS descriptors.
Conclusion
This study introduces a clinically interpretable radiomics dictionary and stability-aware model selection framework that overcomes black-box limitations, enabling transparent and trustworthy thyroid nodule-risk stratification from ultrasound.