AI Methods for the Diagnosis of Pneumonia on Chest Radiographs: Evaluation of Clinical Interpretability
Abstract
Purpose
This study investigated performance assessment metrics that, when applied to artificial intelligence (AI) outputs designed to identify the presence of pneumonia on chest radiographs, best align with the subjective clinical opinion of radiologists.
Methods
A subset of 100 chest radiographs was collected from the test set of the Medical Imaging and Data Resource Center (MIDRC) XAI Challenge: Decoding AI Decisions for Pneumonia on Chest Radiographs. Each radiograph had an associated reference probability map generated from manual segmentations of pneumonia by three radiologists. The radiographs were processed through four of the top-performing models from the XAI Challenge (Models A, B, C, and D from best to worst). The model outputs were compared with the reference maps using three different metrics: weighted log-loss (WLL) (the metric used in the XAI Challenge), Dice similarity coefficient (DSC), and weighted Dice similarity coefficient (wDSC). To account for continuous pixel-value ranges within model outputs, pixel values were binned into quartiles to produce weighted outputs for DSC-based calculations. Each model was ranked from best to worst according to each metric separately, and average metric values with 95% confidence intervals (95% CIs) were calculated across all 100 cases. To establish an assessment of clinical utility, five radiologists evaluated the outputs from each model for each case and ranked them based on their perceived clinical utility for pneumonia localization. Metric-based rankings then were compared with the radiologists’ rankings.
Results
Radiologists ranked the models from best to worst as C, B, D, A. The WLL produced the ranking A, B, C, D, and the DSC produced the ranking B, C, A, D. The wDSC produced the ranking C, B, A, D.
Conclusion
The wDSC aligned most closely with the rankings of the radiologists. Future studies will explore alternative weighting schemes for the WLL and wDSC and investigate alternative metrics.