Evaluating a Deep Learning-Based CT Image Quality Model Using Radiologist Assessment In Clinical Abdominal CT
Abstract
Purpose
Deep learning-based image quality assessment (DL-IQA) models are commonly trained using radiologist ratings and may provide a more objective approach to CT image quality evaluation. However, clinical deployment may be limited by institutional variation in scanners, protocols, and reader preferences. This study evaluated how well a DL-IQA model aligns with radiologists’ subjective assessment of image quality for routine abdominal CT at our institution.
Methods
A retrospective dataset of 100 adult abdomen-pelvis CT examinations (50 contrast-enhanced, 50 non-contrast) acquired on two Siemens scanners (Edge, Force) was reconstructed using FBP and ADMIRE with five strengths (IR1–IR5). Slice-level IQA was computed using a reference-free DL-IQA model that predicts perceptual quality scores (0–4) using natural-image pretraining and CT fine-tuning within a hybrid CNN-transformer architecture. Seven radiologists (four residents, one fellow, two attendings with >15 years’ experience) independently rated image quality using the model’s grading rubric for a BMI-balanced subset of 40 cases. Per-slice water-equivalent-diameter (WED), mAs, and DL-IQA scores were recorded. Model alignment was assessed using Spearman correlation versus individual and mean reader scores. Inter-reader agreement was quantified using Spearman’s ρ, Kendall’s W, and Fleiss’ κ, with subgroup analysis by training level.
Results
Inter-reader agreement was low (Fleiss’ κ≈0.14; Kendall’s W≈0.17), highlighting substantial subjectivity in clinical image quality assessment. DL-IQA showed mild correlation with the mean radiologist score (ρ=0.31), comparable to inter-reader correlations. Stronger alignment was observed for two early-career readers (1-3 years; ρ=0.54 and 0.49). DL-IQA exhibited weak dependence on patient size despite expected mAs scaling with WED (R²=0.41 non-contrast; 0.13 contrast). No significant IQA differences were observed between contrast and non-contrast scans (p>0.05).
Conclusion
Radiologist assessment remains the clinical reference for CT image quality yet is inherently variable. The evaluated DL-IQA model demonstrated mild agreement with consensus ratings and stronger alignment with early-career readers, supporting further validation across institutions and protocols.