Is Your AI Drifting? an Agentic Continuous Performance Monitoring System for Autosegmentation
Abstract
Purpose
While deep learning autosegmentation models are widely integrated into clinical workflows in radiation oncology, a critical gap has emerged: the "static deployment" trap. Once deployed, model performance can deteriorate due to real-world data evolution, making performance drift unavoidable in clinical AI systems. Current manual spot-check methods provide limited safeguards. We present a multi-agent performance monitoring system (MAPMS) that provides continuous, automated oversight of model performance, enabling early detection of degradation and generating prescriptive reports to guide clinical decision-making and model maintenance.
Methods
Built on LangGraph, the MAPMS features four specialized agents: an Input Data Monitoring Agent to track population-level model input shifts; a Performance Surveillance Agent for multi-organ contour quality (QC) assessment; a Root-Cause Analysis Agent to correlate input shifts with QC degradation; and a Clinical Synthesis Agent that integrates historical performance trends to generate actionable clinical recommendations. This framework was evaluated on a pelvic 3D CT autosegmentation model using commissioning data (n=104) as a baseline and post-deployment data (2013–2022, n =1206) analyzed. Agentic analysis focused on periodic and longitudinal drift trends of Dice Similarity Scores (DSC) across rectum, bladder, and prostate contours.
Results
Across 39 monitoring periods, the MAPMS demonstrated agreement with expert human in classifying performance shift severity into routine (2013–2014), watch (2015), and investigate (2015–2022). Assigned severity levels reflected DSC changes across multiple organs, with mean values of 0.80/0.92/0.82 for rectum/bladder/prostate during routine periods, declines within 1–2 standard deviations during watch periods (0.77/0.90/0.80), and sustained degradation during investigate periods (0.74/0.84/0.74). For input-domain monitoring, the MAS consistently identified the dominant shift drivers per period that aligned temporally with documented clinical protocols.
Conclusion
This study demonstrates the clinical feasibility of an agentic system for longitudinal autosegmentation monitoring. The framework ensures long-term safety and continuous quality assurance, providing a scalable solution for the long-term oversight of clinical AI workflows.