Paper Proffered Program Therapy Physics

Is Your AI Drifting? an Agentic Continuous Performance Monitoring System for Autosegmentation

Abstract
Purpose

While deep learning autosegmentation models are widely integrated into clinical workflows in radiation oncology, a critical gap has emerged: the "static deployment" trap. Once deployed, model performance can deteriorate due to real-world data evolution, making performance drift unavoidable in clinical AI systems. Current manual spot-check methods provide limited safeguards. We present a multi-agent performance monitoring system (MAPMS) that provides continuous, automated oversight of model performance, enabling early detection of degradation and generating prescriptive reports to guide clinical decision-making and model maintenance.

Methods

Built on LangGraph, the MAPMS features four specialized agents: an Input Data Monitoring Agent to track population-level model input shifts; a Performance Surveillance Agent for multi-organ contour quality (QC) assessment; a Root-Cause Analysis Agent to correlate input shifts with QC degradation; and a Clinical Synthesis Agent that integrates historical performance trends to generate actionable clinical recommendations. This framework was evaluated on a pelvic 3D CT autosegmentation model using commissioning data (n=104) as a baseline and post-deployment data (2013–2022, n =1206) analyzed. Agentic analysis focused on periodic and longitudinal drift trends of Dice Similarity Scores (DSC) across rectum, bladder, and prostate contours.

Results

Across 39 monitoring periods, the MAPMS demonstrated agreement with expert human in classifying performance shift severity into routine (2013–2014), watch (2015), and investigate (2015–2022). Assigned severity levels reflected DSC changes across multiple organs, with mean values of 0.80/0.92/0.82 for rectum/bladder/prostate during routine periods, declines within 1–2 standard deviations during watch periods (0.77/0.90/0.80), and sustained degradation during investigate periods (0.74/0.84/0.74). For input-domain monitoring, the MAS consistently identified the dominant shift drivers per period that aligned temporally with documented clinical protocols.

Conclusion

This study demonstrates the clinical feasibility of an agentic system for longitudinal autosegmentation monitoring. The framework ensures long-term safety and continuous quality assurance, providing a scalable solution for the long-term oversight of clinical AI workflows.

People

Related

Similar sessions

Poster Poster Program
Jul 19 · 07:00
Python-Based Automation Framework for Annual Machine QA Data Archiving In Qatrack+

Annual water-tank measurements help ensure beam characteristics remain consistent with commissioning baselines. However, the lack of a standardized processing workflow and decentralized data storage makes it difficult to analyze...

Syed Bilal Ahmad, PhD
Therapy Physics 0 people interested
Poster Poster Program
Jul 19 · 07:00
User Expectations and Current Availability of HDR Brachytherapy Audits In Europe

The aim of this work was to evaluate the need to implement more dosimetric audits in high‐dose‐rate brachytherapy (HDR-BT) in Europe and to identify which characteristics such audits should meet according to users.

Javier Vijande, PhD Laura Oliver Cañamás
Therapy Physics 0 people interested