When APIs Don’t Exist: Visual-Language–Driven Automation of Streamed Medical Software
Abstract
Purpose
Clinical medical software is often deployed through streamed environments such as Citrix, where conventional automation methods fail due to the lack of native operating system controls and application programming interfaces. This work presents a visual-language–driven automation framework designed to enable reliable automation of complex clinical workflows that are otherwise inaccessible to traditional scripting approaches.
Methods
The system evaluates application screen state prior to each action using a visual language model (VLM) to perform Boolean checkpoint validation. Upon validation, the framework executes modular actions including mouse input, keyboard entry, conditional branching, template and icon matching, looping over variable inputs, and optional human-in-the-loop intervention. A second VLM generates user interface actions by interpreting prompts such as “click the search bar” and returning screen coordinates and action types, which are translated into operating system–level input simulation, enabling control even over streamed connections. System performance was evaluated by extracting plan reports from the Varian Ethos Treatment Management System using a list of 280 patient identifiers injected iteratively into the automation flow.
Results
All 280 plan reports were successfully extracted over a 12-hour period using a standardized 14-step flow that navigated the interface, exported the plan report, closed the patient record, and returned the application to its original state before the next iteration. Five interruption events occurred due to unexpected user interface states, including incorrect patient identifiers and transitions between active and inactive patient contexts, which were outside the programmed logic.
Conclusion
Visual-language–driven automation enables robust control of streamed clinical software where APIs and native automation are unavailable. This modular approach fills critical automation gaps and is adaptable to a wide range of clinical workflows.