Reinforcement Learning for Automated IMRT Treatment Planning: Mathematical Optimization of the Reward Function Design
Abstract
Purpose
Reinforcement learning (RL) constitutes a strong candidate for AI-guided treatment planning for two distinguishing reasons: it differs from greedy algorithms by optimizing strategy over the complete history of a Markovian process; and it contrasts with supervised learning by relying entirely on explorative interactions without a priori knowledge of the environment. The reward signal is the sole human input driving the agent, and is therefore critical to its outcomes. It needs to be meticulously designed to instill clinical preference and consideration on the derived treatment plan. We aim to optimize the reward function and improve RL-generated plan quality.
Methods
We implemented custom soft actor-critic (SAC) RL on in-house treatment planning system. We designed mathematical reward functions that coordinated the agent’s focus between organs-at-risk (OARs) of variable management difficulty. Thirty-eight head-and-neck IMRT cases (Rx: 44 Gy) were randomized into training and testing sets (n=19 each). Agents were trained to conduct plan optimization on the training set—directed by novel reward models—by making informed adjustments on the planning objectives. Performance was evaluated based on DVH metrics and 3D dose distributions of RL-generated plans on the testing set. Results were benchmarked against a re-implemented, previously published piecewise-linear model.
Results
Agents trained with quadratic and exponential rewards outperformed the piecewise-linear baseline. Average maximum PTV doses were 114.4% (quadratic) and 114.7% (exponential), comparable to the piecewise-linear baseline (114.2%). Average median dose decreased to 8.6 Gy and 9.0 Gy for parotids (baseline: 10.2 Gy), and 30.4 Gy for pharynx for both (baseline: 33.1 Gy). Other OARs remained comparable to the baseline.
Conclusion
RL agents trained with novel reward function designs achieved a 10% reduction in average median dose to select OARs, with no loss of PTV coverage or uniformity. Results demonstrated that quadratic and exponential functions are superior RL reward function models for head-and-neck treatment planning.