Global ETD Search

1	Conditioning behavior styles of Reinforcement Learning policies Mysore Sthaneshwar, Siddharth 19 September 2023 (has links) Reinforcement Learning (RL) algorithms may learn any of an arbitrary set of behaviors that may satisfy a reward-based objective, and this lack of consistency can limit the reliability and practical utility of RL. By considering how RL policies are trained, aspects of the core optimization loop are identified, that significantly impact what behaviors are learned and how. The work presented in this thesis develops frameworks for manipulating these aspects to define and train desirable behavior in practical and more user-friendly ways. Smoothness in RL-based control was found to be a common issue among existing applications of RL in real-world controls. Our initial work on REinforcement-based transferable Agents through Learning (RE+AL) demonstrates that, through principled reward engineering and training-environment tuning, it is possible to learn effective and smooth control. However, this would still be tedious to extend to new tasks. Conditioning for Action Policy Smoothness (CAPS) introduces simple regularization terms directly to the policy optimization and serves as a generalized solution to smooth control that is more easily extensible across tasks. Looking next at how neural network architectural choices impact policy learning, it was noted that the burden of complexity in learning and representation often fell disproportionately to the value function approximations learned during training. Building on this knowledge, Multi-Critic Actor Learning (MultiCriticAL) was developed for multi-task RL, drawing from the intuition that, if value functions to estimate policy quality are difficult to learn, having distinct functions to evaluate each task would ease this representational burden. MultiCriticAL provides an effective tool for learning policies that can smoothly transition between multiple behavior styles and demonstrates superior performance over commonly used single-critic techniques, both in reward-based performance metrics, as well as data efficiency, even enabling learning in cases where baseline methods would otherwise fail. When considering user-friendliness for non-expert practitioners, demonstrations of desirable behavior can often be easier to provide than fine-tuned heuristics, making imitation learning an attractive avenue of exploration for user-friendly tools in policy design. Where heuristic-based rewards can guide RL in general learning, imitation can be used to condition optimization for specific behaviors, though this requires a balancing of possibly conflicting RL and imitation policy optimization signals. We overcome this challenge by extending MultiCriticAL to learning behavior from demonstrations. The Split-Critic Imitation Learning (SCIL) framework allows the definition of specific behaviors in parts of the state space, where it matters, and allows policies to learn any other compatible, generally useful behavior over the rest of the states, using a more standard RL reward-based training loop. Inheriting the strengths of MultiCriticAL, SCIL is able to better separate and balance reinforcement- and imitation-based policy optimization signals to adequately handle both, where contemporary state-of-the-art imitation learning frameworks may fail while enabling improved imitation performance and data efficiency. / 2024-09-18T00:00:00Z Artificial intelligence Control Game AI Multi-style learning Reinforcement learning Robotics

Search results

Conditioning behavior styles of Reinforcement Learning policies