<p dir="ltr">Reinforcement learning (RL) has been a rapidly evolving field of research over the past years, enhancing developments in areas such as artificial intelligence, healthcare, and education, to name a few. Regardless of the success of RL, its inherent online learning nature presents obstacles for its real-world applications, since in many settings, online data collection with the latest learned policy can be expensive and/or dangerous (such as robotics, healthcare, and autonomous driving). This challenge has catalyzed research into offline RL, which involves reinforcement learning from previously collected static datasets, without the need for further online data collection. However, most existing offline RL methods depend on two key assumptions: unconfoundedness and positivity (also known as the full-coverage assumption), which frequently do not hold in the context of static datasets. </p><p dir="ltr">In the first part of this dissertation, we simultaneously address these two challenges by proposing a novel policy learning algorithm: PESsimistic CAusal Learning (PESCAL). We utilize the mediator variable based on Front-Door Criterion, to remove the confounding bias. Additionally, we adopt the pessimistic principle to tackle the distributional shift problem induced by the under-coverage issue. This issue refers to the mismatch of distributions between the action distributions induced by candidate policies, and the policy that generates the observational data (known as the behavior policy). Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function, to partially mitigate the issue of distributional shift. This insight significantly simplifies our algorithm, by circumventing the challenging task of sequential uncertainty quantification for the estimated Q-function. Moreover, we provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.</p><p dir="ltr">In the second part of this dissertation, in contrast to the first part, which approaches the distributional shift issue implicitly by penalizing the value function as a whole, we explicitly constrain the learned policy to not deviate significantly from the behavior policy, while still enabling flexible adjustment of the degree of constraints. Building upon the offline reinforcement learning algorithm, TD3+BC \cite{fujimoto2021minimalist}, we propose a model-free actor-critic algorithm with an adjustable behavior cloning (BC) term. We employ an ensemble of networks to quantify the uncertainty of the estimated value function, thus addressing the issue of overestimation. Moreover, we introduce a method that is both convenient and intuitively simple for controlling the degree of BC, through a Bernoulli random variable based on the user-specified confidence level for different offline datasets. Our proposed algorithm, named Ensemble-based Actor Critic with Adaptive Behavior Cloning (EABC), is straightforward to implement, exhibits low variance, and achieves strong performance across all D4RL benchmarks.</p>
Identifer | oai:union.ndltd.org:purdue.edu/oai:figshare.com:article/25653462 |
Date | 20 April 2024 |
Creators | Danyang Wang (18414336) |
Source Sets | Purdue University |
Detected Language | English |
Type | Text, Thesis |
Rights | CC BY 4.0 |
Relation | https://figshare.com/articles/thesis/Statistical_Methods_for_Offline_Deep_Reinforcement_Learning/25653462 |
Page generated in 0.0056 seconds