Return to search

Towards Causal Reinforcement Learning

Causal inference provides a set of principles and tools that allows one to combine data and knowledge about an environment to reason with questions of a counterfactual nature - i.e., what would have happened if the reality had been different - even when no data of this unrealized reality is currently available. Reinforcement learning provides a collection of methods that allows the agent to reason about optimal decision-making under uncertainty by trial and error - i.e., what would the consequences (e.g., subsequent rewards, states) be had the action been different? While these two disciplines have evolved independently and with virtually no interaction, they operate over different aspects of the same building block, i.e., counterfactual reasoning, making them umbilically connected.

This dissertation provides a unified framework of causal reinforcement learning (CRL) that formalizes the connection between a causal inference and reinforcement learning and studies them side by side. The environment where the agent will be deployed is parsimoniously modeled as a structural causal model (Pearl, 2000) consisting of a collection of data-generating mechanisms that lead to different causal invariances. This formalization, in turn, will allow for a unifying treatment for learning strategies that are seemingly different in the literature, including online learning, off-policy learning, and causal identification. Moreover, novel learning opportunities naturally arise which are not addressed by existing strategies but entail new dimensions of analysis.

Specifically, this work advances our understanding of several dimensions of optimal decision-making under uncertainty, which includes the following capabilities and research questions:

Confounding-Robust Policy Evaluation. How to evaluate candidate policies from observations when unobserved confounders exist and the effects of actions on rewards appear more effective than they are? More importantly, how to derive a bound over the effect of a policy (i.e., a partial evaluation) when it cannot be uniquely determined from the observational data?

Offline-to-Online Learning. Online learning could be applied to fine-tune the partial evaluation of candidate policies. How to leverage such partial knowledge in a future online learning process without hurting the performance of the agent? More importantly, under what conditions can the learning process be accelerated instead?

Imitation Learning from Confounded Demonstrations. How to design a proper performance measure (e.g., reward or utility) from the behavioral trajectories of an expert demonstrating the task? Specifically, under which conditions could an imitator achieve the expert's performance by optimizing the learned reward, even when the imitator and the expert's sensory capabilities differ, and unobserved confounding bias is present in the demonstration data?

By building on the modern theory of causation, approximation, and statistical learning, this dissertation provides algebraic and graphical criteria and algorithmic procedures to support the inference required in the corresponding learning tasks. This characterization generalizes the computational framework of reinforcement learning by leveraging underlying causal relationships so that it is robust to the distribution shift presented in the data collected from the agent's different interaction regimes with the environment, from passive observation to active intervention. The theory provided in this work is general in that it takes any arbitrary set of structural causal assumptions as input and decides whether this specific instance admits an optimal solution.

The problems and methods discussed have several applications in the empirical sciences, statistics, operations management, machine learning, and artificial intelligence.

Identiferoai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/8yzw-gv74
Date January 2023
CreatorsZhang, Junzhe
Source SetsColumbia University
LanguageEnglish
Detected LanguageEnglish
TypeTheses

Page generated in 0.0121 seconds