Essays on Adaptive Experimentation: Bringing Real-World Challenges to Multi-Armed Bandits

Classical randomized controlled trials have long been the gold standard for estimating treatment effects. However, adaptive experimentation, especially through multi-armed bandit algorithms, aims to improve efficiency beyond traditional randomized controlled trials. While there is a vast literature on multi-armed bandits, a simple yet powerful framework in reinforcement learning, real-world challenges can hinder the successful implementation of adaptive algorithms. This thesis seeks to bridge this gap by integrating real-world challenges into multi-armed bandits.

The first chapter examines two competing priorities that practitioners often encounter in adaptive experiments: maximizing total welfare through effective treatment assignments and swiftly conducting experiments to implement population-wide treatments. We propose a unified model that simultaneously accounts for within-experiment performance and post-experiment outcomes. We provide a sharp theory of optimal performance that not only unifies canonical results from the literature on regret minimization and best-arm identification but also uncovers novel insights. Our theory reveals that familiar algorithms, such as the recently proposed top-two Thompson sampling algorithm, can optimize a broad class of objectives if a single scalar parameter is appropriately adjusted. Furthermore, we demonstrate that substantial reductions in experiment duration can often be achieved with minimal impact on total regret.

The second chapter studies the fundamental tension between the distinct priorities of non-adaptive and adaptive experiments: robustness to exogenous variation and efficient information gathering. We introduce a novel multi-armed bandit model that incorporates nonstationary exogenous factors, and propose deconfounded Thompson sampling, a more robust variant of the prominent Thompson sampling algorithm. We provide bounds on both within-experiment and post-experiment regret of deconfounded Thompson sampling, illustrating its resilience to exogenous variation and the delicate balance it strikes between exploration and exploitation. Our proofs leverage inverse propensity weights to analyze the evolution of the posterior distribution, a departure from established methods in the literature. Hinting that new understanding is indeed necessary, we demonstrate that a deconfounded variant of the popular upper confidence bound algorithm can fail completely.

Identiferoai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/ht74-e058
Date January 2024
CreatorsQin, Chao
Source SetsColumbia University
LanguageEnglish
Detected LanguageEnglish
TypeTheses

Page generated in 0.002 seconds