Global ETD Search

Data-dependent Regret Bounds for Adversarial Multi-Armed Bandits and Online Portfolio Selection

This dissertation studies 𝐷𝑎𝑡𝑎-𝐷𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 regret bounds for two online learning problems. As opposed to worst-case regret bounds, data-dependent bounds are able to adapt to the particular sequence of losses seen by the player. Thus, they offer a more fine grained performance guarantee compared to worst-case bounds.

We start off with the Adversarial 𝑛-Armed Bandit problem. In prior literature it was a standard practice to assume that the loss vector belonged to a known domain, typically [0,1]ⁿ or [-1,1]ⁿ. We make no such assumption on the loss vectors, they may be completely arbitrary. We term this problem the Scale-Free Adversarial Multi Armed Bandit. At the beginning of the game, the player only knows the number of arms 𝑛. It does not know the scale and magnitude of the losses chosen by the adversary or the number of rounds 𝑇. In each round, it sees bandit feedback about the loss vectors 𝑙₁, . . . , 𝑙_𝑇 ⋲ ℝⁿ. Our goal is to bound its regret as a function of 𝑛 and norms of 𝑙₁, . . . , 𝑙_𝑇 . We design a bandit Follow The Regularized Leader (FTRL) algorithm, that uses a log-barrier regularizer along with an adaptive learning rate tuned via the AdaFTRL technique. We give two different regret bounds, based on the exploration parameter used. With non-adaptive exploration, our algorithm has a regret of 𝑂̃(√𝑛𝐿₂ + 𝐿_∞√𝑛𝑇) and with adaptive exploration, it has a regret of 𝑂(√𝑛𝐿₂ + 𝐿∞√𝑛𝐿₁). Here 𝐿∞ = sup_𝑡 ∥𝑙_𝑡∥_∞, 𝐿₂ = 𝚺ᵀ_𝑡₌₁ ∥𝑙_𝑡∥²₂, 𝐿₁ = 𝚺ᵀ_𝑡₌₁ ∥𝑙_𝑡∥₁ and the 𝑂̃ notation suppress logarithmic factors. These are the first MAB bounds that adapt to the ∥・∥₂, ∥・∥₁ norms of the losses. The second bound is the first data-dependent scale-free MAB bound as 𝑇 does not directly appear in the regret. We also develop a new technique for obtaining a rich class of local-norm lower-bounds for Bregman Divergences. This technique plays a crucial role in our analysis for controlling the regret when using importance weighted estimators of unbounded losses.

Next, we consider the Online Portfolio Selection (OPS) problem over 𝑛 assets and 𝑇 time periods. This problem was first studied by Cover [1], who proposed the Universal Portfolio (UP) algorithm. UP is a computationally expensive algorithm with minimax optimal regret of 𝑂(𝑛 log 𝑇). There has been renewed interest in OPS due to a recently posed open problem Van Erven 𝑒𝑡 𝑎𝑙. [2] which asks for a computationally efficient algorithm that is also has minimax optimal regret. We study data-dependent regret bounds for OPS problem that adapt to the sequence of returns seen by the investor. Our proposed algorithm called AdaCurv ONS modifies the Online Newton Step(ONS) algorithm of [3] using a new adaptive curvature surrogate function for the log losses — log(𝑟_𝑡ᵀ𝑤). We show that the AdaCurv ONS algorithm has 𝑂(𝑅𝑛𝑙𝑜𝑔𝑇) regret where 𝑅 is the data-dependent quantity. For sequences where 𝑅=𝑂(1), the regret of AdaCurv ONS matches the optimal regret. However, for some sequences 𝑅 could be unbounded, making the regret bound vacuous. To overcome this issue, we propose the LB-AdaCurv ONS algorithm that adds a log-barrier regularizer along with an adaptive learning rate tuned via the AdaFTRL technique. LB-AdaCurv ONS has an adaptive regret of the form 𝑂(min(𝑅 log 𝑇, √𝑛𝑇 log 𝑇)). Thus, LB-AdaCurv ONS has a worst case regret of 𝑂(√𝑛𝑇 log 𝑇) while also having a data-dependent regret of 𝑂(𝑛𝑅 log 𝑇) when 𝑅 = 𝑂(1). Additionally, we show logarithmic First-Order and Second-Order regret bounds for AdaCurv ONS and LB-AdaCurv ONS.

Finally, we consider the problem of Online Portfolio Selection (OPS) with predicted returns. We are the first to extend the paradigm of online learning with predictions to the portfolio selection problem. In this setting, the investor has access to noisy predictions of returns for the 𝑛 assets that can be incorporated into the portfolio selection process. We propose the Optimistic Expected Utility LB-FTRL (OUE-LB-FTRL) algorithm that incorporates the predictions using a utility function into the LB-FTRL algorithm. We explore the consistency-robustness properties for our algorithm. If the predictions are accurate, OUE-LB-FTRL's regret is 𝑂(𝑛 log 𝑇), providing a consistency guarantee. Even if the predictions are arbitrary, OUE-LB-FTRL's regret is always bounded by 𝑂(√𝑛𝑇 log 𝑇) providing a providing a robustness guarantee. Our algorithm also recovers a Gradual-variation regret bound for OPS. In the presence of predictions, we argue that the benchmark of static-regret becomes less meaningful. So, we consider the regret with respect to an investor who only uses predictions to select their portfolio (i.e., an expected utility investor). We provide a meta-algorithm called Best-of-Both Worlds for OPS (BoB-OPS), that combines the portfolios of an expected utility investor and a purely regret minimizing investor using a higher level portfolio selection algorithm. By instantiating the meta-algorithm and the purely regret minimizing investor with Cover's Universal Portfolio, we show that the regret of BoB-OPS with respect to the expected utility investor is 𝑂(log 𝑇). Simultaneously, BoB-OPS's static regret is 𝑂(𝑛 log 𝑇). This achieves a stronger form of consistency-robustness guarantee for OPS with predicted returns.

https://doi.org/10.7916/xf8d-vq32

Operations research

Decision making

Probabilities

Identifer	oai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/xf8d-vq32
Date	January 2024
Creators	Putta, Sudeep Raja
Source Sets	Columbia University
Language	English
Detected Language	English
Type	Theses

Page generated in 0.0025 seconds

Data-dependent Regret Bounds for Adversarial Multi-Armed Bandits and Online Portfolio Selection

Description

Links & Downloads

Tags

Additional Fields