Spelling suggestions: "subject:"multiarmed bandits"" "subject:"multiarmed andits""
1 |
Thompson sampling-based online decision making in network routingHuang, Zhiming 02 September 2020 (has links)
Online decision making is a kind of machine learning problems where decisions are made in a sequential manner so as to accumulate as many rewards as possible.
Typical examples include multi-armed bandit (MAB) problems where an agent needs to decide which arm to pull in each round, and network routing problems where each router needs to decide the next hop for each packet.
Thompson sampling (TS) is an efficient and effective algorithm for online decision making problems. Although TS has been proposed for a long time, it was not until recent years that the theoretical guarantees for TS in the standard MAB were given.
In this thesis, we first analyze the performance of TS both theoretically and practically in a special MAB called combinatorial MAB with sleeping arms and long-term fairness constraints (CSMAB-F). Then, we apply TS to a novel reactive network routing problem, called \emph{opportunistic routing without link metrics known a priori}, and use the proof techniques we developed for CSMAB-F to analyze the performance. / Graduate
|
2 |
Modelling Infertility with Markov ChainsDorff, Rebecca 20 June 2013 (has links) (PDF)
Infertility affects approximately 15% of couples. Testing and interventions are costly, in time, money, and emotional energy. This paper will discuss using Markov decision and multi-armed bandit processes to identify a systematic approach of interventions that will lead to the desired baby while minimizing costs.
|
3 |
Group-Envy Fairness in the Stochastic Bandit SettingScinocca, Stephen 29 September 2022 (has links)
We introduce a new, group fairness-inspired stochastic multi-armed bandit problem
in the pure exploration setting. We look at the discrepancy between an arm’s mean
reward from a group and the highest mean reward for any arm from that group, and
call this the disappointment that group suffers from that arm. We define the optimal
arm to be the one that minimizes the maximum disappointment over all groups. This
optimal arm addresses one problem with maximin fairness, where the group used to
choose the maximin best arm suffers little disappointment regardless of what arm is
picked, but another group suffers significantly more disappointment by picking that
arm as the best one. The challenge of this problem is that the highest mean reward
for a group and the arm that gives that reward are unknown. This means we need
to pull arms for multiple goals: to find the optimal arm, and to estimate the highest
mean reward of certain groups. This leads to the new adaptive sampling algorithm for
best arm identification in the fixed confidence setting called MD-LUCB, or Minimax
Disappointment LUCB. We prove bounds on MD-LUCB’s sample complexity and
then study its performance with empirical simulations. / Graduate
|
4 |
Essays in Econometrics and Dynamic Kidney ExchangeBaisi Hadad, Vitor January 2018 (has links)
Thesis advisor: Stefan Hoderlein / This dissertation is divided into two parts. Part I - Dynamic Kidney Exchange In recent years, kidney paired donation (KPD) has an emerged as an attractive alternative for end-stage renal disease patients with incompatible living donors. However, we argue that the matching algorithm currently used by organ clearinghouses is inefficient, in the sense that a larger number of patients may be reached if kidney transplant centers take into consideration how their pool of patients and donors will evolve over time. In our work Two Novel Algorithms for Dynamic Kidney Exchange, we explore this claim and propose new computational algorithms to increase the cardinality of matchings in a discrete-time dynamic kidney exchange model with Poisson entries and Geometric deaths. Our algorithms are classified into direct prediction methods and multi-armed bandit methods. In the direct prediction method, we use machine learning estimator to produce a probability that each patient-donor pair should be matched today, as op- posed to being left for a future matching. The estimators are trained on offline optimal solutions. In contrast, in multi-armed bandit methods, we use simulations to evaluate the desirability of different matchings. Since the amount of different matchings is enormous, multi-armed bandits (MAB) are employed to decrease order to decrease the computational burden. Our methods are evaluated using simulations in a variety of simulation configurations. We find that the performance of at least one of our methods, based on multi-armed bandit algorithms, is able to uniformly dominate the myopic method that is used by kidney transplants in practice. We restrict our experiments to pairwise kidney exchange, but the methods described here are easily extensible, computational constraints permitting. Part II - Econometrics In our econometric paper Heterogenous Production Functions, Panel Data, and Productivity, we present methods for identification of moments and nonparametric marginal distributions of endogenous random coefficient models in fixed-T linear panel data models. Our identification strategy is constructive, immediately leading to relatively simple estimators that can be shown to be consistent and asymptotically normal. Because our strategy makes use of special properties of “small” (measure-zero) subpopulations, our estimators are irregularly identified: they can be shown to be consistent and asymptotically Normal, but converge at rates slower than root-n. We provide an illustration of our methods by estimating first and second moments of random Cobb-Douglas coefficients in production functions, using Indian plant-level microdata. / Thesis (PhD) — Boston College, 2018. / Submitted to: Boston College. Graduate School of Arts and Sciences. / Discipline: Economics.
|
5 |
Minimizing Regret in Combinatorial Bandits and Reinforcement LearningTalebi Mazraeh Shahi, Mohammad Sadegh January 2017 (has links)
This thesis investigates sequential decision making tasks that fall in the framework of reinforcement learning (RL). These tasks involve a decision maker repeatedly interacting with an environment modeled by an unknown finite Markov decision process (MDP), who wishes to maximize a notion of reward accumulated during her experience. Her performance can be measured through the notion of regret, which compares her accumulated expected reward against that achieved by an oracle algorithm always following an optimal behavior. In order to maximize her accumulated reward, or equivalently to minimize the regret, she needs to face a trade-off between exploration and exploitation. The first part of this thesis investigates combinatorial multi-armed bandit (MAB) problems, which are RL problems whose state-space is a singleton. It also addresses some applications that can be cast as combinatorial MAB problems. The number of arms in such problems generically grows exponentially with the number of basic actions, but the rewards of various arms are correlated. Hence, the challenge in such problems is to exploit the underlying combinatorial structure.For these problems, we derive asymptotic (i.e., when the time horizon grows large) lower bounds on the regret of any admissible algorithm and investigate how these bounds scale with the dimension of the underlying combinatorial structure. We then propose several algorithms and provide finite-time analyses of their regret. The proposed algorithms efficiently exploit the structure of the problem, provide better performance guarantees than existing algorithms, and significantly outperform these algorithms in practice. The second part of the thesis concerns RL in an unknown and discrete MDP under the average-reward criterion. We develop some variations of the transportation lemma that could serve as novel tools for the regret analysis of RL algorithms. Revisiting existing regret lower bounds allows us to derive alternative bounds, which motivate that the local variance of the bias function of the MDP, i.e., the variance with respect to next-state transition laws, could serve as a notion of problem complexity for regret minimization in RL. Leveraging these tools also allows us to report a novel regret analysis of the KL-UCRL algorithm for ergodic MDPs. The leading term in our regret bound depends on the local variance of the bias function, thus coinciding with observations obtained from our presented lower bounds. Numerical evaluations in some benchmark MDPs indicate that the leading term of the derived bound can provide an order of magnitude improvement over previously known results for this algorithm. / <p>QC 20171215</p>
|
6 |
Bandits Manchots sur Flux de Données Non Stationnaires / Multi-armed Bandits on non Stationary Data StreamsAllesiardo, Robin 19 October 2016 (has links)
Le problème des bandits manchots est un cadre théorique permettant d'étudier le compromis entre exploration et exploitation lorsque l'information observée est partielle. Dans celui-ci, un joueur dispose d'un ensemble de K bras (ou actions), chacun associé à une distribution de récompenses D(µk) de moyenne µk Є [0, 1] et de support [0, 1]. A chaque tour t Є [1, T], il choisit un bras kt et observe la récompense y kt tirée depuis D (µkt). La difficulté du problème vient du fait que le joueur observe uniquement la récompense associée au bras joué; il ne connaît pas celle qui aurait pu être obtenue en jouant un autre bras. À chaque choix, il est ainsi confronté au dilemme entre l'exploration et l'exploitation; explorer lui permet d'affiner sa connaissance des distributions associées aux bras explorés tandis qu'exploiter lui permet d'accumuler davantage de récompenses en jouant le meilleur bras empirique (sous réserve que le meilleur bras empirique soit effectivement le meilleur bras). Dans la première partie de la thèse nous aborderons le problème des bandits manchots lorsque les distributions générant les récompenses sont non-stationnaires. Nous étudierons dans un premier temps le cas où même si les distributions varient au cours du temps, le meilleur bras ne change pas. Nous étudierons ensuite le cas où le meilleur bras peut aussi changer au cours du temps. La seconde partie est consacrée aux algorithmes de bandits contextuels où les récompenses dépendent de l'état de l'environnement. Nous étudierons l'utilisation des réseaux de neurones et des forêts d'arbres dans le cas des bandits contextuels puis les différentes approches à base de méta-bandits permettant de sélectionner en ligne l'expert le plus performant durant son apprentissage. / The multi-armed bandit is a framework allowing the study of the trade-off between exploration and exploitation under partial feedback. At each turn t Є [1,T] of the game, a player has to choose an arm kt in a set of K and receives a reward ykt drawn from a reward distribution D(µkt) of mean µkt and support [0,1]. This is a challeging problem as the player only knows the reward associated with the played arm and does not know what would be the reward if she had played another arm. Before each play, she is confronted to the dilemma between exploration and exploitation; exploring allows to increase the confidence of the reward estimators and exploiting allows to increase the cumulative reward by playing the empirical best arm (under the assumption that the empirical best arm is indeed the actual best arm).In the first part of the thesis, we will tackle the multi-armed bandit problem when reward distributions are non-stationary. Firstly, we will study the case where, even if reward distributions change during the game, the best arm stays the same. Secondly, we will study the case where the best arm changes during the game. The second part of the thesis tacles the contextual bandit problem where means of reward distributions are now dependent of the environment's current state. We will study the use of neural networks and random forests in the case of contextual bandits. We will then propose meta-bandit based approach for selecting online the most performant expert during its learning.
|
7 |
Hyperparameter Tuning for Reinforcement Learning with Bandits and Off-Policy SamplingHauser, Kristen 21 June 2021 (has links)
No description available.
|
8 |
A Recommender System for Suggested Sites using Multi-Armed Bandits : Initialising Bandit Contexts by Neural Collaborative Filtering / Ett rekommendationssystem för länkförslag byggt på flerarmade banditerStenberg, William January 2021 (has links)
The abundance of information available on the internet necessitates means of quickly finding what is relevant for the individual user. To this end, there has been much research concerning recommender systems and lately specifically methods using deep learning for such systems. This work proposes a Multi-Armed Bandit as a recommender for suggested sites on a browser start page. The system is compared to a pre-existing baseline and does not manage to outperform it in the setting used in controlled experiments. A Neural Collaborative Filtering system is then constructed using a stacked autoencoder and is used to produce user preference vectors that are inserted in the bandit in the hope of improving its performance. Analysis indicates that the bandit solution works better as the number of items grows. The user-informed initialisation used in this work shows a trend of improving over a randomly-initialised bandit, but results are inconclusive. This work also contributes an analysis of the problem domain including which factors impact the performance on the model training for preference vectors, and the performance of the bandit algorithms.
|
9 |
Efficient Online Learning with Bandit FeedbackLiu, Fang 29 September 2020 (has links)
No description available.
|
10 |
Online Combinatorial Optimization under Bandit FeedbackTalebi Mazraeh Shahi, Mohammad Sadegh January 2016 (has links)
Multi-Armed Bandits (MAB) constitute the most fundamental model for sequential decision making problems with an exploration vs. exploitation trade-off. In such problems, the decision maker selects an arm in each round and observes a realization of the corresponding unknown reward distribution. Each decision is based on past decisions and observed rewards. The objective is to maximize the expected cumulative reward over some time horizon by balancing exploitation (arms with higher observed rewards should be selectedoften) and exploration (all arms should be explored to learn their average rewards). Equivalently, the performanceof a decision rule or algorithm can be measured through its expected regret, defined as the gap betweenthe expected reward achieved by the algorithm and that achieved by an oracle algorithm always selecting the bestarm. This thesis investigates stochastic and adversarial combinatorial MAB problems, where each arm is a collection of several basic actions taken from a set of $d$ elements, in a way that the set of arms has a certain combinatorial structure. Examples of such sets include the set of fixed-size subsets, matchings, spanning trees, paths, etc. These problems are specific forms of online linear optimization, where the decision space is a subset of $d$-dimensional hypercube.Due to the combinatorial nature, the number of arms generically grows exponentially with $d$. Hence, treating arms as independent and applying classical sequential arm selection policies would yield a prohibitive regret. It may then be crucial to exploit the combinatorial structure of the problem to design efficient arm selection algorithms.As the first contribution of this thesis, in Chapter 3 we investigate combinatorial MABs in the stochastic setting and with Bernoulli rewards. We derive asymptotic (i.e., when the time horizon grows large) lower bounds on the regret of any algorithm under bandit and semi-bandit feedback. The proposed lower bounds are problem-specific and tight in the sense that there exists an algorithm that achieves these regret bounds. Our derivation leverages some theoretical results in adaptive control of Markov chains. Under semi-bandit feedback, we further discuss the scaling of the proposed lower bound with the dimension of the underlying combinatorial structure. For the case of semi-bandit feedback, we propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ESCB has better performance guarantees than existing algorithms, and significantly outperforms these algorithms in practice. In the fourth chapter, we consider stochastic combinatorial MAB problems where the underlying combinatorial structure is a matroid. Specializing the results of Chapter 3 to matroids, we provide explicit regret lower bounds for this class of problems. For the case of semi-bandit feedback, we propose KL-OSM, a computationally efficient greedy-based algorithm that exploits the matroid structure. Through a finite-time analysis, we prove that the regret upper bound of KL-OSM matches the proposed lower bound, thus making it the first asymptotically optimal algorithm for this class of problems. Numerical experiments validate that KL-OSM outperforms state-of-the-art algorithms in practice, as well.In the fifth chapter, we investigate the online shortest-path routing problem which is an instance of combinatorial MABs with geometric rewards. We consider and compare three different types of online routing policies, depending (i) on where routing decisions are taken (at the source or at each node), and (ii) on the received feedback (semi-bandit or bandit). For each case, we derive the asymptotic regret lower bound. These bounds help us to understand the performance improvements we can expect when (i) taking routing decisions at each hop rather than at the source only, and (ii) observing per-link delays rather than end-to-end path delays. In particular, we show that (i) is of no use while (ii) can have a spectacular impact.For source routing under semi-bandit feedback, we then propose two algorithms with a trade-off betweencomputational complexity and performance. The regret upper bounds of these algorithms improve over those ofthe existing algorithms, and they significantly outperform state-of-the-art algorithms in numerical experiments. Finally, we discuss combinatorial MABs in the adversarial setting and under bandit feedback. We concentrate on the case where arms have the same number of basic actions but are otherwise arbitrary. We propose CombEXP, an algorithm that has the same regret scaling as state-of-the-art algorithms. Furthermore, we show that CombEXP admits lower computational complexity for some combinatorial problems. / <p>QC 20160201</p>
|
Page generated in 0.0408 seconds