Spelling suggestions: "subject:"multiarmed bandit"" "subject:"multiarmed pandit""
1 |
Sélection Séquentielle en Environnement Aléatoire Appliquée à l'Apprentissage SuperviséCaelen, Olivier 25 September 2009 (has links)
Cette thèse se penche sur les problèmes de décisions devant être prises de manière séquentielle au sein d'un environnement aléatoire. Lors de chaque étape d'un tel problème décisionnel, une alternative doit être sélectionnée parmi un ensemble d'alternatives. Chaque alternative possède un gain moyen qui lui est propre et lorsque l'une d'elles est sélectionnée, celle-ci engendre un gain aléatoire. La sélection opérée peut suivre deux types d'objectifs.
Dans un premier cas, les tests viseront à maximiser la somme des gains collectés. Un juste compromis doit alors être trouvé entre l'exploitation et l'exploration. Ce problème est couramment dénommé dans la littérature scientifique "multi-armed bandit problem".
Dans un second cas, un nombre de sélections maximal est imposé et l'objectif consistera à répartir ces sélections de façon à augmenter les chances de trouver l'alternative présentant le gain moyen le plus élevé. Ce deuxième problème est couramment repris dans la littérature scientifique sous l'appellation "selecting the best".
La sélection de type gloutonne joue un rôle important dans la résolution de ces problèmes de décision et opère en choisissant l'alternative qui s'est jusqu'ici montrée optimale. Or, la nature généralement aléatoire de l'environnement rend incertains les résultats d'une telle sélection.
Dans cette thèse, nous introduisons une nouvelle quantité, appelée le "gain espéré d'une action gloutonne". Sur base de quelques propriétés de cette quantité, de nouveaux algorithmes permettant de résoudre les deux problèmes décisionnels précités seront proposés.
Une attention particulière sera ici prêtée à l'application des techniques présentées au domaine de la sélection de modèles en l'apprentissage artificiel supervisé.
La collaboration avec le service d'anesthésie de l'Hôpital Erasme nous a permis d'appliquer les algorithmes proposés à des données réelles, provenant du milieu médical. Nous avons également développé un système d'aide à la décision dont un prototype a déjà été testé en conditions réelles sur un échantillon restreint de patients.
|
2 |
Multi-Armed Bandit Problems under Delayed FeedbackJoulani, Pooria Unknown Date
No description available.
|
3 |
Minimizing age of information for semi-periodic arrivals of multiple packetsChen, Mianlong 04 December 2019 (has links)
Age of information (AoI) captures the freshness of information and has been used broadly for scheduling data transmission in the Internet of Things (IoT). We consider a general scenario where a meaningful piece of information consists of multiple packets and the information would not be considered complete until all related packets have been correctly received. This general scenario, seemingly a trivial extension of exiting work where information update is in terms of single packet, is actually challenging in both scheduling algorithm design and theoretical analysis, because we need to track the history of received packets before a complete piece of information can be updated. We first analyse the necessary condition for optimal scheduling based on which we present an optimal scheduling method. The optimal solution, however, has high time complexity. To address the problem, we investigate the problem in the framework of restless multi-armed bandit (RMAB) and propose an index-based scheduling policy by applying Whittle index. We also propose a new transmission strategy based on erasure codes to improve the performance of scheduling policies in lossy networks. Performance evaluation results demonstrate that our solution outperforms other baseline policies such as greedy policy and naive Whittle index policy in both lossless and lossy networks. / Graduate
|
4 |
A Study of Thompson Sampling Approach for the Sleeping Multi-Armed Bandit ProblemChatterjee, Aritra January 2017 (has links) (PDF)
The multi-armed bandit (MAB) problem provides a convenient abstraction for many online decision problems arising in modern applications including Internet display advertising, crowdsourcing, online procurement, smart grids, etc. Several variants of the MAB problem have been proposed to extend the basic model to a variety of practical and general settings. The sleeping multi-armed bandit (SMAB) problem is one such variant where the set of available arms varies with time. This study is focused on analyzing the efficacy of the Thompson Sampling algorithm for solving the SMAB problem.
Any algorithm for the classical MAB problem is expected to choose one of K available arms (actions) in each of T consecutive rounds. Each choice of an arm generates a stochastic reward from an unknown but fixed distribution. The goal of the algorithm is to maximize the expected sum of rewards over the T rounds (or equivalently minimize the expected total regret), relative to the best fixed action in hindsight. In many real-world settings, however, not all arms may be available in any given round. For example, in Internet display advertising, some advertisers might choose to stay away from the auction due to budget constraints; in crowdsourcing, some workers may not be available at a given time due to timezone difference, etc. Such situations give rise to the sleeping MAB abstraction.
In the literature, several upper confidence bound (UCB)-based approaches have been proposed and investigated for the SMAB problem. Our contribution is to investigate the efficacy of a Thomp-son Sampling-based approach. Our key finding is to establish a logarithmic regret bound, which non-trivially generalizes a similar bound known for this approach in the classical MAB setting. Our bound also matches (up to constants) the best-known lower bound for the SMAB problem. Furthermore, we show via detailed simulations, that the Thompson Sampling approach in fact outperforms the known algorithms for the SMAB problem.
|
5 |
Multi-channel opportunistic access : a restless multi-armed bandit perspective / Accès opportuniste dans les systèmes de communication multi-canaux : une perspective du problème de bandit-manchotWang, Kehao 22 June 2012 (has links)
Dans cette thèse, nous abordons le problème fondamental de l'accès au spectre opportuniste dans un système de communication multi-canal. Plus précisément, nous considérons un système de communication dans lequel un utilisateur a accès à de multiples canaux, tout en étant limité à la détection et la transmission sur un sous-ensemble de canaux. Nous explorons comment l'utilisateur intelligent exploite ses observations passées et les propriétés stochastiques de ces canaux afin de maximiser son débit. Formellement, nous fournissons une analyse générique sur le problème d'accès au spectre opportuniste en nous basant sur le problème de `restless multi-bandit’ (RMAB), l'une des généralisations les plus connues du problème classique de multi-armed bandit (MAB), un problème fondamental dans la théorie de décision stochastique. Malgré les importants efforts de la communauté de recherche dans ce domaine, le problème RMAB dans sa forme générique reste encore ouvert. Jusqu'à aujourd'hui, très peu de résultats sont connus sur la structure de la politique optimale. L'obtention de la politique optimale pour un problème RMAB général est intraçable dû la complexité de calcul exponentiel. Par conséquent, une alternative naturelle est de se focaliser sur la politique myopique qui maximise la récompense à immédiate, tout en ignorant celles du futur. Donc, nous développons trois axiomes caractérisant une famille de fonctions que nous appelons fonctions régulières, qui sont génériques et pratiquement importantes. Nous établissons ensuite l'optimalité de la politique myopique lorsque la fonction de récompense peut être exprimée comme une fonction régulière et le facteur de discount est borné par un seuil déterminé par la fonction de récompense. Nous illustrons également l'application des résultats pour analyser une classe de problèmes RMAB dans l'accès opportuniste. Ensuite, nous étudions un problème plus difficile, où l'utilisateur doit configurer le nombre de canaux à accéder afin de maximiser son utilité (par exemple, le débit). Après avoir montré la complexité exponentielle du problème, nous développons une stratégie heuristique v-step look-ahead. Dans la stratégie développée, le paramètre v permet de parvenir à un compromis souhaité entre l'efficacité sociale et de la complexité de calcul. Nous démontrons les avantages de la stratégie proposée via des simulations numériques sur plusieurs scénarios typiques. / In the thesis, we address the fundamental problem of opportunistic spectrum access in a multi-channel communication system. Specifically, we consider a communication system in which a user has access to multiple channels, but is limited to sensing and transmitting only on one at a given time. We explore how the smart user should exploit past observations and the knowledge of the stochastic properties of these channels to maximize its transmission rate by switching channels opportunistically. Formally, we provide a generic analysis on the opportunistic spectrum access problem by casting the problem into the restless multi-armed bandit (RMAB) problem, one of the most well-known generalizations of the classic multi-armed bandit (MAB) problem, which is of fundamental importance in stochastic decision theory. Despite the significant research efforts in the field, the RMAB problem in its generic form still remains open. Until today, very little result is reported on the structure of the optimal policy. Obtaining the optimal policy for a general RMAB problem is often intractable due to the exponential computation complexity. Hence, a natural alternative is to seek a simple myopic policy maximizing the short-term reward. Therefore, we develop three axioms characterizing a family of functions which we refer to as regular functions, which are generic and practically important. We then establish the optimality of the myopic policy when the reward function can be expressed as a regular function and the discount factor is bounded by a closed-form threshold determined by the reward function. We also illustrate how the derived results, generic in nature, are applied to analyze a class of RMAB problems arising from multi-channel opportunistic access. Next, we further investigate the more challenging problem where the user has to decide the number of channels to sense in each slot in order to maximize its utility (e.g., throughput). After showing the exponential complexity of the problem, we develop a heuristic v-step look-ahead strategy. In the developed strategy, the parameter v allows to achieve a desired tradeoff between social efficiency and computation complexity. We demonstrate the benefits of the proposed strategy via numerical experiments on several typical settings.
|
6 |
Decision making using Thompson SamplingMellor, Joseph Charles January 2014 (has links)
The ability to make decisions is a crucial ability of many autonomous systems. In many scenarios the consequence of a decision is unknown and often stochastic. The same decision may lead to a different outcome every time it is taken. An agent that can learn to make decisions based purely on its past experience needs less tuning and is likely more robust. An agent must often balance between learning the payoff of actions by exploring, and exploiting the knowledge they currently have. The multi-armed bandit problem exhibits such an exploration-exploitation dilemma. Thompson Sampling is a strategy for the problem, first proposed in 1933. In the last several years there has been renewed interest in it, with the emergence of strong empirical and theoretical justification for its use. This thesis seeks to take advantage of the benefits of Thompson Sampling while applying it to other decision-making models. In doing so we propose different algorithms for these scenarios. Firstly we explore a switching multi-armed bandit problem. In real applications the most appropriate decision to take often changes over time. We show that an agent assuming switching is often robust to many types of changing environment. Secondly we consider the best arm identification problem. Unlike the multi-armed bandit problem, where an agent wants to increase reward over the entire period of decision making, the best arm identification is concerned in increasing the reward gained by a final decision. This thesis argues that both problems can be tackled effectively using Thompson Sampling based approaches and provides empirical evidence to support this claim.
|
7 |
Intelligent Data Mining Techniques for Automatic Service ManagementWang, Qing 07 November 2018 (has links)
Today, as more and more industries are involved in the artificial intelligence era, all business enterprises constantly explore innovative ways to expand their outreach and fulfill the high requirements from customers, with the purpose of gaining a competitive advantage in the marketplace. However, the success of a business highly relies on its IT service. Value-creating activities of a business cannot be accomplished without solid and continuous delivery of IT services especially in the increasingly intricate and specialized world. Driven by both the growing complexity of IT environments and rapidly changing business needs, service providers are urgently seeking intelligent data mining and machine learning techniques to build a cognitive ``brain" in IT service management, capable of automatically understanding, reasoning and learning from operational data collected from human engineers and virtual engineers during the IT service maintenance.
The ultimate goal of IT service management optimization is to maximize the automation of IT routine procedures such as problem detection, determination, and resolution. However, to fully automate the entire IT routine procedure is still a challenging task without any human intervention. In the real IT system, both the step-wise resolution descriptions and scripted resolutions are often logged with their corresponding problematic incidents, which typically contain abundant valuable human domain knowledge. Hence, modeling, gathering and utilizing the domain knowledge from IT system maintenance logs act as an extremely crucial role in IT service management optimization. To optimize the IT service management from the perspective of intelligent data mining techniques, three research directions are identified and considered to be greatly helpful for automatic service management: (1) efficiently extract and organize the domain knowledge from IT system maintenance logs; (2) online collect and update the existing domain knowledge by interactively recommending the possible resolutions; (3) automatically discover the latent relation among scripted resolutions and intelligently suggest proper scripted resolutions for IT problems.
My dissertation addresses these challenges mentioned above by designing and implementing a set of intelligent data-driven solutions including (1) constructing the domain knowledge base for problem resolution inference; (2) online recommending resolution in light of the explicit hierarchical resolution categories provided by domain experts; and (3) interactively recommending resolution with the latent resolution relations learned through a collaborative filtering model.
|
8 |
New Methods for Learning from Heterogeneous and Strategic AgentsDivya, Padmanabhan January 2017 (has links) (PDF)
1 Introduction
In this doctoral thesis, we address several representative problems that arise in the context of learning from multiple heterogeneous agents. These problems are relevant to many modern applications such as crowdsourcing and internet advertising. In scenarios such as crowdsourcing, there is a planner who is interested in learning a task and a set of noisy agents provide the training data for this learning task. Any learning algorithm making use of the data provided by these noisy agents must account for their noise levels. The noise levels of the agents are unknown to the planner, leading to a non-trivial difficulty. Further, the agents are heterogeneous as they differ in terms of their noise levels. A key challenge in such settings is to learn the noise levels of the agents while simultaneously learning the underlying model. Another challenge arises when the agents are strategic. For example, when the agents are required to perform a task, they could be strategic on the efforts they put in. As another example, when required to report their costs incurred towards performing the task, the agents could be strategic and may not report the costs truthfully. In general, the performance of the learning algorithms could be severely affected if the information elicited from the agents is incorrect. We address the above challenges that arise in the following representative learning problems.
Multi-label Classification from Heterogeneous Noisy Agents Multi-label classification is a well-known supervised machine learning problem where each instance is associated with multiple classes. Since several labels can be assigned to a single instance, one of the key challenges in this problem is to learn the correlations between the classes. We first assume labels from a perfect source and propose a novel topic model called Multi-Label Presence-Absence Latent Dirichlet Allocation (ML-PA-LDA). In the current day scenario, a natural source for procuring the training dataset is through mining user-generated content or directly through users in a crowdsourcing platform. In the more practical scenario of crowdsourcing, an additional challenge arises as the labels of the training instances are provided by noisy, heterogeneous crowd-workers with unknown qualities. With this as the motivation, we further adapt our topic model to the scenario where the labels are provided by multiple noisy sources and refer to this model as ML-PA-LDA-MNS (ML-PA-LDA with Multiple Noisy Sources). With experiments on standard datasets, we show that the proposed models achieve superior performance over existing methods.
Active Linear Regression with Heterogeneous, Noisy and Strategic Agents
In this work, we study the problem of training a linear regression model by procuring labels from multiple noisy agents or crowd annotators, under a budget constraint. We propose a Bayesian model for linear regression from multiple noisy sources and use variational inference for parameter estimation. When labels are sought from agents, it is important to minimize the number of labels procured as every call to an agent incurs a cost. Towards this, we adopt an active learning approach. In this specific context, we prove the equivalence of well-studied criteria of active learning such as entropy minimization and expected error reduction. For the purpose of annotator selection in active learning, we observe a useful connection with the multi-armed bandit framework. Due to the nature of the distribution of the rewards on the arms, we resort to the Robust Upper Confidence Bound (UCB) scheme with truncated empirical mean estimator to solve the annotator selection problem. This yields provable guarantees on the regret. We apply our model to the scenario where annotators are strategic and design suitable incentives to induce them to put in their best efforts.
Ranking with Heterogeneous Strategic Agents
We look at the problem where a planner must rank multiple strategic agents, a problem that has many applications including sponsored search auctions (SSA). Stochastic multi-armed bandit (MAB) mechanisms have been used in the literature to solve this problem. Existing stochastic MAB mechanisms with a deterministic payment rule, proposed in the literature, necessarily suffer a regret of (T 2=3), where T is the number of time steps. This happens because these mechanisms address the worst case scenario where the means of the agents’ stochastic rewards are separated by a very small amount that depends on T . We however take a detour and allow the planner to indicate the resolution, , with which the agents must be distinguished. This immediately leads us to introduce the notion of -Regret. We propose a dominant strategy incentive compatible (DSIC) and individually rational (IR), deterministic MAB mechanism, based on ideas from the Upper Confidence Bound (UCB) family of MAB algorithms. The proposed mechanism - UCB achieves a -regret of O(log T ). We first establish the results for single slot SSA and then non-trivially extend the results to the case of multi-slot SSA.
|
9 |
Méthodes optimistes d’apprentissage actif pour la classification / Optimistic Methods in Active Learning for ClassificationCollet, Timothé 11 July 2016 (has links)
La classification se base sur un jeu de données étiquetées par un expert. Plus le jeu de données est grand, meilleure est la performance de classification. Pourtant, la requête à un expert peut parfois être coûteuse. Le but de l'apprentissage actif est alors de minimiser le nombre de requêtes à l'expert. La collection des données non-étiquetées reste aisée cependant et illimitée, il est donc nécessaire de faire un choix sur les données à annoter, l'idée est alors de profiter de ce choix pour maximiser les performances en ne lui fournissant que les données les plus informatives à étiqueter. Pourtant, le niveau d'informativité de chaque donnée ne peut pas être calculé exactement et ne peut être estimé qu'à une incertitude près. Améliorer la précision de l'estimation nécessite d'annoter de nouvelles données. Il y a donc un dilemme entre utiliser le budget d'annotations disponible pour améliorer la performance du classifieur selon l'estimation actuelle du critère ou pour améliorer la précision sur le critère. Ce dilemme est bien connu dans le cadre de l'optimisation en budget fini sous le nom de dilemme entre exploration et exploitation. Les solutions usuelles pour résoudre ce dilemme dans ce contexte font usage du principe d'Optimisme Face à l'Incertitude. Dans cette thèse, nous montrons donc qu'il est possible d'adapter ce principe au problème d'apprentissage actif pour la classification. Pour cela, plusieurs algorithmes ont été être développés pour des classifieurs de complexité croissante, chacun utilisant le principe de l'Optimisme Face à l'Incertitude, et leurs résultats ont été évalués empiriquement / A Classification problem makes use of a training set consisting of data labeled by an oracle. The larger the training set, the best the performance. However, requesting the oracle may be costly. The goal of Active Learning is thus to minimize the number of requests to the oracle while achieving the best performance. To do so, the data that are presented to the oracle must be carefully selected among a large number of unlabeled instances acquired at no cost. However, the true profitability of labeling a particular instance may not be known perfectly. It can therefore be estimated along with a measure of uncertainty. To Increase the precision on the estimate, we need to label more data. Thus, there is a dilemma between labeling data in order to increase the performance of the classifier or to better know how to select data. This dilemma is well studied in the context of finite budget optimization under the name of exploration versus exploitation dilemma. The most famous solutions make use of the principle of Optimism in the Face of Uncertainty. In this thesis, we show that it is possible to adapt this principle to the active learning problem for classification. Several algorithms have been developed for classifiers of increasing complexity, each one of them using the principle of Optimism in the Face of Uncertainty, and their performances have been empirically evaluated
|
10 |
Multi-channel opportunistic access : a restless multi-armed bandit perspectiveWang, Kehao 22 June 2012 (has links) (PDF)
In the thesis, we address the fundamental problem of opportunistic spectrum access in a multi-channel communication system. Specifically, we consider a communication system in which a user has access to multiple channels, but is limited to sensing and transmitting only on one at a given time. We explore how the smart user should exploit past observations and the knowledge of the stochastic properties of these channels to maximize its transmission rate by switching channels opportunistically. Formally, we provide a generic analysis on the opportunistic spectrum access problem by casting the problem into the restless multi-armed bandit (RMAB) problem, one of the most well-known generalizations of the classic multi-armed bandit (MAB) problem, which is of fundamental importance in stochastic decision theory. Despite the significant research efforts in the field, the RMAB problem in its generic form still remains open. Until today, very little result is reported on the structure of the optimal policy. Obtaining the optimal policy for a general RMAB problem is often intractable due to the exponential computation complexity. Hence, a natural alternative is to seek a simple myopic policy maximizing the short-term reward. Therefore, we develop three axioms characterizing a family of functions which we refer to as regular functions, which are generic and practically important. We then establish the optimality of the myopic policy when the reward function can be expressed as a regular function and the discount factor is bounded by a closed-form threshold determined by the reward function. We also illustrate how the derived results, generic in nature, are applied to analyze a class of RMAB problems arising from multi-channel opportunistic access. Next, we further investigate the more challenging problem where the user has to decide the number of channels to sense in each slot in order to maximize its utility (e.g., throughput). After showing the exponential complexity of the problem, we develop a heuristic v-step look-ahead strategy. In the developed strategy, the parameter v allows to achieve a desired tradeoff between social efficiency and computation complexity. We demonstrate the benefits of the proposed strategy via numerical experiments on several typical settings.
|
Page generated in 0.0655 seconds