Spelling suggestions: "subject:"exploration vs exploitation"" "subject:"exploration vs lexploitation""
1 |
A control theoretic perspective on learning in roboticsO'Flaherty, Rowland Wilde 27 May 2016 (has links)
For robotic systems to continue to move towards ubiquity, robots need to be more autonomous. More autonomy dictates that robots need to be able to make better decisions. Control theory and machine learning are fields of robotics that focus on the decision making process. However, each of these fields implements decision making at different levels of abstraction and at different time scales. Control theory defines low-level decisions at high rates, while machine learning defines high-level decision at low rates. The objective of this research is to integrate tools from both machine leaning and control theory to solve higher dimensional, complex problems, and to optimize the decision making process. Throughout this research, multiple algorithms were created that use concepts from both control theory and machine learning, which provide new tools for robots to make better decisions. One algorithm enables a robot to learn how to optimally explore an unknown space, and autonomously decide when to explore for new information or exploit its current information. Another algorithm enables a robot to learn how to locomote with complex dynamics. These algorithms are evaluated both in simulation and on real robots. The results and analysis of these experiments are presented, which demonstrate the utility of the algorithms introduced in this work. Additionally, a new notion of “learnability” is introduced to define and determine when a given dynamical system has the ability to gain knowledge to optimize a given objective function.
|
2 |
Stochastic Optimization in Dynamic Environments : with applications in e-commerceBastani, Spencer, Andersson, Olov January 2007 (has links)
<p>In this thesis we address the problem of how to construct an optimal algorithm for displaying banners (i.e advertisements shown on web sites). The optimization is based on the revenue each banner generates, with the aim of selecting those banners which maximize future total revenue. Banner optimality is of major importance in the e-commerce industry, in particular on web sites with heavy traffic. The 'micropayments' from showing banners add up to substantial profits due to the large volumes involved. We provide a broad, up-to-date and primarily theoretical treatment of this global optimization problem. Through a synthesis of mathematical modeling, statistical methodology and computer science we construct a stochastic 'planning algorithm'. The superiority of our algorithm is based on empirical analysis conducted by us on real internet-data at TradeDoubler AB, as well as test-results on a selection of stylized data-sets. The algorithm is flexible and adapts well to new environments.</p>
|
3 |
Stochastic Optimization in Dynamic Environments : with applications in e-commerceBastani, Spencer, Andersson, Olov January 2007 (has links)
In this thesis we address the problem of how to construct an optimal algorithm for displaying banners (i.e advertisements shown on web sites). The optimization is based on the revenue each banner generates, with the aim of selecting those banners which maximize future total revenue. Banner optimality is of major importance in the e-commerce industry, in particular on web sites with heavy traffic. The 'micropayments' from showing banners add up to substantial profits due to the large volumes involved. We provide a broad, up-to-date and primarily theoretical treatment of this global optimization problem. Through a synthesis of mathematical modeling, statistical methodology and computer science we construct a stochastic 'planning algorithm'. The superiority of our algorithm is based on empirical analysis conducted by us on real internet-data at TradeDoubler AB, as well as test-results on a selection of stylized data-sets. The algorithm is flexible and adapts well to new environments.
|
4 |
Regret Minimization in the Gain Estimation ProblemTourkaman, Mahan January 2019 (has links)
A novel approach to the gain estimation problem,using a multi-armed bandit formulation, is studied. The gain estimation problem deals with the problem of estimating the largest L2-gain that signal of bounded norm experiences when passing through a linear and time-invariant system. Under certain conditions, this new approach is guaranteed to surpass traditional System Identification methods in terms of accuracy.The bandit algorithms Upper Confidence Bound, Thompson Sampling and Weighted Thompson Sampling are implemented with the aim of designing the optimal input for maximizing the gain of an unknown system. The regret performance of each algorithm is studied using simulations on a test system. Upper Confidence Bound, with exploration parameter set to zero, performed the best among all tested values for this parameter. Weighted Thompson Sampling performed better than Thompson Sampling.
|
5 |
Reinforcement Learning for Procedural Game Animation: Creating Uncanny Zombie MovementsTayeh, Adrian, Almquist, Arvid January 2024 (has links)
This thesis explores the use of reinforcement learning within the Unity ML Agents framework to simulate zombie-like movements in humanoid ragdolls. The generated locomotion aims to embrace the Uncanny Valley phenomenon, partly through the way it walks, but also through limb disablement. Additionally, the paper strives to test the effectiveness of reinforcement learning as a valuable tool for generative adaptive locomotion. The research implements reward functions and addresses technical challenges. It lays a focus on adaptability through the limb disablement system. A user study comparing the reinforcement learning agent to Mixamo animations evaluates the effectiveness of simulating zombie-like movements as well as if the Uncanny Valley phenomenon was achieved. Results show that while the reinforcement learning agent may lack believability and uncanniness when compared to the Mixamo animation, it features a level of adaptability that is worth expanding upon. Given the inconclusive results, there is room for further research on the topic to achieve the Uncanny Valley effect and enhance zombie-like locomotion with reinforcement learning.
|
6 |
Contributions to Multi-Armed Bandits : Risk-Awareness and Sub-Sampling for Linear Contextual Bandits / Contributions aux bandits manchots : gestion du risque et sous-échantillonnage pour les bandits contextuels linéairesGalichet, Nicolas 28 September 2015 (has links)
Cette thèse s'inscrit dans le domaine de la prise de décision séquentielle en environnement inconnu, et plus particulièrement dans le cadre des bandits manchots (multi-armed bandits, MAB), défini par Robbins et Lai dans les années 50. Depuis les années 2000, ce cadre a fait l'objet de nombreuses recherches théoriques et algorithmiques centrées sur le compromis entre l'exploration et l'exploitation : L'exploitation consiste à répéter le plus souvent possible les choix qui se sont avérés les meilleurs jusqu'à présent. L'exploration consiste à essayer des choix qui ont rarement été essayés, pour vérifier qu'on a bien identifié les meilleurs choix. Les applications des approches MAB vont du choix des traitements médicaux à la recommandation dans le contexte du commerce électronique, en passant par la recherche de politiques optimales de l'énergie. Les contributions présentées dans ce manuscrit s'intéressent au compromis exploration vs exploitation sous deux angles spécifiques. Le premier concerne la prise en compte du risque. Toute exploration dans un contexte inconnu peut en effet aboutir à des conséquences indésirables ; par exemple l'exploration des comportements d'un robot peut aboutir à des dommages pour le robot ou pour son environnement. Dans ce contexte, l'objectif est d'obtenir un compromis entre exploration, exploitation, et prise de risque (EER). Plusieurs algorithmes originaux sont proposés dans le cadre du compromis EER. Sous des hypothèses fortes, l'algorithme MIN offre des garanties de regret logarithmique, à l'état de l'art ; il offre également une grande robustesse, contrastant avec la forte sensibilité aux valeurs des hyper-paramètres de e.g. (Auer et al. 2002). L'algorithme MARAB s'intéresse à un critère inspiré de la littérature économique(Conditional Value at Risk), et montre d'excellentes performances empiriques comparées à (Sani et al. 2012), mais sans garanties théoriques. Enfin, l'algorithme MARABOUT modifie l'estimation du critère CVaR pour obtenir des garanties théoriques, tout en obtenant un bon comportement empirique. Le second axe de recherche concerne le bandit contextuel, où l'on dispose d'informations additionnelles relatives au contexte de la décision ; par exemple, les variables d'état du patient dans un contexte médical ou de l'utilisateur dans un contexte de recommandation. L'étude se focalise sur le choix entre bras qu'on a tirés précédemment un nombre de fois différent. Le choix repose en général sur la notion d'optimisme, comparant les bornes supérieures des intervalles de confiance associés aux bras considérés. Une autre approche appelée BESA, reposant sur le sous-échantillonnage des valeurs tirées pour les bras les plus visités, et permettant ainsi de se ramener au cas où tous les bras ont été tirés un même nombre de fois, a été proposée par (Baransi et al. 2014). / This thesis focuses on sequential decision making in unknown environment, and more particularly on the Multi-Armed Bandit (MAB) setting, defined by Lai and Robbins in the 50s. During the last decade, many theoretical and algorithmic studies have been aimed at cthe exploration vs exploitation tradeoff at the core of MABs, where Exploitation is biased toward the best options visited so far while Exploration is biased toward options rarely visited, to enforce the discovery of the the true best choices. MAB applications range from medicine (the elicitation of the best prescriptions) to e-commerce (recommendations, advertisements) and optimal policies (e.g., in the energy domain). The contributions presented in this dissertation tackle the exploration vs exploitation dilemma under two angles. The first contribution is centered on risk avoidance. Exploration in unknown environments often has adverse effects: for instance exploratory trajectories of a robot can entail physical damages for the robot or its environment. We thus define the exploration vs exploitation vs safety (EES) tradeoff, and propose three new algorithms addressing the EES dilemma. Firstly and under strong assumptions, the MIN algorithm provides a robust behavior with guarantees of logarithmic regret, matching the state of the art with a high robustness w.r.t. hyper-parameter setting (as opposed to, e.g. UCB (Auer 2002)). Secondly, the MARAB algorithm aims at optimizing the cumulative 'Conditional Value at Risk' (CVar) rewards, originated from the economics domain, with excellent empirical performances compared to (Sani et al. 2012), though without any theoretical guarantees. Finally, the MARABOUT algorithm modifies the CVar estimation and yields both theoretical guarantees and a good empirical behavior. The second contribution concerns the contextual bandit setting, where additional informations are provided to support the decision making, such as the user details in the ontent recommendation domain, or the patient history in the medical domain. The study focuses on how to make a choice between two arms with different numbers of samples. Traditionally, a confidence region is derived for each arm based on the associated samples, and the 'Optimism in front of the unknown' principle implements the choice of the arm with maximal upper confidence bound. An alternative, pioneered by (Baransi et al. 2014), and called BESA, proceeds instead by subsampling without replacement the larger sample set. In this framework, we designed a contextual bandit algorithm based on sub-sampling without replacement, relaxing the (unrealistic) assumption that all arm reward distributions rely on the same parameter. The CL-BESA algorithm yields both theoretical guarantees of logarithmic regret and good empirical behavior.
|
Page generated in 0.1232 seconds