Global ETD Search

1	The Essential Dynamics Algorithm: Essential Results Martin, Martin C. 01 May 2003 (has links) This paper presents a novel algorithm for learning in a class of stochastic Markov decision processes (MDPs) with continuous state and action spaces that trades speed for accuracy. A transform of the stochastic MDP into a deterministic one is presented which captures the essence of the original dynamics, in a sense made precise. In this transformed MDP, the calculation of values is greatly simplified. The online algorithm estimates the model of the transformed MDP and simultaneously does policy search against it. Bounds on the error of this approximation are proven, and experimental results in a bicycle riding domain are presented. The algorithm learns near optimal policies in orders of magnitude fewer interactions with the stochastic MDP, using less domain knowledge. All code used in the experiments is available on the project's web site. AI Reinforcement learning bicycle policy search markov decision processes
2	Reinforcement Learning by Policy Search Peshkin, Leonid 14 February 2003 (has links) One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations can be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. This means learning a policy---a mapping of observations into actions---based on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multi-agent systems. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience re-use. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network. AI POMDP policy search adaptive systems reinforcement learning adaptive behavior
3	Variable Risk Policy Search for Dynamic Robot Control Kuindersma, Scott Robert 01 September 2012 (has links) A central goal of the robotics community is to develop general optimization algorithms for producing high-performance dynamic behaviors in robot systems. This goal is challenging because many robot control tasks are characterized by significant stochasticity, high-dimensionality, expensive evaluations, and unknown or unreliable system models. Despite these challenges, a range of algorithms exist for performing efficient optimization of parameterized control policies with respect to average cost criteria. However, other statistics of the cost may also be important. In particular, for many stochastic control problems, it can be advantageous to select policies based not only on their average cost, but also their variance (or risk). In this thesis, I present new efficient global and local risk-sensitive stochastic optimization algorithms suitable for performing policy search in a wide variety of problems of interest to robotics researchers. These algorithms exploit new techniques in nonparameteric heteroscedastic regression to directly model the policy-dependent distribution of cost. For local search, learned cost models can be used as critics for performing risk-sensitive gradient descent. Alternatively, decision-theoretic criteria can be applied to globally select policies to balance exploration and exploitation in a principled way, or to perform greedy minimization with respect to various risk-sensitive criteria. This separation of learning and policy selection permits variable risk control, where risk sensitivity can be flexibly adjusted and appropriate policies can be selected at runtime without requiring additional policy executions. To evaluate these algorithms and highlight the importance of risk in dynamic control tasks, I describe several experiments with the UMass uBot-5 that include learning dynamic arm motions to stabilize after large impacts, lifting heavy objects while balancing, and developing safe fall bracing behaviors. The results of these experiments suggest that the ability to select policies based on risk-sensitive criteria can lead to greater flexibility in dynamic behavior generation. Bayesian control optimization policy search risk-sensitive robotics Computer Sciences
4	ON DEVELOPMENTAL VARIATION IN HIERARCHICAL SYMBIOTIC POLICY SEARCH Kelly, Stephen 16 August 2012 (has links) A hierarchical symbiotic framework for policy search with genetic programming (GP) is evaluated in two control-style temporal sequence learning domains. The symbiotic formulation assumes each policy takes the form of a cooperative team between multiple symbiont programs. An initial cycle of evolution establishes a diverse range of host behaviours with limited capability. The second cycle uses these initial policies as meta actions for reuse by symbiont programs. The relationship between development and ecology is explored by explicitly altering the interaction between learning agent and environment at fixed points throughout evolution. In both task domains, this developmental diversity significantly improves performance. Specifically, ecologies designed to promote good specialists in the first developmental phase and then good generalists result in much stronger organisms from the perspective of generalization ability and efficiency. Conversely, when there is no diversity in the interaction between task environment and policy learner, the resulting hierarchy is not as robust or general. The relative contribution from each cycle of evolution in the resulting hierarchical policies is measured from the perspective of multi-level selection. These multi-level policies are shown to be significantly better than the sum of contributing meta actions. Genetic programming Symbiosis Coevolution Reinforcement Learning Temporal Sequence Learning Policy Search Gennetic Algorithms
5	Policy Hyperparameter Exploration for Behavioral Learning of Smartphone Robots / スマートフォンロボットの行動学習のための方策ハイパーパラメータ探索法 Wang, Jiexin 23 March 2017 (has links) 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第20519号 / 情博第647号 / 新制\|\|情\|\|112(附属図書館) / 京都大学大学院情報学研究科システム科学専攻 / (主査)教授石井信, 教授杉江俊治, 教授大塚敏之, 銅谷賢治 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM EM-based policy search Smartphone robot Non-linear motor control Reinforcement learning Vision-based navigation 007
6	Utilizing Trajectory Optimization in the Training of Neural Network Controllers Kimball, Nicholas 01 September 2019 (has links) (PDF) Applying reinforcement learning to control systems enables the use of machine learning to develop elegant and efficient control laws. Coupled with the representational power of neural networks, reinforcement learning algorithms can learn complex policies that can be difficult to emulate using traditional control system design approaches. In this thesis, three different model-free reinforcement learning algorithms, including Monte Carlo Control, REINFORCE with baseline, and Guided Policy Search are compared in simulated, continuous action-space environments. The results show that the Guided Policy Search algorithm is able to learn a desired control policy much faster than the other algorithms. In the inverted pendulum system, it learns an effective policy up to three times faster than the other algorithms. In the cartpole system, it learns an effective policy up to nearly fifteen times faster than the other algorithms. Reinforcement Learning Optimal Control Trajectory Optimization iLQR Guided Policy Search Neural Networks Controls and Control Theory Robotics
7	Micro-Data Reinforcement Learning for Adaptive Robots / Apprentissage micro-data pour l'adaptation en robotique Chatzilygeroudis, Konstantinos 14 December 2018 (has links) Les robots opèrent dans le monde réel, dans lequel essayer quelque chose prend beaucoup de temps. Pourtant, les methodes d’apprentissage par renforcement actuels (par exemple, deep reinforcement learning) nécessitent de longues périodes d’interaction pour trouver des politiques efficaces. Dans cette thèse, nous avons exploré des algorithmes qui abordent le défi de l’apprentissage par essai-erreur en quelques minutes sur des robots physiques. Nous appelons ce défi “Apprentissage par renforcement micro-data”. Dans la première contribution, nous avons proposé un nouvel algorithme d’apprentissage appelé “Reset-free Trial-and-Error” qui permet aux robots complexes de s’adapter rapidement dans des circonstances inconnues (par exemple, des dommages) tout en accomplissant leurs tâches; en particulier, un robot hexapode endommagé a retrouvé la plupart de ses capacités de marche dans un environnement avec des obstacles, et sans aucune intervention humaine. Dans la deuxième contribution, nous avons proposé un nouvel algorithme de recherche de politique “basé modèle”, appelé Black-DROPS, qui: (1) n’impose aucune contrainte à la fonction de récompense ou à la politique, (2) est aussi efficace que les algorithmes de l’état de l’art, et (3) est aussi rapide que les approches analytiques lorsque plusieurs processeurs sont disponibles. Nous avons aussi proposé Multi-DEX, une extension qui s’inspire de l’algorithme “Novelty Search” et permet de résoudre plusieurs scénarios où les récompenses sont rares. Dans la troisième contribution, nous avons introduit une nouvelle procédure d’apprentissage du modèle dans Black-DROPS qui exploite un simulateur paramétré pour permettre d’apprendre des politiques sur des systèmes avec des espaces d’état de grande taille; par exemple, cette extension a trouvé des politiques performantes pour un robot hexapode (espace d’état 48D et d’action 18D) en moins d’une minute d’interaction. Enfin, nous avons exploré comment intégrer les contraintes de sécurité, améliorer la robustesse et tirer parti des multiple a priori en optimisation bayésienne. L'objectif de la thèse était de concevoir des méthodes qui fonctionnent sur des robots physiques (pas seulement en simulation). Par conséquent, tous nos approches ont été évaluées sur au moins un robot physique. Dans l’ensemble, nous proposons des méthodes qui permettre aux robots d’être plus autonomes et de pouvoir apprendre en poignée d’essais / Robots have to face the real world, in which trying something might take seconds, hours, or even days. Unfortunately, the current state-of-the-art reinforcement learning algorithms (e.g., deep reinforcement learning) require big interaction times to find effective policies. In this thesis, we explored approaches that tackle the challenge of learning by trial-and-error in a few minutes on physical robots. We call this challenge “micro-data reinforcement learning”. In our first contribution, we introduced a novel learning algorithm called “Reset-free Trial-and-Error” that allows complex robots to quickly recover from unknown circumstances (e.g., damages or different terrain) while completing their tasks and taking the environment into account; in particular, a physical damaged hexapod robot recovered most of its locomotion abilities in an environment with obstacles, and without any human intervention. In our second contribution, we introduced a novel model-based reinforcement learning algorithm, called Black-DROPS that: (1) does not impose any constraint on the reward function or the policy (they are treated as black-boxes), (2) is as data-efficient as the state-of-the-art algorithm for data-efficient RL in robotics, and (3) is as fast (or faster) than analytical approaches when several cores are available. We additionally proposed Multi-DEX, a model-based policy search approach, that takes inspiration from novelty-based ideas and effectively solved several sparse reward scenarios. In our third contribution, we introduced a new model learning procedure in Black-DROPS (we call it GP-MI) that leverages parameterized black-box priors to scale up to high-dimensional systems; for instance, it found high-performing walking policies for a physical damaged hexapod robot (48D state and 18D action space) in less than 1 minute of interaction time. Finally, in the last part of the thesis, we explored a few ideas on how to incorporate safety constraints, robustness and leverage multiple priors in Bayesian optimization in order to tackle the micro-data reinforcement learning challenge. Throughout this thesis, our goal was to design algorithms that work on physical robots, and not only in simulation. Consequently, all the proposed approaches have been evaluated on at least one physical robot. Overall, this thesis aimed at providing methods and algorithms that will allow physical robots to be more autonomous and be able to learn in a handful of trials Apprentissage Micro-Data Apprentissage en robotique Apprentissage par essais et erreurs Apprentissage par renforcement Agents autonomes Micro-Data Policy Search Robot Learning Trial and Error Learning Reinforcement Learning Autonomous agents 006.310 15181
8	Automatic State Construction using Decision Trees for Reinforcement Learning Agents Au, Manix January 2005 (has links) Reinforcement Learning (RL) is a learning framework in which an agent learns a policy from continual interaction with the environment. A policy is a mapping from states to actions. The agent receives rewards as feedback on the actions performed. The objective of RL is to design autonomous agents to search for the policy that maximizes the expectation of the cumulative reward. When the environment is partially observable, the agent cannot determine the states with certainty. These states are called hidden in the literature. An agent that relies exclusively on the current observations will not always find the optimal policy. For example, a mobile robot needs to remember the number of doors went by in order to reach a specific door, down a corridor of identical doors. To overcome the problem of partial observability, an agent uses both current and past (memory) observations to construct an internal state representation, which is treated as an abstraction of the environment. This research focuses on how features of past events are extracted with variable granularity regarding the internal state construction. The project introduces a new method that applies Information Theory and decision tree technique to derive a tree structure, which represents the state and the policy. The relevance, of a candidate feature, is assessed by the Information Gain Ratio ranking with respect to the cumulative expected reward. Experiments carried out on three different RL tasks have shown that our variant of the U-Tree (McCallum, 1995) produces a more robust state representation and faster learning. This better performance can be explained by the fact that the Information Gain Ratio exhibits a lower variance in return prediction than the Kolmogorov-Smirnov statistical test used in the original U-Tree algorithm. Reinforcement learning State Action Reward Policy Value based method Policy search method Automatic state construction Decision tree Partial observability U-Tree Kolmogorov-Smirnov two sample test Information gain ratio test

Search results