A thesis submitted to the Faculty of Science, School of Computational and Applied Mathematics University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Doctor of Philosophy. Johannesburg, South Africa, July 2015. / Reinforcement learning agents solve tasks by finding policies that maximise their reward over time. The policy can be found from the value function, which represents the value of each state-action pair. In continuous state spaces, the value function must be approximated. Often, this is done using a fixed linear combination of functions across all dimensions. We introduce and demonstrate the wavelet basis for reinforcement learning, a basis function scheme competitive against state of the art fixed bases. We extend two online adaptive tiling schemes to wavelet functions and show their performance improvement across standard domains. Finally we introduce the Multiscale Adaptive Wavelet Basis (MAWB), a wavelet-based adaptive basis scheme which is dimensionally scalable and insensitive to the initial level of detail. This scheme adaptively grows the basis function set by combining across dimensions, or splitting within a dimension those candidate functions which have a high estimated projection onto the Bellman error. A number of novel measures are used to find this estimate. i
Hengst, Bernhard, Computer Science & Engineering, Faculty of Engineering, UNSW
This thesis addresses the open problem of automatically discovering hierarchical structure in reinforcement learning. Current algorithms for reinforcement learning fail to scale as problems become more complex. Many complex environments empirically exhibit hierarchy and can be modeled as interrelated subsystems, each in turn with hierarchic structure. Subsystems are often repetitive in time and space, meaning that they reoccur as components of different tasks or occur multiple times in different circumstances in the environment. A learning agent may sometimes scale to larger problems if it successfully exploits this repetition. Evidence suggests that a bottom up approach that repetitively finds building-blocks at one level of abstraction and uses them as background knowledge at the next level of abstraction, makes learning in many complex environments tractable. An algorithm, called HEXQ, is described that automatically decomposes and solves a multi-dimensional Markov decision problem (MDP) by constructing a multi-level hierarchy of interlinked subtasks without being given the model beforehand. The effectiveness and efficiency of the HEXQ decomposition depends largely on the choice of representation in terms of the variables, their temporal relationship and whether the problem exhibits a type of constrained stochasticity. The algorithm is first developed for stochastic shortest path problems and then extended to infinite horizon problems. The operation of the algorithm is demonstrated using a number of examples including a taxi domain, various navigation tasks, the Towers of Hanoi and a larger sporting problem. The main contributions of the thesis are the automation of (1)decomposition, (2) sub-goal identification, and (3) discovery of hierarchical structure for MDPs with states described by a number of variables or features. It points the way to further scaling opportunities that encompass approximations, partial observability, selective perception, relational representations and planning. The longer term research aim is to train rather than program intelligent agents
Gaskett, Chris, email@example.com
Q-Learning is a method for solving reinforcement learning problems. Reinforcement learning problems require improvement of behaviour based on received rewards. Q-Learning has the potential to reduce robot programming effort and increase the range of robot abilities. However, most currentQ-learning systems are not suitable for robotics problems: they treat continuous variables, for example speeds or positions, as discretised values. Discretisation does not allow smooth control and does not fully exploit sensed information. A practical algorithm must also cope with real-time constraints, sensing and actuation delays, and incorrect sensor data. This research describes an algorithm that deals with continuous state and action variables without discretising. The algorithm is evaluated with vision-based mobile robot and active head gaze control tasks. As well as learning the basic control tasks, the algorithm learns to compensate for delays in sensing and actuation by predicting the behaviour of its environment. Although the learned dynamic model is implicit in the controller, it is possible to extract some aspects of the model. The extracted models are compared to theoretically derived models of environment behaviour. The difficulty of working with robots motivates development of methods that reduce experimentation time. This research exploits Q-learnings ability to learn by passively observing the robots actionsrather than necessarily controlling the robot. This is a valuable tool for shortening the duration of learning experiments.
09 May 1996
Reinforcement Learning (RL) is the study of learning agents that improve their performance from rewards and punishments. Most reinforcement learning methods optimize the discounted total reward received by an agent, while, in many domains, the natural criterion is to optimize the average reward per time step. In this thesis, we introduce a model-based average reward reinforcement learning method called "H-learning" and show that it performs better than other average reward and discounted RL methods in the domain of scheduling a simulated Automatic Guided Vehicle (AGV). We also introduce a version of H-learning which automatically explores the unexplored parts of the state space, while always choosing an apparently best action with respect to the current value function. We show that this "Auto-exploratory H-Learning" performs much better than the original H-learning under many previously studied exploration strategies. To scale H-learning to large state spaces, we extend it to learn action models and reward functions in the form of Bayesian networks, and approximate its value function using local linear regression. We show that both of these extensions are very effective in significantly reducing the space requirement of H-learning, and in making it converge much faster in the AGV scheduling task. Further, Auto-exploratory H-learning synergistically combines with Bayesian network model learning and value function approximation by local linear regression, yielding a highly effective average reward RL algorithm. We believe that the algorithms presented here have the potential to scale to large applications in the context of average reward optimization. / Graduation date:1996
Thesis (Ph. D.)--Oregon State University, 2010. / Printout. Includes bibliographical references (leaves 121-123). Also available on the World Wide Web.
Jong, Nicholas K.
18 December 2012
Reinforcement Learning (RL) offers a promising approach towards achieving the dream of autonomous agents that can behave intelligently in the real world. Instead of requiring humans to determine the correct behaviors or sufficient knowledge in advance, RL algorithms allow an agent to acquire the necessary knowledge through direct experience with its environment. Early algorithms guaranteed convergence to optimal behaviors in limited domains, giving hope that simple, universal mechanisms would allow learning agents to succeed at solving a wide variety of complex problems. In practice, the field of RL has struggled to apply these techniques successfully to the full breadth and depth of real-world domains. This thesis extends the reach of RL techniques by demonstrating the synergies among certain key developments in the literature. The first of these developments is model-based exploration, which facilitates theoretical convergence guarantees in finite problems by explicitly reasoning about an agent's certainty in its understanding of its environment. A second branch of research studies function approximation, which generalizes RL to infinite problems by artificially limiting the degrees of freedom in an agent's representation of its environment. The final major advance that this thesis incorporates is hierarchical decomposition, which seeks to improve the efficiency of learning by endowing an agent's knowledge and behavior with the gross structure of its environment. Each of these ideas has intuitive appeal and sustains substantial independent research efforts, but this thesis defines the first RL agent that combines all their benefits in the general case. In showing how to combine these techniques effectively, this thesis investigates the twin issues of generalization and exploration, which lie at the heart of efficient learning. This thesis thus lays the groundwork for the next generation of RL algorithms, which will allow scientific agents to know when it suffices to estimate a plan from current data and when to accept the potential cost of running an experiment to gather new data. / text
The goal of this thesis is to explore the use of reinforcement learning (RL) in commercial computer games. Although RL has been applied with success to many types of board games and non-game simulated environments, there has been little work in applying RL to the most popular genres of games: first-person shooters, role-playing games, and real-time strategies. In this thesis we use a first-person shooter environment to create computer players, or bots, that learn to play the game using reinforcement learning techniques. / We have created three experimental bots: ChaserBot, ItemBot and HybridBot. The two first bots each focus on a different aspect of the first-person shooter genre, and learn using basic RL. ChaserBot learns to chase down and shoot an enemy player. ItemBot, on the other hand, learns how to pick up the items---weapons, ammunition, armor---that are available, scattered on the ground, for the players to improve their arsenal. Both of these bots become reasonably proficient at their assigned task. Our goal for the third bot, HybridBot, was to create a bot that both chases and shoots an enemy player and goes after the items in the environment. Unlike the two previous bots which only have primitive actions available (strafing right or left, moving forward or backward, etc.), HybridBot uses options. At any state, it may choose either the player chasing option or the item gathering option. These options' internal policies are determined by the data learned by ChaserBot and ItemBot. HybridBot uses reinforcement learning to learn which option to pick at a given state. / Each bot learns to perform its given tasks. We compare the three bots' ability to gather items, and ChaserBot's and HybridBot's ability to chase their opponent. HybridBot's results are of particular interest as it outperforms ItemBot at picking up items by a large amount. However, none of our experiments yielded bots that are competitive with human players. We discuss the reasons for this and suggest improvements for future work that could lead to competitive reinforcement learning bots.
Ferns, Norman Francis.
In recent years, various metrics have been developed for measuring the similarity of states in probabilistic transition systems (Desharnais et al., 1999; van Breugel & Worrell, 2001a). In the context of Markov decision processes, we have devised metrics providing a robust quantitative analogue of bisimulation. Most importantly, the metric distances can be used to bound the differences in the optimal value function that is integral to reinforcement learning (Ferns et al. 2004; 2005). More recently, we have discovered an efficient algorithm to calculate distances in the case of finite systems (Ferns et al., 2006). In this thesis, we seek to properly extend state-similarity metrics to Markov decision processes with continuous state spaces both in theory and in practice. In particular, we provide the first distance-estimation scheme for metrics based on bisimulation for continuous probabilistic transition systems. Our work, based on statistical sampling and infinite dimensional linear programming, is a crucial first step in real-world planning; many practical problems are continuous in nature, e.g. robot navigation, and often a parametric model or crude finite approximation does not suffice. State-similarity metrics allow us to reason about the quality of replacing one model with another. In practice, they can be used directly to aggregate states.
The hardware implementation of an artificial neural network using stochastic pulse rate encoding principlesGlover, John Sigsworth January 1995 (has links)
In this thesis the development of a hardware artificial neuron device and artificial neural network using stochastic pulse rate encoding principles is considered. After a review of neural network architectures and algorithmic approaches suitable for hardware implementation, a critical review of hardware techniques which have been considered in analogue and digital systems is presented. New results are presented demonstrating the potential of two learning schemes which adapt by the use of a single reinforcement signal. The techniques for computation using stochastic pulse rate encoding are presented and extended with new novel circuits relevant to the hardware implementation of an artificial neural network. The generation of random numbers is the key to the encoding of data into the stochastic pulse rate domain. The formation of random numbers and multiple random bit sequences from a single PRBS generator have been investigated. Two techniques, Simulated Annealing and Genetic Algorithms, have been applied successfully to the problem of optimising the configuration of a PRBS random number generator for the formation of multiple random bit sequences and hence random numbers. A complete hardware design for an artificial neuron using stochastic pulse rate encoded signals has been described, designed, simulated, fabricated and tested before configuration of the device into a network to perform simple test problems. The implementation has shown that the processing elements of the artificial neuron are small and simple, but that there can be a significant overhead for the encoding of information into the stochastic pulse rate domain. The stochastic artificial neuron has the capability of on-line weight adaption. The implementation of reinforcement schemes using the stochastic neuron as a basic element are discussed.
Successive discrimination and reversal learning as a function of differential sensory reinforcement and discriminative cues in two sensory modalities /Duckmanton, Robert Antony. January 1971 (has links) (PDF)
Thesis (B.A. (Hons.)), Department of Psychology, University of Adelaide, 1971.
Page generated in 0.1372 seconds