• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 651
  • 81
  • 66
  • 22
  • 11
  • 8
  • 8
  • 7
  • 7
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • Tagged with
  • 1063
  • 1063
  • 260
  • 217
  • 194
  • 177
  • 161
  • 159
  • 154
  • 149
  • 142
  • 127
  • 123
  • 121
  • 113
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
151

Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

Aberdeen, Douglas Alexander, doug.aberdeen@anu.edu.au January 2003 (has links)
Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). ¶ In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ¶ Monte-Carlo policy-gradient approaches tend to produce gradient estimates with high variance. Two novel methods for reducing variance are introduced. The first uses high-order filters to replace the eligibility trace of the gradient estimator. The second uses a low-variance value-function method to learn a subset of the parameters and a policy-gradient method to learn the remainder. ¶ The algorithms are applied to large domains including a simulated robot navigation scenario, a multi-agent scenario with 21,000 states, and the complex real-world task of large vocabulary continuous speech recognition. To the best of the author's knowledge, no other policy-gradient algorithms have performed well at such tasks. ¶ The high variance of Monte-Carlo methods requires lengthy simulation and hence a super-computer to train agents within a reasonable time. The ANU ``Bunyip'' Linux cluster was built with such tasks in mind. It was used for several of the experimental results presented here. One chapter of this thesis describes an application written for the Bunyip cluster that won the international Gordon-Bell prize for price/performance in 2001.
152

Policy Gradient Methods: Variance Reduction and Stochastic Convergence

Greensmith, Evan, evan.greensmith@gmail.com January 2005 (has links)
In a reinforcement learning task an agent must learn a policy for performing actions so as to perform well in a given environment. Policy gradient methods consider a parameterized class of policies, and using a policy from the class, and a trajectory through the environment taken by the agent using this policy, estimate the performance of the policy with respect to the parameters. Policy gradient methods avoid some of the problems of value function methods, such as policy degradation, where inaccuracy in the value function leads to the choice of a poor policy. However, the estimates produced by policy gradient methods can have high variance.¶ In Part I of this thesis we study the estimation variance of policy gradient algorithms, in particular, when augmenting the estimate with a baseline, a common method for reducing estimation variance, and when using actor-critic methods. A baseline adjusts the reward signal supplied by the environment, and can be used to reduce the variance of a policy gradient estimate without adding any bias. We find the baseline that minimizes the variance. We also consider the class of constant baselines, and find the constant baseline that minimizes the variance. We compare this to the common technique of adjusting the rewards by an estimate of the performance measure. Actor-critic methods usually attempt to learn a value function accurate enough to be used in a gradient estimate without adding much bias. In this thesis we propose that in learning the value function we should also consider the variance. We show how considering the variance of the gradient estimate when learning a value function can be beneficial, and we introduce a new optimization criterion for selecting a value function.¶ In Part II of this thesis we consider online versions of policy gradient algorithms, where we update our policy for selecting actions at each step in time, and study the convergence of the these online algorithms. For such online gradient-based algorithms, convergence results aim to show that the gradient of the performance measure approaches zero. Such a result has been shown for an algorithm which is based on observing trajectories between visits to a special state of the environment. However, the algorithm is not suitable in a partially observable setting, where we are unable to access the full state of the environment, and its variance depends on the time between visits to the special state, which may be large even when only few samples are needed to estimate the gradient. To date, convergence results for algorithms that do not rely on a special state are weaker. We show that, for a certain algorithm that does not rely on a special state, the gradient of the performance measure approaches zero. We show that this continues to hold when using certain baseline algorithms suggested by the results of Part I.
153

All learning is local: Multi-agent learning in global reward games

Chang, Yu-Han, Ho, Tracey, Kaelbling, Leslie P. 01 1900 (has links)
In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and effectively learn a near-optimal policy in a wide variety of settings. A sequence of increasingly complex empirical tests verifies the efficacy of this technique. / Singapore-MIT Alliance (SMA)
154

Importance Sampling for Reinforcement Learning with Multiple Objectives

Shelton, Christian Robert 01 August 2001 (has links)
This thesis considers three complications that arise from applying reinforcement learning to a real-world application. In the process of using reinforcement learning to build an adaptive electronic market-maker, we find the sparsity of data, the partial observability of the domain, and the multiple objectives of the agent to cause serious problems for existing reinforcement learning algorithms. We employ importance sampling (likelihood ratios) to achieve good performance in partially observable Markov decision processes with few data. Our importance sampling estimator requires no knowledge about the environment and places few restrictions on the method of collecting data. It can be used efficiently with reactive controllers, finite-state controllers, or policies with function approximation. We present theoretical analyses of the estimator and incorporate it into a reinforcement learning algorithm. Additionally, this method provides a complete return surface which can be used to balance multiple objectives dynamically. We demonstrate the need for multiple goals in a variety of applications and natural solutions based on our sampling method. The thesis concludes with example results from employing our algorithm to the domain of automated electronic market-making.
155

The Essential Dynamics Algorithm: Essential Results

Martin, Martin C. 01 May 2003 (has links)
This paper presents a novel algorithm for learning in a class of stochastic Markov decision processes (MDPs) with continuous state and action spaces that trades speed for accuracy. A transform of the stochastic MDP into a deterministic one is presented which captures the essence of the original dynamics, in a sense made precise. In this transformed MDP, the calculation of values is greatly simplified. The online algorithm estimates the model of the transformed MDP and simultaneously does policy search against it. Bounds on the error of this approximation are proven, and experimental results in a bicycle riding domain are presented. The algorithm learns near optimal policies in orders of magnitude fewer interactions with the stochastic MDP, using less domain knowledge. All code used in the experiments is available on the project's web site.
156

Mobilized ad-hoc networks: A reinforcement learning approach

Chang, Yu-Han, Ho, Tracey, Kaelbling, Leslie Pack 04 December 2003 (has links)
Research in mobile ad-hoc networks has focused on situations in which nodes have no control over their movements. We investigate an important but overlooked domain in which nodes do have control over their movements. Reinforcement learning methods can be used to control both packet routing decisions and node mobility, dramatically improving the connectivity of the network. We first motivate the problem by presenting theoretical bounds for the connectivity improvement of partially mobile networks and then present superior empirical results under a variety of different scenarios in which the mobile nodes in our ad-hoc network are embedded with adaptive routing policies and learned movement policies.
157

Reinforcement Learning by Policy Search

Peshkin, Leonid 14 February 2003 (has links)
One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations can be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. This means learning a policy---a mapping of observations into actions---based on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multi-agent systems. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience re-use. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network.
158

Reinforcement Learning and Simulation-Based Search in Computer Go

Silver, David 11 1900 (has links)
Learning and planning are two fundamental problems in artificial intelligence. The learning problem can be tackled by reinforcement learning methods, such as temporal-difference learning, which update a value function from real experience, and use function approximation to generalise across states. The planning problem can be tackled by simulation-based search methods, such as Monte-Carlo tree search, which update a value function from simulated experience, but treat each state individually. We introduce a new method, temporal-difference search, that combines elements of both reinforcement learning and simulation-based search methods. In this new method the value function is updated from simulated experience, but it uses function approximation to efficiently generalise across states. We also introduce the Dyna-2 architecture, which combines temporal-difference learning with temporal-difference search. Whereas temporal-difference learning acquires general domain knowledge from its past experience, temporal-difference search acquires local knowledge that is specialised to the agent's current state, by simulating future experience. Dyna-2 combines both forms of knowledge together. We apply our algorithms to the game of 9x9 Go. Using temporal-difference learning, with a million binary features matching simple patterns of stones, and using no prior knowledge except the grid structure of the board, we learnt a fast and effective evaluation function. Using temporal-difference search with the same representation produced a dramatic improvement: without any explicit search tree, and with equivalent domain knowledge, it achieved better performance than a vanilla Monte-Carlo tree search. When combined together using the Dyna-2 architecture, our program outperformed all handcrafted, traditional search, and traditional machine learning programs on the 9x9 Computer Go Server. We also use our framework to extend the Monte-Carlo tree search algorithm. By forming a rapid generalisation over subtrees of the search space, and incorporating heuristic pattern knowledge that was learnt or handcrafted offline, we were able to significantly improve the performance of the Go program MoGo. Using these enhancements, MoGo became the first 9x9 Go program to achieve human master level.
159

Dynamic Tuning of PI-Controllers based on Model-free Reinforcement Learning Methods

Abbasi Brujeni, Lena 06 1900 (has links)
In this thesis, a Reinforcement Learning (RL) method called Sarsa is used to dynamically tune a PI-controller for a Continuous Stirred Tank Heater (CSTH) experimental setup. The proposed approach uses an approximate model to train the RL agent in the simulation environment before implementation on the real plant. This is done in order to help the RL agent initially start from a reasonably stable policy. Learning without any information about the dynamics of the process is not practically feasible due to the great amount of data (time) that the RL algorithm requires and safety issues. The process in this thesis is modeled with a First Order Plus Time Delay (FOPTD) transfer function, because almost all of the chemical processes can be sufficiently represented by this class of transfer functions. The presence of a delay term in this type of transfer functions makes them inherently more complicated models for RL methods. RL methods should be combined with generalization techniques to handle the continuous state space. Here, parameterized quadratic function approximation compounded with k-nearest neighborhood function approximation is used for the regions close and far from the origin, respectively. Applying each of these generalization methods separately has some disadvantages, hence their combination is used to overcome these flaws. The proposed RL-based PI-controller is initially trained in the simulation environment. Thereafter, the policy of the simulation-based RL agent is used as the starting policy of the RL agent during implementation on the experimental setup. As a result of the existing plant-model mismatch, the performance of the RL-based PI-controller using this primary policy is not as good as the simulationresults; however, training on the real plant results in a significant improvement in this performance. On the other hand, the IMC-tuned PI-controllers, which are the most commonly used feedback controllers are also compared and they also degrade because of the inevitable plant-model mismatch. To improve the performance of these IMC-tuned PI-controllers, re-tuning of these controllers based on a more precise model of the process is necessary. The experimental tests are carried out for the cases of set-point tracking and disturbance rejection. In both cases, the successful adaptability of the RL-based PI-controller is clearly evident. Finally, in the case of a disturbance entering the process, the performance of the proposed model-free self-tuning PI-controller degrades more, when compared to the existing IMC controllers. However, the adaptability of the RL-based PI- controller provides a good solution to this problem. After being trained to handle disturbances in the process, an improved control policy is obtained, which is able to successfully return the output to the set-point. / Process Control
160

RELPH: A Computational Model for Human Decision Making

Mohammadi Sepahvand, Nazanin January 2013 (has links)
The updating process, which consists of building mental models and adapting them to the changes occurring in the environment, is impaired in neglect patients. A simple rock-paper-scissors experiment was conducted in our lab to examine updating impairments in neglect patients. The results of this experiment demonstrate a significant difference between the performance of healthy and brain damaged participants. While healthy controls did not show any difficulty learning the computer’s strategy, right brain damaged patients failed to learn the computer’s strategy. A computational modeling approach is employed to help us better understand the reason behind this difference and thus learn more about the updating process in healthy people and its impairment in right brain damaged patients. Broadly, we hope to learn more about the nature of the updating process, in general. Also the hope is that knowing what must be changed in the model to “brain-damage” it can shed light on the updating deficit in right brain damaged patients. To do so I adapted a pattern detection method named “ELPH” to a reinforcement-learning human decision making model called “RELPH”. This model is capable of capturing the behavior of both healthy and right brain damaged participants in our task according to our defined measures. Indeed, this thesis is an effort to discuss the possible differences among these groups employing this computational model.

Page generated in 0.3629 seconds