Partially observable Markov decision processes are interesting because
of their ability to model most conceivable real-world learning
problems, for example, robot navigation, driving a car, speech
recognition, stock trading, and playing games. The downside of this
generality is that exact algorithms are computationally
intractable. Such computational complexity motivates approximate
approaches. One such class of algorithms are the so-called
policy-gradient methods from reinforcement learning. They seek to
adjust the parameters of an agent in the direction that maximises the
long-term average of a reward signal. Policy-gradient methods are
attractive as a \emph{scalable} approach for controlling partially
observable Markov decision processes (POMDPs).
¶
In the most general case POMDP policies require some form of internal
state, or memory, in order to act optimally. Policy-gradient methods
have shown promise for problems admitting memory-less policies but have
been less successful when memory is required. This thesis develops
several improved algorithms for learning policies with memory in an
infinite-horizon setting. Directly, when the dynamics of the world
are known, and via Monte-Carlo methods otherwise.
The algorithms simultaneously learn how to act and what to remember.
¶
Monte-Carlo policy-gradient approaches tend to produce gradient
estimates with high variance. Two novel methods for reducing variance
are introduced. The first uses high-order filters to replace the
eligibility trace of the gradient estimator. The second uses a
low-variance value-function method to learn a subset of the parameters
and a policy-gradient method to learn the remainder.
¶
The algorithms are applied to large domains including a simulated
robot navigation scenario, a multi-agent scenario with 21,000 states,
and the complex real-world task of large vocabulary continuous speech
recognition. To the best of the author's knowledge, no other policy-gradient
algorithms have performed well at such tasks.
¶
The high variance of Monte-Carlo methods requires lengthy simulation
and hence a super-computer to train agents within a reasonable time. The ANU
``Bunyip'' Linux cluster was built with such tasks in mind. It was
used for several of the experimental results presented here. One
chapter of this thesis describes an application written for the Bunyip
cluster that won the international Gordon-Bell prize for
price/performance in 2001.
Identifer | oai:union.ndltd.org:ADTP/216735 |
Date | January 2003 |
Creators | Aberdeen, Douglas Alexander, doug.aberdeen@anu.edu.au |
Publisher | The Australian National University. Research School of Information Sciences and Engineering |
Source Sets | Australiasian Digital Theses Program |
Language | English |
Detected Language | English |
Rights | http://www.anu.edu.au/legal/copyright/copyrit.html), Copyright Douglas Alexander Aberdeen |
Page generated in 0.0021 seconds