Return to search

A study of model-based average reward reinforcement learning

Reinforcement Learning (RL) is the study of learning agents that improve
their performance from rewards and punishments. Most reinforcement learning
methods optimize the discounted total reward received by an agent, while, in many
domains, the natural criterion is to optimize the average reward per time step. In this
thesis, we introduce a model-based average reward reinforcement learning method
called "H-learning" and show that it performs better than other average reward and
discounted RL methods in the domain of scheduling a simulated Automatic Guided
Vehicle (AGV).
We also introduce a version of H-learning which automatically explores the
unexplored parts of the state space, while always choosing an apparently best action
with respect to the current value function. We show that this "Auto-exploratory H-Learning"
performs much better than the original H-learning under many previously
studied exploration strategies.
To scale H-learning to large state spaces, we extend it to learn action models
and reward functions in the form of Bayesian networks, and approximate its value
function using local linear regression. We show that both of these extensions are very
effective in significantly reducing the space requirement of H-learning, and in making
it converge much faster in the AGV scheduling task. Further, Auto-exploratory H-learning
synergistically combines with Bayesian network model learning and value
function approximation by local linear regression, yielding a highly effective average
reward RL algorithm.
We believe that the algorithms presented here have the potential to scale to
large applications in the context of average reward optimization. / Graduation date:1996

Identiferoai:union.ndltd.org:ORGSU/oai:ir.library.oregonstate.edu:1957/34698
Date09 May 1996
CreatorsOk, DoKyeong
ContributorsTadepalli, Prasad
Source SetsOregon State University
Languageen_US
Detected LanguageEnglish
TypeThesis/Dissertation

Page generated in 0.0018 seconds