Return to search

Value-gradient learning

This thesis presents an Adaptive Dynamic Programming method, Value-Gradient Learning, for solving a control optimisation problem, using a neural network to represent a critic function in a large continuous-valued state space. The algorithm developed, called VGL(λ), requires a learned differentiable model of the environment. VGL(λ) is an extension of Dual Heuristic Programming (DHP) to include a bootstrapping parameter, λ, analogous to that used in the reinforcement learning algorithm TD(λ). Online and batch-mode implementations of the algorithm are provided, and its theoretical relationships to its precursor algorithms, DHP and TD(λ), are described. A theoretical result is given which shows that to achieve trajectory optimality in a continuous-valued state space, the critic must learn the value-gradient, and this fact affects any critic-learning algorithm. The connection of this result to Pontryagin's Minimum Principle is made clear. Hence it is proven that learning this value-gradient directly will obviate the need for local exploration of the value function, and this motivates value-gradient learning methods in terms of automatic local value exploration and improved learning speed. Empirical results for the algorithm are given for several benchmark problems, and the improved speed, convergence, and ability to work without local value exploration, is demonstrated in comparison to its precursor algorithms, TD(λ) and DHP. A convergence proof for one instance of the VGL(λ) algorithm is given, which is valid for control problems with a greedy policy, and a general nonlinear function approximator to represent the critic. This is a non-trivial accomplishment, since most or all other related algorithms can be made to diverge under similar conditions, and new divergence proofs demonstrating this for certain algorithms are given in the thesis. Several technical problems must be overcome to make a robust VGL(λ) implementation, and these solutions are described. These include implementing an efficient greedy policy, implementing trajectory clipping correctly, and the efficient computation of second-order gradients with a neural network.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:600663
Date January 2014
CreatorsFairbank, Michael
PublisherCity University London
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttp://openaccess.city.ac.uk/3438/

Page generated in 0.1334 seconds