Global ETD Search

Return to search

Value-gradient learning

This thesis presents an Adaptive Dynamic Programming method, Value-Gradient Learning, for solving a control optimisation problem, using a neural network to represent a critic function in a large continuous-valued state space. The algorithm developed, called VGL(λ), requires a learned differentiable model of the environment. VGL(λ) is an extension of Dual Heuristic Programming (DHP) to include a bootstrapping parameter, λ, analogous to that used in the reinforcement learning algorithm TD(λ). Online and batch-mode implementations of the algorithm are provided, and its theoretical relationships to its precursor algorithms, DHP and TD(λ), are described. A theoretical result is given which shows that to achieve trajectory optimality in a continuous-valued state space, the critic must learn the value-gradient, and this fact affects any critic-learning algorithm. The connection of this result to Pontryagin's Minimum Principle is made clear. Hence it is proven that learning this value-gradient directly will obviate the need for local exploration of the value function, and this motivates value-gradient learning methods in terms of automatic local value exploration and improved learning speed. Empirical results for the algorithm are given for several benchmark problems, and the improved speed, convergence, and ability to work without local value exploration, is demonstrated in comparison to its precursor algorithms, TD(λ) and DHP. A convergence proof for one instance of the VGL(λ) algorithm is given, which is valid for control problems with a greedy policy, and a general nonlinear function approximator to represent the critic. This is a non-trivial accomplishment, since most or all other related algorithms can be made to diverge under similar conditions, and new divergence proofs demonstrating this for certain algorithms are given in the thesis. Several technical problems must be overcome to make a robust VGL(λ) implementation, and these solutions are described. These include implementing an efficient greedy policy, implementing trajectory clipping correctly, and the efficient computation of second-order gradients with a neural network.

http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.600663

006.32

Identifer	oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:600663
Date	January 2014
Creators	Fairbank, Michael
Publisher	City University London
Source Sets	Ethos UK
Detected Language	English
Type	Electronic Thesis or Dissertation
Source	http://openaccess.city.ac.uk/3438/

Page generated in 0.1334 seconds

Value-gradient learning

Description

Links & Downloads

Tags

Additional Fields