151 |
Bounding Box Improvement with Reinforcement LearningCleland, Andrew Lewis 12 June 2018 (has links)
In this thesis, I explore a reinforcement learning technique for improving bounding box localizations of objects in images. The model takes as input a bounding box already known to overlap an object and aims to improve the fit of the box through a series of transformations that shift the location of the box by translation, or change its size or aspect ratio. Over the course of these actions, the model adapts to new information extracted from the image. This active localization approach contrasts with existing bounding-box regression methods, which extract information from the image only once. I implement, train, and test this reinforcement learning model using data taken from the Portland State Dog-Walking image set.
The model balances exploration with exploitation in training using an ε-greedy policy. I find that the performance of the model is sensitive to the ε-greedy configuration used during training, performing best when the epsilon parameter is set to very low values over the course of training. With = 0.01, I find the algorithm can improve bounding boxes in about 78% of test cases for the "dog" object category, and 76% for the "human" category.
|
152 |
On the Selection of Just-in-time InterventionsJaimes, Luis Gabriel 20 March 2015 (has links)
A deeper understanding of human physiology, combined with improvements in sensing technologies, is fulfilling the vision of affective computing, where applications monitor and react to changes in affect. Further, the proliferation of commodity mobile devices is extending these applications into the natural environment, where they become a pervasive part of our daily lives. This work examines one such pervasive affective computing application with significant implications for long-term health and quality of life adaptive just-in-time interventions (AJITIs). We discuss fundamental components needed to design AJITIs based for one kind of affective data, namely stress. Chronic stress has significant long-term behavioral and physical health consequences, including an increased risk of cardiovascular disease, cancer, anxiety and depression. This dissertation presents the state-of-the-art of Just-in-time interventions for stress. It includes a new architecture. that is used to describe the most important issues in the design, implementation, and evaluation of AJITIs. Then, the most important mechanisms available in the literature are described, and classified. The dissertation also presents a simulation model to study and evaluate different strategies and algorithms for interventions selection. Then, a new hybrid mechanism based on value iteration and monte carlo simulation method is proposed. This semi-online algorithm dynamically builds a transition probability matrix (TPM) which is used to obtain a new policy for intervention selection. We present this algorithm in two different versions. The first version uses a pre-determined number of stress episodes as a training set to create a TPM, and then to generate the policy that will be used to select interventions in the future. In the second version, we use each new stress episode to update the TPM, and a pre-determined number of episodes to update our selection policy for interventions. We also present a completely online learning algorithm for intervention selection based on Q-learning with eligibility traces. We show that this algorithm could be used by an affective computing system to select and deliver in mobile environments. Finally, we conducts posthoc experiments and simulations to demonstrate feasibility of both real-time stress forecasting and stress intervention adaptation and optimization.
|
153 |
Policy-Gradient Algorithms for Partially Observable Markov Decision ProcessesAberdeen, Douglas Alexander, doug.aberdeen@anu.edu.au January 2003 (has links)
Partially observable Markov decision processes are interesting because
of their ability to model most conceivable real-world learning
problems, for example, robot navigation, driving a car, speech
recognition, stock trading, and playing games. The downside of this
generality is that exact algorithms are computationally
intractable. Such computational complexity motivates approximate
approaches. One such class of algorithms are the so-called
policy-gradient methods from reinforcement learning. They seek to
adjust the parameters of an agent in the direction that maximises the
long-term average of a reward signal. Policy-gradient methods are
attractive as a \emph{scalable} approach for controlling partially
observable Markov decision processes (POMDPs).
¶
In the most general case POMDP policies require some form of internal
state, or memory, in order to act optimally. Policy-gradient methods
have shown promise for problems admitting memory-less policies but have
been less successful when memory is required. This thesis develops
several improved algorithms for learning policies with memory in an
infinite-horizon setting. Directly, when the dynamics of the world
are known, and via Monte-Carlo methods otherwise.
The algorithms simultaneously learn how to act and what to remember.
¶
Monte-Carlo policy-gradient approaches tend to produce gradient
estimates with high variance. Two novel methods for reducing variance
are introduced. The first uses high-order filters to replace the
eligibility trace of the gradient estimator. The second uses a
low-variance value-function method to learn a subset of the parameters
and a policy-gradient method to learn the remainder.
¶
The algorithms are applied to large domains including a simulated
robot navigation scenario, a multi-agent scenario with 21,000 states,
and the complex real-world task of large vocabulary continuous speech
recognition. To the best of the author's knowledge, no other policy-gradient
algorithms have performed well at such tasks.
¶
The high variance of Monte-Carlo methods requires lengthy simulation
and hence a super-computer to train agents within a reasonable time. The ANU
``Bunyip'' Linux cluster was built with such tasks in mind. It was
used for several of the experimental results presented here. One
chapter of this thesis describes an application written for the Bunyip
cluster that won the international Gordon-Bell prize for
price/performance in 2001.
|
154 |
Policy Gradient Methods: Variance Reduction and Stochastic ConvergenceGreensmith, Evan, evan.greensmith@gmail.com January 2005 (has links)
In a reinforcement learning task an agent must learn a policy for performing actions so as to perform well in a given environment. Policy gradient methods consider a parameterized class of policies, and using a policy from the class, and a trajectory through the environment taken by the agent using this policy, estimate the performance of the policy with respect to the parameters. Policy gradient methods avoid some of the problems of value function methods, such as policy degradation, where inaccuracy in the value function leads to the choice of a poor policy. However, the estimates produced by policy gradient methods can have high variance.¶
In Part I of this thesis we study the estimation variance of policy gradient algorithms, in particular, when augmenting the estimate with a baseline, a common method for reducing estimation variance, and when using actor-critic methods. A baseline adjusts the reward signal supplied by the environment, and can be used to reduce the variance of a policy gradient estimate without adding any bias. We find the baseline that minimizes the variance. We also consider the class of constant baselines, and find the constant baseline that minimizes the variance. We compare this to the common technique of adjusting the rewards by an estimate of the performance measure. Actor-critic methods usually attempt to learn a value function accurate enough to be used in a gradient estimate without adding much bias. In this thesis we propose that in learning the value function we should also consider the variance. We show how considering the variance of the gradient estimate when learning a value function can be beneficial, and we introduce a new optimization criterion for selecting a value function.¶
In Part II of this thesis we consider online versions of policy gradient algorithms, where we update our policy for selecting actions at each step in time, and study the convergence of the these online algorithms. For such online gradient-based algorithms, convergence results aim to show that the gradient of the performance measure approaches zero. Such a result has been shown for an algorithm which is based on observing trajectories between visits to a special state of the environment. However, the algorithm is not suitable in a partially observable setting, where we are unable to access the full state of the environment, and its variance depends on the time between visits to the special state, which may be large even when only few samples are needed to estimate the gradient. To date, convergence results for algorithms that do not rely on a special state are weaker. We show that, for a certain algorithm that does not rely on a special state, the gradient of the performance measure approaches zero. We show that this continues to hold when using certain baseline algorithms suggested by the results of Part I.
|
155 |
All learning is local: Multi-agent learning in global reward gamesChang, Yu-Han, Ho, Tracey, Kaelbling, Leslie P. 01 1900 (has links)
In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and effectively learn a near-optimal policy in a wide variety of settings. A sequence of increasingly complex empirical tests verifies the efficacy of this technique. / Singapore-MIT Alliance (SMA)
|
156 |
Importance Sampling for Reinforcement Learning with Multiple ObjectivesShelton, Christian Robert 01 August 2001 (has links)
This thesis considers three complications that arise from applying reinforcement learning to a real-world application. In the process of using reinforcement learning to build an adaptive electronic market-maker, we find the sparsity of data, the partial observability of the domain, and the multiple objectives of the agent to cause serious problems for existing reinforcement learning algorithms. We employ importance sampling (likelihood ratios) to achieve good performance in partially observable Markov decision processes with few data. Our importance sampling estimator requires no knowledge about the environment and places few restrictions on the method of collecting data. It can be used efficiently with reactive controllers, finite-state controllers, or policies with function approximation. We present theoretical analyses of the estimator and incorporate it into a reinforcement learning algorithm. Additionally, this method provides a complete return surface which can be used to balance multiple objectives dynamically. We demonstrate the need for multiple goals in a variety of applications and natural solutions based on our sampling method. The thesis concludes with example results from employing our algorithm to the domain of automated electronic market-making.
|
157 |
The Essential Dynamics Algorithm: Essential ResultsMartin, Martin C. 01 May 2003 (has links)
This paper presents a novel algorithm for learning in a class of stochastic Markov decision processes (MDPs) with continuous state and action spaces that trades speed for accuracy. A transform of the stochastic MDP into a deterministic one is presented which captures the essence of the original dynamics, in a sense made precise. In this transformed MDP, the calculation of values is greatly simplified. The online algorithm estimates the model of the transformed MDP and simultaneously does policy search against it. Bounds on the error of this approximation are proven, and experimental results in a bicycle riding domain are presented. The algorithm learns near optimal policies in orders of magnitude fewer interactions with the stochastic MDP, using less domain knowledge. All code used in the experiments is available on the project's web site.
|
158 |
Mobilized ad-hoc networks: A reinforcement learning approachChang, Yu-Han, Ho, Tracey, Kaelbling, Leslie Pack 04 December 2003 (has links)
Research in mobile ad-hoc networks has focused on situations in which nodes have no control over their movements. We investigate an important but overlooked domain in which nodes do have control over their movements. Reinforcement learning methods can be used to control both packet routing decisions and node mobility, dramatically improving the connectivity of the network. We first motivate the problem by presenting theoretical bounds for the connectivity improvement of partially mobile networks and then present superior empirical results under a variety of different scenarios in which the mobile nodes in our ad-hoc network are embedded with adaptive routing policies and learned movement policies.
|
159 |
Reinforcement Learning by Policy SearchPeshkin, Leonid 14 February 2003 (has links)
One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations can be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. This means learning a policy---a mapping of observations into actions---based on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multi-agent systems. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience re-use. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network.
|
160 |
Reinforcement Learning and Simulation-Based Search in Computer GoSilver, David 11 1900 (has links)
Learning and planning are two fundamental problems in artificial intelligence. The learning problem can be tackled by reinforcement learning methods, such as temporal-difference learning, which update a value function from real experience, and use function approximation to generalise across states. The planning problem can be tackled by simulation-based search methods, such as Monte-Carlo tree search, which update a value function from simulated experience, but treat each state individually. We introduce a new method, temporal-difference search, that combines elements of both reinforcement learning and simulation-based search methods. In this new method the value function is updated from simulated experience, but it uses function approximation to efficiently generalise across states. We also introduce the Dyna-2 architecture, which combines temporal-difference learning with temporal-difference search. Whereas temporal-difference learning acquires general domain knowledge from its past experience, temporal-difference search acquires local knowledge that is specialised to the agent's current state, by simulating future experience. Dyna-2 combines both forms of knowledge together.
We apply our algorithms to the game of 9x9 Go. Using temporal-difference learning, with a million binary features matching simple patterns of stones, and using no prior knowledge except the grid structure of the board, we learnt a fast and effective evaluation function. Using temporal-difference search with the same representation produced a dramatic improvement: without any explicit search tree, and with equivalent domain knowledge, it achieved better performance than a vanilla Monte-Carlo tree search. When combined together using the Dyna-2 architecture, our program outperformed all handcrafted, traditional search, and traditional machine learning programs on the 9x9 Computer Go Server.
We also use our framework to extend the Monte-Carlo tree search algorithm. By forming a rapid generalisation over subtrees of the search space, and incorporating heuristic pattern knowledge that was learnt or handcrafted offline, we were able to significantly improve the performance of the Go program MoGo. Using these enhancements, MoGo became the first 9x9 Go program to achieve human master level.
|
Page generated in 0.0965 seconds