Spelling suggestions: "subject:"reinforcement learning"" "subject:"einforcement learning""
151 |
Towards a mechanistic understanding of the neurobiological mechanisms underlying psychosisHaarsma, Joost January 2018 (has links)
Psychotic symptoms are prevalent in a wide variety of psychiatric and neurological disorders. Yet, despite decades of research, the neurobiological mechanisms via which these symptoms come to manifest themselves remain to be elucidated. I argue in this thesis that using a mechanistic approach towards understanding psychosis that borrows heavily from the predictive coding framework, can help us understand the relationship between neurobiology and symptomology. In the first results chapter I present new data on a biomarker that has often been cited in relation to psychotic disorders, which is glutamate levels in the anterior cingulate cortex (ACC), as measured with magnetic resonance spectroscopy. In this chapter I aimed to replicate previous results that show differences in glutamate levels in psychosis and health. However, no statistically significant group differences and correlations with symptomology were found. In order to elucidate the potential mechanism underlying glutamate changes in the anterior cingulate cortex in psychosis, I tested whether a pharmacological challenge of Bromocriptine or Sulpiride altered glutamate levels in the anterior cingulate cortex. However, no significant group differences were found, between medication groups. In the second results chapter I aimed to address a long-standing question in the field of computational psychiatry, which is whether prior expectations have a stronger or weaker influence on inference in psychosis. I go on to show that this depends on the origin of the prior expectation and disease stage. That is, cognitive priors are stronger in first episode psychosis but not in people at risk for psychosis, whereas perceptual priors seem to be weakened in individuals at risk for psychosis compared to healthy individuals and individuals with first episode psychosis. Furthermore, there is some evidence that these alterations are correlated with glutamate levels. In the third results chapter I aimed to elucidate the nature of reward prediction error aberrancies in chronic schizophrenia. There has been some evidence suggesting that schizophrenia is associated with aberrant coding of reward prediction errors during reinforcement learning. However it is unclear whether these aberrancies are related to disease years and medication use. Here I provide evidence for a small but significant alteration in the coding of reward prediction errors that is correlated with medication use. In the fourth results chapter I aimed to study the influence of uncertainty on the coding of unsigned prediction errors during learning. It has been hypothesized by predictive coding theorists that dopamine plays a role in the precision-weighting of unsigned prediction error. This theory is of particular relevance to psychosis research, as this might provide a mechanism via which dopamine aberrancies, might lead to psychotic symptoms. I found that blocking dopamine using Sulpiride abolishes precision-weighting of unsigned prediction error, providing evidence for a dopamine mediated precision-weighting mechanism. In the fifth results chapter I aimed to extend this research into early psychosis, to elucidate whether psychosis is indeed associated with a failure to precision-weight prediction error. I found that first episode psychosis is indeed associated with a failure to precision-weight prediction errors, an effect that is explained by the experience of positive symptoms. In the sixth results chapter I explore whether the degree of precision-weighting of unsigned prediction errors is correlated with glutamate levels in the anterior cingulate cortex. Such a correlation might be plausible given that psychosis has been associated with both. However, I did not find such a relationship, even in a sample of 137 individuals. Thus I concluded that anterior cingulate glutamate levels might be more related to non-positive symptoms associated with psychotic disorders. In summary, a mechanistic approach towards understanding psychosis can give us valuable insights into the disease mechanisms at play. I have shown here that the influence of expectations on perception is different across disease stage in psychosis. Furthermore, aberrancies in prediction error mechanisms might explain positive symptoms in psychosis, a process likely mediated by dopaminergic mechanisms, whereas evidence for glutamatergic mediation remains absent.
|
152 |
Bounding Box Improvement with Reinforcement LearningCleland, Andrew Lewis 12 June 2018 (has links)
In this thesis, I explore a reinforcement learning technique for improving bounding box localizations of objects in images. The model takes as input a bounding box already known to overlap an object and aims to improve the fit of the box through a series of transformations that shift the location of the box by translation, or change its size or aspect ratio. Over the course of these actions, the model adapts to new information extracted from the image. This active localization approach contrasts with existing bounding-box regression methods, which extract information from the image only once. I implement, train, and test this reinforcement learning model using data taken from the Portland State Dog-Walking image set.
The model balances exploration with exploitation in training using an ε-greedy policy. I find that the performance of the model is sensitive to the ε-greedy configuration used during training, performing best when the epsilon parameter is set to very low values over the course of training. With = 0.01, I find the algorithm can improve bounding boxes in about 78% of test cases for the "dog" object category, and 76% for the "human" category.
|
153 |
On the Selection of Just-in-time InterventionsJaimes, Luis Gabriel 20 March 2015 (has links)
A deeper understanding of human physiology, combined with improvements in sensing technologies, is fulfilling the vision of affective computing, where applications monitor and react to changes in affect. Further, the proliferation of commodity mobile devices is extending these applications into the natural environment, where they become a pervasive part of our daily lives. This work examines one such pervasive affective computing application with significant implications for long-term health and quality of life adaptive just-in-time interventions (AJITIs). We discuss fundamental components needed to design AJITIs based for one kind of affective data, namely stress. Chronic stress has significant long-term behavioral and physical health consequences, including an increased risk of cardiovascular disease, cancer, anxiety and depression. This dissertation presents the state-of-the-art of Just-in-time interventions for stress. It includes a new architecture. that is used to describe the most important issues in the design, implementation, and evaluation of AJITIs. Then, the most important mechanisms available in the literature are described, and classified. The dissertation also presents a simulation model to study and evaluate different strategies and algorithms for interventions selection. Then, a new hybrid mechanism based on value iteration and monte carlo simulation method is proposed. This semi-online algorithm dynamically builds a transition probability matrix (TPM) which is used to obtain a new policy for intervention selection. We present this algorithm in two different versions. The first version uses a pre-determined number of stress episodes as a training set to create a TPM, and then to generate the policy that will be used to select interventions in the future. In the second version, we use each new stress episode to update the TPM, and a pre-determined number of episodes to update our selection policy for interventions. We also present a completely online learning algorithm for intervention selection based on Q-learning with eligibility traces. We show that this algorithm could be used by an affective computing system to select and deliver in mobile environments. Finally, we conducts posthoc experiments and simulations to demonstrate feasibility of both real-time stress forecasting and stress intervention adaptation and optimization.
|
154 |
Policy-Gradient Algorithms for Partially Observable Markov Decision ProcessesAberdeen, Douglas Alexander, doug.aberdeen@anu.edu.au January 2003 (has links)
Partially observable Markov decision processes are interesting because
of their ability to model most conceivable real-world learning
problems, for example, robot navigation, driving a car, speech
recognition, stock trading, and playing games. The downside of this
generality is that exact algorithms are computationally
intractable. Such computational complexity motivates approximate
approaches. One such class of algorithms are the so-called
policy-gradient methods from reinforcement learning. They seek to
adjust the parameters of an agent in the direction that maximises the
long-term average of a reward signal. Policy-gradient methods are
attractive as a \emph{scalable} approach for controlling partially
observable Markov decision processes (POMDPs).
¶
In the most general case POMDP policies require some form of internal
state, or memory, in order to act optimally. Policy-gradient methods
have shown promise for problems admitting memory-less policies but have
been less successful when memory is required. This thesis develops
several improved algorithms for learning policies with memory in an
infinite-horizon setting. Directly, when the dynamics of the world
are known, and via Monte-Carlo methods otherwise.
The algorithms simultaneously learn how to act and what to remember.
¶
Monte-Carlo policy-gradient approaches tend to produce gradient
estimates with high variance. Two novel methods for reducing variance
are introduced. The first uses high-order filters to replace the
eligibility trace of the gradient estimator. The second uses a
low-variance value-function method to learn a subset of the parameters
and a policy-gradient method to learn the remainder.
¶
The algorithms are applied to large domains including a simulated
robot navigation scenario, a multi-agent scenario with 21,000 states,
and the complex real-world task of large vocabulary continuous speech
recognition. To the best of the author's knowledge, no other policy-gradient
algorithms have performed well at such tasks.
¶
The high variance of Monte-Carlo methods requires lengthy simulation
and hence a super-computer to train agents within a reasonable time. The ANU
``Bunyip'' Linux cluster was built with such tasks in mind. It was
used for several of the experimental results presented here. One
chapter of this thesis describes an application written for the Bunyip
cluster that won the international Gordon-Bell prize for
price/performance in 2001.
|
155 |
Policy Gradient Methods: Variance Reduction and Stochastic ConvergenceGreensmith, Evan, evan.greensmith@gmail.com January 2005 (has links)
In a reinforcement learning task an agent must learn a policy for performing actions so as to perform well in a given environment. Policy gradient methods consider a parameterized class of policies, and using a policy from the class, and a trajectory through the environment taken by the agent using this policy, estimate the performance of the policy with respect to the parameters. Policy gradient methods avoid some of the problems of value function methods, such as policy degradation, where inaccuracy in the value function leads to the choice of a poor policy. However, the estimates produced by policy gradient methods can have high variance.¶
In Part I of this thesis we study the estimation variance of policy gradient algorithms, in particular, when augmenting the estimate with a baseline, a common method for reducing estimation variance, and when using actor-critic methods. A baseline adjusts the reward signal supplied by the environment, and can be used to reduce the variance of a policy gradient estimate without adding any bias. We find the baseline that minimizes the variance. We also consider the class of constant baselines, and find the constant baseline that minimizes the variance. We compare this to the common technique of adjusting the rewards by an estimate of the performance measure. Actor-critic methods usually attempt to learn a value function accurate enough to be used in a gradient estimate without adding much bias. In this thesis we propose that in learning the value function we should also consider the variance. We show how considering the variance of the gradient estimate when learning a value function can be beneficial, and we introduce a new optimization criterion for selecting a value function.¶
In Part II of this thesis we consider online versions of policy gradient algorithms, where we update our policy for selecting actions at each step in time, and study the convergence of the these online algorithms. For such online gradient-based algorithms, convergence results aim to show that the gradient of the performance measure approaches zero. Such a result has been shown for an algorithm which is based on observing trajectories between visits to a special state of the environment. However, the algorithm is not suitable in a partially observable setting, where we are unable to access the full state of the environment, and its variance depends on the time between visits to the special state, which may be large even when only few samples are needed to estimate the gradient. To date, convergence results for algorithms that do not rely on a special state are weaker. We show that, for a certain algorithm that does not rely on a special state, the gradient of the performance measure approaches zero. We show that this continues to hold when using certain baseline algorithms suggested by the results of Part I.
|
156 |
All learning is local: Multi-agent learning in global reward gamesChang, Yu-Han, Ho, Tracey, Kaelbling, Leslie P. 01 1900 (has links)
In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and effectively learn a near-optimal policy in a wide variety of settings. A sequence of increasingly complex empirical tests verifies the efficacy of this technique. / Singapore-MIT Alliance (SMA)
|
157 |
Importance Sampling for Reinforcement Learning with Multiple ObjectivesShelton, Christian Robert 01 August 2001 (has links)
This thesis considers three complications that arise from applying reinforcement learning to a real-world application. In the process of using reinforcement learning to build an adaptive electronic market-maker, we find the sparsity of data, the partial observability of the domain, and the multiple objectives of the agent to cause serious problems for existing reinforcement learning algorithms. We employ importance sampling (likelihood ratios) to achieve good performance in partially observable Markov decision processes with few data. Our importance sampling estimator requires no knowledge about the environment and places few restrictions on the method of collecting data. It can be used efficiently with reactive controllers, finite-state controllers, or policies with function approximation. We present theoretical analyses of the estimator and incorporate it into a reinforcement learning algorithm. Additionally, this method provides a complete return surface which can be used to balance multiple objectives dynamically. We demonstrate the need for multiple goals in a variety of applications and natural solutions based on our sampling method. The thesis concludes with example results from employing our algorithm to the domain of automated electronic market-making.
|
158 |
The Essential Dynamics Algorithm: Essential ResultsMartin, Martin C. 01 May 2003 (has links)
This paper presents a novel algorithm for learning in a class of stochastic Markov decision processes (MDPs) with continuous state and action spaces that trades speed for accuracy. A transform of the stochastic MDP into a deterministic one is presented which captures the essence of the original dynamics, in a sense made precise. In this transformed MDP, the calculation of values is greatly simplified. The online algorithm estimates the model of the transformed MDP and simultaneously does policy search against it. Bounds on the error of this approximation are proven, and experimental results in a bicycle riding domain are presented. The algorithm learns near optimal policies in orders of magnitude fewer interactions with the stochastic MDP, using less domain knowledge. All code used in the experiments is available on the project's web site.
|
159 |
Mobilized ad-hoc networks: A reinforcement learning approachChang, Yu-Han, Ho, Tracey, Kaelbling, Leslie Pack 04 December 2003 (has links)
Research in mobile ad-hoc networks has focused on situations in which nodes have no control over their movements. We investigate an important but overlooked domain in which nodes do have control over their movements. Reinforcement learning methods can be used to control both packet routing decisions and node mobility, dramatically improving the connectivity of the network. We first motivate the problem by presenting theoretical bounds for the connectivity improvement of partially mobile networks and then present superior empirical results under a variety of different scenarios in which the mobile nodes in our ad-hoc network are embedded with adaptive routing policies and learned movement policies.
|
160 |
Reinforcement Learning by Policy SearchPeshkin, Leonid 14 February 2003 (has links)
One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations can be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. This means learning a policy---a mapping of observations into actions---based on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multi-agent systems. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience re-use. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network.
|
Page generated in 0.172 seconds