Global ETD Search

151	A Coordinated Reinforcement Learning Framework for Multi-Agent Virtual Environments Sause, William 01 January 2013 (has links) The growing popularity of online virtual communities such as Second Life and ActiveWorlds demands the presence of intelligent agents to assist users in their daily online activities (e.g., exploring, shopping, and socializing). As these virtual environments become more crowded, multiple agents are needed to support the increasing number of users. Multi-agent environments, however, can suffer from the problem of resource competition among agents. It is therefore necessary that agents within multi-agent environments include a coordination mechanism to prevent unrealistic behaviors. Moreover, it is essential that these agents exhibit some form of intelligence, or the ability to learn, to support realism as well as to eliminate the need for developers to write separate scripts for each task the agents are required to perform. This research presents a coordinated reinforcement learning framework which can be used to develop task-oriented intelligent agents in multi-agent virtual environments. The framework contains a combination of a "next available agent" coordination model and a reinforcement learning model consisting of existing temporal difference reinforcement learning algorithms. Furthermore, the framework supports evaluations of reinforcement learning algorithms to determine which methods are best suited for task-oriented intelligent agents in dynamic, multi-agent virtual environments. To assess the effectiveness of the temporal difference reinforcement algorithms used in this study (Q-learning and Sarsa), experiments were conducted that measured an agent's ability to learn three tasks commonly performed by workers in a café environment. These tasks were basic sandwich making (BSM), complex sandwich making (CSM), and dynamic sandwich making (DSM). The BSM task consisted of four steps. The CSM and DSM tasks contained an additional fifth step. The agent learned the BSM and CSM tasks from scratch while the DSM task was learned after the agent became skillful in BSM. The measurements used to evaluate the efficiency of the Q-learning and Sarsa algorithms were the percentage of successful and optimally successful episodes performed by the agent and the average number of time steps taken by the agent to complete a successful episode. The experiments were run using both a fixed (FEP) and variable (VEP) ε-greedy probability rate. Results showed that the Sarsa reinforcement learning algorithm, on average, outperformed the Q-learning algorithm in almost all experiments except when measuring the percentage of successfully completed episodes using FEP for CSM and DSM, in which Sarsa performed almost equally as well as Q-learning. Overall, experiments utilizing VEP resulted in higher percentages of successes and optimal successes, and showed convergence to the optimal policy when measuring the average number of time steps per successful episode. Intelligent Agents Reinforcement Learning Virtual Environments Computer Sciences
152	Dynamic movement primitives andreinforcement learning for adapting alearned skill Lundell, Jens January 2016 (has links) Traditionally robots have been preprogrammed to execute specific tasks. Thisapproach works well in industrial settings where robots have to execute highlyaccurate movements, such as when welding. However, preprogramming a robot isalso expensive, error prone and time consuming due to the fact that every featuresof the task has to be considered. In some cases, where a robot has to executecomplex tasks such as playing the ball-in-a-cup game, preprogramming it mighteven be impossible due to unknown features of the task. With all this in mind,this thesis examines the possibility of combining a modern learning framework,known as Learning from Demonstrations (LfD), to first teach a robot how toplay the ball-in-a-cup game by demonstrating the movement for the robot, andthen have the robot to improve this skill by itself with subsequent ReinforcementLearning (RL). The skill the robot has to learn is demonstrated with kinestheticteaching, modelled as a dynamic movement primitive, and subsequently improvedwith the RL algorithm Policy Learning by Weighted Exploration with the Returns.Experiments performed on the industrial robot KUKA LWR4+ showed that robotsare capable of successfully learning a complex skill such as playing the ball-in-a-cupgame. / Traditionellt sett har robotar blivit förprogrammerade för att utföra specifika uppgifter.Detta tillvägagångssätt fungerar bra i industriella miljöer var robotar måsteutföra mycket noggranna rörelser, som att svetsa. Förprogrammering av robotar ärdock dyrt, felbenäget och tidskrävande eftersom varje aspekt av uppgiften måstebeaktas. Dessa nackdelar kan till och med göra det omöjligt att förprogrammeraen robot att utföra komplexa uppgifter som att spela bollen-i-koppen spelet. Medallt detta i åtanke undersöker den här avhandlingen möjligheten att kombinera ettmodernt ramverktyg, kallat inläraning av demonstrationer, för att lära en robothur bollen-i-koppen-spelet ska spelas genom att demonstrera uppgiften för denoch sedan ha roboten att själv förbättra sin inlärda uppgift genom att användaförstärkande inlärning. Uppgiften som roboten måste lära sig är demonstreradmed kinestetisk undervisning, modellerad som dynamiska rörelseprimitiver, ochsenare förbättrad med den förstärkande inlärningsalgoritmen Policy Learning byWeighted Exploration with the Returns. Experiment utförda på den industriellaKUKA LWR4+ roboten visade att robotar är kapabla att framgångsrikt lära sigspela bollen-i-koppen spelet Learning from Demonstrations Dynamic Movement Primitives Reinforcement Learning
153	A Forex Trading System Using Evolutionary Reinforcement Learning Song, Yupu 01 May 2017 (has links) Building automated trading systems has long been one of the most cutting-edge and exciting fields in the financial industry. In this research project, we built a trading system based on machine learning methods. We used the Recurrent Reinforcement Learning (RRL) algorithm as our fundamental algorithm, and by introducing Genetic Algorithms (GA) in the optimization procedure, we tackled the problems of picking good initial values of parameters and dynamically updating the learning speed in the original RRL algorithm. We call this optimization algorithm the Evolutionary Recurrent Reinforcement Learning algorithm (ERRL), or the GA-RRL algorithm. ERRL allows us to find many local optimal solutions easier and faster than the original RRL algorithm. Finally, we implemented the GA-RRL system on EUR/USD at a 5-minute level, and the backtest performance showed that our GA-RRL system has potentially promising profitability. In future research we plan to introduce some risk control mechanism, implement the system on different markets and assets, and perform backtest at higher frequency level. Genetic Algorithms Reinforcement Learning Foreign Exchange Algorithmic Trading Machine Learning
154	From Model-Based to Data-Driven Discrete-Time Iterative Learning Control Song, Bing January 2019 (has links) This dissertation presents a series of new results of iterative learning control (ILC) that progresses from model-based ILC algorithms to data-driven ILC algorithms. ILC is a type of trial-and-error algorithm to learn by repetitions in practice to follow a pre-defined finite-time maneuver with high tracking accuracy. Mathematically ILC constructs a contraction mapping between the tracking errors of successive iterations, and aims to converge to a tracking accuracy approaching the reproducibility level of the hardware. It produces feedforward commands based on measurements from previous iterations to eliminates tracking errors from the bandwidth limitation of these feedback controllers, transient responses, model inaccuracies, unknown repeating disturbance, etc. Generally, ILC uses an a priori model to form the contraction mapping that guarantees monotonic decay of the tracking error. However, un-modeled high frequency dynamics may destabilize the control system. The existing infinite impulse response filtering techniques to stop the learning at such frequencies, have initial condition issues that can cause an otherwise stable ILC law to become unstable. A circulant form of zero-phase filtering for finite-time trajectories is proposed here to avoid such issues. This work addresses the problem of possible lack of stability robustness when ILC uses an imperfect a prior model. Besides the computation of feedforward commands, measurements from previous iterations can also be used to update the dynamic model. In other words, as the learning progresses, an iterative data-driven model development is made. This leads to adaptive ILC methods. An indirect adaptive linear ILC method to speed up the desired maneuver is presented here. The updates of the system model are realized by embedding an observer in ILC to estimate the system Markov parameters. This method can be used to increase the productivity or to produce high tracking accuracy when the desired trajectory is too fast for feedback control to be effective. When it comes to nonlinear ILC, data is used to update a progression of models along a homotopy, i.e., the ILC method presented in this thesis uses data to repeatedly create bilinear models in a homotopy approaching the desired trajectory. The improvement here makes use of Carleman bilinearized models to capture more nonlinear dynamics, with the potential for faster convergence when compared to existing methods based on linearized models. The last work presented here finally uses model-free reinforcement learning (RL) to eliminate the need for an a priori model. It is analogous to direct adaptive control using data to directly produce the gains in the ILC law without use of a model. An off-policy RL method is first developed by extending a model-free model predictive control method and then applied in the trial domain for ILC. Adjustments of the ILC learning law and the RL recursion equation for state-value function updates allow the collection of enough data while improving the tracking accuracy without much safety concerns. This algorithm can be seen as the first step to bridge ILC and RL aiming to address nonlinear systems. Mechanical engineering Process control Iterative methods (Mathematics) Reinforcement learning
155	Aplicação da rede GTSOM para navegação de robôs móveis utilizando aprendizado por reforço / Using the GTSOM network for mobile robot navigation with reinforcement learning Menegaz, Mauricio January 2009 (has links) Neste trabalho será descrita uma arquitetura de agente robótico autônomo projetada para ser capaz de criar uma representação de estado do ambiente e de realizar o aprendizado de tarefas simples em cima desta representação. A rede GTSOM (BASTOS, 2007) foi selecionada como método para classificação de estados. Sua tarefa é transformar os dados multidimensionais e contínuos lidos dos sensores em uma representação discreta, permitindo o uso de aprendizado por reforço convencional. Algumas modificações no algoritmo da rede foram necessárias para que pudesse ser aplicada neste contexto. Juntamente com esta rede, foi utilizado um mapa de grade que permite associar as experiências sensoriais com sua localização espacial. Enquanto a rede GTSOM é o ponto central de um sistema de classificação de estados, o algoritmo Q-Learning de aprendizado por reforço foi utilizado para a realização da tarefa. Utilizando a representação compacta de estado criada pela rede auto-organizável, o agente aprende as ações que devem ser executadas em cada ponto, para atingimento de seus objetivos. O modelo foi testado com um experimento que consiste em encontrar um objeto em um labirinto. Os resultados obtidos nos testes mostraram que o modelo consegue segmentar adequadamente o espaço de estados, e realiza o aprendizado da tarefa. O agente consegue aprender a evitar colisões e memorizar a localização do alvo, podendo chegar até ele independentemente de sua posição inicial. Além disso, é capaz de expandir sua representação sempre que se depara com situações não conhecidas, ao mesmo tempo que gradualmente remove da memória estados associados a experiências que não se repetem. / This work describes an architecture for an autonomous robotic agent that is capable of creating a state representation of its environment and learning how to execute simple tasks using this representation. The GTSOM Neural Network was chosen as the method for state clustering. It is used to transform the multidimensional and continuous state signal into a discrete representation, allowing the use of conventional reinforcement learning techniques. Some modifications on the algorithm were necessary so that it could be used in this project. This network is used together with a grid map algorithm that allows the model to associate the sensor readings with the places where they ocurred. While the GTSOM network is the main component of a state clustering system, the Q-Learning reinforcement learning method was chosen for the task execution. Using the compact state representation created by the self-organizing network, the agent learns which actions to execute at each state in order to achieve its objectives. The model was tested in an experiment that consists in finding the path in a maze. The results show that it can divide the state space in an useful way, and is capable of executing the task. It learns to avoid collisions and remembers the location of the target, even when the robot’s initial position is changed. Furthermore, the representation is expanded when the agent faces an unknown situation, and at the same time, states associated with old experiences are forgotten. Inteligência artificial Redes neurais Robotics Neural networks Reinforcement learning
156	Dynamic generalisation of continuous action spaces in reinforcement learning : a neurally inspired approach Smith, Andrew James January 2002 (has links) This thesis is about the dynamic generalisation of continuous action spaces in reinforcement learning problems. The standard Reinforcement Learning (RL) account provides a principled and comprehensive means of optimising a scalar reward signal in a Markov Decision Process. However, the theory itself does not directly address the imperative issue of generalisation which naturally arises as a consequence of large or continuous state and action spaces. A current thrust of research is aimed at fusing the generalisation capabilities of supervised (and unsupervised) learning techniques with the RL theory. An example par excellence is Tesauro’s TD-Gammon. Although much effort has gone into researching ways to represent and generalise over the input space, much less attention has been paid to the action space. This thesis first considers the motivation for learning real-valued actions, and then proposes a set of key properties desirable in any candidate algorithm addressing generalisation of both input and action spaces. These properties include: Provision of adaptive and online generalisation, adherence to the standard theory with a central focus on estimating expected reward, provision for real-valued states and actions, and full support for a real-valued discounted reward signal. Of particular interest are issues pertaining to robustness in non-stationary environments, scalability, and efficiency for real-time learning in applications such as robotics. Since exploring the action space is discovered to be a potentially costly process, the system should also be flexible enough to enable maximum reuse of learned actions. A new approach is proposed which succeeds for the first time in addressing all of the key issues identified. The algorithm, which is based on the ubiquitous self-organising map, is analysed and compared with other techniques including those based on the backpropagation algorithm. The investigation uncovers some important implications of the differences between these two particular approaches with respect to RL. In particular, the distributed representation of the multi-layer perceptron is judged to be something of a double-edged sword offering more sophisticated and more scalable generalising power, but potentially causing problems in dynamic or non-equiprobable environments, and tasks involving a highly varying input-output mapping. The thesis concludes that the self-organising map can be used in conjunction with current RL theory to provide real-time dynamic representation and generalisation of continuous action spaces. The proposed model is shown to be reliable in non-stationary, unpredictable and noisy environments and judged to be unique in addressing and satisfying a number of desirable properties identified as important to a large class of RL problems. 004
157	Using Dialogue Acts in dialogue strategy learning : optimising repair strategies Frampton, Matthew January 2008 (has links) A Spoken Dialogue System's (SDS's) dialogue strategy specifies which action it will take depending on its representation of the current dialogue context. Designing it by hand involves anticipating how users will interact with the system, and/or repeated testing and refining, and so can be a difficult, time-consuming task. Since SDSs inevitably make understanding errors, a particularly important issue is how to design ``repair strategies'', the parts of the dialogue strategy which attempt to get the dialogue ``back-on-track'' following these errors. To try to produce better dialogue strategies with less time and effort, previous researchers have modelled a dialogue strategy as a sequential decision problem called a Markov Decision Process (MDP), and then applied Reinforcement Learning (RL) algorithms to example training dialogues to generate dialogue strategies automatically. More recent research has used training dialogues conducted with simulated rather than real users and learned which action to take in all dialogue contexts, (a ``full'' as opposed to a ``partial'' dialogue strategy) - simulated users allow more training dialogues to be generated, and the exploration of new dialogue contexts not present in an original dataset. As yet however, limited insight has been provided as to which dialogue contextual features are important to include in the MDP and why. Indeed, a full dialogue strategy has not been learned from training dialogues with a realistic probabilistic user simulation derived from real user data, and then shown to work well with real users. This thesis investigates the value of adding new linguistically-motivated contextual features to the MDP when using RL to learn full dialogue strategies for SDSs. These new features are recent Dialogue Acts (DAs). DAs indicate the role or intention of an utterance in a dialogue e.g. ``provide-information'', an utterance being a complete unit of a speaker's speech, often bounded by silence. An accurate probabilistic user simulation learned from real user data is used for generating training dialogues, and the recent DAs are shown to improve performance in testing in simulation and with real users. With real users, performance is also better than other competing learned and hand-crafted strategies. Analysis of the strategies, and further simulation experiments show how the DAs improve performance through better repair strategies. The main findings are expected to apply to SDSs in general - indeed our strategies are learned and tested on real users in different domains, (flight-booking versus tourist information). Comparisons are also made to recent research which focuses on handling understanding errors in SDSs, but which does not use RL or user simulations. 006.3
158	Towards a mechanistic understanding of the neurobiological mechanisms underlying psychosis Haarsma, Joost January 2018 (has links) Psychotic symptoms are prevalent in a wide variety of psychiatric and neurological disorders. Yet, despite decades of research, the neurobiological mechanisms via which these symptoms come to manifest themselves remain to be elucidated. I argue in this thesis that using a mechanistic approach towards understanding psychosis that borrows heavily from the predictive coding framework, can help us understand the relationship between neurobiology and symptomology. In the first results chapter I present new data on a biomarker that has often been cited in relation to psychotic disorders, which is glutamate levels in the anterior cingulate cortex (ACC), as measured with magnetic resonance spectroscopy. In this chapter I aimed to replicate previous results that show differences in glutamate levels in psychosis and health. However, no statistically significant group differences and correlations with symptomology were found. In order to elucidate the potential mechanism underlying glutamate changes in the anterior cingulate cortex in psychosis, I tested whether a pharmacological challenge of Bromocriptine or Sulpiride altered glutamate levels in the anterior cingulate cortex. However, no significant group differences were found, between medication groups. In the second results chapter I aimed to address a long-standing question in the field of computational psychiatry, which is whether prior expectations have a stronger or weaker influence on inference in psychosis. I go on to show that this depends on the origin of the prior expectation and disease stage. That is, cognitive priors are stronger in first episode psychosis but not in people at risk for psychosis, whereas perceptual priors seem to be weakened in individuals at risk for psychosis compared to healthy individuals and individuals with first episode psychosis. Furthermore, there is some evidence that these alterations are correlated with glutamate levels. In the third results chapter I aimed to elucidate the nature of reward prediction error aberrancies in chronic schizophrenia. There has been some evidence suggesting that schizophrenia is associated with aberrant coding of reward prediction errors during reinforcement learning. However it is unclear whether these aberrancies are related to disease years and medication use. Here I provide evidence for a small but significant alteration in the coding of reward prediction errors that is correlated with medication use. In the fourth results chapter I aimed to study the influence of uncertainty on the coding of unsigned prediction errors during learning. It has been hypothesized by predictive coding theorists that dopamine plays a role in the precision-weighting of unsigned prediction error. This theory is of particular relevance to psychosis research, as this might provide a mechanism via which dopamine aberrancies, might lead to psychotic symptoms. I found that blocking dopamine using Sulpiride abolishes precision-weighting of unsigned prediction error, providing evidence for a dopamine mediated precision-weighting mechanism. In the fifth results chapter I aimed to extend this research into early psychosis, to elucidate whether psychosis is indeed associated with a failure to precision-weight prediction error. I found that first episode psychosis is indeed associated with a failure to precision-weight prediction errors, an effect that is explained by the experience of positive symptoms. In the sixth results chapter I explore whether the degree of precision-weighting of unsigned prediction errors is correlated with glutamate levels in the anterior cingulate cortex. Such a correlation might be plausible given that psychosis has been associated with both. However, I did not find such a relationship, even in a sample of 137 individuals. Thus I concluded that anterior cingulate glutamate levels might be more related to non-positive symptoms associated with psychotic disorders. In summary, a mechanistic approach towards understanding psychosis can give us valuable insights into the disease mechanisms at play. I have shown here that the influence of expectations on perception is different across disease stage in psychosis. Furthermore, aberrancies in prediction error mechanisms might explain positive symptoms in psychosis, a process likely mediated by dopaminergic mechanisms, whereas evidence for glutamatergic mediation remains absent.
159	Bounding Box Improvement with Reinforcement Learning Cleland, Andrew Lewis 12 June 2018 (has links) In this thesis, I explore a reinforcement learning technique for improving bounding box localizations of objects in images. The model takes as input a bounding box already known to overlap an object and aims to improve the fit of the box through a series of transformations that shift the location of the box by translation, or change its size or aspect ratio. Over the course of these actions, the model adapts to new information extracted from the image. This active localization approach contrasts with existing bounding-box regression methods, which extract information from the image only once. I implement, train, and test this reinforcement learning model using data taken from the Portland State Dog-Walking image set. The model balances exploration with exploitation in training using an ε-greedy policy. I find that the performance of the model is sensitive to the ε-greedy configuration used during training, performing best when the epsilon parameter is set to very low values over the course of training. With = 0.01, I find the algorithm can improve bounding boxes in about 78% of test cases for the "dog" object category, and 76% for the "human" category. Reinforcement learning Machine learning Computer vision Computer Sciences
160	On the Selection of Just-in-time Interventions Jaimes, Luis Gabriel 20 March 2015 (has links) A deeper understanding of human physiology, combined with improvements in sensing technologies, is fulfilling the vision of affective computing, where applications monitor and react to changes in affect. Further, the proliferation of commodity mobile devices is extending these applications into the natural environment, where they become a pervasive part of our daily lives. This work examines one such pervasive affective computing application with significant implications for long-term health and quality of life adaptive just-in-time interventions (AJITIs). We discuss fundamental components needed to design AJITIs based for one kind of affective data, namely stress. Chronic stress has significant long-term behavioral and physical health consequences, including an increased risk of cardiovascular disease, cancer, anxiety and depression. This dissertation presents the state-of-the-art of Just-in-time interventions for stress. It includes a new architecture. that is used to describe the most important issues in the design, implementation, and evaluation of AJITIs. Then, the most important mechanisms available in the literature are described, and classified. The dissertation also presents a simulation model to study and evaluate different strategies and algorithms for interventions selection. Then, a new hybrid mechanism based on value iteration and monte carlo simulation method is proposed. This semi-online algorithm dynamically builds a transition probability matrix (TPM) which is used to obtain a new policy for intervention selection. We present this algorithm in two different versions. The first version uses a pre-determined number of stress episodes as a training set to create a TPM, and then to generate the policy that will be used to select interventions in the future. In the second version, we use each new stress episode to update the TPM, and a pre-determined number of episodes to update our selection policy for interventions. We also present a completely online learning algorithm for intervention selection based on Q-learning with eligibility traces. We show that this algorithm could be used by an affective computing system to select and deliver in mobile environments. Finally, we conducts posthoc experiments and simulations to demonstrate feasibility of both real-time stress forecasting and stress intervention adaptation and optimization. Affective Computing mHealth Reinforcement Learning Ubiquitous Computing Electrical and Computer Engineering

Search results