Spelling suggestions: "subject:"reinforcement learning"" "subject:"einforcement learning""
161 |
All learning is local: Multi-agent learning in global reward gamesChang, Yu-Han, Ho, Tracey, Kaelbling, Leslie P. 01 1900 (has links)
In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and effectively learn a near-optimal policy in a wide variety of settings. A sequence of increasingly complex empirical tests verifies the efficacy of this technique. / Singapore-MIT Alliance (SMA)
|
162 |
Importance Sampling for Reinforcement Learning with Multiple ObjectivesShelton, Christian Robert 01 August 2001 (has links)
This thesis considers three complications that arise from applying reinforcement learning to a real-world application. In the process of using reinforcement learning to build an adaptive electronic market-maker, we find the sparsity of data, the partial observability of the domain, and the multiple objectives of the agent to cause serious problems for existing reinforcement learning algorithms. We employ importance sampling (likelihood ratios) to achieve good performance in partially observable Markov decision processes with few data. Our importance sampling estimator requires no knowledge about the environment and places few restrictions on the method of collecting data. It can be used efficiently with reactive controllers, finite-state controllers, or policies with function approximation. We present theoretical analyses of the estimator and incorporate it into a reinforcement learning algorithm. Additionally, this method provides a complete return surface which can be used to balance multiple objectives dynamically. We demonstrate the need for multiple goals in a variety of applications and natural solutions based on our sampling method. The thesis concludes with example results from employing our algorithm to the domain of automated electronic market-making.
|
163 |
The Essential Dynamics Algorithm: Essential ResultsMartin, Martin C. 01 May 2003 (has links)
This paper presents a novel algorithm for learning in a class of stochastic Markov decision processes (MDPs) with continuous state and action spaces that trades speed for accuracy. A transform of the stochastic MDP into a deterministic one is presented which captures the essence of the original dynamics, in a sense made precise. In this transformed MDP, the calculation of values is greatly simplified. The online algorithm estimates the model of the transformed MDP and simultaneously does policy search against it. Bounds on the error of this approximation are proven, and experimental results in a bicycle riding domain are presented. The algorithm learns near optimal policies in orders of magnitude fewer interactions with the stochastic MDP, using less domain knowledge. All code used in the experiments is available on the project's web site.
|
164 |
Mobilized ad-hoc networks: A reinforcement learning approachChang, Yu-Han, Ho, Tracey, Kaelbling, Leslie Pack 04 December 2003 (has links)
Research in mobile ad-hoc networks has focused on situations in which nodes have no control over their movements. We investigate an important but overlooked domain in which nodes do have control over their movements. Reinforcement learning methods can be used to control both packet routing decisions and node mobility, dramatically improving the connectivity of the network. We first motivate the problem by presenting theoretical bounds for the connectivity improvement of partially mobile networks and then present superior empirical results under a variety of different scenarios in which the mobile nodes in our ad-hoc network are embedded with adaptive routing policies and learned movement policies.
|
165 |
Reinforcement Learning by Policy SearchPeshkin, Leonid 14 February 2003 (has links)
One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations can be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. This means learning a policy---a mapping of observations into actions---based on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multi-agent systems. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience re-use. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network.
|
166 |
Reinforcement Learning and Simulation-Based Search in Computer GoSilver, David 11 1900 (has links)
Learning and planning are two fundamental problems in artificial intelligence. The learning problem can be tackled by reinforcement learning methods, such as temporal-difference learning, which update a value function from real experience, and use function approximation to generalise across states. The planning problem can be tackled by simulation-based search methods, such as Monte-Carlo tree search, which update a value function from simulated experience, but treat each state individually. We introduce a new method, temporal-difference search, that combines elements of both reinforcement learning and simulation-based search methods. In this new method the value function is updated from simulated experience, but it uses function approximation to efficiently generalise across states. We also introduce the Dyna-2 architecture, which combines temporal-difference learning with temporal-difference search. Whereas temporal-difference learning acquires general domain knowledge from its past experience, temporal-difference search acquires local knowledge that is specialised to the agent's current state, by simulating future experience. Dyna-2 combines both forms of knowledge together.
We apply our algorithms to the game of 9x9 Go. Using temporal-difference learning, with a million binary features matching simple patterns of stones, and using no prior knowledge except the grid structure of the board, we learnt a fast and effective evaluation function. Using temporal-difference search with the same representation produced a dramatic improvement: without any explicit search tree, and with equivalent domain knowledge, it achieved better performance than a vanilla Monte-Carlo tree search. When combined together using the Dyna-2 architecture, our program outperformed all handcrafted, traditional search, and traditional machine learning programs on the 9x9 Computer Go Server.
We also use our framework to extend the Monte-Carlo tree search algorithm. By forming a rapid generalisation over subtrees of the search space, and incorporating heuristic pattern knowledge that was learnt or handcrafted offline, we were able to significantly improve the performance of the Go program MoGo. Using these enhancements, MoGo became the first 9x9 Go program to achieve human master level.
|
167 |
Dynamic Tuning of PI-Controllers based on Model-free Reinforcement Learning MethodsAbbasi Brujeni, Lena 06 1900 (has links)
In this thesis, a Reinforcement Learning (RL) method called Sarsa is used to dynamically tune a PI-controller for a Continuous Stirred Tank Heater (CSTH) experimental setup. The proposed approach uses an approximate model to train the RL agent in the simulation environment before implementation on the real plant. This is done in order to help the RL agent initially start from a reasonably stable policy. Learning without any information about the dynamics of the process is not practically feasible due to the great amount of data (time) that the RL algorithm requires and safety issues.
The process in this thesis is modeled with a First Order Plus Time Delay (FOPTD) transfer function, because almost all of the chemical processes can be sufficiently represented by this class of transfer functions. The presence of a delay term in this type of transfer functions makes them inherently more complicated models for RL methods.
RL methods should be combined with generalization techniques to handle the continuous state space. Here, parameterized quadratic function approximation compounded with k-nearest neighborhood function approximation is used for the regions close and far from the origin, respectively. Applying each of these generalization methods separately has some disadvantages, hence their combination is used to overcome these flaws.
The proposed RL-based PI-controller is initially trained in the simulation environment. Thereafter, the policy of the simulation-based RL agent is used as the starting policy of the RL agent during implementation on the experimental setup. As a result of the existing plant-model mismatch, the performance of the RL-based PI-controller using this primary policy is not as good as the simulationresults; however, training on the real plant results in a significant improvement in this performance. On the other hand, the IMC-tuned PI-controllers, which are the most commonly used feedback controllers are also compared and they also degrade because of the inevitable plant-model mismatch. To improve the performance of these IMC-tuned PI-controllers, re-tuning of these controllers based on a more precise model of the process is necessary.
The experimental tests are carried out for the cases of set-point tracking and disturbance rejection. In both cases, the successful adaptability of the RL-based PI-controller is clearly evident.
Finally, in the case of a disturbance entering the process, the performance of the proposed model-free self-tuning PI-controller degrades more, when compared to the existing IMC controllers. However, the adaptability of the RL-based PI- controller provides a good solution to this problem. After being trained to handle disturbances in the process, an improved control policy is obtained, which is able to successfully return the output to the set-point. / Process Control
|
168 |
RELPH: A Computational Model for Human Decision MakingMohammadi Sepahvand, Nazanin January 2013 (has links)
The updating process, which consists of building mental models and adapting them to the changes occurring in the environment, is impaired in neglect patients. A simple rock-paper-scissors experiment was conducted in our lab to examine updating impairments in neglect patients. The results of this experiment demonstrate a significant difference between the performance of healthy and brain damaged participants. While healthy controls did not show any difficulty learning the computer’s strategy, right brain damaged patients failed to learn the computer’s strategy. A computational modeling approach is employed to help us better understand the reason behind this difference and thus learn more about the updating process in healthy people and its impairment in right brain damaged patients. Broadly, we hope to learn more about the nature of the updating process, in general. Also the hope is that knowing what must be changed in the model to “brain-damage” it can shed light on the updating deficit in right brain damaged patients. To do so I adapted a pattern detection method named “ELPH” to a reinforcement-learning human decision making model called “RELPH”. This model is capable of capturing the behavior of both healthy and right brain damaged participants in our task according to our defined measures. Indeed, this thesis is an effort to discuss the possible differences among these groups employing this computational model.
|
169 |
Modeling, Analysis and Control of Nonlinear Switching SystemsKaisare, Niket S. 22 December 2004 (has links)
The first part of this two-part thesis examines the reverse-flow operation of auto-thermal methane reforming in a microreactor. A theoretical study is undertaken to explain the physical origins of the experimentally observed improvements in the performance of the reverse-flow operation compared to the unidirectional operation. First, a scaling analysis is presented to understand the effect of various time scales existing within the microreactor, and to obtain guidelines for the optimal reverse-flow operation. Then, the effect of kinetic parameters, transport properties, reactor design and operating conditions on the reactor operation is parametrically studied through numerical simulations. The reverse-flow operation is shown to be more robust than the unidirectional operation with respect to both optimal operating conditions as well as variations in hydrogen throughput requirements. A rational scheme for improved catalyst placement in the microreactor, which exploits the spatial temperature profiles in the reactor, is also presented. Finally, a design modification of the microreactor called "opposed-flow" reactor, which retains the performance benefits of the reverse-flow operation without requiring the input / output port switching, is suggested.
In the second part of this thesis, a novel simulation-based Approximate Dynamic Programming (ADP) framework is presented for optimal control of switching between multiple metabolic states in a microbial bioreactor. The cybernetic modeling framework is used to capture these cellular metabolic switches. Model Predictive Control, one of the most popular advanced control methods, is able to drive the reactor to the desired steady state. However, the nonlinearity and switching nature of the system cause computational and performance problems with MPC. The proposed ADP has an advantage over MPC, as the closed-loop optimal policy is computed offline in the form of so-called value or cost-to-go function. Through the use of an approximation of the value function, the infinite horizon problem is converted into an equivalent single-stage problem, which can be solved online. Various issues in implementation of ADP are also addressed.
|
170 |
Development and evaluation of an arterial adaptive traffic signal control system using reinforcement learningXie, Yuanchang 15 May 2009 (has links)
This dissertation develops and evaluates a new adaptive traffic signal control
system for arterials. This control system is based on reinforcement learning, which is an
important research area in distributed artificial intelligence and has been extensively
used in many applications including real-time control.
In this dissertation, a systematic comparison between the reinforcement learning
control methods and existing adaptive traffic control methods is first presented from the
theoretical perspective. This comparison shows both the connections between them and
the benefits of using reinforcement learning. A Neural-Fuzzy Actor-Critic
Reinforcement Learning (NFACRL) method is then introduced for traffic signal control.
NFACRL integrates fuzzy logic and neural networks into reinforcement learning and can
better handle the curse of dimensionality and generalization problems associated with
ordinary reinforcement learning methods.
This NFACRL method is first applied to isolated intersection control. Two
different implementation schemes are considered. The first scheme uses a fixed phase sequence and variable cycle length, while the second one optimizes phase sequence in
real time and is not constrained to the concept of cycle. Both schemes are further
extended for arterial control, with each intersection being controlled by one NFACRL
controller. Different strategies used for coordinating reinforcement learning controllers
are reviewed, and a simple but robust method is adopted for coordinating traffic signals
along the arterial.
The proposed NFACRL control system is tested at both isolated intersection and
arterial levels based on VISSIM simulation. The testing is conducted under different
traffic volume scenarios using real-world traffic data collected during morning, noon,
and afternoon peak periods. The performance of the NFACRL control system is
compared with that of the optimized pre-timed and actuated control.
Testing results based on VISSIM simulation show that the proposed NFACRL
control has very promising performance. It outperforms optimized pre-timed and
actuated control in most cases for both isolated intersection and arterial control. At the
end of this dissertation, issues on how to further improve the NFACRL method and
implement it in real world are discussed.
|
Page generated in 0.1208 seconds