1 |
Proactive Planning through Active Policy Inference in Stochastic EnvironmentsPoulin, Nolan 01 May 2018 (has links)
In multi-agent Markov Decision Processes, a controllable agent must perform optimal planning in a dynamic and uncertain environment that includes another unknown and uncontrollable agent. Given a task specification for the controllable agent, its ability to complete the task can be impeded by an inaccurate model of the intent and behaviors of other agents. In this work, we introduce an active policy inference algorithm that allows a controllable agent to infer a policy of the environmental agent through interaction. Active policy inference is data-efficient and is particularly useful when data are time-consuming or costly to obtain. The controllable agent synthesizes an exploration-exploitation policy that incorporates the knowledge learned about the environment's behavior. Whenever possible, the agent also tries to elicit behavior from the other agent to improve the accuracy of the environmental model. This is done by mapping the uncertainty in the environmental model to a bonus reward, which helps elicit the most informative exploration, and allows the controllable agent to return to its main task as fast as possible. Experiments demonstrate the improved sample efficiency of active learning and the convergence of the policy for the controllable agents.
|
2 |
Policy-Gradient Algorithms for Partially Observable Markov Decision ProcessesAberdeen, Douglas Alexander, doug.aberdeen@anu.edu.au January 2003 (has links)
Partially observable Markov decision processes are interesting because
of their ability to model most conceivable real-world learning
problems, for example, robot navigation, driving a car, speech
recognition, stock trading, and playing games. The downside of this
generality is that exact algorithms are computationally
intractable. Such computational complexity motivates approximate
approaches. One such class of algorithms are the so-called
policy-gradient methods from reinforcement learning. They seek to
adjust the parameters of an agent in the direction that maximises the
long-term average of a reward signal. Policy-gradient methods are
attractive as a \emph{scalable} approach for controlling partially
observable Markov decision processes (POMDPs).
¶
In the most general case POMDP policies require some form of internal
state, or memory, in order to act optimally. Policy-gradient methods
have shown promise for problems admitting memory-less policies but have
been less successful when memory is required. This thesis develops
several improved algorithms for learning policies with memory in an
infinite-horizon setting. Directly, when the dynamics of the world
are known, and via Monte-Carlo methods otherwise.
The algorithms simultaneously learn how to act and what to remember.
¶
Monte-Carlo policy-gradient approaches tend to produce gradient
estimates with high variance. Two novel methods for reducing variance
are introduced. The first uses high-order filters to replace the
eligibility trace of the gradient estimator. The second uses a
low-variance value-function method to learn a subset of the parameters
and a policy-gradient method to learn the remainder.
¶
The algorithms are applied to large domains including a simulated
robot navigation scenario, a multi-agent scenario with 21,000 states,
and the complex real-world task of large vocabulary continuous speech
recognition. To the best of the author's knowledge, no other policy-gradient
algorithms have performed well at such tasks.
¶
The high variance of Monte-Carlo methods requires lengthy simulation
and hence a super-computer to train agents within a reasonable time. The ANU
``Bunyip'' Linux cluster was built with such tasks in mind. It was
used for several of the experimental results presented here. One
chapter of this thesis describes an application written for the Bunyip
cluster that won the international Gordon-Bell prize for
price/performance in 2001.
|
3 |
Policy Gradient Methods: Variance Reduction and Stochastic ConvergenceGreensmith, Evan, evan.greensmith@gmail.com January 2005 (has links)
In a reinforcement learning task an agent must learn a policy for performing actions so as to perform well in a given environment. Policy gradient methods consider a parameterized class of policies, and using a policy from the class, and a trajectory through the environment taken by the agent using this policy, estimate the performance of the policy with respect to the parameters. Policy gradient methods avoid some of the problems of value function methods, such as policy degradation, where inaccuracy in the value function leads to the choice of a poor policy. However, the estimates produced by policy gradient methods can have high variance.¶
In Part I of this thesis we study the estimation variance of policy gradient algorithms, in particular, when augmenting the estimate with a baseline, a common method for reducing estimation variance, and when using actor-critic methods. A baseline adjusts the reward signal supplied by the environment, and can be used to reduce the variance of a policy gradient estimate without adding any bias. We find the baseline that minimizes the variance. We also consider the class of constant baselines, and find the constant baseline that minimizes the variance. We compare this to the common technique of adjusting the rewards by an estimate of the performance measure. Actor-critic methods usually attempt to learn a value function accurate enough to be used in a gradient estimate without adding much bias. In this thesis we propose that in learning the value function we should also consider the variance. We show how considering the variance of the gradient estimate when learning a value function can be beneficial, and we introduce a new optimization criterion for selecting a value function.¶
In Part II of this thesis we consider online versions of policy gradient algorithms, where we update our policy for selecting actions at each step in time, and study the convergence of the these online algorithms. For such online gradient-based algorithms, convergence results aim to show that the gradient of the performance measure approaches zero. Such a result has been shown for an algorithm which is based on observing trajectories between visits to a special state of the environment. However, the algorithm is not suitable in a partially observable setting, where we are unable to access the full state of the environment, and its variance depends on the time between visits to the special state, which may be large even when only few samples are needed to estimate the gradient. To date, convergence results for algorithms that do not rely on a special state are weaker. We show that, for a certain algorithm that does not rely on a special state, the gradient of the performance measure approaches zero. We show that this continues to hold when using certain baseline algorithms suggested by the results of Part I.
|
4 |
Adaptive Curvature for Stochastic OptimizationJanuary 2019 (has links)
abstract: This thesis presents a family of adaptive curvature methods for gradient-based stochastic optimization. In particular, a general algorithmic framework is introduced along with a practical implementation that yields an efficient, adaptive curvature gradient descent algorithm. To this end, a theoretical and practical link between curvature matrix estimation and shrinkage methods for covariance matrices is established. The use of shrinkage improves estimation accuracy of the curvature matrix when data samples are scarce. This thesis also introduce several insights that result in data- and computation-efficient update equations. Empirical results suggest that the proposed method compares favorably with existing second-order techniques based on the Fisher or Gauss-Newton and with adaptive stochastic gradient descent methods on both supervised and reinforcement learning tasks. / Dissertation/Thesis / Masters Thesis Computer Science 2019
|
5 |
DYNAMIC TASK OFFLOADING FOR LATENCY MINIMIZATION IN IOT EDGE-CLOUD ENVIRONMENTSHaimin Ku (12457464) 26 April 2022 (has links)
<p>With the exponential growth and diversity of Internet of Things (IoT) devices, computational-intensive and delay-sensitive applications, such as object detection, smart homes, and smart grids, are emerging constantly. We can adopt the paradigm of cloud computing to offload computation-heavy tasks from IoT devices to a cloud server which can break through the limitation of IoT devices with more powerful resources. However, cloud computing architecture can cause high latency which is not suitable for IoT devices that have limited computing and storage capabilities. Edge computing has been introduced to improve this situation by deploying an edge device nearby IoT devices that can provide IoT devices computing resources with low latency compared to cloud computing. Nevertheless, the edge server may not be able to complete all the offloaded tasks from the devices in time when the requests are flooding. In such cases, the edge server can offload some of the requested tasks to a cloud server to further speed up the offloading process with more powerful cloud resources. In this paper, we aim to minimize the average completion time of tasks in an IoT edge-cloud environment, by optimizing the task offloading ratio from edge to cloud, based on Deep Deterministic Policy Gradient (DDPG), a type of Reinforcement Learning (RL) approach. We propose a dynamic task offloading decision mechanism deployed on the edge that can determine the amounts of computational resources to be processed in the cloud server considering multiple factors to complete a task. Simulation results demonstrate that our dynamic task offloading decision mechanism can improve the overall completion time of tasks than naïve approaches. </p>
|
6 |
Bayesian Reinforcement Learning Methods for Network Intrusion PreventionNesti Lopes, Antonio Frederico January 2021 (has links)
A growing problem in network security stems from the fact that both attack methods and target systems constantly evolve. This problem makes it difficult for human operators to keep up and manage the security problem. To deal with this challenge, a promising approach is to use reinforcement learning to adapt security policies to a changing environment. However, a drawback of this approach is that traditional reinforcement learning methods require a large amount of data in order to learn effective policies, which can be both costly and difficult to obtain. To address this problem, this thesis investigates ways to incorporate prior knowledge in learning systems for network security. Our goal is to be able to learn security policies with less data compared to traditional reinforcement learning algorithms. To investigate this question, we take a Bayesian approach and consider Bayesian reinforcement learning methods as a complement to current algorithms in reinforcement learning. Specifically, in this work, we study the following algorithms: Bayesian Q-learning, Bayesian REINFORCE, and Bayesian Actor-Critic. To evaluate our approach, we have implemented the mentioned algorithms and techniques and applied them to different simulation scenarios of intrusion prevention. Our results demonstrate that the Bayesian reinforcement learning algorithms are able to learn more efficiently compared to their non-Bayesian counterparts but that the Bayesian approach is more computationally demanding. Further, we find that the choice of prior and the kernel function have a large impact on the performance of the algorithms. / Ett växande problem inom cybersäkerhet är att både attackmetoder samt system är i en konstant förändring och utveckling: å ena sidan så blir attackmetoder mer och mer sofistikerade, och å andra sidan så utvecklas system via innovationer samt uppgraderingar. Detta problem gör det svårt för mänskliga operatörer att hantera säkerhetsproblemet. En lovande metod för att hantera denna utmaning är förstärkningslärande. Med förstärkningslärande kan en autonom agent automatiskt lära sig att anpassa säkerhetsstrategier till en föränderlig miljö. En utmaning med detta tillvägagångsätt är dock att traditionella förstärkningsinlärningsmetoder kräver en stor mängd data för att lära sig effektiva strategier, vilket kan vara både kostsamt och svårt att erskaffa. För att lösa detta problem så undersöker denna avhandling Bayesiska metoder för att inkorporera förkunskaper i inlärningsalgoritmen, vilket kan möjliggöra lärande med mindre data. Specifikt så studerar vi följande Bayesiska algoritmer: Bayesian Q-learning, Bayesian REINFORCE och Bayesian Actor- Critic. För att utvärdera vårt tillvägagångssätt har vi implementerat de nämnda algoritmerna och utvärderat deras prestanda i olika simuleringsscenarier för intrångsförebyggande samt analyserat deras komplexitet. Våra resultat visar att de Bayesiska förstärkningsinlärningsalgoritmerna kan användas för att lära sig strategier med mindre data än vad som kravs vid användande av icke-Bayesiska motsvarigheter, men att den Bayesiska metoden är mer beräkningskrävande. Vidare finner vi att metoden för att inkorporera förkunskap i inlärningsalgoritmen, samt val av kernelfunktion, har stor inverkan på algoritmernas prestanda.
|
7 |
Comparison of Modern Controls and Reinforcement Learning for Robust Control of Autonomously Backing Up Tractor-Trailers to Loading DocksMcDowell, Journey 01 November 2019 (has links)
Two controller performances are assessed for generalization in the path following task of autonomously backing up a tractor-trailer. Starting from random locations and orientations, paths are generated to loading docks with arbitrary pose using Dubins Curves. The combination vehicles can be varied in wheelbase, hitch length, weight distributions, and tire cornering stiffness. The closed form calculation of the gains for the Linear Quadratic Regulator (LQR) rely heavily on having an accurate model of the plant. However, real-world applications cannot expect to have an updated model for each new trailer. Finding alternative robust controllers when the trailer model is changed was the motivation of this research.
Reinforcement learning, with neural networks as their function approximators, can allow for generalized control from its learned experience that is characterized by a scalar reward value. The Linear Quadratic Regulator and the Deep Deterministic Policy Gradient (DDPG) are compared for robust control when the trailer is changed. This investigation quantifies the capabilities and limitations of both controllers in simulation using a kinematic model. The controllers are evaluated for generalization by altering the kinematic model trailer wheelbase, hitch length, and velocity from the nominal case.
In order to close the gap from simulation and reality, the control methods are also assessed with sensor noise and various controller frequencies. The root mean squared and maximum errors from the path are used as metrics, including the number of times the controllers cause the vehicle to jackknife or reach the goal. Considering the runs where the LQR did not cause the trailer to jackknife, the LQR tended to have slightly better precision. DDPG, however, controlled the trailer successfully on the paths where the LQR jackknifed. Reinforcement learning was found to sacrifice a short term reward, such as precision, to maximize the future expected reward like reaching the loading dock. The reinforcement learning agent learned a policy that imposed nonlinear constraints such that it never jackknifed, even when it wasn't the trailer it trained on.
|
8 |
Domain Transfer for End-to-end Reinforcement Learning / Domain Transfer for End-to-end Reinforcement LearningOlsson, Anton, Rosberg, Felix January 2020 (has links)
In this master thesis project a LiDAR-based, depth image-based and semantic segmentation image-based reinforcement learning agent is investigated and compared forlearning in simulation and performing in real-time. The project utilize the Deep Deterministic Policy Gradient architecture for learning continuous actions and was designed to control a RC car. One of the first project to deploy an agent in a real scenario after training in a similar simulation. The project demonstrated that with a proper reward function and by tuning driving parameters such as restricting steering, maximum velocity, minimum velocity and performing input data scaling a LiDAR-based agent could drive indefinitely on a simple but completely unseen track in real-time.
|
9 |
Partially Observable Markov Decision Processes for Faster Object RecognitionOlafsson, Björgvin January 2016 (has links)
Object recognition in the real world is a big challenge in the field of computer vision. Given the potentially enormous size of the search space it is essential to be able to make intelligent decisions about where in the visual field to obtain information from to reduce the computational resources needed. In this report a POMDP (Partially Observable Markov Decision Process) learning framework, using a policy gradient method and information rewards as a training signal, has been implemented and used to train fixation policies that aim to maximize the information gathered in each fixation. The purpose of such policies is to make object recognition faster by reducing the number of fixations needed. The trained policies are evaluated by simulation and comparing them with several fixed policies. Finally it is shown that it is possible to use the framework to train policies that outperform the fixed policies for certain observation models.
|
10 |
MODEL-FREE ALGORITHMS FOR CONSTRAINED REINFORCEMENT LEARNING IN DISCOUNTED AND AVERAGE REWARD SETTINGSQinbo Bai (19804362) 07 October 2024 (has links)
<p dir="ltr">Reinforcement learning (RL), which aims to train an agent to maximize its accumulated reward through time, has attracted much attention in recent years. Mathematically, RL is modeled as a Markov Decision Process, where the agent interacts with the environment step by step. In practice, RL has been applied to autonomous driving, robotics, recommendation systems, and financial management. Although RL has been greatly studied in the literature, most proposed algorithms are model-based, which requires estimating the transition kernel. To this end, we begin to study the sample efficient model-free algorithms under different settings.</p><p dir="ltr">Firstly, we propose a conservative stochastic primal-dual algorithm in the infinite horizon discounted reward setting. The proposed algorithm converts the original problem from policy space to the occupancy measure space, which makes the non-convex problem linear. Then, we advocate the use of a randomized primal-dual approach to achieve O(\eps^-2) sample complexity, which matches the lower bound.</p><p dir="ltr">However, when it comes to the infinite horizon average reward setting, the problem becomes more challenging since the environment interaction never ends and can’t be reset, which makes reward samples not independent anymore. To solve this, we design an epoch-based policy-gradient algorithm. In each epoch, the whole trajectory is divided into multiple sub-trajectories with an interval between each two of them. Such intervals are long enough so that the reward samples are asymptotically independent. By controlling the length of trajectory and intervals, we obtain a good gradient estimator and prove the proposed algorithm achieves O(T^3/4) regret bound.</p>
|
Page generated in 0.0993 seconds