121 |
Time-normalised discounting in reinforcement learningAkan, Oguzhan, Waara Ankarstrand, Wilmer January 2024 (has links)
Reinforcement learning has emerged as a powerful paradigm in machinelearning, witnessing remarkable progress in recent years. Amongreinforcement algorithms, Q-learning stands out, enabling agents tolearn quickly from past actions. This study aims to investigate andenhance Q-learning methodologies, with a specific focus on tabularQ-learning. In particular, it addresses Q-learning with an actionspace containing actions that require different amounts of time toexecute. With such an action space the algorithm might convergeto a suboptimal solution when using a constant discount factor sincediscounting occurs per action and not per time step. We refer to thisissue as the non-temporal discounting (NTD) problem. By introducinga time-normalised discounting function, we were able to address theissue of NTD. In addition, we were able to stabilise the solutionby implementing a cost for specific actions. As a result, the modelconverged to the expected solution. Building on these results it wouldbe wise to implement time-normalised discounting in a state-of-the-artreinforcement learning model such as deep Q-learning.
|
122 |
Inverse Reinforcement Learning and Routing Metric DiscoveryShiraev, Dmitry Eric 01 September 2003 (has links)
Uncovering the metrics and procedures employed by an autonomous networking system is an important problem with applications in instrumentation, traffic engineering, and game-theoretic studies of multi-agent environments. This thesis presents a method for utilizing inverse reinforcement learning (IRL)techniques for the purpose of discovering a composite metric used by a dynamic routing algorithm on an Internet Protocol (IP) network. The network and routing algorithm are modeled as a reinforcement learning (RL) agent and a Markov decision process (MDP). The problem of routing metric discovery is then posed as a problem of recovering the reward function, given observed optimal behavior. We show that this approach is empirically suited for determining the relative contributions of factors that constitute a composite metric. Experimental results for many classes of randomly generated networks are presented. / Master of Science
|
123 |
Encoding the Sensor Allocation Problem for Reinforcement LearningPenn, Dylan R. 16 May 2024 (has links)
Traditionally, space situational awareness (SSA) sensor networks have relied on dynamic programming theory to generate tasking plans which govern how sensors are allocated to observe resident space objects. Deep reinforcement learning (DRL) techniques, with their ability to be trained on simulated environments, which are readily available for the SSA sensor allocation problem, and demonstrated performance in other fields, have potential to exceed performance of deterministic methods. The research presented in this dissertation develops techniques for encoding an SSA environment model to apply DRL to the sensor allocation problem. This dissertation is the compilation of two separate but related studies. The first study compares two alternative invalid action handling techniques, penalization and masking. The second study examines the performance of policies that have forecast state knowledge incorporated in the observation space. / Doctor of Philosophy / Resident space objects (RSOs) are typically tracked by ground-based sensors (telescopes and radar). Determining how to allocate sensors to RSOs is a complex problem traditionally performed by dynamic programming techniques. Deep reinforcement learning (DRL), a subset of machine learning, has demonstrated performance in other fields, and has the potential to exceed performance of traditional techniques. The research presented in this dissertation develops techniques for encoding a space situational awareness environment model to apply DRL to the sensor allocation problem. This dissertation is the compilation of two separate but related studies. The first study compares two alternative invalid action handling techniques, penalization and masking. The second study examines the performance of policies that have forecast state knowledge incorporated in the observation space.
|
124 |
Learning-based Optimal Control of Time-Varying Linear Systems Over Large Time IntervalsBaddam, Vasanth Reddy January 2023 (has links)
We solve the problem of two-point boundary optimal control of linear time-varying systems with unknown model dynamics using reinforcement learning. Leveraging singular perturbation theory techniques, we transform the time-varying optimal control problem into two time-invariant subproblems. This allows the utilization of an off-policy iteration method to learn the controller gains. We show that the performance of the learning-based controller approximates that of the model-based optimal controller and the approximation accuracy improves as the control problem’s time horizon increases. We also provide a simulation example to verify the results / M.S. / We use reinforcement learning to find two-point boundary optimum controls for linear time-varying systems with uncertain model dynamics. We divided the LTV control problem into two LTI subproblems using singular perturbation theory techniques. As a result, it is possible to identify the controller gains via a learning technique. We show that the training-based controller’s performance approaches that of the model-based optimal controller, with approximation accuracy growing with the temporal horizon of the control issue. In addition, we provide a simulated scenario to back up our findings.
|
125 |
Bayesian methods for knowledge transfer and policy search in reinforcement learningWilson, Aaron (Aaron Creighton) 28 July 2012 (has links)
How can an agent generalize its knowledge to new circumstances? To learn
effectively an agent acting in a sequential decision problem must make intelligent action selection choices based on its available knowledge. This dissertation focuses on Bayesian methods of representing learned knowledge and develops novel algorithms that exploit the represented
knowledge when selecting actions.
Our first contribution introduces the multi-task Reinforcement
Learning setting in which an agent solves a sequence of tasks. An
agent equipped with knowledge of the relationship between tasks can
transfer knowledge between them. We propose the transfer of two
distinct types of knowledge: knowledge of domain models and knowledge
of policies. To represent the transferable knowledge, we propose
hierarchical Bayesian priors on domain models and policies
respectively. To transfer domain model knowledge, we introduce a new
algorithm for model-based Bayesian Reinforcement Learning in the
multi-task setting which exploits the learned hierarchical Bayesian
model to improve exploration in related tasks. To transfer policy
knowledge, we introduce a new policy search algorithm that accepts a
policy prior as input and uses the prior to bias policy search. A
specific implementation of this algorithm is developed that accepts a
hierarchical policy prior. The algorithm learns the hierarchical
structure and reuses components of the structure in related tasks.
Our second contribution addresses the basic problem of generalizing knowledge gained from previously-executed policies. Bayesian
Optimization is a method of exploiting a prior model of an objective function to quickly identify the point maximizing the modeled objective.
Successful use of Bayesian Optimization in Reinforcement Learning
requires a model relating policies and their performance. Given such a
model, Bayesian Optimization can be applied to search for an optimal
policy. Early work using Bayesian Optimization in the Reinforcement
Learning setting ignored the sequential nature of the underlying
decision problem. The work presented in this thesis explicitly
addresses this problem. We construct new Bayesian models that take
advantage of sequence information to better generalize knowledge
across policies. We empirically evaluate the value of this approach in
a variety of Reinforcement Learning benchmark problems. Experiments
show that our method significantly reduces the amount of exploration
required to identify the optimal policy.
Our final contribution is a new framework for learning parametric
policies from queries presented to an expert. In many domains it is
difficult to provide expert demonstrations of desired policies.
However, it may still be a simple matter for an expert to identify
good and bad performance. To take advantage of this limited expert
knowledge, our agent presents experts with pairs of demonstrations and
asks which of the demonstrations best represents a latent target
behavior. The goal is to use a small number of queries to elicit the
latent behavior from the expert. We formulate a Bayesian model of the
querying process, an inference procedure that estimates the posterior
distribution over the latent policy space, and an active procedure for
selecting new queries for presentation to the expert. We show, in
multiple domains, that the algorithm successfully learns the target
policy and that the active learning strategy generally improves the
speed of learning. / Graduation date: 2013
|
126 |
Embodied Evolution of Learning AbilityElfwing, Stefan January 2007 (has links)
Embodied evolution is a methodology for evolutionary robotics that mimics the distributed, asynchronous, and autonomous properties of biological evolution. The evaluation, selection, and reproduction are carried out by cooperation and competition of the robots, without any need for human intervention. An embodied evolution framework is therefore well suited to study the adaptive learning mechanisms for artificial agents that share the same fundamental constraints as biological agents: self-preservation and self-reproduction. The main goal of the research in this thesis has been to develop a framework for performing embodied evolution with a limited number of robots, by utilizing time-sharing of subpopulations of virtual agents inside each robot. The framework integrates reproduction as a directed autonomous behavior, and allows for learning of basic behaviors for survival by reinforcement learning. The purpose of the evolution is to evolve the learning ability of the agents, by optimizing meta-properties in reinforcement learning, such as the selection of basic behaviors, meta-parameters that modulate the efficiency of the learning, and additional and richer reward signals that guides the learning in the form of shaping rewards. The realization of the embodied evolution framework has been a cumulative research process in three steps: 1) investigation of the learning of a cooperative mating behavior for directed autonomous reproduction; 2) development of an embodied evolution framework, in which the selection of pre-learned basic behaviors and the optimization of battery recharging are evolved; and 3) development of an embodied evolution framework that includes meta-learning of basic reinforcement learning behaviors for survival, and in which the individuals are evaluated by an implicit and biologically inspired fitness function that promotes reproductive ability. The proposed embodied evolution methods have been validated in a simulation environment of the Cyber Rodent robot, a robotic platform developed for embodied evolution purposes. The evolutionarily obtained solutions have also been transferred to the real robotic platform. The evolutionary approach to meta-learning has also been applied for automatic design of task hierarchies in hierarchical reinforcement learning, and for co-evolving meta-parameters and potential-based shaping rewards to accelerate reinforcement learning, both in regards to finding initial solutions and in regards to convergence to robust policies. / QC 20100706
|
127 |
Adaptation-based programmingBauer, Tim (Timothy R.) 31 January 2013 (has links)
Partial programming is a field of study where users specify an outline or skeleton of a program,
but leave various parts undefined. The undefined parts are then completed by an external
mechanism to form a complete program. Adaptation-Based Programming (ABP) is a method
of partial programming that utilizes techniques from the field of reinforcement learning (RL), a
subfield of machine learning, to find good completions of those partial programs.
An ABP user writes a partial program in some host programming language. At various
points where the programmer is uncertain of the best course of action, they include choices
that non-deterministically select amongst several options. Additionally, users indicate program
success through a reward construct somewhere in their program. The resulting non-deterministic
program is completed by treating it as an equivalent RL problem and solving the problem with
techniques from that field. Over repeated executions, the RL algorithms within the ABP system
will learn to select choices at various points that maximize the reward received.
This thesis explores various aspects of ABP such as the semantics of different implementations,
including different design trade-offs encountered with each approach. The goal of all
approaches is to present a model for programs that adapt to their environment based on the
points of uncertainty within the program that the programmer has indicated.
The first approach presented in this work is an implementation of ABP as a domain-specific
language embedded within a functional language. This language provides constructs for common
patterns and situations that arise in adaptive programs. This language proves to be compositional
and to foster rapid experimentation with different adaptation methods (e.g. learning
algorithms). A second approach presents an implementation of ABP as an object-oriented library
that models adaptive programs as formal systems from the field of RL called Markov Decision
Processes (MDPs). This approach abstracts away many of the details of the learning algorithm
from the casual user and uses a fixed learning algorithm to control the program adaptation rather
than allowing it to vary. This abstraction results in an easier-to-use library, but limits the scenarios that ABP can effectively be used in. Moreover, treating adaptive programs as MDPs leads to some unintuitive situations where seemingly reasonably programs fail to adapt efficiently.
This work addresses this problem with algorithms that analyze the adaptive program's structure and data flow to boost the rate at which these problematic adaptive programs learn thus increasing the number of problems that ABP can effectively be used to solve. / Graduation date: 2013
|
128 |
Utilizing negative policy information to accelerate reinforcement learningIrani, Arya John 08 June 2015 (has links)
A pilot study by Subramanian et al. on Markov decision problem task decomposition by humans revealed that participants break down tasks into both short-term subgoals with a defined end-condition (such as "go to food") and long-term considerations and invariants with no end-condition (such as "avoid predators"). In the context of Markov decision problems, behaviors having clear start and end conditions are well-modeled by an abstraction known as options, but no abstraction exists in the literature for continuous constraints imposed on the agent's behavior.
We propose two representations to fill this gap: the state constraint (a set or predicate identifying states that the agent should avoid) and the state-action constraint (identifying state-action pairs that should not be taken). State-action constraints can be directly utilized by an agent, which must choose an action in each state, while state constraints require an approximation of the MDP’s state transition function to be used; however, it is important to support both representations, as certain constraints may be more easily expressed in terms of one as compared to the other, and users may conceive of rules in either form.
Using domains inspired by classic video games, this dissertation demonstrates the thesis that explicitly modeling this negative policy information improves reinforcement learning performance by decreasing the amount of training needed to achieve a given level of performance. In particular, we will show that even the use of negative policy information captured from individuals with no background in artificial intelligence yields improved performance.
We also demonstrate that the use of options and constraints together form a powerful combination: an option and constraint can be taken together to construct a constrained option, which terminates in any situation where the original option would violate a constraint. In this way, a naive option defined to perform well in a best-case scenario may still accelerate learning in domains where the best-case scenario is not guaranteed.
|
129 |
Revisiting user simulation in dialogue systems : do we still need them ? : will imitation play the role of simulation ?Chandramohan, Senthilkumar 25 September 2012 (has links) (PDF)
Recent advancements in the area of spoken language processing and the wide acceptance of portable devices, have attracted signicant interest in spoken dialogue systems.These conversational systems are man-machine interfaces which use natural language (speech) as the medium of interaction.In order to conduct dialogues, computers must have the ability to decide when and what information has to be exchanged with the users. The dialogue management module is responsible to make these decisions so that the intended task (such as ticket booking or appointment scheduling) can be achieved.Thus learning a good strategy for dialogue management is a critical task.In recent years reinforcement learning-based dialogue management optimization has evolved to be the state-of-the-art. A majority of the algorithms used for this purpose needs vast amounts of training data.However, data generation in the dialogue domain is an expensive and time consuming process. In order to cope with this and also to evaluatethe learnt dialogue strategies, user modelling in dialogue systems was introduced. These models simulate real users in order to generate synthetic data.Being computational models, they introduce some degree of modelling errors. In spite of this, system designers are forced to employ user models due to the data requirement of conventional reinforcement learning algorithms can learn optimal dialogue strategies from limited amount of training data when compared to the conventional algorithms. As a consequence of this, user models are no longer required for the purpose of optimization, yet they continue to provide a fast and easy means for quantifying the quality of dialogue strategies. Since existing methods for user modelling are relatively less realistic compared to real user behaviors, the focus is shifted towards user modelling by means of inverse reinforcement learning. Using experimental results, the proposed method's ability to learn a computational models with real user like qualities is showcased as part of this work.
|
130 |
Value methods for efficiently solving stochastic games of complete and incomplete informationMac Dermed, Liam Charles 13 January 2014 (has links)
Multi-agent reinforcement learning (MARL) poses the same planning problem as traditional reinforcement learning (RL): What actions over time should an agent take in order to maximize its rewards? MARL tackles a challenging set of problems that can be better understood by modeling them as having a relatively simple environment but with complex dynamics attributed to the presence of other agents who are also attempting to maximize their rewards. A great wealth of research has developed around specific subsets of this problem, most notably when the rewards for each agent are either the same or directly opposite each other. However, there has been relatively little progress made for the general problem. This thesis address this lack.
Our goal is to tackle the most general, least restrictive class of MARL problems. These are general-sum, non-deterministic, infinite horizon, multi-agent sequential decision problems of complete and incomplete information. Towards this goal, we engage in two complementary endeavors: the creation of tractable models and the construction of efficient algorithms to solve these models. We tackle three well known models: stochastic games, decentralized partially observable Markov decision problems, and partially observable stochastic games. We also present a new fourth model, Markov games of incomplete information, to help solve the partially observable models.
For stochastic games and decentralized partially observable Markov decision problems, we develop novel and efficient value iteration algorithms to solve for game theoretic solutions. We empirically evaluate these algorithms on a range of problems, including well known benchmarks and show that our value iteration algorithms perform better than current policy iteration algorithms. Finally, we argue that our approach is easily extendable to new models and solution concepts, thus providing a foundation for a new class of multi-agent value iteration algorithms.
|
Page generated in 0.129 seconds