Multi-Task Reinforcement Learning: From Single-Agent to Multi-Agent SystemsTrang, Matthew Luu 06 January 2023 (has links)
Generalized collaborative drones are a technology that has many potential benefits. General purpose drones that can handle exploration, navigation, manipulation, and more without having to be reprogrammed would be an immense breakthrough for usability and adoption of the technology. The ability to develop these multi-task, multi-agent drone systems is limited by the lack of available training environments, as well as deficiencies of multi-task learning due to a phenomenon known as catastrophic forgetting. In this thesis, we present a set of simulation environments for exploring the abilities of multi-task drone systems and provide a platform for testing agents in incremental single-agent and multi-agent learning scenarios. The multi-task platform is an extension of an existing drone simulation environment written in Python using the PyBullet Physics Simulation Engine, with these environments incorporated. Using this platform, we present an analysis of Incremental Learning and detail the beneficial impacts of using the technique for multi-task learning, with respect to multi-task learning speed and catastrophic forgetting. Finally, we introduce a novel algorithm, Incremental Learning with Second-Order Approximation Regularization (IL-SOAR), to mitigate some of the effects of catastrophic forgetting in multi-task learning. We show the impact of this method and contrast the performance relative to a multi-agent multi-task approach using a centralized policy sharing algorithm. / Master of Science / Machine Learning techniques allow drones to be trained to achieve tasks which are otherwise time-consuming or difficult. The goal of this thesis is to facilitate the work of creating these complex drone machine learning systems by exploring Reinforcement Learning (RL), a field of machine learning which involves learning the correct actions to take through experience. Currently, RL methods are effective in the design of drones which are able to solve one particular task. The next step in this technology is to develop RL systems which are able to handle generalization and perform well across multiple tasks. In this thesis, simulation environments for drones to learn complex tasks are created, and algorithms which are able to train drones in multiple hard tasks are developed and tested. We explore the benefits of using a specific multi-task training technique known as Incremental Learning. Additionally, we consider one of the prohibitive factors of multi-task machine learning-based solutions, the degradation problem of agent performance on previously learned tasks, known as catastrophic forgetting. We create an algorithm that aims to prevent the impact of forgetting when training drones sequentially on new tasks. We contrast this approach with a multi-agent solution, where multiple drones learn simultaneously across the tasks.
Action selection in modular reinforcement learningZhang, Ruohan 16 September 2014 (has links)
Modular reinforcement learning is an approach to resolve the curse of dimensionality problem in traditional reinforcement learning. We design and implement a modular reinforcement learning algorithm, which is based on three major components: Markov decision process decomposition, module training, and global action selection. We define and formalize module class and module instance concepts in decomposition step. Under our framework of decomposition, we train each modules efficiently using SARSA($\lambda$) algorithm. Then we design, implement, test, and compare three action selection algorithms based on different heuristics: Module Combination, Module Selection, and Module Voting. For last two algorithms, we propose a method to calculate module weights efficiently, by using standard deviation of Q-values of each module. We show that Module Combination and Module Voting algorithms produce satisfactory performance in our test domain. / text
Design of optimal neural network control strategies with minimal a priori knowledgeParaskevopoulos, Vasileios January 2000 (has links)
No description available.
Learning user modelling strategies for adaptive referring expression generation in spoken dialogue systemsJanarthanam, Srinivasan Chandrasekaran January 2011 (has links)
We address the problem of dynamic user modelling for referring expression generation in spoken dialogue systems, i.e how a spoken dialogue system should choose referring expressions to refer to domain entities to users with different levels of domain expertise, whose domain knowledge is initially unknown to the system. We approach this problem using a statistical planning framework: Reinforcement Learning techniques in Markov Decision Processes (MDP). We present a new reinforcement learning framework to learn user modelling strategies for adaptive referring expression generation (REG) in resource scarce domains (i.e. where no large corpus exists for learning). As a part of the framework, we present novel user simulation models that are sensitive to the referring expressions used by the system and are able to simulate users with different levels of domain knowledge. Such models are shown to simulate real user behaviour more closely than baseline user simulation models. In contrast to previous approaches to user adaptive systems, we do not assume that the user’s domain knowledge is available to the system before the conversation starts. We show that using a small corpus of non-adaptive dialogues it is possible to learn an adaptive user modelling policy in resource scarce domains using our framework. We also show that the learned user modelling strategies performed better in terms of adaptation than hand-coded baselines policies on both simulated and real users. With real users, the learned policy produced around 20% increase in adaptation in comparison to the best performing hand-coded adaptive baseline. We also show that adaptation to user’s domain knowledge results in improving task success (99.47% for learned policy vs 84.7% for hand-coded baseline) and reducing dialogue time of the conversation (11% relative difference). This is because users found it easier to identify domain objects when the system used adaptive referring expressions during the conversations.
Neural mechanisms of suboptimal decisionsChau, Ka Hung Bolton January 2014 (has links)
Making good decisions and adapting flexibly to environmental change are critical to the survival of animals. In this thesis, I investigated neural mechanisms underlying suboptimal decision making in humans and underlying behavioural adaptation in monkeys with the use of functional magnetic resonance imaging (fMRI) in both species. In recent decades, in the neuroscience of decision making, there has been a prominent focus on binary decisions. Whether the presence of an additional third option could have an impact on behaviour and neural signals has been largely overlooked. I designed an experiment in which decisions were made between two options in the presence of a third option. A biophysical model simulation made surprising predictions that more suboptimal decisions were made in the presence of a very poor third alternative. Subsequent human behavioural testing showed consistent results with these predictions. In the ventromedial prefrontal cortex (vmPFC), I found that a value comparison signal that is critical for decision making became weaker in the presence of a poor value third option. The effect contrasts with another prominent potential mechanism during multi-alternative decision making – divisive normalization – the signatures of which were observed in the posterior parietal cortex. It has long been thought that the orbitofrontal cortex (OFC) and amygdala mediate reward-guided behavioural adaptation. However, this viewpoint has been recently challenged. I recorded whole brain activity in macaques using fMRI while they performed an object discrimination reversal task over multiple testing sessions. I identified a lateral OFC (lOFC) region in which activity predicted adaptive win-stay/lose-shift behaviour. In contrast, anterior cingulate cortex (ACC) activity predicted future exploratory decisions regardless of reward outcome. Amygdala and lOFC activity was more strongly coupled for adaptive choice shifting and decoupled for task irrelevant reward memory. Day-to-day fluctuations in signals and signal coupling were correlated with day-to-day fluctuations in performance. These data demonstrate OFC, ACC, and amygdala each make unique contributions to flexible behaviour and credit assignment.
The behavior of institutional investors in IPO markets and the decision of going public abroadFu, Youyan January 2016 (has links)
This thesis comprehensively studies three questions. First of all, I use a unique set of institutional investor bids to examine the impact of personal experience on the behavior of institutional investors in an IPO market. I find that, when deciding to participate in future IPOs, institutions take into account initial returns of past IPOs in which they submitted bids more than IPOs which they merely observed. In addition, initial returns from past IPOs in which institutions’ bids were qualified for share allocation were given more consideration than IPOs for which unqualified bids were submitted. This phenomenon is consistent with reinforcement learning. I also find that institutions do not distinguish the returns that are derived from random events. Furthermore, institutions become more aggressive bidders after experiencing high returns in recent IPOs, conditional on personal participation or being qualified for share allocation in those IPOs. This bidding behavior provides additional evidence of reinforcement learning in IPO markets. Secondly, I merge the dataset of institutional investor bids with post-IPO institutional holdings data to examine whether institutional investors such as fund companies reveal their true valuations through bids in a unique quasi-bookbuilding IPO mechanism. I find that fund companies do truthfully disclose their private information via bids, despite these being without guaranteed compensation. My results contribute to the existing literature by providing new evidence on the information compensation theory and have implications for the IPO mechanism design. Finally, I explore the impact on firm valuation of going public abroad using a sample of 136 Chinese firms that conducted IPOs in the US during the period of 1999-2012. I find that US-listed Chinese firms have higher price multiples and experience less underpricing than their domestic-listed peers. The valuation premium stays consistent when a firm’s characteristics and listing cost are being controlled. These findings are consistent with the theories of foreign listing. Moreover, I find that high-tech Chinese firms with a high growth rate but low profitability are more likely to issue shares in the US, particularly for specific industries such as semiconductors, software and online business services. This industry clustering is interpreted as an incentive to access foreign expertise through listing abroad.
An architecture for situated learning agentsMitchell, Matthew Winston, 1968- January 2003 (has links)
Abstract not available
Reinforcement learning by incremental patchingKim, Min Sub, Computer Science & Engineering, Faculty of Engineering, UNSW January 2007 (has links)
This thesis investigates how an autonomous reinforcement learning agent can improve on an approximate solution by augmenting it with a small patch, which overrides the approximate solution at certain states of the problem. In reinforcement learning, many approximate solutions are smaller and easier to produce than ???flat??? solutions that maintain distinct parameters for each fully enumerated state, but the best solution within the constraints of the approximation may fall well short of global optimality. This thesis proposes that the remaining gap to global optimality can be efficiently minimised by learning a small patch over the approximate solution. In order to improve the agent???s behaviour, algorithms are presented for learning the overriding patch. The patch is grown around particular regions of the problem where the approximate solution is found to be deficient. Two heuristic strategies are proposed for concentrating resources to those areas where inaccuracies in the approximate solution are most costly, drawing a compromise between solution quality and storage requirements. Patching also handles problems with continuous state variables, by two alternative methods: Kuhn triangulation over a fixed discretisation and nearest neighbour interpolation with a variable discretisation. As well as improving the agent???s behaviour, patching is also applied to the agent???s model of the environment. Inaccuracies in the agent???s model of the world are detected by statistical testing, using a selective sampling strategy to limit storage requirements for collecting data. The patching algorithms are demonstrated in several problem domains, illustrating the effectiveness of patching under a wide range of conditions. A scenario drawn from a real-time strategy game demonstrates the ability of patching to handle large complex tasks. These contributions combine to form a general framework for patching over approximate solutions in reinforcement learning. Complex problems cannot be solved by brute force alone, and some form of approximation is necessary to handle large problems. However, this does not mean that the limitations of approximate solutions must be accepted without question. Patching demonstrates one way in which an agent can leverage approximation techniques without losing the ability to handle fine yet important details.
Modelling motivation for experience-based attention focus in reinforcement learningMerrick, Kathryn January 2007 (has links)
Doctor of Philosophy / Computational models of motivation are software reasoning processes designed to direct, activate or organise the behaviour of artificial agents. Models of motivation inspired by psychological motivation theories permit the design of agents with a key reasoning characteristic of natural systems: experience-based attention focus. The ability to focus attention is critical for agent behaviour in complex or dynamic environments where only small amounts of available information is relevant at a particular time. Furthermore, experience-based attention focus enables adaptive behaviour that focuses on different tasks at different times in response to an agent’s experiences in its environment. This thesis is concerned with the synthesis of motivation and reinforcement learning in artificial agents. This extends reinforcement learning to adaptive, multi-task learning in complex, dynamic environments. Reinforcement learning algorithms are computational approaches to learning characterised by the use of reward or punishment to direct learning. The focus of much existing reinforcement learning research has been on the design of the learning component. In contrast, the focus of this thesis is on the design of computational models of motivation as approaches to the reinforcement component that generates reward or punishment. The primary aim of this thesis is to develop computational models of motivation that extend reinforcement learning with three key aspects of attention focus: rhythmic behavioural cycles, adaptive behaviour and multi-task learning in complex, dynamic environments. This is achieved by representing such environments using context-free grammars, modelling maintenance tasks as observations of these environments and modelling achievement tasks as events in these environments. Motivation is modelled by processes for task selection, the computation of experience-based reward signals for different tasks and arbitration between reward signals to produce a motivation signal. Two specific models of motivation based on the experience-oriented psychological concepts of interest and competence are designed within this framework. The first models motivation as a function of environmental experiences while the second models motivation as an introspective process. This thesis synthesises motivation and reinforcement learning as motivated reinforcement learning agents. Three models of motivated reinforcement learning are presented to explore the combination of motivation with three existing reinforcement learning components. The first model combines motivation with flat reinforcement learning for highly adaptive learning of behaviours for performing multiple tasks. The second model facilitates the recall of learned behaviours by combining motivation with multi-option reinforcement learning. In the third model, motivation is combined with an hierarchical reinforcement learning component to allow both the recall of learned behaviours and the reuse of these behaviours as abstract actions for future learning. Because motivated reinforcement learning agents have capabilities beyond those of existing reinforcement learning approaches, new techniques are required to measure their performance. The secondary aim of this thesis is to develop metrics for measuring the performance of different computational models of motivation with respect to the adaptive, multi-task learning they motivate. This is achieved by analysing the behaviour of motivated reinforcement learning agents incorporating different motivation functions with different learning components. Two new metrics are introduced that evaluate the behaviour learned by motivated reinforcement learning agents in terms of the variety of tasks learned and the complexity of those tasks. Persistent, multi-player computer game worlds are used as the primary example of complex, dynamic environments in this thesis. Motivated reinforcement learning agents are applied to control the non-player characters in games. Simulated game environments are used for evaluating and comparing motivated reinforcement learning agents using different motivation and learning components. The performance and scalability of these agents are analysed in a series of empirical studies in dynamic environments and environments of progressively increasing complexity. Game environments simulating two types of complexity increase are studied: environments with increasing numbers of potential learning tasks and environments with learning tasks that require behavioural cycles comprising more actions. A number of key conclusions can be drawn from the empirical studies, concerning both different computational models of motivation and their combination with different reinforcement learning components. Experimental results confirm that rhythmic behavioural cycles, adaptive behaviour and multi-task learning can be achieved using computational models of motivation as an experience-based reward signal for reinforcement learning. In dynamic environments, motivated reinforcement learning agents incorporating introspective competence motivation adapt more rapidly to change than agents motivated by interest alone. Agents incorporating competence motivation also scale to environments of greater complexity than agents motivated by interest alone. Motivated reinforcement learning agents combining motivation with flat reinforcement learning are the most adaptive in dynamic environments and exhibit scalable behavioural variety and complexity as the number of potential learning tasks is increased. However, when tasks require behavioural cycles comprising more actions, motivated reinforcement learning agents using a multi-option learning component exhibit greater scalability. Motivated multi-option reinforcement learning also provides a more scalable approach to recall than motivated hierarchical reinforcement learning. In summary, this thesis makes contributions in two key areas. Computational models of motivation and motivated reinforcement learning extend reinforcement learning to adaptive, multi-task learning in complex, dynamic environments. Motivated reinforcement learning agents allow the design of non-player characters for computer games that can progressively adapt their behaviour in response to changes in their environment.
Hierarchical average reward reinforcement learningSeri, Sandeep 15 March 2002 (has links)
Reinforcement Learning (RL) is the study of agents that learn optimal behavior by interacting with and receiving rewards and punishments from an unknown environment. RL agents typically do this by learning value functions that assign a value to each state (situation) or to each state-action pair. Recently, there has been a growing interest in using hierarchical methods to cope with the complexity that arises due to the huge number of states found in most interesting real-world problems. Hierarchical methods seek to reduce this complexity by the use of temporal and state abstraction. Like most RL methods, most hierarchical RL methods optimize the discounted total reward that the agent receives. However, in many domains, the proper criteria to optimize is the average reward per time step. In this thesis, we adapt the concepts of hierarchical and recursive optimality, which are used to describe the kind of optimality achieved by hierarchical methods, to the average reward setting and show that they coincide under a condition called Result Distribution Invariance. We present two new model-based hierarchical RL methods, HH-learning and HAH-learning, that are intended to optimize the average reward. HH-learning is a hierarchical extension of the model-based, average-reward RL method, H-learning. Like H-learning, HH-learning requires exploration in order to learn correct domain models and optimal value function. HH-learning can be used with any exploration strategy whereas HAH-learning uses the principle of "optimism under uncertainty", which gives it a built-in "auto-exploratory" feature. We also give the hierarchical and auto-exploratory hierarchical versions of R-learning, a model-free average reward method, and a hierarchical version of ARTDP, a model-based discounted total reward method. We compare the performance of the "flat" and hierarchical methods in the task of scheduling an Automated Guided Vehicle (AGV) in a variety of settings. The results show that hierarchical methods can take advantage of temporal and state abstraction and converge in fewer steps than the flat methods. The exception is the hierarchical version of ARTDP. We give an explanation for this anomaly. Auto-exploratory hierarchical methods are faster than the hierarchical methods with ��-greedy exploration. Finally, hierarchical model-based methods are faster than hierarchical model-free methods. / Graduation date: 2003
Page generated in 0.1583 seconds