Global ETD Search

121	Utilizing negative policy information to accelerate reinforcement learning Irani, Arya John 08 June 2015 (has links) A pilot study by Subramanian et al. on Markov decision problem task decomposition by humans revealed that participants break down tasks into both short-term subgoals with a defined end-condition (such as "go to food") and long-term considerations and invariants with no end-condition (such as "avoid predators"). In the context of Markov decision problems, behaviors having clear start and end conditions are well-modeled by an abstraction known as options, but no abstraction exists in the literature for continuous constraints imposed on the agent's behavior. We propose two representations to fill this gap: the state constraint (a set or predicate identifying states that the agent should avoid) and the state-action constraint (identifying state-action pairs that should not be taken). State-action constraints can be directly utilized by an agent, which must choose an action in each state, while state constraints require an approximation of the MDP’s state transition function to be used; however, it is important to support both representations, as certain constraints may be more easily expressed in terms of one as compared to the other, and users may conceive of rules in either form. Using domains inspired by classic video games, this dissertation demonstrates the thesis that explicitly modeling this negative policy information improves reinforcement learning performance by decreasing the amount of training needed to achieve a given level of performance. In particular, we will show that even the use of negative policy information captured from individuals with no background in artificial intelligence yields improved performance. We also demonstrate that the use of options and constraints together form a powerful combination: an option and constraint can be taken together to construct a constrained option, which terminates in any situation where the original option would violate a constraint. In this way, a naive option defined to perform well in a best-case scenario may still accelerate learning in domains where the best-case scenario is not guaranteed. Interactive machine learning State constraints State-action constraints Reinforcement learning Hierarchical reinforcement learning
122	Revisiting user simulation in dialogue systems : do we still need them ? : will imitation play the role of simulation ? Chandramohan, Senthilkumar 25 September 2012 (has links) (PDF) Recent advancements in the area of spoken language processing and the wide acceptance of portable devices, have attracted signicant interest in spoken dialogue systems.These conversational systems are man-machine interfaces which use natural language (speech) as the medium of interaction.In order to conduct dialogues, computers must have the ability to decide when and what information has to be exchanged with the users. The dialogue management module is responsible to make these decisions so that the intended task (such as ticket booking or appointment scheduling) can be achieved.Thus learning a good strategy for dialogue management is a critical task.In recent years reinforcement learning-based dialogue management optimization has evolved to be the state-of-the-art. A majority of the algorithms used for this purpose needs vast amounts of training data.However, data generation in the dialogue domain is an expensive and time consuming process. In order to cope with this and also to evaluatethe learnt dialogue strategies, user modelling in dialogue systems was introduced. These models simulate real users in order to generate synthetic data.Being computational models, they introduce some degree of modelling errors. In spite of this, system designers are forced to employ user models due to the data requirement of conventional reinforcement learning algorithms can learn optimal dialogue strategies from limited amount of training data when compared to the conventional algorithms. As a consequence of this, user models are no longer required for the purpose of optimization, yet they continue to provide a fast and easy means for quantifying the quality of dialogue strategies. Since existing methods for user modelling are relatively less realistic compared to real user behaviors, the focus is shifted towards user modelling by means of inverse reinforcement learning. Using experimental results, the proposed method's ability to learn a computational models with real user like qualities is showcased as part of this work. [INFO:INFO_OH] Computer Science/Other User simulation Spoken dialogue systems Reinforcement learning Inverse reinforcement learning Dialogue management
123	Value methods for efficiently solving stochastic games of complete and incomplete information Mac Dermed, Liam Charles 13 January 2014 (has links) Multi-agent reinforcement learning (MARL) poses the same planning problem as traditional reinforcement learning (RL): What actions over time should an agent take in order to maximize its rewards? MARL tackles a challenging set of problems that can be better understood by modeling them as having a relatively simple environment but with complex dynamics attributed to the presence of other agents who are also attempting to maximize their rewards. A great wealth of research has developed around specific subsets of this problem, most notably when the rewards for each agent are either the same or directly opposite each other. However, there has been relatively little progress made for the general problem. This thesis address this lack. Our goal is to tackle the most general, least restrictive class of MARL problems. These are general-sum, non-deterministic, infinite horizon, multi-agent sequential decision problems of complete and incomplete information. Towards this goal, we engage in two complementary endeavors: the creation of tractable models and the construction of efficient algorithms to solve these models. We tackle three well known models: stochastic games, decentralized partially observable Markov decision problems, and partially observable stochastic games. We also present a new fourth model, Markov games of incomplete information, to help solve the partially observable models. For stochastic games and decentralized partially observable Markov decision problems, we develop novel and efficient value iteration algorithms to solve for game theoretic solutions. We empirically evaluate these algorithms on a range of problems, including well known benchmarks and show that our value iteration algorithms perform better than current policy iteration algorithms. Finally, we argue that our approach is easily extendable to new models and solution concepts, thus providing a foundation for a new class of multi-agent value iteration algorithms. Multi-agent planning Game theory Reinforcement learning Reinforcement learning Multiagent systems Game theory
124	Towards a Deep Reinforcement Learning based approach for real-time decision making and resource allocation for Prognostics and Health Management applications Ludeke, Ricardo Pedro João January 2020 (has links) Industrial operational environments are stochastic and can have complex system dynamics which introduce multiple levels of uncertainty. This uncertainty leads to sub-optimal decision making and resource allocation. Digitalisation and automation of production equipment and the maintenance environment enable predictive maintenance, meaning that equipment can be stopped for maintenance at the optimal time. Resource constraints in maintenance capacity could however result in further undesired downtime if maintenance cannot be performed when scheduled. In this dissertation the applicability of using a Multi-Agent Deep Reinforcement Learning based approach for decision making is investigated to determine the optimal maintenance scheduling policy in a fleet of assets where there are maintenance resource constraints. By considering the underlying system dynamics of maintenance capacity, as well as the health state of individual assets, a near-optimal decision making policy is found that increases equipment availability while also maximising maintenance capacity. The implemented solution is compared to a run-to-failure corrective maintenance strategy, a constant interval preventive maintenance strategy and a condition based predictive maintenance strategy. The proposed approach outperformed traditional maintenance strategies across several asset and operational maintenance performance metrics. It is concluded that Deep Reinforcement Learning based decision making for asset health management and resource allocation is more effective than human based decision making. / Dissertation (MEng (Mechanical Engineering))--University of Pretoria, 2020. / Mechanical and Aeronautical Engineering / MEng (Mechanical Engineering) / Unrestricted UCTD Maintenance Policy Optimisation Deep Reinforcement Learning Multi-agent Reinforcement Learning
125	Comparison of deep reinforcement learning algorithms in a self-play setting Kumar, Sunil 30 August 2021 (has links) In this exciting era of artificial intelligence and machine learning, the success of AlphaGo, AlphaZero, and MuZero has generated a great interest in deep reinforcement learning, especially under self-play settings. The methods used by AlphaZero are finding their ways to be more useful than before in many different application areas, such as clinical medicine, intelligent military command decision support systems, and recommendation systems. While specific methods of reinforcement learning with selfplay have found their place in application domains, there is much to be explored from existing reinforcement learning methods not originally intended for self-play settings. This thesis focuses on evaluating performance of existing reinforcement learning techniques in self-play settings. In this research, we trained and evaluated the performance of two deep reinforcement learning algorithms with self-play settings on game environments, such as the games Connect Four and Chess. We demonstrate how a simple on-policy, policy-based method, such as REINFORCE, shows signs of learning, whereas an off-policy value-based method such as Deep Q-Networks does not perform well with self-play settings in the selected environments. The results show that REINFORCE agent wins 85% of the games after training against a random baseline agent and 60% games against the greedy baseline agent in the game Connect Four. The agent’s strength from both techniques was measured and plotted against different baseline agents. We also investigate the impact of selected significant hyper-parameters in the performance of the agents. Finally, we provide our recommendation for these hyper-parameters’ values for training deep reinforcement learning agents in similar environments. / Graduate Deep Reinforcement Learning Self-play machine learning Deep learning reinforcement learning
126	Single asset trading: a recurrent reinforcement learning approach Nikolic, Marko January 2020 (has links) Asset trading using machine learning has become popular within the financial industry in the recent years. This can for instance be seen in the large number of daily trading volume which are defined by an automatic algorithm. This thesis presents a recurrent reinforcement learning model to trade an asset. The benefits, drawdowns and the derivations of the model are presented. Different parameters of the model are calibrated and tuned considering a traditional division between training and testing data set and also with the help of nested cross validation. The results of the single asset trading model are compared to the benchmark strategy, which consists of buying the underlying asset and hold it for a long period of time regardless of the asset volatility. The proposed model outperforms the buy and hold strategy on three out of four stocks selected for the experiment. Additionally, returns of the model are sensitive to changes in epoch, m, learning rate and training/test ratio. Nyckelord är machine learning Other Mathematics Annan matematik
127	Efficiency Comparison Between Curriculum Reinforcement Learning & Reinforcement Learning Using ML-Agents. Tabell Johnsson, Marco, Jafar, Ala January 2020 (has links) No description available. Machine Learning Reinforcement Learning Curriculum Reinforcement Learning Computer Sciences Datavetenskap (datalogi)
128	Training reinforcement learning model with custom OpenAI gym for IIoT scenario Norman, Pontus January 2022 (has links) Denna studie består av ett experiment för att se, som ett test, hur bra det skulle fungera att implementera en industriell gymmiljö för att träna en reinforcement learning modell. För att fastställa det här tränas modellen upprepade gånger och modellen testas. Om modellen lyckas lösa scenariot, som är en representation av miljön, räknas den träningsiterationen som en framgång. Tiden det tar att träna för ett visst antal spelavsnitt mäts. Antalet avsnitt det tar för reinforcement learning modellen att uppnå ett acceptabelt resultat på 80 % av maximal poäng mäts och tiden det tar att träna dessa avsnitt mäts. Dessa mätningar utvärderas och slutsatser dras om hur väl reinforcement learning modellerna fungerade. Verktygen som används är Q-learning algoritmen implementerad på egen hand och djup Q-learning med TensorFlow. Slutsatsen visade att den manuellt implementerade Q-learning algoritmen visade varierande resultat beroende på miljödesign och hur länge modellen tränades. Det gav både hög och låg framgångsfrekvens varierande från 100 % till 0 %. Och tiderna det tog att träna agenten till en acceptabel nivå var 0,116, 0,571 och 3,502 sekunder beroende på vilken miljö som testades (se resultatkapitlet för mer information om hur modellerna ser ut). TensorFlow-implementeringen gav antingen 100 % eller 0 % framgång och eftersom jag tror att de polariserande resultaten berodde på något problem med implementeringen så valde jag att inte göra fler mätningar än för en miljö. Och eftersom modellen aldrig nådde ett stabilt utfall på mer än 80 % mättes ingen tid på länge den behöver tränas för denna implementering. / This study consists of an experiment to see, as a proof of concept, how well it would work to implement an industrial gym environment to train a reinforcement learning model. To determine this, the reinforcement learning model is trained repeatedly and tested. If the model completes the training scenario, then that training iteration counts as a success. The time it takes to train for certain amount of game episodes is measured. The number of episodes it takes for the reinforcement learning model to achieve an acceptable outcome of 80% of maximum score is measured and the time it takes to train those episodes are measured. These measurements are evaluated, and conclusions are drawn on how well the reinforcement learning models worked. The tools used is the Q-learning algorithm implemented on its own and deep Q-learning with TensorFlow. The conclusion showed that the manually implemented Q-learning algorithm showed varying results depending on environment design and how long the agent is trained. It gave both high and low success rate varying from 100% to 0%. And the times it took to train the agent to an acceptable level was 0.116, 0.571 and 3.502 seconds depending on what environment was tested (see the result chapter for more information on the environments). The TensorFlow implementation gave either 100% or 0% success rate and since I believe the polarizing results was because of some issue with the implementation I chose to not do more measurements than for one environment. And since the model never reached a stable outcome of more than 80% no time for long it needs to train was measured for this implementation. Q-learning. Reinforcement Learning OpenAI gym Q-learning. Reinforcement Learning OpenAI gym Software Engineering Programvaruteknik
129	Performance Enhancement of Aerial Base Stations via Reinforcement Learning-Based 3D Placement Techniques Parvaresh, Nahid 21 December 2022 (has links) Deploying unmanned aerial vehicles (UAVs) as aerial base stations (BSs) in order to assist terrestrial connectivity, has drawn significant attention in recent years. UAV-BSs can take over quickly as service providers during natural disasters and many other emergency situations when ground BSs fail in an unanticipated manner. UAV-BSs can also provide cost-effective Internet connection to users who are out of infrastructure. UAV-BSs benefit from their mobility nature that enables them to change their 3D locations if the demand of ground users changes. In order to effectively make use of the mobility of UAV-BSs in a dynamic network and maximize the performance, the 3D location of UAV-BSs should be continuously optimized. However, solving the location optimization problem of UAV-BSs is NP-hard with no optimal solution in polynomial time for which near optimal solutions have to be exploited. Besides the conventional solutions, i.e. heuristic solutions, machine learning (ML), specifically reinforcement learning (RL), has emerged as a promising solution for tackling the positioning problem of UAV-BSs. The common practice for optimizing the 3D location of UAV-BSs using RL algorithms is assuming fixed location of ground users (i.e., UEs) in addition to defining discrete and limited action space for the agent of RL. In this thesis, we focus on improving the location optimization of UAV-BSs in two ways: 1-Taking into account the mobility of users in the design of RL algorithm, 2-Extending the action space of RL to a continuous action space so that the UAV-BS agent can flexibly change its location by any distance (limited by the maximum speed of UAV-BS). Three types of RL algorithms, i.e. Q-learning (QL), deep Q-learning (DQL) and actor-critic deep Q-learning (ACDQL) have been employed in this thesis to step-by-step improve the performance results of a UAV-assisted cellular network. QL is the first type of RL algorithm we use for the autonomous movement of the UAV-BS in the presence of mobile users. As a solution to the limitations of QL, we next propose a DQL-based strategy for the location optimization of the UAV-BS which largely improves the performance results of the network compared to the QL-based model. Third, we propose an ACDQL-based solution for autonomously moving the UAV-BS in a continuous action space wherein the performance results significantly outperforms both QL and DQL strategies. Aerial Base Stations Reinforcement Learning Deep Reinforcement Learning Wireless Networks Unmanned Aerial Vehicles
130	The interaction of working memory and Uncertainty (mis)estimation in context-dependent outcome estimation Li Xin Lim (9230078) 13 November 2023 (has links) <p dir="ltr">In the context of reinforcement learning, extensive research has shown how reinforcement learning was facilitated by the estimation of uncertainty to improve the ability to make decisions. However, the constraints imposed by the limited observation of the process of forming environment representation have seldom been a subject of discussion. Thus, the study intended to demonstrate that when incorporating a limited memory into uncertainty estimation, individuals potentially misestimate outcomes and environmental statistics. The study included a computational model that included the process of active working memory and lateral inhibition in working memory (WM) to describe how the relevant information was chosen and stored to form estimations of uncertainty in forming outcome expectations. The active working memory maintained relevant information not just by the recent memory, but also with utility. With the relevant information stored in WM, the model was able to estimate expected uncertainty and perceived volatility and detect contextual changes or dynamics in the outcome structure. Two experiments to investigate limitations in information availability and uncertainty estimation were carried out. The first experiment investigated the impact of cognitive loading on the reliance on memories to form outcome estimation. The findings revealed that introducing cognitive loading diminished the reliance on memory for uncertainty estimations and lowered the expected uncertainty, leading to an increased perception of environmental volatility. The second experiment investigated the ability to detect changes in outcome noise under different conditions of outcome exposure. The study found differences in the mechanisms used for detecting environmental changes in various conditions. Through the experiments and model fitting, the study showed that the misestimation of uncertainties was reliant on individual experiences and relevant information stored in WM under a limited capacity.</p> Reinforcement learning uncertainty estimation working memory gating mechanism reinforcement learning

Search results