Spelling suggestions: "subject:"reinforcement learning"" "subject:"einforcement learning""
441 |
Sample-Efficient Reinforcement Learning of Robot Control Policies in the Real WorldJanuary 2019 (has links)
abstract: The goal of reinforcement learning is to enable systems to autonomously solve tasks in the real world, even in the absence of prior data. To succeed in such situations, reinforcement learning algorithms collect new experience through interactions with the environment to further the learning process. The behaviour is optimized by maximizing a reward function, which assigns high numerical values to desired behaviours. Especially in robotics, such interactions with the environment are expensive in terms of the required execution time, human involvement, and mechanical degradation of the system itself. Therefore, this thesis aims to introduce sample-efficient reinforcement learning methods which are applicable to real-world settings and control tasks such as bimanual manipulation and locomotion. Sample efficiency is achieved through directed exploration, either by using dimensionality reduction or trajectory optimization methods. Finally, it is demonstrated how data-efficient reinforcement learning methods can be used to optimize the behaviour and morphology of robots at the same time. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2019
|
442 |
Game AI of StarCraft II based on Deep Reinforcement LearningJunjie Luo (8786552) 30 April 2020 (has links)
The research problem of this article is the Game AI agent of StarCraft II based on Deep Reinforcement Learning (DRL). StarCraft II is viewed as the most challenging Real-time Strategy (RTS) game for now, and it is also the most popular game where researchers are developing and improving AI agents. Building AI agents of StarCraft II can help researchers on machine learning figure out the weakness of DRL and improve this series of algorithms. In 2018, DeepMind and Blizzard developed the StarCraft II Learning Environment (PySC2) to enable researchers to promote the development of AI agents. DeepMind started to develop a new project called AlphaStar after AlphaGo based on DRL, while several laboratories also published articles about the AI agents of StarCraft II. Most of them are researching on the AI agents of Terran and Zerg, which are two of three races in StarCraft II. AI agents show high-level performance compared with most StarCraft II players. However, the performance is far from defeating E-sport players because Game AI for StarCraft II has large observation space and large action space. However, there is no publication on Protoss, which is the remaining and most complicated race to deal with (larger action space, larger observation space) for AI agents due to its characteristics. Thus, in this paper, the research question is whether the AI agent of Protoss, which is developed by the model based on DRL, for a full-length game on a particular map can defeat the high-level built-in cheating AI. The population of this research design is the StarCraft II AI agents that researchers built based on their DRL models, while the sample is the Protoss AI agent in this paper. The raw data is from the game matches between the Protoss AI agent and built-in AI agents. PySC2 can capture features and numerical variables in each match to obtain the training data. The expected outcome is the model based on DRL, which can train a Protoss AI agent to defeat high-level game AI agents with the win rate. The model includes the action space of Protoss, the observation space and the realization of DRL algorithms. Meanwhile, the model is built on PySC2 v2.0, which provides additional action functions. Due to the complexity and the unique characteristics of Protoss in StarCraft II, the model cannot be applied to other games or platforms. However, how the model trains a Protoss AI agent can show the limitation of DRL and push DRL algorithm a little forward.
|
443 |
Autonomous Guidance for Multi-body Orbit Transfers using Reinforcement LearningNicholas Blaine LaFarge (8790908) 01 May 2020 (has links)
While human presence in cislunar space continues to expand, so too does the demand for `lightweight' automated on-board processes. In nonlinear dynamical environments, computationally efficient guidance strategies are challenging. Many traditional approaches rely on either simplifying assumptions in the dynamical model or on abundant computational resources. This research employs reinforcement learning, a subset of machine learning, to produce a controller that is suitable for on-board low-thrust guidance in challenging dynamical regions of space. The proposed controller functions without knowledge of the simplifications and assumptions of the dynamical model, and direct interaction with the nonlinear equations of motion creates a flexible learning scheme that is not limited to a single force model. The learning process leverages high-performance computing to train a closed-loop neural network controller. This controller may be employed on-board, and autonomously generates low-thrust control profiles in real-time without imposing a heavy workload on a flight computer. Control feasibility is demonstrated through sample transfers between Lyapunov orbits in the Earth-Moon system. The sample low-thrust controller exhibits remarkable robustness to perturbations and generalizes effectively to nearby motion. Effective guidance in sample scenarios suggests extendibility of the learning framework to higher-fidelity domains.
|
444 |
Bestärkendes Lernen zur Steuerung und Regelung nichtlinearer dynamischer SystemePritzkoleit, Max 21 January 2020 (has links)
In der vorliegenden Arbeit wird das bestärkende Lernen im Kontext der Steuerung und Regelung nichtlinearer dynamischer Systeme untersucht. Es werden zunächst die Grundlagen der stochastischen Optimalsteuerung sowie des maschinellen Lernens, die für die Betrachtungen dieser Arbeit relevant sind, erläutert. Anschließend werden die Methoden des bestärkenden Lernens im Kontext der datenbasierten Steuerung und Regelung dargelegt, um anschließend auf drei Methoden des tiefen bestärkenden Lernens näher einzugehen. Der Algorithmus Deep-Deterministic-Policy-Gradient (DDPG) wird zum Gegenstand intensiver Untersuchungen an verschiedenen mechanischen Beispielsystemen.
Weiterhin erfolgt der Vergleich mit einem klassischen Ansatz, bei dem die zu bewältigenden Steuerungsaufgaben mit einer modellbasierten Trajektorienberechnung, die auf dem iterativen linear-quadratischen Regler (iLQR) basiert, gelöst werden. Mit dem iLQR können zwar alle Steuerungsaufgaben erfolgreich bewältigt werden, aber für neue Anfangswerte muss das Problem erneut gelöst werden. Bei DDPG hingegen wird ein Regler erlernt, der das zu steuernde dynamische System – aus nahezu beliebigen Anfangswerten – in den gewünschten Zustand überführt. Nachteilig ist jedoch, dass der Algorithmus sich auf hochgradig nichtlineare Systeme bisher nicht anwenden lässt und eine geringe Dateneffizienz aufweist. / In this thesis, the application of reinforcement learning for the control of nonlinear dynamical systems is researched. At first, the relevant principles of stochastic optimal control and machine learning are explained. Afterwards, reinforcement learning is embedded in the context of optimal control. Three methods of deep reinforcement learning are analyzed. A particular algorithm, namely Deep-Deterministic-Policy-Gradient (DDPG), is chosen for further studies on a variety of mechanical systems. Furthermore, the reinforcement learning approach is compared to a model-based trajectory optimization method, called iterative linear-quadratic regulator (iLQR). All control problems can be successfully solved with the trajectory optimization approach, but for new initial conditions, the problem has to be solved again. In contrast, with DDPG a \emph{global} feedback controller is learned, that can drive the controlled system in the desired state. Disadvantageous is the poor data efficiency and the lack of applicability to highly nonlinear systems.
|
445 |
Adipositas- und geschlechtsspezifische Einflüsse auf phasische kardiale Reaktionen bei verstärkendem LernenKastner, Lucas 02 October 2018 (has links)
Die Adipositas stellt eine der größten medizinischen und soziökonomischen Herausforderungen für unsere modernen Gesundheitssysteme dar. Als wichtige der Adipositas zugrundeliegende Faktoren wurden in früheren Studien typische Verhaltensunterschiede, abweichende hirnmorphologische und -funktionelle Befunde sowie unterschiedliche Aktivitäten in den Anteilen des autonomen Nervensystems im Vergleich adipöser und schlanker Männer und Frauen festgestellt. Diese Unterschiede könnten nach weiterer differenzierter Untersuchung wichtige Ansatzpunkte neuer Therapieformen liefern.
In der vorliegenden Studie untersuchten wir Lernperformanz und kardiale Reaktionsmuster während verstärkenden Lernens unter dem Einfluss von Feedback-Valenz, Geschlecht und Adipositas auf Lernleistung und autonome Reaktionen anhand einer probabilistischen Lernaufgabe.
Um exakt zwischen dem Lernverhalten bei positivem gegenüber negativem Feedback differenzieren zu können verwendeten wir ein spezielles Aufgaben-Design eines probabilistischen Lernexperiments zur operanten Konditionierung mittels monetären Feedbacks. Neben der Lernleistung untersuchten wir die Unterschiede in der kardialen Reaktivität bei der Verarbeitung der beiden Feedback-Valenzen sowie die Einflüsse von Geschlecht und Adipositas auf diese Prozesse.
In der Analyse der Stärke der phasischen kardialen Reaktionen auf die Präsentation von Feedback zeigte sich ein direkter Zusammenhang zur Stärke des Vorhersagefehlers. Dieser kodiert als neuronales Signal für die Neubewertung von kortikalen Werte-Repräsentationen, falls das tatsächliche Ergebnis einer Entscheidung von dem erwarteten Ergebnis abweicht. Folglich bestehen direkte Wechselwirkungen zwischen phasischen Herzraten-Dezelerationen und höheren Prozessen des Feedback-Monitorings, was in der vorliegenden Studie nach unserem besten Wissen erstmalig als direkter Zusammenhang aufgezeigt werden konnte.
Die beobachteten geschlechtsabhängigen Defizite bei verstärkendem Lernen waren nicht durch Differenzen in der Aneignung von Wissen, sondern in einer unzureichenden Anwendung des Erlernten begründet. Dabei zeigten besonders weibliche Probanden in der Belohnungsbedingung ein stärker inkonsistentes Verhalten im Vergleich zu männlichen Probanden, was in dieser Aufgabe zu einer geringeren Anzahl an vorteilhaften Entscheidungen führte und damit einer geringeren Lernperformanz.
Darüber hinaus liefern unsere Ergebnisse weitere wichtige Hinweise für adipositasspezifische Unterschiede im Lernverhalten. In der initialen Lernphase war der Lernprozess im Vermeiden von Bestrafung bei adipösen Probanden verlangsamt, was im Einklang mit Ergebnissen aus der Literatur zu Einschränkungen in der Vermeidung negativer Langzeit-Folgen steht. Dieser Fund sollte in folgenden Studien differenzierter untersucht werden, um so die Entwicklung geeigneter Therapieformen weiter voran zu treiben.:1. Einführung in die Thematik
1.1 Adipositas
1.2 Lernen
1.3 Adipositasspezifische Lerndefizite
1.4 Geschlechtsunterschiede im Lernverhalten
1.5 Lernen und das autonome Nervensystem
1.6 Adipositasspezifische Veränderungen des autonomen Nervensystems
1.7 Phasische Herzreaktionen – Internet Intervals
1.8 Rationale der Studie
2. Paper
3. Zusammenfassung der Arbeit
3.1 Behaviorale Ergebnisse
3.2 Einfluss der Adipositas auf den Lernvorgang
3.3. Einfluss des Geschlechts auf den Lernvorgang
3.4 Zusammenhänge zwischen physischen Herzreaktionen und dem Lernvorgang
3.5 Schlussfolgerungen
4. Literaturverzeichnis
5. Appendix
5.1 Zusatzmaterial
5.1.1 Herzratenvariabilität (HRV)
5.1.2 Interbeat Intervals (IBIs)
5.3 Selbstständigkeitserklärung
5.4 Lebenslauf
5.5 Danksagung
|
446 |
Exploration of Intelligent HVAC Operation Strategies for Office BuildingsXiaoqi Liu (9681032) 15 December 2020 (has links)
<p>Commercial buildings not only have significant
impacts on occupants’ well-being, but also contribute to more than 19% of the total
energy consumption in the United States. Along with improvements in building
equipment efficiency and utilization of renewable energy, there has been significant
focus on the development of advanced heating, ventilation, and air conditioning (HVAC) system controllers that incorporate
predictions (e.g., occupancy patterns, weather forecasts) and current state
information to execute optimization-based strategies. For example, model predictive
control (MPC) provides a systematic implementation option using a system model
and an optimization algorithm to adjust the control setpoints dynamically. This
approach automatically satisfies component and operation constraints related to
building dynamics, HVAC equipment, etc. However, the wide adaptation of advanced
controls still faces several practical challenges: such approaches
involve significant engineering effort and require site-specific solutions for
complex problems that need to consider uncertain weather forecast and engaging
the building occupants. This thesis explores smart building operation
strategies to resolve such issues from the following three aspects. </p>
<p>First, the thesis explores a stochastic
model predictive control (SMPC) method for the optimal utilization of solar
energy in buildings with integrated solar systems. This approach considers the
uncertainty in solar irradiance forecast over a prediction horizon, using a new
probabilistic time series autoregressive model, calibrated on the sky-cover
forecast from a weather service provider. In the optimal control formulation,
we model the effect of solar irradiance as non-Gaussian stochastic disturbance
affecting the cost and constraints, and the nonconvex cost function is an
expectation over the stochastic process. To solve this optimization problem, we
introduce a new approximate dynamic programming methodology that represents the
optimal cost-to-go functions using Gaussian process, and achieves good solution
quality. We use an emulator to evaluate the closed-loop operation of a
building-integrated system with a solar-assisted heat pump coupled with radiant
floor heating. For the system and climate considered, the SMPC saves up to 44%
of the electricity consumption for heating in a winter month, compared to a
well-tuned rule-based controller, and it is robust, imposing less uncertainty
on thermal comfort violation.</p>
<p>Second,
this thesis explores user-interactive thermal environment control systems that
aim to increase energy efficiency and occupant satisfaction in office
buildings. Towards this goal, we present a new modeling approach of occupant
interactions with a temperature control and energy use interface based on
utility theory that reveals causal effects in the human decision-making process.
The model is a utility function that quantifies occupants’ preference over
temperature setpoints incorporating their comfort and energy use
considerations. We demonstrate our approach by implementing the
user-interactive system in actual office spaces with an energy efficient model
predictive HVAC controller. The results show that with the developed
interactive system occupants achieved the same level of overall satisfaction
with selected setpoints that are closer to temperatures determined by the MPC
strategy to reduce energy use. Also, occupants often accept the default MPC
setpoints when a significant improvement in the thermal environment conditions
is not needed to satisfy their preference. Our results show that the occupants’
overrides can contribute up to 55% of the HVAC energy consumption on average
with MPC. The prototype user-interactive system recovered 36% of this
additional energy consumption while achieving the same overall occupant satisfaction
level. Based on these findings, we propose that the utility model can become a
generalized approach to evaluate the design of similar user-interactive systems
for different office layouts and building operation scenarios. </p>
<p>Finally, this thesis presents an
approach based on meta-reinforcement learning (Meta-RL) that enables autonomous
optimal building controls with minimum engineering effort. In reinforcement
learning (RL), the controller acts as an agent that executes control actions in
response to the real-time building system status and exogenous disturbances according
to a policy. The agent has the ability to update the policy towards improving
the energy efficiency and occupant satisfaction based on the previously
achieved control performance. In order to ensure satisfactory performance upon
deployment to a target building, the agent is trained using the Meta-RL
algorithm beforehand with a model universe obtained from available building
information, which is a probability measure over the possible building
dynamical models. Starting from what is learned in the training process, the
agent then fine-tunes the policy to adapt to the target building based on-site
observations. The control performance and adaptability of the Meta-RL agent is
evaluated using an emulator of a private office space over 3 summer months. For
the system and climate under consideration, the Meta-RL agent can successfully
maintain the indoor air temperature within the first week, and result in only
16% higher energy consumption in the 3<sup>rd</sup> month than MPC, which
serves as the theoretical upper performance bound. It also significantly
outperforms the agents trained with conventional RL approach. </p>
|
447 |
Hybrid Station-Keeping Controller Design Leveraging Floquet Mode and Reinforcement Learning ApproachesAndrew Blaine Molnar (9746054) 15 December 2020 (has links)
The general station-keeping problem is a focal topic when considering any spacecraft mission application. Recent missions are increasingly requiring complex trajectories to satisfy mission requirements, necessitating the need for accurate station-keeping controllers. An ideal controller reliably corrects for spacecraft state error,
minimizes the required propellant, and is computationally efficient. To that end,
this investigation assesses the effectiveness of several controller formulations in the
circular restricted three-body model. Particularly, a spacecraft is positioned in a L<sub>1</sub> southern halo orbit within the Sun-Earth Moon Barycenter system. To prevent the
spacecraft from departing the vicinity of this reference halo orbit, the Floquet mode
station-keeping approach is introduced and evaluated. While this control strategy
generally succeeds in the station-keeping objective, a breakdown in performance is
observed proportional to increases in state error. Therefore, a new hybrid controller
is developed which leverages Floquet mode and reinforcement learning. The hybrid
controller is observed to efficiently determine corrective maneuvers that consistently
recover the reference orbit for all evaluated scenarios. A comparative analysis of the
performance metrics of both control strategies is conducted, highlighting differences
in the rates of success and the expected propellant costs. The performance comparison demonstrates a relative improvement in the ability of the hybrid controller to
meet the mission objectives, and suggests the applicability of reinforcement learning
to the station-keeping problem.
|
448 |
A framework for training Spiking Neural Networks using Evolutionary Algorithms and Deep Reinforcement LearningAnirudh Shankar (10276349) 12 March 2021 (has links)
In this work two novel frameworks, one using evolutionary algorithms and another using Reinforcement Learning for training Spiking Neural Networks are proposed and analyzed. A novel multi-agent evolutionary robotics (ER) based framework, inspired by competitive evolutionary environments in nature, is demonstrated for training Spiking Neural Networks (SNN). The weights of a population of SNNs along with morphological parameters of bots they control in the ER environment are treated as phenotypes. Rules of the framework select certain bots and their SNNs for reproduction and others for elimination based on their efficacy in capturing food in a competitive environment. While the bots and their SNNs are given no explicit reward to survive or reproduce via any loss function, these drives emerge implicitly as they evolve to hunt food and survive within these rules. Their efficiency in capturing food as a function of generations exhibit the evolutionary signature of punctuated equilibria. Two evolutionary inheritance algorithms on the phenotypes, Mutation and Crossover with Mutation along with their variants, are demonstrated. Performances of these algorithms are compared using ensembles of 100 experiments for each algorithm. We find that one of the Crossover with Mutation variants promotes 40% faster learning in the SNN than mere Mutation with a statistically significant margin. Along with an evolutionary approach to training SNNs, we also describe a novel Reinforcement Learning(RL) based framework using the Proximal Policy Optimization to train a SNN for an image classification task. The experiments and results of the framework are then discussed highlighting future direction of the work.
|
449 |
Regularized Greedy Gradient Q-Learning with Mobile Health ApplicationsLu, Xiaoqi January 2021 (has links)
Recent advance in health and technology has made mobile apps a viable approach to delivering behavioral interventions in areas including physical activity encouragement, smoking cessation, substance abuse prevention, and mental health management. Due to the chronic nature of most of the disorders and heterogeneity among mobile users, delivery of the interventions needs to be sequential and tailored to individual needs. We operationalize the sequential decision making via a policy that takes a mobile user's past usage pattern and health status as input and outputs an app/intervention recommendation with the goal of optimizing the cumulative rewards of interest in an indefinite horizon setting. There is a plethora of reinforcement learning methods on the development of optimal policies in this case. However, the vast majority of the literature focuses on studying the convergence of the algorithms with infinite amount of data in computer science domain. Their performances in health applications with limited amount of data and high noise are yet to be explored. Technically the nature of sequential decision making results in an objective function that is non-smooth (not even a Lipschitz) and non-convex in the model parameters. This poses theoretical challenges to the characterization of the asymptotic properties of the optimizer of the objective function, as well as computational challenges for optimization. This problem is especially exacerbated with the presence of high dimensional data in mobile health applications.
In this dissertation we propose a regularized greedy gradient Q-learning (RGGQ) method to tackle this estimation problem. The optimal policy is estimated via an algorithm which synthesizes the PGM and the GGQ algorithms in the presence of an L₁ regularization, and its asymptotic properties are established. The theoretical framework initiated in this work can be applied to tackle other non-smooth high dimensional problems in reinforcement learning.
|
450 |
[pt] COORDENAÇÃO INTELIGENTE PARA MULTIAGENTES BASEADOS EM MODELOS NEURO-FUZZY HIERÁRQUICOS COM APRENDIZADO POR REFORÇO / [en] INTELLIGENT COORDINATION FOR MULTIAGENT BASED MODELS HIERARCHICAL NEURO-FUZZY WITH REINFORCEMENT LEARNING08 November 2018 (has links)
[pt] Esta tese consiste na investigação e no desenvolvimento de estratégias de coordenação inteligente que possam ser integradas a modelos neuro-fuzzy hierárquicos para sistemas de múltiplos agentes em ambientes complexos. Em ambientes dinâmicos ou complexos a organização dos agentes deve se adaptar a mudanças nos objetivos do sistema, na disponibilidade de recursos, nos relacionamentos entre os agentes, e assim por diante. Esta flexibilidade é um problema chave nos sistemas multiagente. O objetivo principal dos modelos propostos é fazer com que múltiplos agentes interajam de forma inteligente entre si em sistemas complexos. Neste trabalho foram desenvolvidos dois novos modelos inteligentes neuro-fuzzy hierárquicos com mecanismo de coordenação para sistemas multiagentes, a saber: modelo Neuro-Fuzzy Hierárquico com Aprendizado por Reforço com mecanismo de coordenação Market-Driven (RL-NFHP-MA-MD); e o Modelo Neuro-Fuzzy Hierárquico com Aprendizado por Reforço com modelo de coordenação por grafos (RL-NFHP-MA-CG). A inclusão de modelos de coordenação ao modelo Neuro-Fuzzy Hierárquicos com Aprendizado por Reforço (RL-NHFP-MA) foi motivada principalmente pela importância de otimizar o desempenho do trabalho em conjunto dos agentes, melhorando os resultados do modelo e visando aplicações mais complexas. Os modelos foram concebidos a partir do estudo das limitações existentes nos modelos atuais e das características desejáveis para sistemas de aprendizado baseados em RL, em particular quando aplicados a ambientes contínuos e/ou ambientes considerados de grande dimensão. Os modelos desenvolvidos foram testados através de basicamente dois estudos de caso: a aplicação benchmark do jogo da presa-predador (Pursuit- Game) e Futebol de robôs (simulado e com agentes robóticos). Os resultados obtidos tanto no jogo da presa-predador quanto no futebol de robô através dos novos modelos RL-NFHP-MA-MD e RL-NFHP-MA-CG para múltiplos agentes se mostraram bastante promissores. Os testes demonstraram que o novo sistema mostrou capacidade de coordenar as ações entre agentes com uma velocidade de convergência quase 30 por cento maior que a versão original. Os resultados de futebol de robô foram obtidos com o modelo RL-NFHP-MA-MD e o modelo RL-NFHP-MA-CG, os resultados são bons em jogos completos como em jogadas específicas, ganhando de times desenvolvidos com outros modelos similares. / [en] This thesis is the research and development of intelligent coordination strategies that can be integrated into models for hierarchical neuro-fuzzy systems of multiple agents in complex environments. In dynamic environments or complex organization of agents must adapt to changes in the objectives of the system, availability of resources, relationships between agents, and so on. This flexibility is a key problem in multiagent systems. The main objective of the proposed models is to make multiple agents interact intelligently with each other in complex systems. In this work we developed two new intelligent neuro-fuzzy models with hierarchical coordination mechanism for multi-agent systems, namely Neuro-Fuzzy Model with Hierarchical Reinforcement Learning with coordination mechanism Market-Driven (RL-NFHP-MA-MD), and Neuro-Fuzzy model with Hierarchical Reinforcement Learning with coordination model for graphs (RL-NFHP-MA-CG). The inclusion of coordination models to model with Neuro-Fuzzy Hierarchical Reinforcement Learning (RL-NHFP-MA) was primarily motivated by the importance of optimizing the performance of the work in all players, improving the model results and targeting more complex applications. The models were designed based on the study of existing limitations in current models and desirable features for learning systems based RL, in particular when applied to continuous environments and/or environments considered large. The developed models were tested primarily through two case studies: application benchmark game of predator-prey ( Pursuit-Game) and Soccer robots (simulated and robotic agents). The results obtained both in the game of predator-prey as in soccer robot through new models RL-NFHP-MA-MD and RL-NFHP-MA-CG for multiple agents proved promising. The tests showed that the new system showed ability to coordinate actions among agents with a convergence rate nearly 30 percent higher than the original version. Results soccer robot were obtained with model RL-NFHP-MA-MD–NFHP-RL and model-CG-MA, the results are good in games played in full as specific winning teams developed with other similar models.
|
Page generated in 0.1124 seconds