• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 3
  • 1
  • Tagged with
  • 5
  • 5
  • 5
  • 4
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Learning Strategies in Multi-Agent Systems - Applications to the Herding Problem

Gadre, Aditya Shrikant 14 December 2001 (has links)
"Multi-Agent systems" is a topic for a lot of research, especially research involving strategy, evolution and cooperation among various agents. Various learning algorithm schemes have been proposed such as reinforcement learning and evolutionary computing. In this thesis two solutions to a multi-agent herding problem are presented. One solution is based on Q-learning algorithm, while the other is based on modeling of artificial immune system. Q-learning solution for the herding problem is developed, using region-based local learning for each individual agent. Individual and batch processing reinforcement algorithms are implemented for non-cooperative agents. Agents in this formulation do not share any information or knowledge. Issues such as computational requirements, and convergence are discussed. An idiotopic artificial immune network is proposed that includes individual B-cell model for agents and T-cell model for controlling the interaction among these agents. Two network models are proposed--one for evolving group behavior/strategy arbitration and the other for individual action selection. A comparative study of the Q-learning solution and the immune network solution is done on important aspects such as computation requirements, predictability, and convergence. / Master of Science
2

Learning the Parameters of Reinforcement Learning from Data for Adaptive Spoken Dialogue Systems / Apprentissage automatique des paramètres de l'apprentissage par renforcement pour les systèmes de dialogues adaptatifs

Asri, Layla El 21 January 2016 (has links)
Cette thèse s’inscrit dans le cadre de la recherche sur les systèmes de dialogue. Ce document propose d’apprendre le comportement d’un système à partir d’un ensemble de dialogues annotés. Le système apprend un comportement optimal via l’apprentissage par renforcement. Nous montrons qu’il n’est pas nécessaire de définir une représentation de l’espace d’état ni une fonction de récompense. En effet, ces deux paramètres peuvent être appris à partir du corpus de dialogues annotés. Nous montrons qu’il est possible pour un développeur de systèmes de dialogue d’optimiser la gestion du dialogue en définissant seulement la logique du dialogue ainsi qu’un critère à maximiser (par exemple, la satisfaction utilisateur). La première étape de la méthodologie que nous proposons consiste à prendre en compte un certain nombre de paramètres de dialogue afin de construire une représentation de l’espace d’état permettant d’optimiser le critère spécifié par le développeur. Par exemple, si le critère choisi est la satisfaction utilisateur, il est alors important d’inclure dans la représentation des paramètres tels que la durée du dialogue et le score de confiance de la reconnaissance vocale. L’espace d’état est modélisé par une mémoire sparse distribuée. Notre modèle, Genetic Sparse Distributed Memory for Reinforcement Learning (GSDMRL), permet de prendre en compte de nombreux paramètres de dialogue et de sélectionner ceux qui sont importants pour l’apprentissage par évolution génétique. L’espace d’état résultant ainsi que le comportement appris par le système sont aisément interprétables. Dans un second temps, les dialogues annotés servent à apprendre une fonction de récompense qui apprend au système à optimiser le critère donné par le développeur. A cet effet, nous proposons deux algorithmes, reward shaping et distance minimisation. Ces deux méthodes interprètent le critère à optimiser comme étant la récompense globale pour chaque dialogue. Nous comparons ces deux fonctions sur un ensemble de dialogues simulés et nous montrons que l’apprentissage est plus rapide avec ces fonctions qu’en utilisant directement le critère comme récompense finale. Nous avons développé un système de dialogue dédié à la prise de rendez-vous et nous avons collecté un corpus de dialogues annotés avec ce système. Ce corpus permet d’illustrer la capacité de mise à l’échelle de la représentation de l’espace d’état GSDMRL et constitue un bon exemple de système industriel sur lequel la méthodologie que nous proposons pourrait être appliquée / This document proposes to learn the behaviour of the dialogue manager of a spoken dialogue system from a set of rated dialogues. This learning is performed through reinforcement learning. Our method does not require the definition of a representation of the state space nor a reward function. These two high-level parameters are learnt from the corpus of rated dialogues. It is shown that the spoken dialogue designer can optimise dialogue management by simply defining the dialogue logic and a criterion to maximise (e.g user satisfaction). The methodology suggested in this thesis first considers the dialogue parameters that are necessary to compute a representation of the state space relevant for the criterion to be maximized. For instance, if the chosen criterion is user satisfaction then it is important to account for parameters such as dialogue duration and the average speech recognition confidence score. The state space is represented as a sparse distributed memory. The Genetic Sparse Distributed Memory for Reinforcement Learning (GSDMRL) accommodates many dialogue parameters and selects the parameters which are the most important for learning through genetic evolution. The resulting state space and the policy learnt on it are easily interpretable by the system designer. Secondly, the rated dialogues are used to learn a reward function which teaches the system to optimise the criterion. Two algorithms, reward shaping and distance minimisation are proposed to learn the reward function. These two algorithms consider the criterion to be the return for the entire dialogue. These functions are discussed and compared on simulated dialogues and it is shown that the resulting functions enable faster learning than using the criterion directly as the final reward. A spoken dialogue system for appointment scheduling was designed during this thesis, based on previous systems, and a corpus of rated dialogues with this system were collected. This corpus illustrates the scaling capability of the state space representation and is a good example of an industrial spoken dialogue system upon which the methodology could be applied
3

Measuring and Influencing Sequential Joint Agent Behaviours

Raffensperger, Peter Abraham January 2013 (has links)
Algorithmically designed reward functions can influence groups of learning agents toward measurable desired sequential joint behaviours. Influencing learning agents toward desirable behaviours is non-trivial due to the difficulties of assigning credit for global success to the deserving agents and of inducing coordination. Quantifying joint behaviours lets us identify global success by ranking some behaviours as more desirable than others. We propose a real-valued metric for turn-taking, demonstrating how to measure one sequential joint behaviour. We describe how to identify the presence of turn-taking in simulation results and we calculate the quantity of turn-taking that could be observed between independent random agents. We demonstrate our turn-taking metric by reinterpreting previous work on turn-taking in emergent communication and by analysing a recorded human conversation. Given a metric, we can explore the space of reward functions and identify those reward functions that result in global success in groups of learning agents. We describe 'medium access games' as a model for human and machine communication and we present simulation results for an extensive range of reward functions for pairs of Q-learning agents. We use the Nash equilibria of medium access games to develop predictors for determining which reward functions result in turn-taking. Having demonstrated the predictive power of Nash equilibria for turn-taking in medium access games, we focus on synthesis of reward functions for stochastic games that result in arbitrary desirable Nash equilibria. Our method constructs a reward function such that a particular joint behaviour is the unique Nash equilibrium of a stochastic game, provided that such a reward function exists. This method builds on techniques for designing rewards for Markov decision processes and for normal form games. We explain our reward design methods in detail and formally prove that they are correct.
4

Evolution of reward functions for reinforcement learning applied to stealth games

Mendonça, Matheus Ribeiro Furtado de January 2016 (has links)
Submitted by Renata Lopes (renatasil82@gmail.com) on 2017-05-31T11:40:17Z No. of bitstreams: 1 matheusribeirofurtadodemendonca.pdf: 1083096 bytes, checksum: bb42372f22411bc93823b92e7361a490 (MD5) / Approved for entry into archive by Adriana Oliveira (adriana.oliveira@ufjf.edu.br) on 2017-05-31T12:42:30Z (GMT) No. of bitstreams: 1 matheusribeirofurtadodemendonca.pdf: 1083096 bytes, checksum: bb42372f22411bc93823b92e7361a490 (MD5) / Made available in DSpace on 2017-05-31T12:42:30Z (GMT). No. of bitstreams: 1 matheusribeirofurtadodemendonca.pdf: 1083096 bytes, checksum: bb42372f22411bc93823b92e7361a490 (MD5) Previous issue date: 2016 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Muitos jogos modernos apresentam elementos que permitem que o jogador complete certos objetivos sem ser visto pelos inimigos. Isso culminou no surgimento de um novo gênero chamado de jogos furtivos, onde a furtividade é essencial. Embora elementos de furtividade sejam muito comuns em jogos modernos, este tema não tem sido estudado extensivamente. Este trabalho aborda três problemas distintos: (i) como utilizar uma abordagem por aprendizado de máquinas de forma a permitir que o agente furtivo aprenda como se comportar adequadamente em qualquer ambiente, (ii) criar um método eficiente para planejamento de caminhos furtivos que possa ser acoplado à nossa formulação por aprendizado de máquinas e (iii) como usar computação evolutiva de forma a definir certos parâmetros para nossa abordagem por aprendizado de máquinas. É utilizado aprendizado por reforço para aprender bons comportamentos que sejam capazes de atingir uma alta taxa de sucesso em testes aleatórios de um jogo furtivo. Também é proposto uma abor dagem evolucionária capaz de definir automaticamente uma boa função de reforço para a abordagem por aprendizado por reforço. / Many modern games present stealth elements that allow the player to accomplish a certain objective without being spotted by enemy patrols. This gave rise to a new genre called stealth games, where covertness plays a major role. Although quite popular in modern games, stealthy behaviors has not been extensively studied. In this work, we tackle three different problems: (i) how to use a machine learning approach in order to allow the stealthy agent to learn good behaviors for any environment, (ii) create an efficient stealthy path planning method that can be coupled with our machine learning formulation, and (iii) how to use evolutionary computing in order to define specific parameters for our machine learning approach without any prior knowledge of the problem. We use Reinforcement Learning in order to learn good covert behavior capable of achieving a high success rate in random trials of a stealth game. We also propose an evolutionary approach that is capable of automatically defining a good reward function for our reinforcement learning approach.
5

Benchmarking Deep Reinforcement Learning on Continuous Control Tasks : AComparison of Neural Network Architectures and Environment Designs / Prestandajämförelse av djup förstärkningsinlärning för kontinuerliga system : En jämförelse av neurala nätverksarkitekturer och miljödesigner

Sahlin, Daniel January 2022 (has links)
Deep Reinforcement Learning (RL) has received much attention in recent years. This thesis investigates how reward functions, environment termination conditions, Neural Network (NN) architectures, and the type of the deep RL algorithm aect the performance for continuous control tasks. To this end, the Furuta pendulum swing-up task is adopted as the primary benchmark, since it oers low input- and state-dimensionality without being trivial. Focusing on model-free algorithms, the results indicate that DDPG, an actorcritic algorithm, performs significantly better than other algorithms. They also suggest that larger NN architectures may benefit performance in some instances. Comparing reward functions, Potential Based Reward Shaping (PBRS) applied to a sparse reward signal shows promising results compared to a reward function of previous work, and combining PBRS with large negative rewards for terminations due to unwanted behavior seems to improve performance for some algorithms. However, although designs such as PBRS can improve performance they are shown to not be necessary to achieve adequate performance, and the same applies to environment terminations upon unwanted behavior. Attempting to apply a DDPG agent trained in a simulator to a physical Furuta pendulum results in performance that closely resembles what is observed in the simulator for certain training seeds. The results and test suite of this thesis are available on GitHub and should hopefully help inspire future research in environment design and NN architectures for deep RL. Specifically, future work may investigate whether extensive parametertuning alters the results. / Djup förstärkningsinlärning har fått mycket uppmärksamhet de senaste åren. Detta arbete undersöker hur belöningsfunktioner, miljöers termineringsvillkor, neurala nätverksarkitekturer, och typen av djup förstärkningsinlärningsalgoritm påverkar prestandan för kontroll av kontinuerliga system. För att uppnå detta används uppsvängning av Furuta-pendeln som primärt referensproblem, ty det har få indata- och tillståndsdimensioner utan att vara trivialt. Fokus riktas mot modellfria algoritmer, där resultaten indikerar att DDPG, en aktörkritisk algoritm, presterar signifikant bättre än andra algoritmer. Resultaten indikerar också att större nätverksarkitekturer kan ge bättre prestanda i vissa fall. Vid jämförelse av belöningsfunktioner visar potentialbaseradbelöningsutformning (PBRS) applicerat på en gles belöningsfunktion lovande resultat jämfört med en belöningsfunktion från tidigare forskning, och kombinationen av PBRS med stora negativa belöningar för termineringar på grund av oönskat beteende verkar förbättra prestandan för vissa algoritmer. Dock, även om designer så som PBRS kan förbättra prestandan påvisas det att de inte är nödvändiga för att uppnå adekvat prestanda, och detsamma gäller miljötermineringar vid oönskat beteende. Försöket med applicering av en DDPG-agent tränad i en simulator på en fysisk Furuta-pendel resulterar i prestanda som nära efterliknar vad som uppnås i simulatorn för särskilda träningsfrön. Resultaten och testsviten för detta projekt finns tillgängliga på GitHub och kommer förhoppningsvis inspirera framtida forskning inom miljödesign och neurala nätverksarkitekturer för djup förstärkningsinlärning. Specifikt så kan framtida arbeten utreda huruvida utförlig parameterjustering påverkar resultaten.

Page generated in 0.0905 seconds