Spelling suggestions: "subject:"emporal difference learning"" "subject:"atemporal difference learning""
1 |
Reinforcement Learning and Simulation-Based Search in Computer GoSilver, David 11 1900 (has links)
Learning and planning are two fundamental problems in artificial intelligence. The learning problem can be tackled by reinforcement learning methods, such as temporal-difference learning, which update a value function from real experience, and use function approximation to generalise across states. The planning problem can be tackled by simulation-based search methods, such as Monte-Carlo tree search, which update a value function from simulated experience, but treat each state individually. We introduce a new method, temporal-difference search, that combines elements of both reinforcement learning and simulation-based search methods. In this new method the value function is updated from simulated experience, but it uses function approximation to efficiently generalise across states. We also introduce the Dyna-2 architecture, which combines temporal-difference learning with temporal-difference search. Whereas temporal-difference learning acquires general domain knowledge from its past experience, temporal-difference search acquires local knowledge that is specialised to the agent's current state, by simulating future experience. Dyna-2 combines both forms of knowledge together.
We apply our algorithms to the game of 9x9 Go. Using temporal-difference learning, with a million binary features matching simple patterns of stones, and using no prior knowledge except the grid structure of the board, we learnt a fast and effective evaluation function. Using temporal-difference search with the same representation produced a dramatic improvement: without any explicit search tree, and with equivalent domain knowledge, it achieved better performance than a vanilla Monte-Carlo tree search. When combined together using the Dyna-2 architecture, our program outperformed all handcrafted, traditional search, and traditional machine learning programs on the 9x9 Computer Go Server.
We also use our framework to extend the Monte-Carlo tree search algorithm. By forming a rapid generalisation over subtrees of the search space, and incorporating heuristic pattern knowledge that was learnt or handcrafted offline, we were able to significantly improve the performance of the Go program MoGo. Using these enhancements, MoGo became the first 9x9 Go program to achieve human master level.
|
2 |
Reinforcement Learning and Simulation-Based Search in Computer GoSilver, David Unknown Date
No description available.
|
3 |
Gradient Temporal-Difference Learning AlgorithmsMaei, Hamid Reza Unknown Date
No description available.
|
4 |
Reinforcement learning : theory, methods and application to decision support systemsMouton, Hildegarde Suzanne 12 1900 (has links)
Thesis (MSc (Applied Mathematics))--University of Stellenbosch, 2010. / ENGLISH ABSTRACT: In this dissertation we study the machine learning subfield of Reinforcement Learning (RL).
After developing a coherent background, we apply a Monte Carlo (MC) control algorithm
with exploring starts (MCES), as well as an off-policy Temporal-Difference (TD) learning
control algorithm, Q-learning, to a simplified version of the Weapon Assignment (WA)
problem.
For the MCES control algorithm, a discount parameter of τ
= 1 is used. This gives very
promising results when applied to 7 × 7 grids, as well as 71 × 71 grids. The same discount
parameter cannot be applied to the Q-learning algorithm, as it causes the Q-values to
diverge. We take a greedy approach, setting ε = 0, and vary the learning rate (α ) and the
discount parameter (τ). Experimentation shows that the best results are found with set
to 0.1 and
constrained in the region 0.4 ≤ τ ≤ 0.7.
The MC control algorithm with exploring starts gives promising results when applied to the
WA problem. It performs significantly better than the off-policy TD algorithm, Q-learning,
even though it is almost twice as slow.
The modern battlefield is a fast paced, information rich environment, where discovery of
intent, situation awareness and the rapid evolution of concepts of operation and doctrine
are critical success factors. Combining the techniques investigated and tested in this work
with other techniques in Artificial Intelligence (AI) and modern computational techniques
may hold the key to solving some of the problems we now face in warfare. / AFRIKAANSE OPSOMMING: Die fokus van hierdie verhandeling is die masjienleer-algoritmes in die veld van versterkingsleer.
’n Koherente agtergrond van die veld word gevolg deur die toepassing van ’n
Monte Carlo (MC) beheer-algoritme met ondersoekende begintoestande, sowel as ’n afbeleid
Temporale-Verskil beheer-algoritme, Q-leer, op ’n vereenvoudigde weergawe van die
wapentoekenningsprobleem.
Vir die MC beheer-algoritme word ’n afslagparameter van τ = 1 gebruik. Dit lewer belowende
resultate wanneer toegepas op 7 × 7 roosters, asook op 71 × 71 roosters. Dieselfde
afslagparameter kan nie op die Q-leer algoritme toegepas word nie, aangesien dit veroorsaak
dat die Q-waardes divergeer. Ons neem ’n gulsige aanslag deur die gulsigheidsparameter te
verstel na ε = 0. Ons varieer dan die leertempo ( α) en die afslagparameter (τ). Die beste
eksperimentele resultate is behaal wanneer = 0.1 en as die afslagparameter vasgehou word
in die gebied 0.4 ≤ τ ≤ 0.7.
Die MC beheer-algoritme lewer belowende resultate wanneer toegepas op die wapentoekenningsprobleem.
Dit lewer beduidend beter resultate as die Q-leer algoritme, al neem dit
omtrent twee keer so lank om uit te voer.
Die moderne slagveld is ’n omgewing ryk aan inligting, waar dit kritiek belangrik is om
vinnig die vyand se planne te verstaan, om bedag te wees op die omgewing en die konteks
van gebeure, en waar die snelle ontwikkeling van die konsepte van operasie en doktrine lei tot
sukses. Die tegniekes wat in die verhandeling ondersoek en getoets is, en ander kunsmatige
intelligensie tegnieke en moderne berekeningstegnieke saamgesnoer, mag dalk die sleutel hou
tot die oplossing van die probleme wat ons tans in die gesig staar in oorlogvoering.
|
5 |
Mathematical Description of Differential Hebbian Plasticity and its Relation to Reinforcement Learning / Mathematische Beschreibung Hebb'scher Plastizität und deren Beziehung zu Bestärkendem LernenKolodziejski, Christoph Markus 13 February 2009 (has links)
No description available.
|
6 |
A formal investigation of dopamine’s role in Attention-Deficit/Hyperactive Disorder: evidence for asymmetrically effective reinforcement learning signalsCockburn, Jeffrey 14 January 2010 (has links)
Attention-Deficit/Hyperactive Disorder is a well studied but poorly understood disorder. Given that the underlying neurological mechanisms involved in the disorder have yet to be established, diagnosis is dependent upon behavioural markers. However, recent research has begun to associate a dopamine system dysfunction with ADHD; though, consensus on the nature of dopamine’s role in ADHD has yet to be established. Here, I use a computational modelling approach to investigate two opposing theories of the dopaminergic dysfunction in ADHD. The hyper-active dopamine theory posits that ADHD is associated with a midbrain dopamine system that produces abnormally large prediction errors signals; whereas the dynamic developmental theory argues that abnormally small prediction errors give rise to ADHD. Given that these two theories center on the size of prediction errors encoded by the midbrain dopamine system, I have formally investigated the implications of each theory within the framework of temporal-difference learning, a reinforcement learning algorithm demonstrated to model midbrain dopamine activity. The results presented in this thesis suggest that neither theory provides a good account for the behaviour of children and animal models of ADHD. Instead, my results suggest ADHD is the result of asymmetrically effective reinforcement learning signals encoded by the midbrain dopamine system. More specifically, the model presented here reproduced behaviours associated with ADHD when positive prediction errors were more effective than negative prediction errors. The biological sources of this asymmetry are considered, as are other computational models of ADHD.
|
7 |
Hraní nedeterministických her s učením / Playing of Nondeterministic Games with LearningBukovský, Marek January 2011 (has links)
The thesis is dedicated to the study and implementation of methods used for learning from the course of playing. The chosen game for this thesis is Backgammon. The algorithm used for training neural networks is called the temporal difference learning with use of eligible traces. This algorithm is also known as TD(lambda). The theoretical part describes algorithms for playing games without learning, introduction to reinforcement learning, temporal difference learning and introduction to artificial neural networks. The practical part deals with application of combination of neural networks and TD(lambda) algorithms.
|
8 |
Modèle informatique du coapprentissage des ganglions de la base et du cortex : l'apprentissage par renforcement et le développement de représentationsRivest, François 12 1900 (has links)
Tout au long de la vie, le cerveau développe des représentations de son environnement permettant à l’individu d’en tirer meilleur profit. Comment ces représentations se développent-elles pendant la quête de récompenses demeure un mystère. Il est raisonnable de penser que le cortex est le siège de ces représentations et que les ganglions de la base jouent un rôle important dans la maximisation des récompenses. En particulier, les neurones dopaminergiques semblent coder un signal d’erreur de prédiction de récompense. Cette thèse étudie le problème en construisant, à l’aide de l’apprentissage machine, un modèle informatique intégrant de nombreuses évidences neurologiques.
Après une introduction au cadre mathématique et à quelques algorithmes de l’apprentissage machine, un survol de l’apprentissage en psychologie et en neuroscience et une revue des modèles de l’apprentissage dans les ganglions de la base, la thèse comporte trois articles. Le premier montre qu’il est possible d’apprendre à maximiser ses récompenses tout en développant de meilleures représentations des entrées. Le second article porte sur l'important problème toujours non résolu de la représentation du temps. Il démontre qu’une représentation du temps peut être acquise automatiquement dans un réseau de neurones artificiels faisant office de mémoire de travail. La représentation développée par le modèle ressemble beaucoup à l’activité de neurones corticaux dans des tâches similaires. De plus, le modèle montre que l’utilisation du signal d’erreur de récompense peut accélérer la construction de ces représentations temporelles. Finalement, il montre qu’une telle représentation acquise automatiquement dans le cortex peut fournir l’information nécessaire aux ganglions de la base pour expliquer le signal dopaminergique. Enfin, le troisième article évalue le pouvoir explicatif et prédictif du modèle sur différentes situations comme la présence ou l’absence d’un stimulus (conditionnement classique ou de trace) pendant l’attente de la récompense. En plus de faire des prédictions très intéressantes en lien avec la littérature sur les intervalles de temps, l’article révèle certaines lacunes du modèle qui devront être améliorées.
Bref, cette thèse étend les modèles actuels de l’apprentissage des ganglions de la base et du système dopaminergique au développement concurrent de représentations temporelles dans le cortex et aux interactions de ces deux structures. / Throughout lifetime, the brain develops abstract representations of its environment that allow the individual to maximize his benefits. How these representations are developed while trying to acquire rewards remains a mystery. It is reasonable to assume that these representations arise in the cortex and that the basal ganglia are playing an important role in reward maximization. In particular, dopaminergic neurons appear to code a reward prediction error signal. This thesis studies the problem by constructing, using machine learning tools, a computational model that incorporates a number of relevant neurophysiological findings.
After an introduction to the machine learning framework and to some of its algorithms, an overview of learning in psychology and neuroscience, and a review of models of learning in the basal ganglia, the thesis comprises three papers. The first article shows that it is possible to learn a better representation of the inputs while learning to maximize reward. The second paper addresses the important and still unresolved problem of the representation of time in the brain. The paper shows that a time representation can be acquired automatically in an artificial neural network acting like a working memory. The representation learned by the model closely resembles the activity of cortical neurons in similar tasks. Moreover, the model shows that the reward prediction error signal could accelerate the development of the temporal representation. Finally, it shows that if such a learned representation exists in the cortex, it could provide the necessary information to the basal ganglia to explain the dopaminergic signal. The third article evaluates the explanatory and predictive power of the model on the effects of differences in task conditions such as the presence or absence of a stimulus (classical versus trace conditioning) while waiting for the reward. Beyond making interesting predictions relevant to the timing literature, the paper reveals some shortcomings of the model that will need to be resolved.
In summary, this thesis extends current models of reinforcement learning of the basal ganglia and the dopaminergic system to the concurrent development of representation in the cortex and to the interactions between these two regions.
|
9 |
Modèle informatique du coapprentissage des ganglions de la base et du cortex : l'apprentissage par renforcement et le développement de représentationsRivest, François 12 1900 (has links)
Tout au long de la vie, le cerveau développe des représentations de son environnement permettant à l’individu d’en tirer meilleur profit. Comment ces représentations se développent-elles pendant la quête de récompenses demeure un mystère. Il est raisonnable de penser que le cortex est le siège de ces représentations et que les ganglions de la base jouent un rôle important dans la maximisation des récompenses. En particulier, les neurones dopaminergiques semblent coder un signal d’erreur de prédiction de récompense. Cette thèse étudie le problème en construisant, à l’aide de l’apprentissage machine, un modèle informatique intégrant de nombreuses évidences neurologiques.
Après une introduction au cadre mathématique et à quelques algorithmes de l’apprentissage machine, un survol de l’apprentissage en psychologie et en neuroscience et une revue des modèles de l’apprentissage dans les ganglions de la base, la thèse comporte trois articles. Le premier montre qu’il est possible d’apprendre à maximiser ses récompenses tout en développant de meilleures représentations des entrées. Le second article porte sur l'important problème toujours non résolu de la représentation du temps. Il démontre qu’une représentation du temps peut être acquise automatiquement dans un réseau de neurones artificiels faisant office de mémoire de travail. La représentation développée par le modèle ressemble beaucoup à l’activité de neurones corticaux dans des tâches similaires. De plus, le modèle montre que l’utilisation du signal d’erreur de récompense peut accélérer la construction de ces représentations temporelles. Finalement, il montre qu’une telle représentation acquise automatiquement dans le cortex peut fournir l’information nécessaire aux ganglions de la base pour expliquer le signal dopaminergique. Enfin, le troisième article évalue le pouvoir explicatif et prédictif du modèle sur différentes situations comme la présence ou l’absence d’un stimulus (conditionnement classique ou de trace) pendant l’attente de la récompense. En plus de faire des prédictions très intéressantes en lien avec la littérature sur les intervalles de temps, l’article révèle certaines lacunes du modèle qui devront être améliorées.
Bref, cette thèse étend les modèles actuels de l’apprentissage des ganglions de la base et du système dopaminergique au développement concurrent de représentations temporelles dans le cortex et aux interactions de ces deux structures. / Throughout lifetime, the brain develops abstract representations of its environment that allow the individual to maximize his benefits. How these representations are developed while trying to acquire rewards remains a mystery. It is reasonable to assume that these representations arise in the cortex and that the basal ganglia are playing an important role in reward maximization. In particular, dopaminergic neurons appear to code a reward prediction error signal. This thesis studies the problem by constructing, using machine learning tools, a computational model that incorporates a number of relevant neurophysiological findings.
After an introduction to the machine learning framework and to some of its algorithms, an overview of learning in psychology and neuroscience, and a review of models of learning in the basal ganglia, the thesis comprises three papers. The first article shows that it is possible to learn a better representation of the inputs while learning to maximize reward. The second paper addresses the important and still unresolved problem of the representation of time in the brain. The paper shows that a time representation can be acquired automatically in an artificial neural network acting like a working memory. The representation learned by the model closely resembles the activity of cortical neurons in similar tasks. Moreover, the model shows that the reward prediction error signal could accelerate the development of the temporal representation. Finally, it shows that if such a learned representation exists in the cortex, it could provide the necessary information to the basal ganglia to explain the dopaminergic signal. The third article evaluates the explanatory and predictive power of the model on the effects of differences in task conditions such as the presence or absence of a stimulus (classical versus trace conditioning) while waiting for the reward. Beyond making interesting predictions relevant to the timing literature, the paper reveals some shortcomings of the model that will need to be resolved.
In summary, this thesis extends current models of reinforcement learning of the basal ganglia and the dopaminergic system to the concurrent development of representation in the cortex and to the interactions between these two regions.
|
10 |
MP-Draughts - Um Sistema Multiagente de Aprendizagem Automática para Damas Baseado em Redes Neurais de Kohonen e Perceptron MulticamadasDuarte, Valquíria Aparecida Rosa 17 July 2009 (has links)
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / The goal of this work is to present MP-Draughts (MultiPhase- Draughts), that is
a multiagent environment for Draughts, where one agent - named IIGA- is built and
trained such as to be specialized for the initial and the intermediate phases of the games
and the remaining ones for the final phases of them. Each agent of MP-Draughts is a
neural network which learns almost without human supervision (distinctly from the world
champion agent Chinook). MP-Draughts issues from a continuous activity of research
whose previous product was the efficient agent VisionDraughts. Despite its good general
performance, VisionDraughts frequently does not succeed in final phases of a game, even
being in advantageous situation compared to its opponent (for instance, getting into
endgame loops). In order to try to reduce this misbehavior of the agent during endgames,
MP-Draughts counts on 25 agents specialized for endgame phases, each one trained such
as to be able to deal with a determined cluster of endgame boardstates. These 25 clusters
are mined by a Kohonen-SOM Network from a Data Base containing a large quantity of
endgame boardstates. After trained, MP-Draughts operates in the following way: first,
an optimized version of VisionDraughts is used as IIGA; next, the endgame agent that
represents the cluster which better fits the current endgame board-state will replace it up
to the end of the game. This work shows that such a strategy significantly improves the
general performance of the player agents. / O objetivo deste trabalho é propor um sistema de aprendizagem de Damas, o MPDraughts
(MultiPhase- Draughts): um sistema multiagentes, em que um deles - conhecido
como IIGA (Initial/Intermediate Game Agent)- é desenvolvido e treinado para ser especializado
em fases iniciais e intermediárias de jogo e os outros 25 agentes, em fases finais.
Cada um dos agentes que compõe o MP-Draughts é uma rede neural que aprende a jogar
com o mínimo possível de intervenção humana (distintamente do agente campeão do
mundo Chinook). O MP-Draughts é fruto de uma contínua atividade de pesquisa que
teve como produto anterior o VisionDraughts. Apesar de sua eficiência geral, o Vision-
Draughts, muitas vezes, tem seu bom desempenho comprometido na fase de finalização
de partidas, mesmo estando em vantagem no jogo em comparação com o seu oponente
(por exemplo, entrando em loop de final de jogo). No sentido de reduzir o comportamento
indesejado do jogador, o MP-Draughts conta com 25 agentes especializados em final de
jogo, sendo que cada um é treinado para lidar com um determinado tipo de cluster de
tabuleiros de final de jogo. Esses 25 clusters são minerados por redes de Kohonen-SOM
de uma base de dados que contém uma grande quantidade de estado de tabuleiro de final
de jogo. Depois de treinado, o MP-Draughts atua da seguinte maneira: primeiro, uma
versão aprimorada do VisionDraughts é usada como o IIGA; depois, um agente de final
de jogo que representa o cluster que mais se aproxima do estado corrente do tabuleiro do
jogo deverá substituir o IIGA e conduzir o jogo até o final. Este trabalho mostra que essa
estratégia melhorou, significativamente, o desempenho geral do agente jogador. / Mestre em Ciência da Computação
|
Page generated in 0.1031 seconds