21 |
Situation-appropriate Investment of Cognitive ResourcesOtt, Florian 29 March 2022 (has links)
The human brain is equipped with the ability to plan ahead, i.e. to mentally simulate the expected consequences of candidate actions to select the one with the most desirable expected long-term outcome. Insufficient planning can lead to maladaptive behaviour and may even be a contributory cause of important societal problems such as the depletion of natural resources or man-made climate change. Understanding the cognitive and neural mechanisms of forward planning and its regulation are therefore of great importance and could ultimately give us clues on how to better align our behaviour with long-term goals.
Apart from its potential beneficial effects, planning is time-consuming and therefore associated with opportunity costs. It is assumed that the brain regulates the investment into planning based on a cost-benefit analysis, so that planning only takes place when the perceived benefits outweigh the costs. But how can the brain know in advance how beneficial or costly planning will be? One potential solution is that people learn from experience how valuable planning would be in a given situation. It is however largely unknown how the brain implements such learning, especially in environments with large state spaces.
This dissertation tested the hypothesis that humans construct and use so-called control contexts to efficiently adjust the degree of planning to the demands of the current situation. Control contexts can be seen as abstract state representations, that conveniently cluster together situations with a similar demand for planning. Inferring context thus allows to prospectively adjust the control system to the learned demands of the global context. To test the control context hypothesis, two complex sequential decision making tasks were developed. Each of the two tasks had to fulfil two important criteria. First, the tasks should generate both situations in which planning had the potential to improve performance, as well as situations in which a simple strategy was sufficient. Second, the tasks had to feature rich state spaces requiring participants to compress their state representation for efficient regulation of planning. Participants’ planning was modelled using a parametrized dynamic programming solution to a Markov Decision Process, with parameters estimated via hierarchical Bayesian inference.
The first study used a 15-step task in which participants had to make a series of decisions to achieve one or multiple goals. In this task, the computational costs of accurate forward planning increased exponentially with the length of the planning horizon. We therefore hypothesized that participants identify ‘distance from goal’ as the relevant contextual feature to guide their regulation of forward planning. As expected we found that participants predominantly relied on a simple heuristic when still far from the goal but progressively switched towards forward planning when the goal approached.
In the second study participants had to sustainably invest a limited but replenishable energy resource, that was needed to accept offers, in order to accumulate a maximum number of points in the long run. The demand for planning varied across the different situations of the task, but due to the large number of possible situations (n = 448) it would be difficult for the participants to develop an expectation for each individual situation of how beneficial planning would be. We therefore hypothesized, that to regulate their forward planning participants used a compressed tasks representation, clustering together states with similar demands for planning. Consistent with this, reaction times (operationalising planning duration) increased with trial-by-trial value-conflict (operationalising approximate planning demand), but this increase was more pronounced in a context with generally high demand for planning. We further found that fMRI activity in the dorsal anterior cingulate cortex (dACC) increased with conflict, but this increase was more pronounced in a context with generally high demand for planning as well. Taken together, the results suggest that the dACC integrates representations of planning demand on different levels of abstraction to regulate prospective information sampling in an efficient and situation-appropriate way.
This dissertation provides novel insights into the question how humans adapt their planning to the demands of the current situation. The results are consistent with the view that the regulation of planning is based on an integrated signal of the expected costs and benefits of planning. Furthermore, the results of this dissertation provide evidence that the regulation of planning in environments with real-world complexity critically relies on the brain’s powerful ability to construct and use abstract hierarchical representations.
|
22 |
Oracle-based algorithms for optimizing sophisticated decision criteria in sequential, robust and fair decision problems / Algorithmes à base d'oracles pour optimiser des critères décisionnels sophistiqués pour les problèmes de décision séquentielle, robuste et équitableGilbert, Hugo 11 December 2017 (has links)
Cette thèse s'inscrit dans le cadre de la théorie de la décision algorithmique, qui est une discipline au croisement de la théorie de la décision, la recherche opérationnelle et l'intelligence artificielle. Dans cette thèse, nous étudions l'utilisation de plusieurs modèles décisionnels pour résoudre des problèmes de décision séquentielle dans l'incertain, d'optimisation robuste, et d'optimisation multi-agents équitable. Pour résoudre efficacement ces problèmes, nous utilisons des méthodes de type maître-esclaves, dites à base d'oracles dans la thèse. Ces méthodes permettent de résoudre des problèmes de grande taille en procédant de manière incrémentale. Une attention particulière est portée au modèle de l'espérance d'utilité antisymétrique et bilinéaire, au modèle de l'espérance d'utilité pondérée et à leurs pendants en décision multicritère. L'intérêt de ces modèles est multiple. En effet, ils étendent les modèles standards (e.g., modèle de l'espérance d'utilité) et permettent de représenter un spectre étendu de préférences tout en conservant leurs bonnes propriétés théoriques et algorithmiques. La thèse apporte des réponses sur des aspects théoriques (e.g., résultats de complexité algorithmique) et sur des aspects opérationnels (e.g., conception de méthodes de résolution efficaces) aux problèmes soulevés par l'emploi de ces critères dans les contextes susmentionnés. / This thesis falls within the area of algorithmic decision theory, which is at the crossroads between decision theory, operational research and artificial intelligence. In this thesis, we study several decision models to solve problems in different domains: sequential decision problems under risk, robust optimization problems, and fair multi-agent optimization problems. To solve these problems efficiently, we use master-slave algorithms which solve the problem through an incremental process. These procedures, referred to as oracle methods in the thesis, make it possible to solve problems of large size. A particular attention is given to the skew-symmetric bilinear utility model, the weighted expected utility model and their counterparts in multicriteria decision making. These models are interesting at several respects. They extend the standard models (e.g., the expected utility model) and allow to represent a broader class of preferences while retaining their good theoretical and algorithmic properties. The thesis focuses both on theoretic (e.g., complexity results) and operational (e.g., design of practically efficient solution methods) aspects of the problems raised by the use of these criteria in the domains aforementioned.
|
23 |
[pt] RESOLVENDO ONLINE PACKING IPS SOB A PRESENÇA DE ENTRADAS ADVERSÁRIAS / [en] SOLVING THE ONLINE PACKING IP UNDER SOME ADVERSARIAL INPUTSDAVID BEYDA 23 January 2023 (has links)
[pt] Nesse trabalho, estudamos online packing integer programs, cujas colunas são
reveladas uma a uma. Já que algoritmos ótimos foram encontrados para o modelo
RANDOMORDER– onde a ordem na qual as colunas são reveladas para o algoritmo
é aleatória – o foco da área se voltou para modelo menos otimistas. Um desses
modelos é o modelo MIXED, no qual algumas colunas são ordenadas de forma
adversária, enquanto outras chegam em ordem aleatória. Pouquíssimos resultados
são conhecidos para online packing IPs no modelo MIXED, que é o objeto do nosso
estudo. Consideramos problemas de online packing com d dimensões de ocupação
(d restrições de empacotamento), cada uma com capacidade B. Assumimos que
todas as recompensas e ocupações dos itens estão no intervalo [0, 1]. O objetivo do
estudo é projetar um algoritmo no qual a presença de alguns itens adversários tenha
um efeito limitado na competitividade do algoritmo relativa às colunas de ordem
aleatória. Portanto, usamos como benchmark OPTStoch, que é o valor da solução
ótima offline que considera apenas a parte aleatória da instância. Apresentamos um
algoritmo que obtém recompensas de pelo menos (1 − 5lambda − Ó de epsilon)OPTStoch com
alta probabilidade, onde lambda é a fração de colunas em ordem adversária.
Para conseguir tal garantia, projetamos um algoritmo primal-dual onde as
decisões são tomadas pelo algoritmo pela avaliação da recompensa e ocupação
de cada item, de acordo com as variáveis duais do programa inteiro. Entretanto,
diferentemente dos algoritmos primais-duais para o modelo RANDOMORDER, não
podemos estimar as variáveis duais pela resolução de um problema reduzido. A
causa disso é que, no modelo MIXED, um adversário pode facilmente manipular
algumas colunas, para atrapalhar nossa estimação. Para contornar isso, propomos o
uso de tecnicas conhecidas de online learning para aprender as variáveis duais do
problema de forma online, conforme o problema progride. / [en] We study online packing integer programs, where the columns arrive one
by one. Since optimal algorithms were found for the RANDOMORDER model –
where columns arrive in random order – much focus of the area has been on less
optimistic models. One of those models is the MIXED model, where some columns
are adversarially ordered, while others come in random-order. Very few results are
known for packing IPs in the MIXED model, which is the object of our study.
We consider online IPs with d occupation dimensions (d packing constraints),
each one with capacity (or right-hand side) B. We also assume all items rewards
and occupations to be less or equal to 1. Our goal is to design an algorithm
where the presence of adversarial columns has a limited effect on the algorithm s
competitiveness relative to the random-order columns. Thus, we use OPTStoch – the
offline optimal solution considering only the random-order part of the input – as a
benchmark.We present an algorithm that, relative to OPTStoch, is (1−5 lambda− OBig O of epsilon)-competitive with high probability, where lambda is the fraction of adversarial columns.
In order to achieve such a guarantee, we make use of a primal-dual algorithm
where the decision variables are set by evaluating each item s reward and occupation
according to the dual variables of the IP, like other algorithms for the RANDOMORDER
model do. However, we can t hope to estimate those dual variables by
solving a scaled version of problem, because they could easily be manipulated by
an adversary in the MIXED model. Our solution was to use online learning techniques
to learn all aspects of the dual variables in an online fashion, as the problem
progresses.
|
24 |
Neurobiologically-inspired models : exploring behaviour prediction, learning algorithms, and reinforcement learningSpinney, Sean 11 1900 (has links)
Le développement du domaine de l’apprentissage profond doit une grande part de son avancée
aux idées inspirées par la neuroscience et aux études sur l’apprentissage humain. De la
découverte de l’algorithme de rétropropagation à la conception d’architectures neuronales
comme les Convolutional Neural Networks, ces idées ont été couplées à l’ingénierie et aux
améliorations technologiques pour engendrer des algorithmes performants en utilisation
aujourd’hui. Cette thèse se compose de trois articles, chacun éclairant des aspects distincts
du thème central de ce domaine interdisciplinaire. Le premier article explore la modélisation
prédictive avec des données d’imagerie du cerveau de haute dimension en utilisant une nouvelle
approche de régularisation hybride. Dans de nombreuses applications pratiques (comme
l’imagerie médicale), l’attention se porte non seulement sur la précision, mais également
sur l’interprétabilité d’un modèle prédictif formé sur des données haute dimension. Cette
étude s’attache à combiner la régularisation l1 et l2, qui régularisent la norme des gradients,
avec l’approche récemment proposée pour la modélisation prédictive robuste, l’Invariant
Learning Consistency, qui impose l’alignement entre les gradients de la même classe lors
de l’entraînement. Nous examinons ici la capacité de cette approche combinée à identifier
des prédicteurs robustes et épars, et nous présentons des résultats prometteurs sur plusieurs
ensembles de données. Cette approche tend à améliorer la robustesse des modèles épars dans
presque tous les cas, bien que les résultats varient en fonction des conditions. Le deuxième
article se penche sur les algorithmes d’apprentissage inspirés de la biologie, en se concentrant
particulièrement sur la méthode Difference Target Propagation (DTP) tout en l’intégrant à
l’optimisation Gauss-Newton. Le développement de tels algorithmes biologiquement plausibles
possède une grande importance pour comprendre les processus d’apprentissage neuronale,
cependant leur extensibilité pratique à des tâches réelles est souvent limitée, ce qui entrave
leur potentiel explicatif pour l’apprentissage cérébral réel. Ainsi, l’exploration d’algorithmes
d’apprentissage qui offrent des fondements théoriques solides et peuvent rivaliser avec la
rétropropagation dans des tâches complexes gagne en importance. La méthode Difference
Target Propagation (DTP) se présente comme une candidate prometteuse, caractérisée par
son étroite relation avec les principes de l’optimisation Gauss-Newton. Néanmoins, la rigueur
de cette relation impose des limites, notamment en ce qui concerne la formation couche par
couche des poids synaptiques du chemin de rétroaction, une configuration considérée comme
plus biologiquement plausible. De plus, l’alignement entre les mises à jour des poids DTP
et les gradients de perte est conditionnel et dépend des scénarios d’architecture spécifiques.
Cet article relève ces défis en introduisant un schéma innovant d’entraînement des poids
de rétroaction. Ce schéma harmonise la DTP avec la BP, rétablissant la viabilité de la
formation des poids de rétroaction couche par couche sans compromettre l’intégrité théorique.
La validation empirique souligne l’efficacité de ce schéma, aboutissant à des performances
exceptionnelles de la DTP sur CIFAR-10 et ImageNet 32×32. Enfin, le troisième article
explore la planification efficace dans la prise de décision séquentielle en intégrant le calcul
adaptatif à des architectures d’apprentissage profond existantes, dans le but de résoudre des
casse-tête complexes. L’étude introduit des principes de calcul adaptatif inspirés des processus
cognitifs humains, ainsi que des avancées récentes dans le domaine du calcul adaptatif. En
explorant en profondeur les comportements émergents du modèle de mémoire adaptatif
entraîné, nous identifions plusieurs comportements reconnaissables similaires aux processus
cognitifs humains. Ce travail élargit la discussion sur le calcul adaptatif au-delà des gains
évidents en efficacité, en explorant les comportements émergents en raison des contraintes
variables généralement attribuées aux processus de la prise de décision chez les humains. / The development of the field of deep learning has benefited greatly from biologically inspired
insights from neuroscience and the study of human learning more generally, from the discovery
of backpropagation to neural architectures such as the Convolutional Neural Network. Coupled
with engineering and technological improvements, the distillation of good strategies and
algorithms for learning inspired from biological observation is at the heart of these advances.
Although it would be difficult to enumerate all useful biases that can be learned by observing
humans, they can serve as a blueprint for intelligent systems. The following thesis is composed
of three research articles, each shedding light on distinct facets of the overarching theme. The
first article delves into the realm of predictive modeling on high-dimensional fMRI data, a
landscape where not only accuracy but also interpretability are crucial. Employing a hybrid
approach blending l1 and l2 regularization with Invariant Learning Consistency, this study
unveils the potential of identifying robust, sparse predictors capable of transmuting noise laden datasets into coherent observations useful for pushing the field forward. Conversely,
the second article delves into the domain of biologically-plausible learning algorithms, a
pivotal endeavor in the comprehension of neural learning processes. In this context, the
investigation centers upon Difference Target Propagation (DTP), a prospective framework
closely related to Gauss-Newton optimization principles. This exploration delves into the
intricate interplay between DTP and the tenets of biologically-inspired learning mechanisms,
revealing an innovative schema for training feedback weights. This schema reinstates the
feasibility of layer-wise feedback weight training within the DTP framework, while concurrently
upholding its theoretical integrity. Lastly, the third article explores the role of memory in
sequential decision-making, and proposes a model with adaptive memory. This domain entails
navigating complex decision sequences within discrete state spaces, where the pursuit of
efficiency encounters difficult scenarios such as the risk of critical irreversibility. The study
introduces adaptive computation principles inspired by human cognitive processes, as well
as recent advances in adaptive computing. By studying in-depth the emergent behaviours
exhibited by the trained adaptive memory model, we identify several recognizable behaviours
akin to human cognitive processes. This work expands the discussion of adaptive computing beyond the obvious gains in efficiency, but to behaviours emerging due to varying constraints
usually attributable to dynamic response times in humans.
|
25 |
Ant Colony Optimization and its Application to Adaptive Routing in Telecommunication NetworksDi Caro, Gianni 10 November 2004 (has links)
In ant societies, and, more in general, in insect societies, the activities of the individuals, as well as of the society as a whole, are not regulated by any explicit form of centralized control. On the other hand, adaptive and robust behaviors transcending the behavioral repertoire of the single individual can be easily observed at society level. These complex global behaviors are the result of self-organizing dynamics driven by local interactions and communications among a number of relatively simple individuals.
The simultaneous presence of these and other fascinating and unique characteristics have made ant societies an attractive and inspiring model for building new algorithms and new multi-agent systems. In the last decade, ant societies have been taken as a reference for an ever growing body of scientific work, mostly in the fields of robotics, operations research, and telecommunications.
Among the different works inspired by ant colonies, the Ant Colony Optimization metaheuristic (ACO) is probably the most successful and popular one. The ACO metaheuristic is a multi-agent framework for combinatorial optimization whose main components are: a set of ant-like agents, the use of memory and of stochastic decisions, and strategies of collective and distributed learning.
It finds its roots in the experimental observation of a specific foraging behavior of some ant colonies that, under appropriate conditions, are able to select the shortest path among few possible paths connecting their nest to a food site. The pheromone, a volatile chemical substance laid on the ground by the ants while walking and affecting in turn their moving decisions according to its local intensity, is the mediator of this behavior.
All the elements playing an essential role in the ant colony foraging behavior were understood, thoroughly reverse-engineered and put to work to solve problems of combinatorial optimization by Marco Dorigo and his co-workers at the beginning of the 1990's.
From that moment on it has been a flourishing of new combinatorial optimization algorithms designed after the first algorithms of Dorigo's et al., and of related scientific events.
In 1999 the ACO metaheuristic was defined by Dorigo, Di Caro and Gambardella with the purpose of providing a common framework for describing and analyzing all these algorithms inspired by the same ant colony behavior and by the same common process of reverse-engineering of this behavior. Therefore, the ACO metaheuristic was defined a posteriori, as the result of a synthesis effort effectuated on the study of the characteristics of all these ant-inspired algorithms and on the abstraction of their common traits.
The ACO's synthesis was also motivated by the usually good performance shown by the algorithms (e.g., for several important combinatorial problems like the quadratic assignment, vehicle routing and job shop scheduling, ACO implementations have outperformed state-of-the-art algorithms).
The definition and study of the ACO metaheuristic is one of the two fundamental goals of the thesis. The other one, strictly related to this former one, consists in the design, implementation, and testing of ACO instances for problems of adaptive routing in telecommunication networks.
This thesis is an in-depth journey through the ACO metaheuristic, during which we have (re)defined ACO and tried to get a clear understanding of its potentialities, limits, and relationships with other frameworks and with its biological background. The thesis takes into account all the developments that have followed the original 1999's definition, and provides a formal and comprehensive systematization of the subject, as well as an up-to-date and quite comprehensive review of current applications. We have also identified in dynamic problems in telecommunication networks the most appropriate domain of application for the ACO ideas. According to this understanding, in the most applicative part of the thesis we have focused on problems of adaptive routing in networks and we have developed and tested four new algorithms.
Adopting an original point of view with respect to the way ACO was firstly defined (but maintaining full conceptual and terminological consistency), ACO is here defined and mainly discussed in the terms of sequential decision processes and Monte Carlo sampling and learning.
More precisely, ACO is characterized as a policy search strategy aimed at learning the distributed parameters (called pheromone variables in accordance with the biological metaphor) of the stochastic decision policy which is used by so-called ant agents to generate solutions. Each ant represents in practice an independent sequential decision process aimed at constructing a possibly feasible solution for the optimization problem at hand by using only information local to the decision step.
Ants are repeatedly and concurrently generated in order to sample the solution set according to the current policy. The outcomes of the generated solutions are used to partially evaluate the current policy, spot the most promising search areas, and update the policy parameters in order to possibly focus the search in those promising areas while keeping a satisfactory level of overall exploration.
This way of looking at ACO has facilitated to disclose the strict relationships between ACO and other well-known frameworks, like dynamic programming, Markov and non-Markov decision processes, and reinforcement learning. In turn, this has favored reasoning on the general properties of ACO in terms of amount of complete state information which is used by the ACO's ants to take optimized decisions and to encode in pheromone variables memory of both the decisions that belonged to the sampled solutions and their quality.
The ACO's biological context of inspiration is fully acknowledged in the thesis. We report with extensive discussions on the shortest path behaviors of ant colonies and on the identification and analysis of the few nonlinear dynamics that are at the very core of self-organized behaviors in both the ants and other societal organizations. We discuss these dynamics in the general framework of stigmergic modeling, based on asynchronous environment-mediated communication protocols, and (pheromone) variables priming coordinated responses of a number of ``cheap' and concurrent agents.
The second half of the thesis is devoted to the study of the application of ACO to problems of online routing in telecommunication networks. This class of problems has been identified in the thesis as the most appropriate for the application of the multi-agent, distributed, and adaptive nature of the ACO architecture.
Four novel ACO algorithms for problems of adaptive routing in telecommunication networks are throughly described. The four algorithms cover a wide spectrum of possible types of network: two of them deliver best-effort traffic in wired IP networks, one is intended for quality-of-service (QoS) traffic in ATM networks, and the fourth is for best-effort traffic in mobile ad hoc networks.
The two algorithms for wired IP networks have been extensively tested by simulation studies and compared to state-of-the-art algorithms for a wide set of reference scenarios. The algorithm for mobile ad hoc networks is still under development, but quite extensive results and comparisons with a popular state-of-the-art algorithm are reported. No results are reported for the algorithm for QoS, which has not been fully tested. The observed experimental performance is excellent, especially for the case of wired IP networks: our algorithms always perform comparably or much better than the state-of-the-art competitors.
In the thesis we try to understand the rationale behind the brilliant performance obtained and the good level of popularity reached by our algorithms. More in general, we discuss the reasons of the general efficacy of the ACO approach for network routing problems compared to the characteristics of more classical approaches. Moving further, we also informally define Ant Colony Routing (ACR), a multi-agent framework explicitly integrating learning components into the ACO's design in order to define a general and in a sense futuristic architecture for autonomic network control.
Most of the material of the thesis comes from a re-elaboration of material co-authored and published in a number of books, journal papers, conference proceedings, and technical reports. The detailed list of references is provided in the Introduction.
|
26 |
Contributions to Simulation-based High-dimensional Sequential Decision Making / Contributions sur la prise de décision séquentielle basée sur des simulations dans des environnements complexes de grande dimensionHoock, Jean-Baptiste 10 April 2013 (has links)
Ma thèse s'intitule « Contributions sur la prise de décision séquentielle basée sur des simulations dans des environnements complexes de grande dimension ». Le cadre de la thèse s'articule autour du jeu, de la planification et des processus de décision markovien. Un agent interagit avec son environnement en prenant successivement des décisions. L'agent part d'un état initial jusqu'à un état final dans lequel il ne peut plus prendre de décision. A chaque pas de temps, l'agent reçoit une observation de l'état de l'environnement. A partir de cette observation et de ses connaissances, il prend une décision qui modifie l'état de l'environnement. L'agent reçoit en conséquence une récompense et une nouvelle observation. Le but est de maximiser la somme des récompenses obtenues lors d'une simulation qui part d'un état initial jusqu'à un état final. La politique de l'agent est la fonction qui, à partir de l'historique des observations, retourne une décision. Nous travaillons dans un contexte où (i) le nombre d'états est immense, (ii) les récompenses apportent peu d'information, (iii) la probabilité d'atteindre rapidement un bon état final est faible et (iv) les connaissances a priori de l'environnement sont soit inexistantes soit difficilement exploitables. Les 2 applications présentées dans cette thèse répondent à ces contraintes : le jeu de Go et le simulateur 3D du projet européen MASH (Massive Sets of Heuristics). Afin de prendre une décision satisfaisante dans ce contexte, plusieurs solutions sont apportées :1. simuler en utilisant le compromis exploration/exploitation (MCTS)2. réduire la complexité du problème par des recherches locales (GoldenEye)3. construire une politique qui s'auto-améliore (RBGP)4. apprendre des connaissances a priori (CluVo+GMCTS) L'algorithme Monte-Carlo Tree Search (MCTS) est un algorithme qui a révolutionné le jeu de Go. A partir d'un modèle de l'environnement, MCTS construit itérativement un arbre des possibles de façon asymétrique en faisant des simulations de Monte-Carlo et dont le point de départ est l'observation courante de l'agent. L'agent alterne entre l'exploration du modèle en prenant de nouvelles décisions et l'exploitation des décisions qui obtiennent statistiquement une bonne récompense cumulée. Nous discutons de 2 moyens pour améliorer MCTS : la parallélisation et l'ajout de connaissances a priori. La parallélisation ne résout pas certaines faiblesses de MCTS ; notamment certains problèmes locaux restent des verrous. Nous proposons un algorithme (GoldenEye) qui se découpe en 2 parties : détection d'un problème local et ensuite sa résolution. L'algorithme de résolution réutilise des principes de MCTS et fait ses preuves sur une base classique de problèmes difficiles. L'ajout de connaissances à la main est laborieuse et ennuyeuse. Nous proposons une méthode appelée Racing-based Genetic Programming (RBGP) pour ajouter automatiquement de la connaissance. Le point fort de cet algorithme est qu'il valide rigoureusement l'ajout d'une connaissance a priori et il peut être utilisé non pas pour optimiser un algorithme mais pour construire une politique. Dans certaines applications telles que MASH, les simulations sont coûteuses en temps et il n'y a ni connaissance a priori ni modèle de l'environnement; l'algorithme Monte-Carlo Tree Search est donc inapplicable. Pour rendre MCTS applicable dans MASH, nous proposons une méthode pour apprendre des connaissances a priori (CluVo). Nous utilisons ensuite ces connaissances pour améliorer la rapidité de l'apprentissage de l'agent et aussi pour construire un modèle. A partir de ce modèle, nous utilisons une version adaptée de Monte-Carlo Tree Search (GMCTS). Cette méthode résout de difficiles problématiques MASH et donne de bons résultats dans une application dont le but est d'améliorer un tirage de lettres. / My thesis is entitled "Contributions to Simulation-based High-dimensional Sequential Decision Making". The context of the thesis is about games, planning and Markov Decision Processes. An agent interacts with its environment by successively making decisions. The agent starts from an initial state until a final state in which the agent can not make decision anymore. At each timestep, the agent receives an observation of the state of the environment. From this observation and its knowledge, the agent makes a decision which modifies the state of the environment. Then, the agent receives a reward and a new observation. The goal is to maximize the sum of rewards obtained during a simulation from an initial state to a final state. The policy of the agent is the function which, from the history of observations, returns a decision. We work in a context where (i) the number of states is huge, (ii) reward carries little information, (iii) the probability to reach quickly a good final state is weak and (iv) prior knowledge is either nonexistent or hardly exploitable. Both applications described in this thesis present these constraints : the game of Go and a 3D simulator of the european project MASH (Massive Sets of Heuristics). In order to take a satisfying decision in this context, several solutions are brought : 1. Simulating with the compromise exploration/exploitation (MCTS) 2. Reducing the complexity by local solving (GoldenEye) 3. Building a policy which improves itself (RBGP) 4. Learning prior knowledge (CluVo+GMCTS) Monte-Carlo Tree Search (MCTS) is the state of the art for the game of Go. From a model of the environment, MCTS builds incrementally and asymetrically a tree of possible futures by performing Monte-Carlo simulations. The tree starts from the current observation of the agent. The agent switches between the exploration of the model and the exploitation of decisions which statistically give a good cumulative reward. We discuss 2 ways for improving MCTS : the parallelization and the addition of prior knowledge. The parallelization does not solve some weaknesses of MCTS; in particular some local problems remain challenges. We propose an algorithm (GoldenEye) which is composed of 2 parts : detection of a local problem and then its resolution. The algorithm of resolution reuses some concepts of MCTS and it solves difficult problems of a classical database. The addition of prior knowledge by hand is laborious and boring. We propose a method called Racing-based Genetic Programming (RBGP) in order to add automatically prior knowledge. The strong point is that RBGP rigorously validates the addition of a prior knowledge and RBGP can be used for building a policy (instead of only optimizing an algorithm). In some applications such as MASH, simulations are too expensive in time and there is no prior knowledge and no model of the environment; therefore Monte-Carlo Tree Search can not be used. So that MCTS becomes usable in this context, we propose a method for learning prior knowledge (CluVo). Then we use pieces of prior knowledge for improving the rapidity of learning of the agent and for building a model, too. We use from this model an adapted version of Monte-Carlo Tree Search (GMCTS). This method solves difficult problems of MASH and gives good results in an application to a word game.
|
27 |
Des algorithmes presque optimaux pour les problèmes de décision séquentielle à des fins de collecte d'information / Near-Optimal Algorithms for Sequential Information-Gathering Decision ProblemsAraya-López, Mauricio 04 February 2013 (has links)
Cette thèse s'intéresse à des problèmes de prise de décision séquentielle dans lesquels l'acquisition d'information est une fin en soi. Plus précisément, elle cherche d'abord à savoir comment modifier le formalisme des POMDP pour exprimer des problèmes de collecte d'information et à proposer des algorithmes pour résoudre ces problèmes. Cette approche est alors étendue à des tâches d'apprentissage par renforcement consistant à apprendre activement le modèle d'un système. De plus, cette thèse propose un nouvel algorithme d'apprentissage par renforcement bayésien, lequel utilise des transitions locales optimistes pour recueillir des informations de manière efficace tout en optimisant la performance escomptée. Grâce à une analyse de l'existant, des résultats théoriques et des études empiriques, cette thèse démontre que ces problèmes peuvent être résolus de façon optimale en théorie, que les méthodes proposées sont presque optimales, et que ces méthodes donnent des résultats comparables ou meilleurs que des approches de référence. Au-delà de ces résultats concrets, cette thèse ouvre la voie (1) à une meilleure compréhension de la relation entre la collecte d'informations et les politiques optimales dans les processus de prise de décision séquentielle, et (2) à une extension des très nombreux travaux traitant du contrôle de l'état d'un système à des problèmes de collecte d'informations / The purpose of this dissertation is to study sequential decision problems where acquiring information is an end in itself. More precisely, it first covers the question of how to modify the POMDP formalism to model information-gathering problems and which algorithms to use for solving them. This idea is then extended to reinforcement learning problems where the objective is to actively learn the model of the system. Also, this dissertation proposes a novel Bayesian reinforcement learning algorithm that uses optimistic local transitions to efficiently gather information while optimizing the expected return. Through bibliographic discussions, theoretical results and empirical studies, it is shown that these information-gathering problems are optimally solvable in theory, that the proposed methods are near-optimal solutions, and that these methods offer comparable or better results than reference approaches. Beyond these specific results, this dissertation paves the way (1) for understanding the relationship between information-gathering and optimal policies in sequential decision processes, and (2) for extending the large body of work about system state control to information-gathering problems
|
28 |
Monte Carlo Tree Search for Continuous and Stochastic Sequential Decision Making Problems / Monte Carlo Tree Search pour les problèmes de décision séquentielle en milieu continus et stochastiquesCouetoux, Adrien 30 September 2013 (has links)
Dans cette thèse, nous avons étudié les problèmes de décisions séquentielles, avec comme application la gestion de stocks d'énergie. Traditionnellement, ces problèmes sont résolus par programmation dynamique stochastique. Mais la grande dimension, et la non convexité du problème, amènent à faire des simplifications sur le modèle pour pouvoir faire fonctionner ces méthodes.Nous avons donc étudié une méthode alternative, qui ne requiert pas de simplifications du modèle: Monte Carlo Tree Search (MCTS). Nous avons commencé par étendre le MCTS classique (qui s’applique aux domaines finis et déterministes) aux domaines continus et stochastiques. Pour cela, nous avons utilisé la méthode de Double Progressive Widening (DPW), qui permet de gérer le ratio entre largeur et profondeur de l’arbre, à l’aide de deux méta paramètres. Nous avons aussi proposé une heuristique nommée Blind Value (BV) pour améliorer la recherche de nouvelles actions, en utilisant l’information donnée par les simulations passées. D’autre part, nous avons étendu l’heuristique RAVE aux domaines continus. Enfin, nous avons proposé deux nouvelles méthodes pour faire remonter l’information dans l’arbre, qui ont beaucoup amélioré la vitesse de convergence sur deux cas tests.Une part importante de notre travail a été de proposer une façon de mêler MCTS avec des heuristiques rapides pré-existantes. C’est une idée particulièrement intéressante dans le cas de la gestion d’énergie, car ces problèmes sont pour le moment résolus de manière approchée. Nous avons montré comment utiliser Direct Policy Search (DPS) pour rechercher une politique par défaut efficace, qui est ensuite utilisée à l’intérieur de MCTS. Les résultats expérimentaux sont très encourageants.Nous avons aussi appliqué MCTS à des processus markoviens partiellement observables (POMDP), avec comme exemple le jeu de démineur. Dans ce cas, les algorithmes actuels ne sont pas optimaux, et notre approche l’est, en transformant le POMDP en MDP, par un changement de vecteur d’état.Enfin, nous avons utilisé MCTS dans un cadre de méta-bandit, pour résoudre des problèmes d’investissement. Le choix d’investissement est fait par des algorithmes de bandits à bras multiples, tandis que l’évaluation de chaque bras est faite par MCTS.Une des conclusions importantes de ces travaux est que MCTS en continu a besoin de très peu d’hypothèses (uniquement un modèle génératif du problème), converge vers l’optimum, et peut facilement améliorer des méthodes suboptimales existantes. / In this thesis, we study sequential decision making problems, with a focus on the unit commitment problem. Traditionally solved by dynamic programming methods, this problem is still a challenge, due to its high dimension and to the sacrifices made on the accuracy of the model to apply state of the art methods. We investigate on the applicability of Monte Carlo Tree Search methods for this problem, and other problems that are single player, stochastic and continuous sequential decision making problems. We started by extending the traditional finite state MCTS to continuous domains, with a method called Double Progressive Widening (DPW). This method relies on two hyper parameters, and determines the ratio between width and depth in the nodes of the tree. We developed a heuristic called Blind Value (BV) to improve the exploration of new actions, using the information from past simulations. We also extended the RAVE heuristic to continuous domain. Finally, we proposed two new ways of backing up information through the tree, that improved the convergence speed considerably on two test cases.An important part of our work was to propose a way to mix MCTS with existing powerful heuristics, with the application to energy management in mind. We did so by proposing a framework that allows to learn a good default policy by Direct Policy Search (DPS), and to include it in MCTS. The experimental results are very positive.To extend the reach of MCTS, we showed how it could be used to solve Partially Observable Markovian Decision Processes, with an application to game of Mine Sweeper, for which no consistent method had been proposed before.Finally, we used MCTS in a meta-bandit framework to solve energy investment problems: the investment decision was handled by classical bandit algorithms, while the evaluation of each investment was done by MCTS.The most important take away is that continuous MCTS has almost no assumption (besides the need for a generative model), is consistent, and can easily improve existing suboptimal solvers by using a method similar to what we proposed with DPS.
|
29 |
Regularization in reinforcement learningFarahmand, Amir-massoud Unknown Date
No description available.
|
30 |
Multi-objective sequential decision makingWang, Weijia 11 July 2014 (has links) (PDF)
This thesis is concerned with multi-objective sequential decision making (MOSDM). The motivation is twofold. On the one hand, many decision problems in the domains of e.g., robotics, scheduling or games, involve the optimization of sequences of decisions. On the other hand, many real-world applications are most naturally formulated in terms of multi-objective optimization (MOO). The proposed approach extends the well-known Monte-Carlo tree search (MCTS) framework to the MOO setting, with the goal of discovering several optimal sequences of decisions through growing a single search tree. The main challenge is to propose a new reward, able to guide the exploration of the tree although the MOO setting does not enforce a total order among solutions. The main contribution of the thesis is to propose and experimentally study two such rewards, inspired from the MOO literature and assessing a solution with respect to the archive of previous solutions (Pareto archive): the hypervolume indicator and the Pareto dominance reward. The study shows the complementarity of these two criteria. The hypervolume indicator suffers from its known computational complexity; however the proposed extension thereof provides fine-grained information about the quality of solutions with respect to the current archive. Quite the contrary, the Pareto-dominance reward is linear but it provides increasingly rare information. Proofs of principle of the approach are given on artificial problems and challenges, and confirm the merits of the approach. In particular, MOMCTS is able to discover policies lying in non-convex regions of the Pareto front, contrasting with the state of the art: existing Multi-Objective Reinforcement Learning algorithms are based on linear scalarization and thus fail to sample such non-convex regions. Finally MOMCTS honorably competes with the state of the art on the 2013 MOPTSP competition.
|
Page generated in 0.1587 seconds