Global ETD Search

11	Differentiable best response shaping Aghajohari, Milad 07 1900 (has links) Cette thèse est structurée en quatre sections. La première constitue une introduction au problème de la formation d'agents coopératifs non exploitables dans les jeux à somme non nulle. La deuxième section, soit le premier chapitre, fournit le contexte nécessaire pour discuter de l'étendue et des outils mathématiques requis pour explorer ce problème. La troisième section, correspondant au deuxième chapitre, expose un cadre spécifique, nommé Best Response Shaping, que nous avons élaboré pour relever ce défi. La quatrième section contient les conclusions que nous tirons de ce travail et nous y discutons des travaux futurs potentiels. Le chapitre introductif se divise en quatre sections. Dans la première, nous présentons le cadre d'apprentissage par renforcement (Reinforcement Learning) afin de formaliser le problème d'un agent interagissant avec l'environnement pour maximiser une récompense scalaire. Nous introduisons ensuite les Processus Décisionnels de Markov (Markov Decision Processes) en tant qu'outil mathématique pour formaliser le problème d'apprentissage par renforcement. Nous discutons de deux méthodes générales de solution pour résoudre le problème d'apprentissage par renforcement. Les premières sont des méthodes basées sur la valeur qui estiment la récompense cumulée optimale réalisable pour chaque paire action-état, et la politique serait alors apprise. Les secondes sont des méthodes basées sur les politiques où la politique est optimisée directement sans estimer les valeurs. Dans la deuxième section, nous introduisons le cadre d'apprentissage par renforcement multi-agents (Multi-Agent Reinforcement Learning) pour formaliser le problème de plusieurs agents tentant de maximiser une récompense cumulative scalaire dans un environnement partagé. Nous présentons les Jeux Stochastiques comme une extension théorique du processus de décision de Markov pour permettre la présence de plusieurs agents. Nous discutons des trois types de jeux possibles entre agents en fonction de la structure de leur système de récompense. Nous traitons des défis spécifiques à l'apprentissage par renforcement multi-agents. En particulier, nous examinons le défi de l'apprentissage par renforcement profond multi-agents dans des environnements partiellement compétitifs, où les méthodes traditionnelles peinent à promouvoir une coopération non exploitable. Dans la troisième section, nous introduisons le Dilemme du prisonnier itéré (Iterated Prisoner's Dilemma) comme un jeu matriciel simple utilisé comme exemple de jouet pour étudier les dilemmes sociaux. Dans la quatrième section, nous présentons le Coin Game comme un jeu à haute dimension qui doit être résolu grâce à des politiques paramétrées par des réseaux de neurones. Dans le deuxième chapitre, nous introduisons la méthode Forme de la Meilleure Réponse (Best Response Shaping). Des approches existantes, comme celles des agents LOLA et POLA, apprennent des politiques coopératives non exploitables en se différenciant grâce à des étapes d'optimisation prédictives de leur adversaire. Toutefois, ces techniques présentent une limitation majeure car elles sont susceptibles d'être exploitées par une optimisation supplémentaire. En réponse à cela, nous introduisons une nouvelle approche, Forme de la Meilleure Réponse, qui se différencie par le fait qu'un adversaire approxime la meilleure réponse, que nous appelons le "détective". Pour conditionner le détective sur la politique de l'agent dans les jeux complexes, nous proposons un mécanisme de conditionnement différenciable sensible à l'état, facilité par une méthode de questions-réponses (QA) qui extrait une représentation de l'agent basée sur son comportement dans des états d'environnement spécifiques. Pour valider empiriquement notre méthode, nous mettons en évidence sa performance améliorée face à un adversaire utilisant l'Arbre de Recherche Monte Carlo (Monte Carlo Tree Search), qui sert d'approximation de la meilleure réponse dans le Coin Game. / This thesis is organized in four sections.The first is an introduction to the problem of training non-exploitable cooperative agents in general-sum games. The second section, the first chapter, provides the necessary background for discussing the scope and necessary mathematical tools for exploring this problem. The third section, the second chapter, explains a particular framework, Best Response Shaping, that we developed for tackling this challenge. In the fourth section, is the conclusion that we drive from this work and we discuss the possible future works. The background chapter consists of four section. In the first section, we introduce the \emph{Reinforcement Learning } framework for formalizing the problem of an agent interacting with the environment maximizing a scalar reward. We then introduce \emph{Markov Decision Processes} as a mathematical tool to formalize the Reinforcement Learning problem. We discuss two general solution methods for solving the Reinforcement Learning problem. The first are Value-based methods that estimate the optimal achievable accumulative reward in each action-state pair and the policy would be learned. The second are Policy-based methods where the policy is optimized directly without estimating the values. In the second section, we introduce \emph{Multi-Agent Reinforcement Learning} framework for formalizing multiple agents trying to maximize a scalar accumulative reward in a shared environment. We introduce \emph{Stochastic Games} as a theoretical extension of the Markov Decision Process to allow multiple agents. We discuss the three types of possible games between agents based on the setup of their reward structure. We discuss the challenges that are specific to Multi-Agent Reinforcement Learning. In particular, we investigate the challenge of multi-agent deep reinforcement learning in partially competitive environments, where traditional methods struggle to foster non-exploitable cooperation. In the third section, we introduce the \emph{Iterated Prisoner's Dilemma} game as a simple matrix game used as a toy-example for studying social dilemmas. In the Fourth section, we introduce the \emph{Coin Game} as a high-dimensional game that should be solved via policies parameterized by neural networks. In the second chapter, we introduce the Best Response Shaping (BRS) method. The existing approaches like LOLA and POLA agents learn non-exploitable cooperative policies by differentiation through look-ahead optimization steps of their opponent. However, there is a key limitation in these techniques as they are susceptible to exploitation by further optimization. In response, we introduce a novel approach, Best Response Shaping (BRS), which differentiates through an opponent approximating the best response, termed the "detective." To condition the detective on the agent's policy for complex games we propose a state-aware differentiable conditioning mechanism, facilitated by a question answering (QA) method that extracts a representation of the agent based on its behaviour on specific environment states. To empirically validate our method, we showcase its enhanced performance against a Monte Carlo Tree Search (MCTS) opponent, which serves as an approximation to the best response in the Coin Game. This work expands the applicability of multi-agent RL in partially competitive environments and provides a new pathway towards achieving improved social welfare in general sum games. Multi-Agent Reinforcement Learning General-Sum Games Best Response Shaping Jeux à somme générale mise en forme de la meilleure réponse
12	Reinforcement learning applied to the real world : uncertainty, sample efficiency, and multi-agent coordination Mai, Vincent 12 1900 (has links) L'immense potentiel des approches d'apprentissage par renforcement profond (ARP) pour la conception d'agents autonomes a été démontré à plusieurs reprises au cours de la dernière décennie. Son application à des agents physiques, tels que des robots ou des réseaux électriques automatisés, est cependant confrontée à plusieurs défis. Parmi eux, l'inefficacité de leur échantillonnage, combinée au coût et au risque d'acquérir de l'expérience dans le monde réel, peut décourager tout projet d'entraînement d'agents incarnés. Dans cette thèse, je me concentre sur l'application de l'ARP sur des agents physiques. Je propose d'abord un cadre probabiliste pour améliorer l'efficacité de l'échantillonnage dans l'ARP. Dans un premier article, je présente la pondération BIV (batch inverse-variance), une fonction de perte tenant compte de la variance du bruit des étiquettes dans la régression bruitée hétéroscédastique. La pondération BIV est un élément clé du deuxième article, où elle est combinée avec des méthodes de pointe de prédiction de l'incertitude pour les réseaux neuronaux profonds dans un pipeline bayésien pour les algorithmes d'ARP avec différences temporelles. Cette approche, nommée apprentissage par renforcement à variance inverse (IV-RL), conduit à un entraînement nettement plus rapide ainsi qu'à de meilleures performances dans les tâches de contrôle. Dans le troisième article, l'apprentissage par renforcement multi-agent (MARL) est appliqué au problème de la réponse rapide à la demande, une approche prometteuse pour gérer l'introduction de sources d'énergie renouvelables intermittentes dans les réseaux électriques. En contrôlant la coordination de plusieurs climatiseurs, les agents MARL obtiennent des performances nettement supérieures à celles des approches basées sur des règles. Ces résultats soulignent le rôle potentiel que les agents physiques entraînés par MARL pourraient jouer dans la transition énergétique et la lutte contre le réchauffement climatique. / The immense potential of deep reinforcement learning (DRL) approaches to build autonomous agents has been proven repeatedly in the last decade. Its application to embodied agents, such as robots or automated power systems, is however facing several challenges. Among them, their sample inefficiency, combined to the cost and the risk of gathering experience in the real world, can deter any idea of training embodied agents. In this thesis, I focus on the application of DRL on embodied agents. I first propose a probabilistic framework to improve sample efficiency in DRL. In the first article, I present batch inverse-variance (BIV) weighting, a loss function accounting for label noise variance in heteroscedastic noisy regression. BIV is a key element of the second article, where it is combined with state-of-the-art uncertainty prediction methods for deep neural networks in a Bayesian pipeline for temporal differences DRL algorithms. This approach, named inverse-variance reinforcement learning (IV-RL), leads to significantly faster training as well as better performance in control tasks. In the third article, multi-agent reinforcement learning (MARL) is applied to the problem of fast-timescale demand response, a promising approach to the manage the introduction of intermittent renewable energy sources in power-grids. As MARL agents control the coordination of multiple air conditioners, they achieve significantly better performance than rule-based approaches. These results underline to the potential role that DRL trained embodied agents could take in the energetic transition and the fight against global warming. Uncertainty estimation Estimation d'incertitude Multi agent reinforcement learning Apprentissage par renforcement profond Deep reinforcement learning Heteroscedastic regression Régression hétéroscédastique Demand response Régulation de fréquence Réseau électrique Power grid
13	The multilevel critical node problem : theoretical intractability and a curriculum learning approach Nabli, Adel 08 1900 (has links) Évaluer la vulnérabilité des réseaux est un enjeu de plus en plus critique. Dans ce mémoire, nous nous penchons sur une approche étudiant la défense d’infrastructures stratégiques contre des attaques malveillantes au travers de problèmes d'optimisations multiniveaux. Plus particulièrement, nous analysons un jeu séquentiel en trois étapes appelé le « Multilevel Critical Node problem » (MCN). Ce jeu voit deux joueurs s'opposer sur un graphe: un attaquant et un défenseur. Le défenseur commence par empêcher préventivement que certains nœuds soient attaqués durant une phase de vaccination. Ensuite, l’attaquant infecte un sous ensemble des nœuds non vaccinés. Finalement, le défenseur réagit avec une stratégie de protection. Dans ce mémoire, nous fournissons les premiers résultats de complexité pour MCN ainsi que ceux de ses sous-jeux. De plus, en considérant les différents cas de graphes unitaires, pondérés ou orientés, nous clarifions la manière dont la complexité de ces problèmes varie. Nos résultats contribuent à élargir les familles de problèmes connus pour être complets pour les classes NP, $\Sigma_2^p$ et $\Sigma_3^p$. Motivés par l’insolubilité intrinsèque de MCN, nous concevons ensuite une heuristique efficace pour le jeu. Nous nous appuyons sur les approches récentes cherchant à apprendre des heuristiques pour des problèmes d’optimisation combinatoire en utilisant l’apprentissage par renforcement et les réseaux de neurones graphiques. Contrairement aux précédents travaux, nous nous intéressons aux situations dans lesquelles de multiples joueurs prennent des décisions de manière séquentielle. En les inscrivant au sein du formalisme d’apprentissage multiagent, nous concevons un algorithme apprenant à résoudre des problèmes d’optimisation combinatoire multiniveaux budgétés opposant deux joueurs dans un jeu à somme nulle sur un graphe. Notre méthode est basée sur un simple curriculum : si un agent sait estimer la valeur d’une instance du problème ayant un budget au plus B, alors résoudre une instance avec budget B+1 peut être fait en temps polynomial quelque soit la direction d’optimisation en regardant la valeur de tous les prochains états possibles. Ainsi, dans une approche ascendante, nous entraînons notre agent sur des jeux de données d’instances résolues heuristiquement avec des budgets de plus en plus grands. Nous rapportons des résultats quasi optimaux sur des graphes de tailles au plus 100 et un temps de résolution divisé par 185 en moyenne comparé au meilleur solutionneur exact pour le MCN. / Evaluating the vulnerability of networks is a problem which has gain momentum in recent decades. In this work, we focus on a Multilevel Programming approach to study the defense of critical infrastructures against malicious attacks. We analyze a three-stage sequential game played in a graph called the Multilevel Critical Node problem (MCN). This game sees two players competing with each other: a defender and an attacker. The defender starts by preventively interdicting nodes from being attacked during what is called a vaccination phase. Then, the attacker infects a subset of non-vaccinated nodes and, finally, the defender reacts with a protection strategy. We provide the first computational complexity results associated with MCN and its subgames. Moreover, by considering unitary, weighted, undirected and directed graphs, we clarify how the theoretical tractability or intractability of those problems vary. Our findings contribute with new NP-complete, $\Sigma_2^p$-complete and $\Sigma_3^p$-complete problems. Motivated by the intrinsic intractability of the MCN, we then design efficient heuristics for the game by building upon the recent approaches seeking to learn heuristics for combinatorial optimization problems through graph neural networks and reinforcement learning. But contrary to previous work, we tackle situations with multiple players taking decisions sequentially. By framing them in a multi-agent reinforcement learning setting, we devise a value-based method to learn to solve multilevel budgeted combinatorial problems involving two players in a zero-sum game over a graph. Our framework is based on a simple curriculum: if an agent knows how to estimate the value of instances with budgets up to B, then solving instances with budget B+1 can be done in polynomial time regardless of the direction of the optimization by checking the value of every possible afterstate. Thus, in a bottom-up approach, we generate datasets of heuristically solved instances with increasingly larger budgets to train our agent. We report results close to optimality on graphs up to 100 nodes and a 185 x speedup on average compared to the quickest exact solver known for the MCN. Optimisation Combinatoire Optimisation Multiniveaux Théorie de la Complexité Hiérarchie Polynomiale Apprentissage par Renforcement Apprentissage Multiagent Réseaux de Neurones Graphiques Combinatorial Optimization Multilevel Programming Computational Complexity Theory Polynomial Hierarchy Reinforcement Learning Multi-Agent Reinforcement Learning Graph Neural Networks
14	Uncontrolled intersection coordination of the autonomous vehicle based on multi-agent reinforcement learning. McSey, Isaac Arnold January 2023 (has links) This study explores the application of multi-agent reinforcement learning (MARL) to enhance the decision-making, safety, and passenger comfort of Autonomous Vehicles (AVs)at uncontrolled intersections. The research aims to assess the potential of MARL in modeling multiple agents interacting within a shared environment, reflecting real-world situations where AVs interact with multiple actors. The findings suggest that AVs trained using aMARL approach with global experiences can better navigate intersection scenarios than AVs trained on local (individual) experiences. This capability is a critical precursor to achieving Level 5 autonomy, where vehicles are expected to manage all aspects of the driving task under all conditions. The research contributes to the ongoing discourse on enhancing autonomous vehicle technology through multi-agent reinforcement learning and informs the development of sophisticated training methodologies for autonomous driving. Autonomous Vehicles (AVs) Road Safety Fuel Efficiency Business Dynamics Intersections Human-Driven Vehicles (HDVs) Pedestrians Algorithmic Interactions Uncontrolled Intersections Global Insights Safety Improvements Comfort Improvements Learning Process Global Experiences Complex Environments Passenger Comfort Navigation Computer Sciences Datavetenskap (datalogi) Information Systems

Page generated in 0.1407 seconds