Global ETD Search

101	On two sequential problems : the load planning and sequencing problem and the non-normal recurrent neural network Goyette, Kyle 07 1900 (has links) The work in this thesis is separated into two parts. The first part deals with the load planning and sequencing problem for double-stack intermodal railcars, an operational problem found at many rail container terminals. In this problem, containers must be assigned to a platform on which the container will be loaded, and the loading order must be determined. These decisions are made with the objective of minimizing the costs associated with handling the containers, as well as minimizing the cost of containers left behind. The deterministic version of the problem can be cast as a shortest path problem on an ordered graph. This problem is challenging to solve because of the large size of the graph. We propose a two-stage heuristic based on the Iterative Deepening A* algorithm to compute solutions to the load planning and sequencing problem within a five-minute time budget. Next, we also illustrate how a Deep Q-learning algorithm can be used to heuristically solve the same problem.The second part of this thesis considers sequential models in deep learning. A recent strategy to circumvent the exploding and vanishing gradient problem in recurrent neural networks (RNNs) is to enforce recurrent weight matrices to be orthogonal or unitary. While this ensures stable dynamics during training, it comes at the cost of reduced expressivity due to the limited variety of orthogonal transformations. We propose a parameterization of RNNs, based on the Schur decomposition, that mitigates the exploding and vanishing gradient problem, while allowing for non-orthogonal recurrent weight matrices in the model. / Le travail de cette thèse est divisé en deux parties. La première partie traite du problème de planification et de séquencement des chargements de conteneurs sur des wagons, un problème opérationnel rencontré dans de nombreux terminaux ferroviaires intermodaux. Dans ce problème, les conteneurs doivent être affectés à une plate-forme sur laquelle un ou deux conteneurs seront chargés et l'ordre de chargement doit être déterminé. Ces décisions sont prises dans le but de minimiser les coûts associés à la manutention des conteneurs, ainsi que de minimiser le coût des conteneurs non chargés. La version déterministe du problème peut être formulé comme un problème de plus court chemin sur un graphe ordonné. Ce problème est difficile à résoudre en raison de la grande taille du graphe. Nous proposons une heuristique en deux étapes basée sur l'algorithme Iterative Deepening A* pour calculer des solutions au problème de planification et de séquencement de la charge dans un budget de cinq minutes. Ensuite, nous illustrons également comment un algorithme d'apprentissage Deep Q peut être utilisé pour résoudre heuristiquement le même problème. La deuxième partie de cette thèse examine les modèles séquentiels en apprentissage profond. Une stratégie récente pour contourner le problème de gradient qui explose et disparaît dans les réseaux de neurones récurrents (RNN) consiste à imposer des matrices de poids récurrentes orthogonales ou unitaires. Bien que cela assure une dynamique stable pendant l'entraînement, cela se fait au prix d'une expressivité réduite en raison de la variété limitée des transformations orthogonales. Nous proposons une paramétrisation des RNN, basée sur la décomposition de Schur, qui atténue les problèmes de gradient, tout en permettant des matrices de poids récurrentes non orthogonales dans le modèle. Intermodal rail terminal containers rail train double-stack dynamic programming load planning and sequencing deep reinforcement learning sequential modelling recurrent neural networks exploding and vanishing gradient problem Transport ferroviaire intermodal, conteneurs programmation dynamique apprentissage par renforcement profond modélisation séquentielle réseaux de neurones récurrents
102	GAME-THEORETIC MODELING OF MULTI-AGENT SYSTEMS: APPLICATIONS IN SYSTEMS ENGINEERING AND ACQUISITION PROCESSES Salar Safarkhani (9165011) 24 July 2020 (has links) <div><div><div><p>The process of acquiring the large-scale complex systems is usually characterized with cost and schedule overruns. To investigate the causes of this problem, we may view the acquisition of a complex system in several different time scales. At finer time scales, one may study different stages of the acquisition process from the intricate details of the entire systems engineering process to communication between design teams to how individual designers solve problems. At the largest time scale one may consider the acquisition process as series of actions which are, request for bids, bidding and auctioning, contracting, and finally building and deploying the system, without resolving the fine details that occur within each step. In this work, we study the acquisition processes in multiple scales. First, we develop a game-theoretic model for engineering of the systems in the building and deploying stage. We model the interactions among the systems and subsystem engineers as a principal-agent problem. We develop a one-shot shallow systems engineering process and obtain the optimum transfer functions that best incentivize the subsystem engineers to maximize the expected system-level utility. The core of the principal-agent model is the quality function which maps the effort of the agent to the performance (quality) of the system. Therefore, we build the stochastic quality function by modeling the design process as a sequential decision-making problem. Second, we develop and evaluate a model of the acquisition process that accounts for the strategic behavior of different parties. We cast our model in terms of government-funded projects and assume the following steps. First, the government publishes a request for bids. Then, private firms offer their proposals in a bidding process and the winner bidder enters in a con- tract with the government. The contract describes the system requirements and the corresponding monetary transfers for meeting them. The winner firm devotes effort to deliver a system that fulfills the requirements. This can be assumed as a game that the government plays with the bidder firms. We study how different parameters in the acquisition procedure affect the bidders’ behaviors and therefore, the utility of the government. Using reinforcement learning, we seek to learn the optimal policies of involved actors in this game. In particular, we study how the requirements, contract types such as cost-plus and incentive-based contracts, number of bidders, problem complexity, etc., affect the acquisition procedure. Furthermore, we study the bidding strategy of the private firms and how the contract types affect their strategic behavior.</p></div></div></div> Applied Computer Science Mechanical Engineering Operations Research Engineering Systems Design Multi agent system (MAS) deep reinforcement learning machine Learning game theory deep learning gaussian process systems science and theory bi-level optimization principal-agent model systems engineering process system acquisition process contracts strategic behavior auction bidding
103	[pt] ESTUDO DE TÉCNICAS DE APRENDIZADO POR REFORÇO APLICADAS AO CONTROLE DE PROCESSOS QUÍMICOS / [en] STUDY OF REINFORCEMENT LEARNING TECHNIQUES APPLIED TO THE CONTROL OF CHEMICAL PROCESSES 30 December 2021 (has links) [pt] A indústria 4.0 impulsionou o desenvolvimento de novas tecnologias para atender as demandas atuais do mercado. Uma dessas novas tecnologias foi a incorporação de técnicas de inteligência computacional no cotidiano da indústria química. Neste âmbito, este trabalho avaliou o desempenho de controladores baseados em aprendizado por reforço em processos químicos industriais. A estratégia de controle interfere diretamente na segurança e no custo do processo. Quanto melhor for o desempenho dessa estrategia, menor será a produção de efluentes e o consumo de insumos e energia. Os algoritmos de aprendizado por reforço apresentaram excelentes resultados para o primeiro estudo de caso, o reator CSTR com a cinética de Van de Vusse. Entretanto, para implementação destes algoritmos na planta química do Tennessee Eastman Process mostrou-se que mais estudos são necessários. A fraca ou inexistente propriedade Markov, a alta dimensionalidade e as peculiaridades da planta foram fatores dificultadores para os controladores desenvolvidos obterem resultados satisfatórios. Foram avaliados para o estudo de caso 1, os algoritmos Q-Learning, Actor Critic TD, DQL, DDPG, SAC e TD3, e para o estudo de caso 2 foram avaliados os algoritmos CMA-ES, TRPO, PPO, DDPG, SAC e TD3. / [en] Industry 4.0 boosted the development of new technologies to meet current market demands. One of these new technologies was the incorporation of computational intelligence techniques into the daily life of the chemical industry. In this context, this present work evaluated the performance of controllers based on reinforcement learning in industrial chemical processes. The control strategy directly affects the safety and cost of the process. The better the performance of this strategy, the lower will be the production of effluents and the consumption of input and energy. The reinforcement learning algorithms showed excellent results for the first case study, the Van de Vusse s reactor. However, to implement these algorithms in the Tennessee Eastman Process chemical plant it was shown that more studies are needed. The weak Markov property, the high dimensionality and peculiarities of the plant were factors that made it difficult for the developed controllers to obtain satisfactory results. For case study 1, the algorithms Q-Learning, Actor Critic TD, DQL, DDPG, SAC and TD3 were evaluated, and for case study 2 the algorithms CMA-ES, TRPO, PPO, DDPG, SAC and TD3 were evaluated. [pt] APRENDIZADO POR REFORCO [pt] SAC [pt] TD3 [pt] DDPG [pt] DEEP Q-LEARNING [pt] ATOR-CRITICO [pt] REATOR DE VAN DE VUSSE [pt] CONTROLE DE PROCESSOS QUIMICOS [pt] APRENDIZADO POR REFORCO PROFUNDO [pt] Q-LEARNING [pt] PROCESSO TENNESSEE EASTMAN [en] REINFORCEMENT LEARNING [en] SAC [en] TD3 [en] DDPG [en] DEEP Q-LEARNING [en] ACTOR CRITIC [en] CHEMICAL PROCESS CONTROL [en] DEEP REINFORCEMENT LEARNING [en] Q-LEARNING [en] TENNESSEE EASTMAN PROCESS
104	EXPANDING THE AUTONOMOUS SURFACE VEHICLE NAVIGATION PARADIGM THROUGH INLAND WATERWAY ROBOTIC DEPLOYMENT Reeve David Lambert (13113279) 19 July 2022 (has links) <p>This thesis presents solutions to some of the problems facing Autonomous Surface Vehicle (ASV) deployments in inland waterways through the development of navigational and control systems. Fluvial systems are one of the hardest inland waterways to navigate and are thus used as a use-case for system development. The systems are built to reduce the reliance on a-prioris during ASV operation. This is crucial for exceptionally dynamic environments such as fluvial bodies of water that have poorly defined routes and edges, can change course in short time spans, carry away and deposit obstacles, and expose or cover shoals and man-made structures as their water level changes. While navigation of fluvial systems is exceptionally difficult potential autonomous data collection can aid in important scientific missions in under studied environments.</p> <p><br></p> <p>The work has four contributions targeting solutions to four fundamental problems present in fluvial system navigation and control. To sense the course of fluvial systems for navigable path determination a fluvial segmentation study is done and a novel dataset detailed. To enable rapid path computations and augmentations in a fast moving environment a Dubins path generator and augmentation algorithm is presented ans is used in conjunction with an Integral Line-Of-Sight (ILOS) path following method. To rapidly avoid unseen/undetected obstacles present in fluvial environments a Deep Reinforcement Learning (DRL) agent is built and tested across domains to create dynamic local paths that can be rapidly affixed to for collision avoidance. Finally, a custom low-cost and deployable ASV, BREAM (Boat for Robotic Engineering and Applied Machine-Learning), capable of operating in fluvial environments is presented along with an autonomy package used in providing base level sensing and autonomy processing capability to varying platforms.</p> <p><br></p> <p>Each of these contributions form a part of a larger documented Fluvial Navigation Control Architecture (FNCA) that is proposed as a way to aid in a-priori free navigation of fluvial waterways. The architecture relates the navigational structures into high, mid, and low-level controller Guidance and Navigational Control (GNC) layers that are designed to increase cross vehicle and domain deployments. Each component of the architecture is documented, tested, and its application to the control architecture as a whole is reported.</p> Autonomous vehicle systems Field robotics Marine engineering Path Planning Autonomous surface vehicle (ASV) Marine Engineering Path Following river navigation Image Segmentation Deep Reinforcement Learning (DRL) Control Systems, Robotics and Automation Autonomous Vehicles Obstacle avoidance
105	Deep Reinforcement Learning for Autonomous Highway Driving Scenario Pradhan, Neil January 2021 (has links) We present an autonomous driving agent on a simulated highway driving scenario with vehicles such as cars and trucks moving with stochastically variable velocity profiles. The focus of the simulated environment is to test tactical decision making in highway driving scenarios. When an agent (vehicle) maintains an optimal range of velocity it is beneficial both in terms of energy efficiency and greener environment. In order to maintain an optimal range of velocity, in this thesis work I proposed two novel reward structures: (a) gaussian reward structure and (b) exponential rise and fall reward structure. I trained respectively two deep reinforcement learning agents to study their differences and evaluate their performance based on a set of parameters that are most relevant in highway driving scenarios. The algorithm implemented in this thesis work is double-dueling deep-Q-network with prioritized experience replay buffer. Experiments were performed by adding noise to the inputs, simulating Partially Observable Markov Decision Process in order to obtain reliability comparison between different reward structures. Velocity occupancy grid was found to be better than binary occupancy grid as input for the algorithm. Furthermore, methodology for generating fuel efficient policies has been discussed and demonstrated with an example. / Vi presenterar ett autonomt körföretag på ett simulerat motorvägsscenario med fordon som bilar och lastbilar som rör sig med stokastiskt variabla hastighetsprofiler. Fokus för den simulerade miljön är att testa taktiskt beslutsfattande i motorvägsscenarier. När en agent (fordon) upprätthåller ett optimalt hastighetsområde är det fördelaktigt både när det gäller energieffektivitet och grönare miljö. För att upprätthålla ett optimalt hastighetsområde föreslog jag i detta avhandlingsarbete två nya belöningsstrukturer: (a) gaussisk belöningsstruktur och (b) exponentiell uppgång och nedgång belöningsstruktur. Jag utbildade respektive två djupförstärkande inlärningsagenter för att studera deras skillnader och utvärdera deras prestanda baserat på en uppsättning parametrar som är mest relevanta i motorvägsscenarier. Algoritmen som implementeras i detta avhandlingsarbete är dubbel-duell djupt Q- nätverk med prioriterad återuppspelningsbuffert. Experiment utfördes genom att lägga till brus i ingångarna, simulera delvis observerbar Markov-beslutsprocess för att erhålla tillförlitlighetsjämförelse mellan olika belöningsstrukturer. Hastighetsbeläggningsgaller visade sig vara bättre än binärt beläggningsgaller som inmatning för algoritmen. Dessutom har metodik för att generera bränsleeffektiv politik diskuterats och demonstrerats med ett exempel. Deep reinforcement learning Highway driving scenario Tactical decision making fuel reduction high-level decision making autonomous driving Lärande om djupförstärkning motorvägsscenario taktiskt beslutsfattande bränslereduktion beslut på hög nivå autonom körning Computer and Information Sciences Data- och informationsvetenskap
106	Apprentissage de stratégies de calcul adaptatives pour les réseaux neuronaux profonds Kamanda, Aton 07 1900 (has links) La théorie du processus dual stipule que la cognition humaine fonctionne selon deux modes distincts : l’un pour le traitement rapide, habituel et associatif, appelé communément "système 1" et le second, ayant un traitement plus lent, délibéré et contrôlé, que l’on nomme "système 2". Cette distinction indique une caractéristique sous-jacente importante de la cognition humaine : la possibilité de passer de manière adaptative à différentes stratégies de calcul selon la situation. Cette capacité est étudiée depuis longtemps dans différents domaines et de nombreux bénéfices hypothétiques semblent y être liés. Cependant, les réseaux neuronaux profonds sont souvent construits sans cette capacité à gérer leurs ressources calculatoires de manière optimale. Cette limitation des modèles actuels est d’autant plus préoccupante que de plus en plus de travaux récents semblent montrer une relation linéaire entre la capacité de calcul utilisé et les performances du modèle lors de la phase d’évaluation. Pour résoudre ce problème, ce mémoire propose différentes approches et étudie leurs impacts sur les modèles, tout d’abord, nous étudions un agent d’apprentissage par renforcement profond qui est capable d’allouer plus de calcul aux situations plus difficiles. Notre approche permet à l’agent d’adapter ses ressources computationnelles en fonction des exigences de la situation dans laquelle il se trouve, ce qui permet en plus d’améliorer le temps de calcul, améliore le transfert entre des tâches connexes et la capacité de généralisation. L’idée centrale commune à toutes nos approches est basée sur les théories du coût de l’effort venant de la littérature sur le contrôle cognitif qui stipule qu’en rendant l’utilisation de ressource cognitive couteuse pour l’agent et en lui laissant la possibilité de les allouer lors de ses décisions il va lui-même apprendre à déployer sa capacité de calcul de façon optimale. Ensuite, nous étudions des variations de la méthode sur une tâche référence d’apprentissage profond afin d’analyser précisément le comportement du modèle et quels sont précisément les bénéfices d’adopter une telle approche. Nous créons aussi notre propre tâche "Stroop MNIST" inspiré par le test de Stroop utilisé en psychologie afin de valider certaines hypothèses sur le comportement des réseaux neuronaux employant notre méthode. Nous finissons par mettre en lumière les liens forts qui existent entre apprentissage dual et les méthodes de distillation des connaissances. Notre approche a la particularité d’économiser des ressources computationnelles lors de la phase d’inférence. Enfin, dans la partie finale, nous concluons en mettant en lumière les contributions du mémoire, nous détaillons aussi des travaux futurs, nous approchons le problème avec les modèles basés sur l’énergie, en apprenant un paysage d’énergie lors de l’entrainement, le modèle peut ensuite lors de l’inférence employer une capacité de calcul dépendant de la difficulté de l’exemple auquel il fait face plutôt qu’une simple propagation avant fixe ayant systématiquement le même coût calculatoire. Bien qu’ayant eu des résultats expérimentaux infructueux, nous analysons les promesses que peuvent tenir une telle approche et nous émettons des hypothèses sur les améliorations potentielles à effectuer. Nous espérons, avec nos contributions, ouvrir la voie vers des algorithmes faisant un meilleur usage de leurs ressources computationnelles et devenant par conséquent plus efficace en termes de coût et de performance, ainsi que permettre une compréhension plus intime des liens qui existent entre certaines méthodes en apprentissage machine et la théorie du processus dual. / The dual-process theory states that human cognition operates in two distinct modes: one for rapid, habitual and associative processing, commonly referred to as "system 1", and the second, with slower, deliberate and controlled processing, which we call "system 2". This distinction points to an important underlying feature of human cognition: the ability to switch adaptively to different computational strategies depending on the situation. This ability has long been studied in various fields, and many hypothetical benefits seem to be linked to it. However, deep neural networks are often built without this ability to optimally manage their computational resources. This limitation of current models is all the more worrying as more and more recent work seems to show a linear relationship between the computational capacity used and model performance during the evaluation phase. To solve this problem, this thesis proposes different approaches and studies their impact on models. First, we study a deep reinforcement learning agent that is able to allocate more computation to more difficult situations. Our approach allows the agent to adapt its computational resources according to the demands of the situation in which it finds itself, which in addition to improving computation time, enhances transfer between related tasks and generalization capacity. The central idea common to all our approaches is based on cost-of-effort theories from the cognitive control literature, which stipulate that by making the use of cognitive resources costly for the agent, and allowing it to allocate them when making decisions, it will itself learn to deploy its computational capacity optimally. We then study variations of the method on a reference deep learning task, to analyze precisely how the model behaves and what the benefits of adopting such an approach are. We also create our own task "Stroop MNIST" inspired by the Stroop test used in psychology to validate certain hypotheses about the behavior of neural networks employing our method. We end by highlighting the strong links between dual learning and knowledge distillation methods. Finally, we approach the problem with energy-based models, by learning an energy landscape during training, the model can then during inference employ a computational capacity dependent on the difficulty of the example it is dealing with rather than a simple fixed forward propagation having systematically the same computational cost. Despite unsuccessful experimental results, we analyze the promise of such an approach and speculate on potential improvements. With our contributions, we hope to pave the way for algorithms that make better use of their computational resources, and thus become more efficient in terms of cost and performance, as well as providing a more intimate understanding of the links that exist between certain machine learning methods and dual process theory. Apprentissage par renforcement profond Théorie du processus dual Efficacité computationnelle Apprentissage profond Efficacité computationnelle Distillation des connaissances Modèles basés sur l’énergie Contrôle cognitif Deep learning Deep reinforcement learning Dual process theory Computational efficiency Knowledge distillation Energy-based models Cognitive control
107	Deep Reinforcement Learning Adaptive Traffic Signal Control / Reinforcement Learning Traffic Signal Control Genders, Wade 22 November 2018 (has links) Sub-optimal automated transportation control systems incur high mobility, human health and environmental costs. With society reliant on its transportation systems for the movement of individuals, goods and services, minimizing these costs benefits many. Intersection traffic signal controllers are an important element of modern transportation systems that govern how vehicles traverse road infrastructure. Many types of traffic signal controllers exist; fixed time, actuated and adaptive. Adaptive traffic signal controllers seek to minimize transportation costs through dynamic control of the intersection. However, many existing adaptive traffic signal controllers rely on heuristic or expert knowledge and were not originally designed for scalability or for transportation’s big data future. This research addresses the aforementioned challenges by developing a scalable system for adaptive traffic signal control model development using deep reinforcement learning in traffic simulation. Traffic signal control can be modelled as a sequential decision-making problem; reinforcement learning can solve sequential decision-making problems by learning an optimal policy. Deep reinforcement learning makes use of deep neural networks, powerful function approximators which benefit from large amounts of data. Distributed, parallel computing techniques are used to provide scalability, with the proposed methods validated on a simulation of the City of Luxembourg, Luxembourg, consisting of 196 intersections. This research contributes to the body of knowledge by successfully developing a scalable system for adaptive traffic signal control model development and validating it on the largest traffic microsimulator in the literature. The proposed system reduces delay, queues, vehicle stopped time and travel time compared to conventional traffic signal controllers. Findings from this research include that using reinforcement learning methods which explicitly develop the policy offers improved performance over purely value-based methods. The developed methods are expected to mitigate the problems caused by sub-optimal automated transportation signal controls systems, improving mobility and human health and reducing environmental costs. / Thesis / Doctor of Philosophy (PhD) / Inefficient transportation systems negatively impact mobility, human health and the environment. The goal of this research is to mitigate these negative impacts by improving automated transportation control systems, specifically intersection traffic signal controllers. This research presents a system for developing adaptive traffic signal controllers that can efficiently scale to the size of cities by using machine learning and parallel computation techniques. The proposed system is validated by developing adaptive traffic signal controllers for 196 intersections in a simulation of the City of Luxembourg, Luxembourg, successfully reducing delay, queues, vehicle stopped time and travel time. intelligent transportation systems machine learning machine learning transportation machine learning traffic signal control artificial intelligence artificial intelligence transportation deep learning deep neural networks traffic optimization adaptive traffic signal control machine learning engineering
108	Reinforcement Learning for Market Making / Förstärkningsinlärningsbaserad likviditetsgarantering Carlsson, Simon, Regnell, August January 2022 (has links) Market making – the process of simultaneously and continuously providing buy and sell prices in a financial asset – is rather complicated to optimize. Applying reinforcement learning (RL) to infer optimal market making strategies is a relatively uncharted and novel research area. Most published articles in the field are notably opaque concerning most aspects, including precise methods, parameters, and results. This thesis attempts to explore and shed some light on the techniques, problem formulations, algorithms, and hyperparameters used to construct RL-derived strategies for market making. First, a simple probabilistic model of a limit order book is used to compare analytical and RL-derived strategies. Second, a market making agent is trained on a more complex Markov chain model of a limit order book using tabular Q-learning and deep reinforcement learning with double deep Q-learning. Results and strategies are analyzed, compared, and discussed. Finally, we propose some exciting extensions and directions for future work in this research field. / Likviditetsgarantering (eng. ”market making”) – processen att simultant och kontinuerligt kvotera köp- och säljpriser i en finansiell tillgång – är förhållandevis komplicerat att optimera. Att använda förstärkningsinlärning (eng. ”reinforcement learning”) för att härleda optimala strategier för likviditetsgarantering är ett relativt outrett och nytt forskningsområde. De flesta publicerade artiklarna inom området är anmärkningsvärt återhållsamma gällande detaljer om de tekniker, problemformuleringar, algoritmer och hyperparametrar som används för att framställa förstärkningsinlärningsbaserade strategier. I detta examensarbete så gör vi ett försök på att utforska och bringa klarhet över dessa punkter. Först används en rudimentär probabilistisk modell av en limitorderbok som underlag för att jämföra analytiska och förstärkningsinlärda strategier. Därefter brukas en mer sofistikerad Markovkedjemodell av en limitorderbok för att jämföra tabulära och djupa inlärningsmetoder. Till sist presenteras även spännande utökningar och direktiv för framtida arbeten inom området. Reinforcement learning Market making Deep reinforcement learning Limit order book Algorithmic trading High-frequency trading Machine learning Artificial intelligence Q-learning DDQN Förstärkningsinlärning Market making Djup förstärkningsinlärning Limitorderbok Algoritmisk handel Högfrekvenshandel Maskininlärning Artificiell intelligens Q-learning DDQN Other Mathematics Annan matematik
109	Deep Reinforcement Learning for Multi-Agent Path Planning in 2D Cost Map Environments : using Unity Machine Learning Agents toolkit Persson, Hannes January 2024 (has links) Multi-agent path planning is applied in a wide range of applications in robotics and autonomous vehicles, including aerial vehicles such as drones and other unmanned aerial vehicles (UAVs), to solve tasks in areas like surveillance, search and rescue, and transportation. In today's rapidly evolving technology in the fields of automation and artificial intelligence, multi-agent path planning is growing increasingly more relevant. The main problems encountered in multi-agent path planning are collision avoidance with other agents, obstacle evasion, and pathfinding from a starting point to an endpoint. In this project, the objectives were to create intelligent agents capable of navigating through two-dimensional eight-agent cost map environments to a static target, while avoiding collisions with other agents and simultaneously minimizing the path cost. The method of reinforcement learning was used by utilizing the development platform Unity and the open-source ML-Agents toolkit that enables the development of intelligent agents with reinforcement learning inside Unity. Perlin Noise was used to generate the cost maps. The reinforcement learning algorithm Proximal Policy Optimization was used to train the agents. The training was structured as a curriculum with two lessons, the first lesson was designed to teach the agents to reach the target, without colliding with other agents or moving out of bounds. The second lesson was designed to teach the agents to minimize the path cost. The project successfully achieved its objectives, which could be determined from visual inspection and by comparing the final model with a baseline model. The baseline model was trained only to reach the target while avoiding collisions, without minimizing the path cost. A comparison of the models showed that the final model outperformed the baseline model, reaching an average of $27.6\%$ lower path cost. / Multi-agent-vägsökning används inom en rad olika tillämpningar inom robotik och autonoma fordon, inklusive flygfarkoster såsom drönare och andra obemannade flygfarkoster (UAV), för att lösa uppgifter inom områden som övervakning, sök- och räddningsinsatser samt transport. I dagens snabbt utvecklande teknik inom automation och artificiell intelligens blir multi-agent-vägsökning allt mer relevant. De huvudsakliga problemen som stöts på inom multi-agent-vägsökning är kollisioner med andra agenter, undvikande av hinder och vägsökning från en startpunkt till en slutpunkt. I detta projekt var målen att skapa intelligenta agenter som kan navigera genom tvådimensionella åtta-agents kostnadskartmiljöer till ett statiskt mål, samtidigt som de undviker kollisioner med andra agenter och minimerar vägkostnaden. Metoden förstärkningsinlärning användes genom att utnyttja utvecklingsplattformen Unity och Unitys open-source ML-Agents toolkit, som möjliggör utveckling av intelligenta agenter med förstärkningsinlärning inuti Unity. Perlin Brus användes för att generera kostnadskartorna. Förstärkningsinlärningsalgoritmen Proximal Policy Optimization användes för att träna agenterna. Träningen strukturerades som en läroplan med två lektioner, den första lektionen var utformad för att lära agenterna att nå målet, utan att kollidera med andra agenter eller röra sig utanför gränserna. Den andra lektionen var utformad för att lära agenterna att minimera vägkostnaden. Projektet uppnådde framgångsrikt sina mål, vilket kunde fastställas genom visuell inspektion och genom att jämföra den slutliga modellen med en basmodell. Basmodellen tränades endast för att nå målet och undvika kollisioner, utan att minimera vägen kostnaden. En jämförelse av modellerna visade att den slutliga modellen överträffade baslinjemodellen, och uppnådde en genomsnittlig $27,6\%$ lägre vägkostnad. deep reinforcement learning reinforcement learning machine learning path planning cost map ML-agents unity artificial neural networks collision avoidance PPO multi agent multi-agent multi-agent system förstärkningsinlärning djup förstärkningsinlärning fleragentssystem kostnadkarta kostnadskartor artificiella neurala nätverk maskininlärning proximal policy optimization PPO svärmintelligens

Search results