Spelling suggestions: "subject:"markov codecision canprocess"" "subject:"markov codecision 3.3vprocess""
131 |
Parsimonious reasoning in reinforcement learning for better credit assignmentMa, Michel 08 1900 (has links)
Le contenu de cette thèse explore la question de l’attribution de crédits à long terme dans l’apprentissage par renforcement du point de vue d’un biais inductif de parcimonie. Dans ce contexte, un agent parcimonieux cherche à comprendre son environnement en utilisant le moins de variables possible. Autrement dit, si l’agent est crédité ou blâmé pour un certain comportement, la parcimonie l’oblige à attribuer ce crédit (ou blâme) à seulement quelques variables latentes sélectionnées. Avant de proposer de nouvelles méthodes d’attribution parci- monieuse de crédits, nous présentons les travaux antérieurs relatifs à l’attribution de crédits à long terme en relation avec l’idée de sparsité. Ensuite, nous développons deux nouvelles idées pour l’attribution de crédits dans l’apprentissage par renforcement qui sont motivées par un raisonnement parcimonieux : une dans le cadre sans modèle et une pour l’apprentissage basé sur un modèle. Pour ce faire, nous nous appuyons sur divers concepts liés à la parcimonie issus de la causalité, de l’apprentissage supervisé et de la simulation, et nous les appliquons dans un cadre pour la prise de décision séquentielle.
La première, appelée évaluation contrefactuelle de la politique, prend en compte les dévi- ations mineures de ce qui aurait pu être compte tenu de ce qui a été. En restreignant l’espace dans lequel l’agent peut raisonner sur les alternatives, l’évaluation contrefactuelle de la politique présente des propriétés de variance favorables à l’évaluation des politiques. L’évaluation contrefactuelle de la politique offre également une nouvelle perspective sur la rétrospection, généralisant les travaux antérieurs sur l’attribution de crédits a posteriori. La deuxième contribution de cette thèse est un algorithme augmenté d’attention latente pour l’apprentissage par renforcement basé sur un modèle : Latent Sparse Attentive Value Gra- dients (LSAVG). En intégrant pleinement l’attention dans la structure d’optimisation de la politique, nous montrons que LSAVG est capable de résoudre des tâches de mémoire active que son homologue sans modèle a été conçu pour traiter, sans recourir à des heuristiques ou à un biais de l’estimateur original. / The content of this thesis explores the question of long-term credit assignment in reinforce- ment learning from the perspective of a parsimony inductive bias. In this context, a parsi- monious agent looks to understand its environment through the least amount of variables possible. Alternatively, given some credit or blame for some behavior, parsimony forces the agent to assign this credit (or blame) to only a select few latent variables. Before propos- ing novel methods for parsimonious credit assignment, previous work relating to long-term credit assignment is introduced in relation to the idea of sparsity. Then, we develop two new ideas for credit assignment in reinforcement learning that are motivated by parsimo- nious reasoning: one in the model-free setting, and one for model-based learning. To do so, we build upon various parsimony-related concepts from causality, supervised learning, and simulation, and apply them to the Markov Decision Process framework.
The first of which, called counterfactual policy evaluation, considers minor deviations of what could have been given what has been. By restricting the space in which the agent can reason about alternatives, counterfactual policy evaluation is shown to have favorable variance properties for policy evaluation. Counterfactual policy evaluation also offers a new perspective to hindsight, generalizing previous work in hindsight credit assignment. The second contribution of this thesis is a latent attention augmented algorithm for model-based reinforcement learning: Latent Sparse Attentive Value Gradients (LSAVG). By fully inte- grating attention into the structure for policy optimization, we show that LSAVG is able to solve active memory tasks that its model-free counterpart was designed to tackle, without resorting to heuristics or biasing the original estimator.
|
132 |
Deep Reinforcement Learning for Autonomous Highway Driving ScenarioPradhan, Neil January 2021 (has links)
We present an autonomous driving agent on a simulated highway driving scenario with vehicles such as cars and trucks moving with stochastically variable velocity profiles. The focus of the simulated environment is to test tactical decision making in highway driving scenarios. When an agent (vehicle) maintains an optimal range of velocity it is beneficial both in terms of energy efficiency and greener environment. In order to maintain an optimal range of velocity, in this thesis work I proposed two novel reward structures: (a) gaussian reward structure and (b) exponential rise and fall reward structure. I trained respectively two deep reinforcement learning agents to study their differences and evaluate their performance based on a set of parameters that are most relevant in highway driving scenarios. The algorithm implemented in this thesis work is double-dueling deep-Q-network with prioritized experience replay buffer. Experiments were performed by adding noise to the inputs, simulating Partially Observable Markov Decision Process in order to obtain reliability comparison between different reward structures. Velocity occupancy grid was found to be better than binary occupancy grid as input for the algorithm. Furthermore, methodology for generating fuel efficient policies has been discussed and demonstrated with an example. / Vi presenterar ett autonomt körföretag på ett simulerat motorvägsscenario med fordon som bilar och lastbilar som rör sig med stokastiskt variabla hastighetsprofiler. Fokus för den simulerade miljön är att testa taktiskt beslutsfattande i motorvägsscenarier. När en agent (fordon) upprätthåller ett optimalt hastighetsområde är det fördelaktigt både när det gäller energieffektivitet och grönare miljö. För att upprätthålla ett optimalt hastighetsområde föreslog jag i detta avhandlingsarbete två nya belöningsstrukturer: (a) gaussisk belöningsstruktur och (b) exponentiell uppgång och nedgång belöningsstruktur. Jag utbildade respektive två djupförstärkande inlärningsagenter för att studera deras skillnader och utvärdera deras prestanda baserat på en uppsättning parametrar som är mest relevanta i motorvägsscenarier. Algoritmen som implementeras i detta avhandlingsarbete är dubbel-duell djupt Q- nätverk med prioriterad återuppspelningsbuffert. Experiment utfördes genom att lägga till brus i ingångarna, simulera delvis observerbar Markov-beslutsprocess för att erhålla tillförlitlighetsjämförelse mellan olika belöningsstrukturer. Hastighetsbeläggningsgaller visade sig vara bättre än binärt beläggningsgaller som inmatning för algoritmen. Dessutom har metodik för att generera bränsleeffektiv politik diskuterats och demonstrerats med ett exempel.
|
133 |
Contributions to Simulation-based High-dimensional Sequential Decision Making / Contributions sur la prise de décision séquentielle basée sur des simulations dans des environnements complexes de grande dimensionHoock, Jean-Baptiste 10 April 2013 (has links)
Ma thèse s'intitule « Contributions sur la prise de décision séquentielle basée sur des simulations dans des environnements complexes de grande dimension ». Le cadre de la thèse s'articule autour du jeu, de la planification et des processus de décision markovien. Un agent interagit avec son environnement en prenant successivement des décisions. L'agent part d'un état initial jusqu'à un état final dans lequel il ne peut plus prendre de décision. A chaque pas de temps, l'agent reçoit une observation de l'état de l'environnement. A partir de cette observation et de ses connaissances, il prend une décision qui modifie l'état de l'environnement. L'agent reçoit en conséquence une récompense et une nouvelle observation. Le but est de maximiser la somme des récompenses obtenues lors d'une simulation qui part d'un état initial jusqu'à un état final. La politique de l'agent est la fonction qui, à partir de l'historique des observations, retourne une décision. Nous travaillons dans un contexte où (i) le nombre d'états est immense, (ii) les récompenses apportent peu d'information, (iii) la probabilité d'atteindre rapidement un bon état final est faible et (iv) les connaissances a priori de l'environnement sont soit inexistantes soit difficilement exploitables. Les 2 applications présentées dans cette thèse répondent à ces contraintes : le jeu de Go et le simulateur 3D du projet européen MASH (Massive Sets of Heuristics). Afin de prendre une décision satisfaisante dans ce contexte, plusieurs solutions sont apportées :1. simuler en utilisant le compromis exploration/exploitation (MCTS)2. réduire la complexité du problème par des recherches locales (GoldenEye)3. construire une politique qui s'auto-améliore (RBGP)4. apprendre des connaissances a priori (CluVo+GMCTS) L'algorithme Monte-Carlo Tree Search (MCTS) est un algorithme qui a révolutionné le jeu de Go. A partir d'un modèle de l'environnement, MCTS construit itérativement un arbre des possibles de façon asymétrique en faisant des simulations de Monte-Carlo et dont le point de départ est l'observation courante de l'agent. L'agent alterne entre l'exploration du modèle en prenant de nouvelles décisions et l'exploitation des décisions qui obtiennent statistiquement une bonne récompense cumulée. Nous discutons de 2 moyens pour améliorer MCTS : la parallélisation et l'ajout de connaissances a priori. La parallélisation ne résout pas certaines faiblesses de MCTS ; notamment certains problèmes locaux restent des verrous. Nous proposons un algorithme (GoldenEye) qui se découpe en 2 parties : détection d'un problème local et ensuite sa résolution. L'algorithme de résolution réutilise des principes de MCTS et fait ses preuves sur une base classique de problèmes difficiles. L'ajout de connaissances à la main est laborieuse et ennuyeuse. Nous proposons une méthode appelée Racing-based Genetic Programming (RBGP) pour ajouter automatiquement de la connaissance. Le point fort de cet algorithme est qu'il valide rigoureusement l'ajout d'une connaissance a priori et il peut être utilisé non pas pour optimiser un algorithme mais pour construire une politique. Dans certaines applications telles que MASH, les simulations sont coûteuses en temps et il n'y a ni connaissance a priori ni modèle de l'environnement; l'algorithme Monte-Carlo Tree Search est donc inapplicable. Pour rendre MCTS applicable dans MASH, nous proposons une méthode pour apprendre des connaissances a priori (CluVo). Nous utilisons ensuite ces connaissances pour améliorer la rapidité de l'apprentissage de l'agent et aussi pour construire un modèle. A partir de ce modèle, nous utilisons une version adaptée de Monte-Carlo Tree Search (GMCTS). Cette méthode résout de difficiles problématiques MASH et donne de bons résultats dans une application dont le but est d'améliorer un tirage de lettres. / My thesis is entitled "Contributions to Simulation-based High-dimensional Sequential Decision Making". The context of the thesis is about games, planning and Markov Decision Processes. An agent interacts with its environment by successively making decisions. The agent starts from an initial state until a final state in which the agent can not make decision anymore. At each timestep, the agent receives an observation of the state of the environment. From this observation and its knowledge, the agent makes a decision which modifies the state of the environment. Then, the agent receives a reward and a new observation. The goal is to maximize the sum of rewards obtained during a simulation from an initial state to a final state. The policy of the agent is the function which, from the history of observations, returns a decision. We work in a context where (i) the number of states is huge, (ii) reward carries little information, (iii) the probability to reach quickly a good final state is weak and (iv) prior knowledge is either nonexistent or hardly exploitable. Both applications described in this thesis present these constraints : the game of Go and a 3D simulator of the european project MASH (Massive Sets of Heuristics). In order to take a satisfying decision in this context, several solutions are brought : 1. Simulating with the compromise exploration/exploitation (MCTS) 2. Reducing the complexity by local solving (GoldenEye) 3. Building a policy which improves itself (RBGP) 4. Learning prior knowledge (CluVo+GMCTS) Monte-Carlo Tree Search (MCTS) is the state of the art for the game of Go. From a model of the environment, MCTS builds incrementally and asymetrically a tree of possible futures by performing Monte-Carlo simulations. The tree starts from the current observation of the agent. The agent switches between the exploration of the model and the exploitation of decisions which statistically give a good cumulative reward. We discuss 2 ways for improving MCTS : the parallelization and the addition of prior knowledge. The parallelization does not solve some weaknesses of MCTS; in particular some local problems remain challenges. We propose an algorithm (GoldenEye) which is composed of 2 parts : detection of a local problem and then its resolution. The algorithm of resolution reuses some concepts of MCTS and it solves difficult problems of a classical database. The addition of prior knowledge by hand is laborious and boring. We propose a method called Racing-based Genetic Programming (RBGP) in order to add automatically prior knowledge. The strong point is that RBGP rigorously validates the addition of a prior knowledge and RBGP can be used for building a policy (instead of only optimizing an algorithm). In some applications such as MASH, simulations are too expensive in time and there is no prior knowledge and no model of the environment; therefore Monte-Carlo Tree Search can not be used. So that MCTS becomes usable in this context, we propose a method for learning prior knowledge (CluVo). Then we use pieces of prior knowledge for improving the rapidity of learning of the agent and for building a model, too. We use from this model an adapted version of Monte-Carlo Tree Search (GMCTS). This method solves difficult problems of MASH and gives good results in an application to a word game.
|
134 |
Analyse et algorithmes de résolution de systèmes ATO (Assemble-To-Order) : Applications aux systèmes du type W / Analysis and Computational Algorithms for Assemble-To-Order systems : Application to W-configuration systemsFang, Jianxin 02 October 2017 (has links)
Nous analysons un type W de système de l’Assemble-à-commande avec des délais de livraison aléatoires, l'arrivée aléatoire de la demande et des ventes perdues, en temps continu. Nous formulons le problème en tant que processus de décision Markov à l'horizon infini. Nous nous éloignons de l'approche standard en caractérisant une région de l'espace d'état où toutes les propriétés de la fonction de coût tiennent. Nous caractérisons la politique optimale dans cette région. En particulier, nous montrons que, dans l'intérieur de la région récurrente, les composants sont toujours produits. Nous caractérisons également la politique d'allocation de composants optimale qui spécifie si une demande de produit arrivant devrait être remplie. Notre analyse révèle que la politique d'allocation optimale est contre-intuitive. Par exemple, même lorsqu'un produit domine l'autre, en termes de coût/taux de vente perdue, sa demande peut ne pas avoir une priorité absolue par rapport à la demande de l'autre produit. Une telle caractéristique n'a pas été observée dans de nombreux paramètres intégrés de production/inventaire où l'allocation d'inventaire suit une priorité fixe pour satisfaire les exigences. Nous montrons également que la structure de la politique optimale reste la même pour les systèmes à production par lots, les temps de production répartis par Erlang et la demande de produits non unitaire. Enfin, nous proposons des heuristiques efficaces qui peuvent être utilisées comme substitut à la politique optimale ou peuvent être utilisées comme une politique de départ pour les algorithmes communs utilisés pour obtenir une politique optimale dans le but de réduire leur temps de calcul. / We analyze a W-configuration assemble-to-order system with random lead times, random arrival of demand, and lost sales, in continuous time. We formulate the problem as an infinite-horizon Markov decision process. We deviate from the standard approach by first characterizing a region (the recurrent region) of the state space where all properties of the cost function hold. We then characterize the optimal policy within this region. In particular, we show that within the interior of the recurrent region components are always produced. We also characterize the optimal component allocation policy which specifies whether an arriving product demand should be fulfilled. Our analysis reveals that the optimal allocation policy is counter-intuitive. For instance, even when one product dominates the other, in terms of lost sale cost and lost sale cost rate (i.e., demand rate times the lost sale cost), its demand may not have absolute priority over the other product’s demand. Such a feature has not been observed in many integrated production/inventory settings where inventory allocation follows a fixed priority in satisfying demands. We also show that the structure of the optimal policy remains the same for systems with batch production, Erlang distributed production times, and non-unitary product demand. Finally, we propose efficient heuristics that can be either used as a substitute for the optimal policy or can be used as a starting policy for the common algorithms that are used to obtain the optimal policy in an effort to reduce their computational time
|
135 |
Contrôle adaptatif des feux de signalisation dans les carrefours : modélisation du système de trafic dynamique et approches de résolution / Adaptative traffic signal control at intersections : dynamic traffic system modeling and algorithmsYin, Biao 11 December 2015 (has links)
La régulation adaptative des feux de signalisation est un problème très important. Beaucoup de chercheurs travaillent continuellement afin de résoudre les problémes liés à l’embouteillage dans les intersections urbaines. Il devient par conséquent très utile d’employer des algorithmes intelligents afin d’améliorer les performances de régulation et la qualité du service. Dans cette thèse, nous essayons d'étudier ce problème d’une part à travers une modèlisation microscopique et dynamique en temps discret, et d’autre part en explorant plusieurs approches de résoltion pour une intersection isolée ainsi que pour un réseau distribué d'intersections.La première partie se concentre sur la modélisation dynamique des problèmes des feux de signalisation ainsi que de la charge du réseau d’intersections. Le mode de la “séquence de phase adaptative” (APS) dans un plan de feux est d'abord considéré. Quant à la modélisation du contrôle des feux aux intersections, elle est formulée grâce à un processus décisionnel de markov (MDP). En particulier, la notion de “l'état du système accordable” est alors proposée pour la coordination du réseau de trafic. En outre, un nouveau modèle de “véhicule-suiveur” est proposé pour l'environnement de trafic. En se basant sur la modélisation proposée, les méthodes de contrôle des feux dans cette thèse comportent des algorithmes optimaux et quasi-optimaux. Deux algorithmes exacts de résolution basées sur la programmation dynamique (DP) sont alors étudiés et les résultats montrent certaines limites de cette solution DP surtout dans quelques cas complexes où l'espace d'états est assez important. En raison de l’importance du temps d’execution de l'algorithme DP et du manque d'information du modèle (notamment l’information exacte relative à l’arrivée des véhicules à l’intersection), nous avons opté pour un algorithme de programmation dynamique approximative (ADP). Enfin, un algorithme quasi-optimal utilisant l'ADP combinée à la méthode d’amélioration RLS-TD (λ) est choisi. Dans les simulations, en particulier avec l'intégration du mode de phase APS, l'algorithme proposé montre de bons résultats notamment en terme de performance et d'efficacité de calcul. / Adaptive traffic signal control is a decision making optimization problem. People address this crucial problem constantly in order to solve the traffic congestion at urban intersections. It is very popular to use intelligent algorithms to improve control performances, such as traffic delay. In the thesis, we try to study this problem comprehensively with a microscopic and dynamic model in discrete-time, and investigate the related algorithms both for isolated intersection and distributed network control. At first, we focus on dynamic modeling for adaptive traffic signal control and network loading problems. The proposed adaptive phase sequence (APS) mode is highlighted as one of the signal phase control mechanisms. As for the modeling of signal control at intersections, problems are fundamentally formulated by Markov decision process (MDP), especially the concept of tunable system state is proposed for the traffic network coordination. Moreover, a new vehicle-following model supports for the network loading environment.Based on the model, signal control methods in the thesis are studied by optimal and near-optimal algorithms in turn. Two exact DP algorithms are investigated and results show some limitations of DP solution when large state space appears in complex cases. Because of the computational burden and unknown model information in dynamic programming (DP), it is suggested to use an approximate dynamic programming (ADP). Finally, the online near-optimal algorithm using ADP with RLS-TD(λ) is confirmed. In simulation experiments, especially with the integration of APS, the proposed algorithm indicates a great advantage in performance measures and computation efficiency.
|
136 |
System Availability Maximization and Residual Life Prediction under Partial ObservationsJiang, Rui 10 January 2012 (has links)
Many real-world systems experience deterioration with usage and age, which often leads to low product quality, high production cost, and low system availability. Most previous maintenance and reliability models in the literature do not incorporate condition monitoring information for decision making, which often results in poor failure prediction for partially observable deteriorating systems. For that reason, the development of fault prediction and control scheme using condition-based maintenance techniques has received considerable attention in recent years. This research presents a new framework for predicting failures of a partially observable deteriorating system using Bayesian control techniques. A time series model is fitted to a vector observation process representing partial information about the system state. Residuals are then calculated using the fitted model, which are indicative of system deterioration. The deterioration process is modeled as a 3-state continuous-time homogeneous Markov process. States 0 and 1 are not observable, representing healthy (good) and unhealthy (warning) system operational conditions, respectively. Only the failure state 2 is assumed to be observable. Preventive maintenance can be carried out at any sampling epoch, and corrective maintenance is carried out upon system failure. The form of the optimal control policy that maximizes the long-run expected average availability per unit time has been investigated. It has been proved that a control limit policy is optimal for decision making. The model parameters have been estimated using the Expectation Maximization (EM) algorithm. The optimal Bayesian fault prediction and control scheme, considering long-run average availability maximization along with a practical statistical constraint, has been proposed and compared with the age-based replacement policy. The optimal control limit and sampling interval are calculated in the semi-Markov decision process (SMDP) framework. Another Bayesian fault prediction and control scheme has been developed based on the average run length (ARL) criterion. Comparisons with traditional control charts are provided. Formulae for the mean residual life and the distribution function of system residual life have been derived in explicit forms as functions of a posterior probability statistic. The advantage of the Bayesian model over the well-known 2-parameter Weibull model in system residual life prediction is shown. The methodologies are illustrated using simulated data, real data obtained from the spectrometric analysis of oil samples collected from transmission units of heavy hauler trucks in the mining industry, and vibration data from a planetary gearbox machinery application.
|
137 |
System Availability Maximization and Residual Life Prediction under Partial ObservationsJiang, Rui 10 January 2012 (has links)
Many real-world systems experience deterioration with usage and age, which often leads to low product quality, high production cost, and low system availability. Most previous maintenance and reliability models in the literature do not incorporate condition monitoring information for decision making, which often results in poor failure prediction for partially observable deteriorating systems. For that reason, the development of fault prediction and control scheme using condition-based maintenance techniques has received considerable attention in recent years. This research presents a new framework for predicting failures of a partially observable deteriorating system using Bayesian control techniques. A time series model is fitted to a vector observation process representing partial information about the system state. Residuals are then calculated using the fitted model, which are indicative of system deterioration. The deterioration process is modeled as a 3-state continuous-time homogeneous Markov process. States 0 and 1 are not observable, representing healthy (good) and unhealthy (warning) system operational conditions, respectively. Only the failure state 2 is assumed to be observable. Preventive maintenance can be carried out at any sampling epoch, and corrective maintenance is carried out upon system failure. The form of the optimal control policy that maximizes the long-run expected average availability per unit time has been investigated. It has been proved that a control limit policy is optimal for decision making. The model parameters have been estimated using the Expectation Maximization (EM) algorithm. The optimal Bayesian fault prediction and control scheme, considering long-run average availability maximization along with a practical statistical constraint, has been proposed and compared with the age-based replacement policy. The optimal control limit and sampling interval are calculated in the semi-Markov decision process (SMDP) framework. Another Bayesian fault prediction and control scheme has been developed based on the average run length (ARL) criterion. Comparisons with traditional control charts are provided. Formulae for the mean residual life and the distribution function of system residual life have been derived in explicit forms as functions of a posterior probability statistic. The advantage of the Bayesian model over the well-known 2-parameter Weibull model in system residual life prediction is shown. The methodologies are illustrated using simulated data, real data obtained from the spectrometric analysis of oil samples collected from transmission units of heavy hauler trucks in the mining industry, and vibration data from a planetary gearbox machinery application.
|
138 |
Generalized Sampling-Based Feedback Motion PlannersKumar, Sandip 2011 December 1900 (has links)
The motion planning problem can be formulated as a Markov decision process (MDP), if the uncertainties in the robot motion and environments can be modeled probabilistically. The complexity of solving these MDPs grow exponentially as the dimension of the problem increases and hence, it is nearly impossible to solve the problem even without constraints. Using hierarchical methods, these MDPs can be transformed into a semi-Markov decision process (SMDP) which only needs to be solved at certain landmark states. In the deterministic robotics motion planning community, sampling based algorithms like probabilistic roadmaps (PRM) and rapidly exploring random trees (RRTs) have been successful in solving very high dimensional deterministic problem. However they are not robust to system with uncertainties in the system dynamics and hence, one of the primary objective of this work is to generalize PRM/RRT to solve motion planning with uncertainty.
We first present generalizations of randomized sampling based algorithms PRM and RRT, to incorporate the process uncertainty, and obstacle location uncertainty, termed as "generalized PRM" (GPRM) and "generalized RRT" (GRRT). The controllers used at the lower level of these planners are feedback controllers which ensure convergence of trajectories while mitigating the effects of process uncertainty. The results indicate that the algorithms solve the motion planning problem for a single agent in continuous state/control spaces in the presence of process uncertainty, and constraints such as obstacles and other state/input constraints.
Secondly, a novel adaptive sampling technique, termed as "adaptive GPRM" (AGPRM), is proposed for these generalized planners to increase the efficiency and overall success probability of these planners. It was implemented on high-dimensional robot n-link manipulators, with up to 8 links, i.e. in a 16-dimensional state-space. The results demonstrate the ability of the proposed algorithm to handle the motion planning problem for highly non-linear systems in very high-dimensional state space.
Finally, a solution methodology, termed the "multi-agent AGPRM" (MAGPRM), is proposed to solve the multi-agent motion planning problem under uncertainty. The technique uses a existing solution technique to the multiple traveling salesman problem (MTSP) in conjunction with GPRM. For real-time implementation, an ?inter-agent collision detection and avoidance? module was designed which ensures that no two agents collide at any time-step. Algorithm was tested on teams of homogeneous and heterogeneous agents in cluttered obstacle space and the algorithm demonstrate the ability to handle such problems in continuous state/control spaces in presence of process uncertainty.
|
139 |
Distribution de Processus Décisionnels Markoviens pour une gestion prédictive d’une ressource partagée : application aux voies navigables des Hauts-de-France dans le contexte incertain du changement climatique / Distributing Markov Decision Processes for a predictive management of a shared resource : application to the Hauts-de-France waterways in the uncertain context of climate changeDesquesnes, Guillaume, Louis, Florent 23 October 2018 (has links)
Les travaux de cette thèse visent à mettre en place une gestion prédictive sous incertitudes de la ressource en eau pour les réseaux de voies navigables. L'objectif est de proposer un plan de gestion de l'eau pour optimiser les conditions de navigation de l'ensemble du réseau supervisé sur un horizon spécifié. La solution attendue doit rendre le réseau résilient aux effets probables du changement climatique et aux évolutions du trafic fluvial. Dans un premier temps, une modélisation générique d'une ressource distribuée sur un réseau est proposée. Celle-ci, basée sur les processus décisionnels markoviens, prend en compte les nombreuses incertitudes affectant les réseaux considérés. L'objectif de cette modélisation est de couvrir l'ensemble des cas possibles, prévus ou non, afin d'avoir une gestion résiliente de ces réseaux. La seconde contribution consiste en une distribution du modèle sur plusieurs agents afin de permettre son passage à l'échelle. Ceci consiste en une répartition des capacités de contrôle du réseau entre les agents. Chaque agent ne possède ainsi qu'une connaissance locale du réseau supervisé. De ce fait, les agents ont besoin de se cordonner pour proposer une gestion efficace du réseau. Une résolution itérative avec échanges de plans temporaires de chaque agent est utilisée pour l'obtention de politiques de gestion locales à chaque agent. Finalement, des expérimentations ont été réalisées sur des réseaux réels de voies navigables françaises pour observer la qualité des solutions produites. Plusieurs scénarios climatiques différents ont été simulés pour tester la résilience des politiques produites. / The work of this thesis aims to introduce and implement a predictive management under uncertainties of the water resource for inland waterway networks. The objective is to provide a water management plan to optimize the navigation conditions of the entire supervised network over a specified horizon. The expected solution must render the network resilient to probable effects of the climate change and changes in waterway traffic. Firstly, a generic modeling of a resource distributed on a network is proposed. This modeling, based on Markovian Decision Processes, takes into account the numerous uncertainties affecting considered networks. The objective of this modeling is to cover all possible cases, foreseen or not, in order to have a resilient management of those networks. The second contribution consists in a distribution of the model over several agents to facilitate the scaling. This consists of a repartition of the network's control capacities among the agents. Thus, each agent has only local knowledge of the supervised network. As a result, agents require coordination to provide an efficient management of the network. An iterative resolution, with exchanges of temporary plans from each agent, is used to obtain local management policies for each agent. Finally, experiments were carried out on realistic and real networks of the French waterways to observe the quality of the solutions produced. Several different climatic scenarios have been simulated to test the resilience of the produced policies.
|
140 |
Optimization Algorithms for Deterministic, Stochastic and Reinforcement Learning SettingsJoseph, Ajin George January 2017 (has links) (PDF)
Optimization is a very important field with diverse applications in physical, social and biological sciences and in various areas of engineering. It appears widely in ma-chine learning, information retrieval, regression, estimation, operations research and a wide variety of computing domains. The subject is being deeply studied both theoretically and experimentally and several algorithms are available in the literature. These algorithms which can be executed (sequentially or concurrently) on a computing machine explore the space of input parameters to seek high quality solutions to the optimization problem with the search mostly guided by certain structural properties of the objective function. In certain situations, the setting might additionally demand for “absolute optimum” or solutions close to it, which makes the task even more challenging.
In this thesis, we propose an optimization algorithm which is “gradient-free”, i.e., does not employ any knowledge of the gradient or higher order derivatives of the objective function, rather utilizes objective function values themselves to steer the search. The proposed algorithm is particularly effective in a black-box setting, where a closed-form expression of the objective function is unavailable and gradient or higher-order derivatives are hard to compute or estimate. Our algorithm is inspired by the well known cross entropy (CE) method. The CE method is a model based search method to solve continuous/discrete multi-extremal optimization problems, where the objective function has minimal structure. The proposed method seeks, in the statistical manifold of the parameters which identify the probability distribution/model defined over the input space to find the degenerate distribution concentrated on the global optima (assumed to be finite in quantity). In the early part of the thesis, we propose a novel stochastic approximation version of the CE method to the unconstrained optimization problem, where the objective function is real-valued and deterministic. The basis of the algorithm is a stochastic process of model parameters which is probabilistically dependent on the past history, where we reuse all the previous samples obtained in the process till the current instant based on discounted averaging. This approach can save the overall computational and storage cost. Our algorithm is incremental in nature and possesses attractive features such as stability, computational and storage efficiency and better accuracy. We further investigate, both theoretically and empirically, the asymptotic behaviour of the algorithm and find that the proposed algorithm exhibits global optimum convergence for a particular class of objective functions.
Further, we extend the algorithm to solve the simulation/stochastic optimization problem. In stochastic optimization, the objective function possesses a stochastic characteristic, where the underlying probability distribution in most cases is hard to comprehend and quantify. This begets a more challenging optimization problem, where the ostentatious nature is primarily due to the hardness in computing the objective function values for various input parameters with absolute certainty. In this case, one can only hope to obtain noise corrupted objective function values for various input parameters. Settings of this kind can be found in scenarios where the objective function is evaluated using a continuously evolving dynamical system or through a simulation. We propose a multi-timescale stochastic approximation algorithm, where we integrate an additional timescale to accommodate the noisy measurements and decimate the effects of the gratuitous noise asymptotically. We found that if the objective function and the noise involved in the measurements are well behaved and the timescales are compatible, then our algorithm can generate high quality solutions.
In the later part of the thesis, we propose algorithms for reinforcement learning/Markov decision processes using the optimization techniques we developed in the early stage. MDP can be considered as a generalized framework for modelling planning under uncertainty. We provide a novel algorithm for the problem of prediction in reinforcement learning, i.e., estimating the value function of a given stationary policy of a model free MDP (with large state and action spaces) using the linear function approximation architecture. Here, the value function is defined as the long-run average of the discounted transition costs. The resource requirement of the proposed method in terms of computational and storage cost scales quadratically in the size of the feature set. The algorithm is an adaptation of the multi-timescale variant of the CE method proposed in the earlier part of the thesis for simulation optimization. We also provide both theoretical and empirical evidence to corroborate the credibility and effectiveness of the approach.
In the final part of the thesis, we consider a modified version of the control problem in a model free MDP with large state and action spaces. The control problem most commonly addressed in the literature is to find an optimal policy which maximizes the value function, i.e., the long-run average of the discounted transition payoffs. The contemporary methods also presume access to a generative model/simulator of the MDP with the hidden premise that observations of the system behaviour in the form of sample trajectories can be obtained with ease from the model. In this thesis, we consider a modified version, where the cost function to be optimized is a real-valued performance function (possibly non-convex) of the value function. Additionally, one has to seek the optimal policy without presuming access to the generative model. In this thesis, we propose a stochastic approximation algorithm for this peculiar control problem. The only information, we presuppose, available to the algorithm is the sample trajectory generated using a priori chosen behaviour policy. The algorithm is data (sample trajectory) efficient, stable, robust as well as computationally and storage efficient. We provide a proof of convergence of our algorithm to a high performing policy relative to the behaviour policy.
|
Page generated in 0.087 seconds