Global ETD Search

1	Algorithms for stochastic finite memory control of partially observable systems Marwah, Gaurav 06 August 2005 (has links) A partially observable Markov decision process (POMDP) is a mathematical framework for planning and control problems in which actions have stochastic effects and observations provide uncertain state information. It is widely used for research in decision-theoretic planning and reinforcement learning. % To cope with partial observability, a policy (or plan) must use memory, and previous work has shown that a finite-state controller provides a good policy representation. This thesis considers a previously-developed bounded policy iteration algorithm for POMDPs that finds policies that take the form of stochastic finite-state controllers. Two new improvements of this algorithm are developed. First improvement provides a simplification of the basic linear program, which is used to find improved controllers. This results in a considerable speed-up in efficiency of the original algorithm. Secondly, a branch and bound algorithm for adding the best possible node to the controller is presented, which provides an error bound and a test for global optimality. Experimental results show that these enhancements significantly improve the algorithm's performance. POMDP
2	STRUCTURED MAINTENANCE POLICIES ON INTERIOR SAMPLE PATHS Zheltova, Ludmila 08 July 2010 (has links) No description available. Operations Research Markov POMDP
3	Quick and Automatic Selection of POMDP Implementations on Mobile Platform Based on Battery Consumption Estimation Yang, Xiao January 2014 (has links) Partially Observable Markov Decision Process (POMDP) is widely used to model sequential decision making process under uncertainty and incomplete knowledge of the environment. It requires strong computation capability and is thus usually deployed on powerful machine. However, as mobile platforms become more advanced and more popular, the potential has been studied to combine POMDP and mobile in order to provide a broader range of services. And yet a question comes with this trend: how should we implement POMDP on mobile platform so that we can take advantages of mobile features while at the same time avoid being restricted by mobile limitations, such as short battery life, weak CPU, unstable networking connection, and other limited resources. In response to the above question, we first point out that the cases vary by problem nature, accuracy requirements and mobile device models. Rather than pure mathematical analysis, our approach is to run experiments on a mobile device and concentrate on a more specific question: which POMDP implementation is the ``best'' for a particular problem on a particular kind of device. Second, we propose and justify a POMDP implementation criterion mainly based on battery consumption that quantifies ``goodness'' of POMDP implementations in terms of mobile battery depletion rate. Then, we present a mobile battery consumption model that translates CPU and WIFI usage into part of the battery depletion rate in order to greatly accelerate the experiment process. With our mobile battery consumption model, we combine a set of simple benchmark experiments with CPU and WIFI usage data from each POMDP implementation candidate to generate estimated battery depletion rates, as opposed to conducting hours of real battery experiments for each implementation individually. The final result is a ranking of POMDP implementations based on their estimated battery depletion rates. It serves as a guidance for on POMDP implementation selection for mobile developers. We develop a mobile software toolkit to automate the above process. Given basic POMDP problem specifications, a set of POMDP implementation candidates and a simple press on the ``start'' button, the toolkit automatically performs benchmark experiments on the target device on which it is installed, and records CPU and WIFI statistics for each POMDP implementation candidate. It then feeds the data to its embedded mobile battery consumption model and produces an estimated battery depletion rate for each candidate. Finally, the toolkit visualizes the ranking of POMDP implementations for mobile developers' reference. Evaluation is assessed through comparsion between the ranking from estimated battery depletion rate and that from real experimental battery depletion rate. We observe the same ranking out of both, which is also our expectation. What's more, the similarity between estimated battery depletion rate and experimental battery depletion rate measured by cosine-similarity is almost 0.999 where 1 indicates they are exactly the same.
4	Policy-Gradient Algorithms for Partially Observable Markov Decision Processes Aberdeen, Douglas Alexander, doug.aberdeen@anu.edu.au January 2003 (has links) Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). ¶ In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ¶ Monte-Carlo policy-gradient approaches tend to produce gradient estimates with high variance. Two novel methods for reducing variance are introduced. The first uses high-order filters to replace the eligibility trace of the gradient estimator. The second uses a low-variance value-function method to learn a subset of the parameters and a policy-gradient method to learn the remainder. ¶ The algorithms are applied to large domains including a simulated robot navigation scenario, a multi-agent scenario with 21,000 states, and the complex real-world task of large vocabulary continuous speech recognition. To the best of the author's knowledge, no other policy-gradient algorithms have performed well at such tasks. ¶ The high variance of Monte-Carlo methods requires lengthy simulation and hence a super-computer to train agents within a reasonable time. The ANU ``Bunyip'' Linux cluster was built with such tasks in mind. It was used for several of the experimental results presented here. One chapter of this thesis describes an application written for the Bunyip cluster that won the international Gordon-Bell prize for price/performance in 2001. POMDP Reinforcement Learning Policy gradient cluster high performance computing
5	Distributed Decision-Making and Task<br />Coordination in Dynamic, Uncertain and<br />Real-Time Multiagent Environments Paquet, Sébastien 19 December 2005 (has links) (PDF) La prise de décision dans l'incertain et la coordination sont au coeur des systèmes multiagents. Dans ce type de systèmes, les agents doivent être en mesure de percevoir leur environnement et de prendre des décisions en considérant les autres agents. Lorsque l'environnement est partiellement observable, les agents doivent être en mesure de gérer cette incertitude pour prendre des décisions les plus éclairées possible en considérant les informations incomplètes qu'ils ont pu acquérir. Par ailleurs, dans le contexte d'environnements multiagents coopératifs, les agents doivent être en mesure de coordonner leurs actions de manière à pouvoir accomplir des tâches demandant la collaboration de plus d'un agent. Dans cette thèse, nous considérons des environnements multiagents coopératifs complexes (dynamiques, incertains et temps-réel). Pour ce type d'environnements, nous proposons une approche de prise de décision dans l'incertain permettant une coordination flexible entre les agents. Plus précisément, nous présentons un algorithme de résolution en ligne de processus de décision de Markov partiellement observables (POMDPs). Par ailleurs, dans de tels environnements, les tâches que doivent accomplir les agents peuvent devenir très complexes. Dans ce cadre, il peut devenir difficile pour les agents de déterminer le nombre de ressources nécessaires à l'accomplissement de chacune des tâches. Pour résoudre ce problème, nous proposons donc un algorithme d'apprentissage permettant d'apprendre le nombre de ressources nécessaires à l'accomplissement des tâches selon les caractéristiques de celles-ci. Dans un même ordre d'idée, nous proposons aussi une méthode d'ordonnancement permettant d'ordonner les différentes tâches des agents de manière à maximiser le nombre de tâches pouvant être accomplies dans un temps limité. Toutes ces approches ont pour but de permettre la coordination d'agents pour l'accomplissement efficace de tâches complexes dans un environnement multiagent partiellement observable, dynamique et incertain. Toutes ces approches ont démontré leur efficacité lors de tests effectués dans l'environnement de simulation de la RoboCup- Rescue. [INFO:INFO_OH] Computer Science/Other Systèmes multiagents apprentissage POMDP coordination
6	Reinforcement Learning by Policy Search Peshkin, Leonid 14 February 2003 (has links) One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations can be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. This means learning a policy---a mapping of observations into actions---based on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multi-agent systems. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience re-use. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network. AI POMDP policy search adaptive systems reinforcement learning adaptive behavior
7	Design of an Adaptive System for Upper-limb Stroke Rehabilitation Kan, Patricia Wai Ling 24 February 2009 (has links) Stroke is the primary cause of adult disability. To support this large population in recovery, robotic technologies are being developed to assist in the delivery of rehabilitation. A partially observable Markov decision process (POMDP) system was designed for a rehabilitation robotic device that guides stroke patients through an upper-limb reaching task. The performance of the POMDP system was evaluated by comparing the decisions made by the POMDP system with those of a human therapist. Overall, the therapist agreed with the POMDP decisions approximately 65% of the time. The therapist thought the POMDP decisions were believable and could envision this system being used in both the clinic and home. The patient would use this system as the primary method of rehabilitation. Limitations of the current system have been identified which require improvement in future research stages. This research has shown that POMDPs have promising potential to facilitate upper extremity rehabilitation. rehabilitation robotics artificial intelligence POMDP stroke upper-limb 0382
8	Design of an Adaptive System for Upper-limb Stroke Rehabilitation Kan, Patricia Wai Ling 24 February 2009 (has links) Stroke is the primary cause of adult disability. To support this large population in recovery, robotic technologies are being developed to assist in the delivery of rehabilitation. A partially observable Markov decision process (POMDP) system was designed for a rehabilitation robotic device that guides stroke patients through an upper-limb reaching task. The performance of the POMDP system was evaluated by comparing the decisions made by the POMDP system with those of a human therapist. Overall, the therapist agreed with the POMDP decisions approximately 65% of the time. The therapist thought the POMDP decisions were believable and could envision this system being used in both the clinic and home. The patient would use this system as the primary method of rehabilitation. Limitations of the current system have been identified which require improvement in future research stages. This research has shown that POMDPs have promising potential to facilitate upper extremity rehabilitation. rehabilitation robotics artificial intelligence POMDP stroke upper-limb 0382
9	A Framework for Integrating Influence Diagrams and POMDPs Shi, Jinchuan 04 May 2018 (has links) An influence diagram is a widely-used graphical model for representing and solving problems of sequential decision making under imperfect information. A closely-related model for the same class of problems is a partially observable Markov decision process (POMDP). This dissertation leverages the relationship between these two models to develop improved algorithms for solving influence diagrams. The primary contribution is to generalize two classic dynamic programming algorithms for solving influence diagrams, Arc Reversal and Variable Elimination, by integrating them with a dynamic programming technique originally developed for solving POMDPs. This generalization relaxes constraints on the ordering of the steps of these algorithms in a way that dramatically improves scalability, especially in solving complex, multi-stage decision problems. A secondary contribution is the adoption of a more compact and intuitive representation of the solution of an influence diagram, called a strategy. Instead of representing a strategy as a table or as a tree, a strategy is represented as an acyclic graph, which can be exponentially more compact, making the strategy easier to interpret and understand. POMDP Graphical Model Probabilistic Inference Theoretical Decision Planning Influence Diagram
10	Efficient Bayesian Nonparametric Methods for Model-Free Reinforcement Learning in Centralized and Decentralized Sequential Environments Liu, Miao January 2014 (has links) <p>As a growing number of agents are deployed in complex environments for scientific research and human well-being, there are increasing demands for designing efficient learning algorithms for these agents to improve their control polices. Such policies must account for uncertainties, including those caused by environmental stochasticity, sensor noise and communication restrictions. These challenges exist in missions such as planetary navigation, forest firefighting, and underwater exploration. Ideally, good control policies should allow the agents to deal with all the situations in an environment and enable them to accomplish their mission within the budgeted time and resources. However, a correct model of the environment is not typically available in advance, requiring the policy to be learned from data. Model-free reinforcement learning (RL) is a promising candidate for agents to learn control policies while engaged in complex tasks, because it allows the control policies to be learned directly from a subset of experiences and with time efficiency. Moreover, to ensure persistent performance improvement for RL, it is important that the control policies be concisely represented based on existing knowledge, and have the flexibility to accommodate new experience. Bayesian nonparametric methods (BNPMs) both allow the complexity of models to be adaptive to data, and provide a principled way for discovering and representing new knowledge.</p><p>In this thesis, we investigate approaches for RL in centralized and decentralized sequential decision-making problems using BNPMs. We show how the control policies can be learned efficiently under model-free RL schemes with BNPMs. Specifically, for centralized sequential decision-making, we study Q learning with Gaussian processes to solve Markov decision processes, and we also employ hierarchical Dirichlet processes as the prior for the control policy parameters to solve partially observable Markov decision processes. For decentralized partially observable Markov decision processes, we use stick-breaking processes as the prior for the controller of each agent. We develop efficient inference algorithms for learning the corresponding control policies. We demonstrate that by combining model-free RL and BNPMs with efficient algorithm design, we are able to scale up RL methods for complex problems that cannot be solved due to the lack of model knowledge. We adaptively learn control policies with concise structure and high value, from a relatively small amount of data.</p> / Dissertation Electrical engineering Computer engineering Bayeisan nonparametric methods Decentralized POMDP Finite state controller Gaussian process POMDP reinforcement learning

Search results