Spelling suggestions: "subject:"markov decision processes."" "subject:"darkov decision processes.""
31 |
Texplore : temporal difference reinforcement learning for robots and time-constrained domains / Temporal difference reinforcement learning for robots and time-constrained domainsHester, Todd 30 January 2013 (has links)
Robots have the potential to solve many problems in society, because of their ability to work in dangerous places doing necessary jobs that no one wants or is able to do. One barrier to their widespread deployment is that they are mainly limited to tasks where it is possible to hand-program behaviors for every situation that may be encountered. For robots to meet their potential, they need methods that enable them to learn and adapt to novel situations that they were not programmed for. Reinforcement learning (RL) is a paradigm for learning sequential decision making processes and could solve the problems of learning and adaptation on robots. This dissertation identifies four key challenges that must be addressed for an RL algorithm to be practical for robotic control tasks. These RL for Robotics Challenges are: 1) it must learn in very few samples; 2) it must learn in domains with continuous state features; 3) it must handle sensor and/or actuator delays; and 4) it should continually select actions in real time. This dissertation focuses on addressing all four of these challenges. In particular, this dissertation is focused on time-constrained domains where the first challenge is critically important. In these domains, the agent's lifetime is not long enough for it to explore the domain thoroughly, and it must learn in very few samples. Although existing RL algorithms successfully address one or more of the RL for Robotics Challenges, no prior algorithm addresses all four of them. To fill this gap, this dissertation introduces TEXPLORE, the first algorithm to address all four challenges. TEXPLORE is a model-based RL method that learns a random forest model of the domain which generalizes dynamics to unseen states. Each tree in the random forest model represents a hypothesis of the domain's true dynamics, and the agent uses these hypotheses to explores states that are promising for the final policy, while ignoring states that do not appear promising. With sample-based planning and a novel parallel architecture, TEXPLORE can select actions continually in real time whenever necessary. We empirically evaluate each component of TEXPLORE in comparison with other state-of-the-art approaches. In addition, we present modifications of TEXPLORE's exploration mechanism for different types of domains. The key result of this dissertation is a demonstration of TEXPLORE learning to control the velocity of an autonomous vehicle on-line, in real time, while running on-board the robot. After controlling the vehicle for only two minutes, TEXPLORE is able to learn to move the pedals of the vehicle to drive at the desired velocities. The work presented in this dissertation represents an important step towards applying RL to robotics and enabling robots to perform more tasks in society. By enabling robots to learn in few actions while acting on-line in real time on robots with continuous state and actuator delays, TEXPLORE significantly broadens the applicability of RL to robots. / text
|
32 |
Multi-state Bayesian Process ControlWang, Jue 14 January 2014 (has links)
Bayesian process control is a statistical process control (SPC) scheme that uses the posterior state probabilities as the control statistic. The key issue is to decide when to restore the process based on real-time observations. Such problems have been extensively studied in the framework of partially observable Markov decision processes (POMDP), with particular emphasis on the structure of optimal control policy.
Almost all existing structural results on the optimal policies are limited to the two-state processes, where the class of control-limit policy is optimal. However, the two-state model is a gross simplification, as real production processes almost always involve multiple states. For example, a machine in the production system often has multiple failure modes differing in their effects; the deterioration process can often be divided into multiple stages with different degradation levels; the condition of a complex multi-unit system also requires a multi-state representation.
We investigate the optimal control policies for multi-state processes with fixed sampling scheme, in which information about the process is represented by a belief vector within a high dimensional probability simplex. It is well known that obtaining structural results for such high-dimensional POMDP is challenging. Firstly, we prove that for an infinite-horizon process subject to multiple competing assignable causes, a so-called conditional control limit policy is optimal. The optimal policy divides the belief space into two individually connected regions, which have analytical bounds. Next, we address a finite-horizon process with at least one absorbing state and show that a structured optimal policy can be established by transforming the belief space into a polar coordinate system, where a so-called polar control limit policy is optimal. Our model is general enough to include many existing models in the literature as special cases. The structural results also lead to significantly efficient algorithms for computing the optimal policies. In addition, we characterize the condition for some out-of-control state to be more desirable than the in-control state. The existence of such counterintuitive situation indicates that multi-state process control is drastically different from the two-state case.
|
33 |
A constrained MDP-based vertical handoff decision algorithm for wireless networksSun, Chi 11 1900 (has links)
The 4th generation wireless communication systems aim to provide users with the convenience of seamless roaming among heterogeneous wireless access networks. To achieve this goal, the support of vertical handoff is important in mobility management. This thesis focuses on the vertical handoff decision algorithm, which determines the criteria under which vertical handoff should be performed. The problem is formulated as a constrained Markov decision process. The objective is to maximize the expected total reward of a connection subject to the expected total access cost constraint. In our model, a benefit function is used to assess the quality of the connection, and a penalty function is used to model the signaling incurred and call dropping. The user's velocity and location information are also considered when making the handoff decisions. The policy iteration and Q-learning algorithms are employed to determine the optimal policy. Structural results on the optimal vertical handoff policy are derived by using the concept of supermodularity. We show that the optimal policy is a threshold policy in bandwidth, delay, and velocity. Numerical results show that our proposed vertical handoff decision algorithm outperforms other decision schemes in a wide range of conditions such as variations on connection duration, user's velocity, user's budget, traffic type, signaling cost, and monetary access cost.
|
34 |
Multi-state Bayesian Process ControlWang, Jue 14 January 2014 (has links)
Bayesian process control is a statistical process control (SPC) scheme that uses the posterior state probabilities as the control statistic. The key issue is to decide when to restore the process based on real-time observations. Such problems have been extensively studied in the framework of partially observable Markov decision processes (POMDP), with particular emphasis on the structure of optimal control policy.
Almost all existing structural results on the optimal policies are limited to the two-state processes, where the class of control-limit policy is optimal. However, the two-state model is a gross simplification, as real production processes almost always involve multiple states. For example, a machine in the production system often has multiple failure modes differing in their effects; the deterioration process can often be divided into multiple stages with different degradation levels; the condition of a complex multi-unit system also requires a multi-state representation.
We investigate the optimal control policies for multi-state processes with fixed sampling scheme, in which information about the process is represented by a belief vector within a high dimensional probability simplex. It is well known that obtaining structural results for such high-dimensional POMDP is challenging. Firstly, we prove that for an infinite-horizon process subject to multiple competing assignable causes, a so-called conditional control limit policy is optimal. The optimal policy divides the belief space into two individually connected regions, which have analytical bounds. Next, we address a finite-horizon process with at least one absorbing state and show that a structured optimal policy can be established by transforming the belief space into a polar coordinate system, where a so-called polar control limit policy is optimal. Our model is general enough to include many existing models in the literature as special cases. The structural results also lead to significantly efficient algorithms for computing the optimal policies. In addition, we characterize the condition for some out-of-control state to be more desirable than the in-control state. The existence of such counterintuitive situation indicates that multi-state process control is drastically different from the two-state case.
|
35 |
Finite Memory Policies for Partially Observable Markov Decision ProessesLusena, Christopher 01 January 2001 (has links)
This dissertation makes contributions to areas of research on planning with POMDPs: complexity theoretic results and heuristic techniques. The most important contributions are probably the complexity of approximating the optimal history-dependent finite-horizon policy for a POMDP, and the idea of heuristic search over the space of FFTs.
|
36 |
Regret-based Reward Elicitation for Markov Decision ProcessesKevin, Regan 22 August 2014 (has links)
Markov decision processes (MDPs) have proven to be a useful model for sequential decision- theoretic reasoning under uncertainty, yet they require the specification of a reward function that can require sophisticated human judgement to assess relevant tradeoffs. This dissertation casts the problem of specifying rewards as one of preference elicitation and aims to minimize the degree of precision with which a reward function must be specified while still allowing optimal or near-optimal policies to be produced. We demonstrate how robust policies can be computed for MDPs given only partial reward information using the minimax regret criterion.
Minimax regret offers an intuitive bound on loss; however, it is computationally intractable in general. This work develops techniques for exploiting MDP structure to allow for offline precomputation that enables efficient online minimax regret computation. To complement this exact approach we develop several general approximations that offer both upper and lower bounds on minimax regret. We further show how approximations can be improved online during the elicitation procedure to balance accuracy and efficiency.
To effectively reduce regret, we investigate a spectrum of elicitation approaches that range from the computationally-demanding optimal selection of complex queries about full MDP policies (which are informative, but, we believe, cognitively difficult) to the heuristic selection of simple queries that focus on a small set of reward parameters. Results are demonstrated on MDPs drawn from the domains of assistive technology and autonomic computing.
Finally we demonstrate our framework on a realistic website optimization domain, per- forming elicitation on websites with tens of thousands of webpages. We show that minimax regret can be efficiently computed, and develop informative and cognitively reasonable queries that quickly lower minimax regret, producing policies that offer significant improvement in the design of the underlying websites.
|
37 |
A constrained MDP-based vertical handoff decision algorithm for wireless networksSun, Chi 11 1900 (has links)
The 4th generation wireless communication systems aim to provide users with the convenience of seamless roaming among heterogeneous wireless access networks. To achieve this goal, the support of vertical handoff is important in mobility management. This thesis focuses on the vertical handoff decision algorithm, which determines the criteria under which vertical handoff should be performed. The problem is formulated as a constrained Markov decision process. The objective is to maximize the expected total reward of a connection subject to the expected total access cost constraint. In our model, a benefit function is used to assess the quality of the connection, and a penalty function is used to model the signaling incurred and call dropping. The user's velocity and location information are also considered when making the handoff decisions. The policy iteration and Q-learning algorithms are employed to determine the optimal policy. Structural results on the optimal vertical handoff policy are derived by using the concept of supermodularity. We show that the optimal policy is a threshold policy in bandwidth, delay, and velocity. Numerical results show that our proposed vertical handoff decision algorithm outperforms other decision schemes in a wide range of conditions such as variations on connection duration, user's velocity, user's budget, traffic type, signaling cost, and monetary access cost.
|
38 |
Time-dependence in Markovian decision processes.McMahon, Jeremy James January 2008 (has links)
The main focus of this thesis is Markovian decision processes with an emphasis on incorporating time-dependence into the system dynamics. When considering such decision processes, we provide value equations that apply to a large range of classes of Markovian decision processes, including Markov decision processes (MDPs) and semi-Markov decision processes (SMDPs), time-homogeneous or otherwise. We then formulate a simple decision process with exponential state transitions and solve this decision process using two separate techniques. The first technique solves the value equations directly, and the second utilizes an existing continuous-time MDP solution technique. To incorporate time-dependence into the transition dynamics of the process, we examine a particular decision process with state transitions determined by the Erlang distribution. Although this process is originally classed as a generalized semi-Markov decision process, we re-define it as a time-inhomogeneous SMDP. We show that even for a simply stated process with desirable state-space properties, the complexity of the value equations becomes so substantial that useful analytic expressions for the optimal solutions for all states of the process are unattainable. We develop a new technique, utilizing phase-type (PH) distributions, in an effort to address these complexity issues. By using PH representations, we construct a new state-space for the process, referred to as the phase-space, incorporating the phases of the state transition probability distributions. In performing this step, we effectively model the original process as a continuous-time MDP. The information available in this system is, however, richer than that of the original system. In the interest of maintaining the physical characteristics of the original system, we define a new valuation technique for the phase-space that shields some of this information from the decision maker. Using the process of phase-space construction and our valuation technique, we define an original system of value equations for this phasespace that are equivalent to those for the general Markovian decision processes mentioned earlier. An example of our own phase-space technique is given for the aforementioned Erlang decision process and we identify certain characteristics of the optimal solution such that, when applicable, the implementation of our phase-space technique is greatly simplified. These newly defined value equations for the phase-space are potentially as complex to solve as those defined for the original model. Restricting our focus to systems with acyclic state-spaces though, we describe a top-down approach to solution of the phase-space value equations for more general processes than those considered thus far. Again, we identify characteristics of the optimal solution to look for when implementing this technique and provide simplifications of the value equations where these characteristics are present. We note, however, that it is almost impossible to determine a priori the class of processes for which the simplifications outlined in our phase-space technique will be applicable. Nevertheless, we do no worse in terms of complexity by utilizing our phase-space technique, and leave open the opportunity to simplify the solution process if an appropriate situation arises. The phase-space technique can handle time-dependence in the state transition probabilities, but is insufficient for any process with time-dependent reward structures or discounting. To address such decision processes, we define an approximation technique for the solution of the class of infinite horizon decision processes whose state transitions and reward structures are described with reference to a single global clock. This technique discretizes time into exponentially distributed length intervals and incorporates this absolute time information into the state-space. For processes where the state-transitions are not exponentially distributed, we use the hazard rates of the transition probability distributions evaluated at the discrete time points to model the transition dynamics of the system. We provide a suitable reward structure approximation using our discrete time points and guidelines for sensible truncation, using an MDP approximation to the tail behaviour of the original infinite horizon process. The result is a finite-state time-homogeneous MDP approximation to the original process and this MDP may be solved using standard existing solution techniques. The approximate solution to the original process can then be inferred from the solution to our MDP approximation. / Thesis (Ph.D.) -- University of Adelaide, School of Mathematical Sciences, 2008
|
39 |
TaxiWorld: Developing and Evaluating Solution Methods for Multi-Agent Planning DomainsJanuary 2011 (has links)
abstract: TaxiWorld is a Matlab simulation of a city with a fleet of taxis which operate within it, with the goal of transporting passengers to their destinations. The size of the city, as well as the number of available taxis and the frequency and general locations of fare appearances can all be set on a scenario-by-scenario basis. The taxis must attempt to service the fares as quickly as possible, by picking each one up and carrying it to its drop-off location. The TaxiWorld scenario is formally modeled using both Decentralized Partially-Observable Markov Decision Processes (Dec-POMDPs) and Multi-agent Markov Decision Processes (MMDPs). The purpose of developing formal models is to learn how to build and use formal Markov models, such as can be given to planners to solve for optimal policies in problem domains. However, finding optimal solutions for Dec-POMDPs is NEXP-Complete, so an empirical algorithm was also developed as an improvement to the method already in use on the simulator, and the methods were compared in identical scenarios to determine which is more effective. The empirical method is of course not optimal - rather, it attempts to simply account for some of the most important factors to achieve an acceptable level of effectiveness while still retaining a reasonable level of computational complexity for online solving. / Dissertation/Thesis / M.S. Computer Science 2011
|
40 |
Representações compactas para processos de decisão de Markov e sua aplicação na adminsitração de impressoras. / Compact representations of Markov decision processes and their application to printer management.João Vitor Torres 02 June 2006 (has links)
Os Processos de Decisão de Markov (PDMs) são uma importante ferramenta de planejamento e otimização em ambientes que envolvem incertezas. Contudo a especificação e representação computacional das distribuições de probabilidades subjacentes a PDMs é uma das principais dificuldades de utilização desta ferramenta. Este trabalho propõe duas estratégias para representação destas probabilidades de forma compacta e eficiente. Estas estratégias utilizam redes Bayesianas e regularidades entre os estados e as variáveis. As estratégias apresentadas são especialmente úteis em sistemas onde as variáveis têm muitas categorias e possuem forte inter-relação. Além disso, é apresentada a aplicação destes modelos no gerenciamento de grupos de impressoras (um problema real da indústria e que motivou o desenvolvimento do trabalho) permitindo que estas atuem coletiva e não individualmente. O último tópico discutido é uma análise comparativa da mesma aplicação utilizando Lógica Difusa. / Markov Decision Processes (MDPs) are an important tool for planning and optimization in environments under uncertainty. The specification and computational representation of the probability distributions underlying MDPs are central difficulties for their application. This work proposes two strategies for representation of probabilities in a compact and efficient way. These strategies use Bayesian networks and regularities among states and variables. The proposed strategies are particularly useful in systems whose variables have many categories and have strong interrelation. This proposal has been applied to the management of clusters of printers, a real problem that in fact motivated the work. Markov Decision Processes are then used to allow printers to act as a group, and not just individually. The work also presents a comparison between MDPs and Fuzzy Logic in the context of clusters of printers.
|
Page generated in 0.1508 seconds