Spelling suggestions: "subject:"markov 8ecision frocesses"" "subject:"markov 8ecision eprocesses""
1 
Acceleration of Iterative Methods for Markov Decision ProcessesShlakhter, Oleksandr 21 April 2010 (has links)
This research focuses on Markov Decision Processes (MDP). MDP is one of the most important and challenging areas of Operations Research. Every day people make many decisions: today's decisions impact
tomorrow's and tomorrow's will impact the ones made the day after. Problems in Engineering, Science, and Business often pose similar challenges: a large number of options and uncertainty about the
future. MDP is one of the most powerful tools for solving such problems.
There are several standard methods for finding optimal or approximately optimal policies for MDP. Approaches widely employed
to solve MDP problems include value iteration and policy iteration. Although simple to implement, these approaches are, nevertheless, limited in the size of problems that can be solved, due to excessive
computation required to find closetooptimal solutions.
My thesis proposes a new value iteration and modified policy iteration methods for classes of the expected discounted MDPs and
average cost MDPs.
We establish a class of operators that can be integrated into value iteration and modified policy iteration algorithms for Markov Decision Processes, so as to speed up the convergence of the iterative search. Application of these operators requires a little additional computation per iteration but reduces the number of iterations significantly. The development of the acceleration operators relies on two key properties of Markov operator, namely
contraction mapping and monotonicity in a restricted region. Since Markov operators of the classical value iteration and modified
policy iteration methods for average cost MDPs do not possess the contraction mapping property, for these models we restrict our study to average cost problems that can be formulated as the stochastic shortest path problem.
The performance improvement is significant, while the implementation of the operator into the value iteration is trivial. Numerical studies show that the accelerated methods can be hundreds of times more efficient for solving MDP problems than the other known approaches. The computational savings can be significant especially
when the discount factor approaches 1 and the transition probability matrix becomes dense, in which case the standard iterative algorithms suffer from slow convergence.

2 
Acceleration of Iterative Methods for Markov Decision ProcessesShlakhter, Oleksandr 21 April 2010 (has links)
This research focuses on Markov Decision Processes (MDP). MDP is one of the most important and challenging areas of Operations Research. Every day people make many decisions: today's decisions impact
tomorrow's and tomorrow's will impact the ones made the day after. Problems in Engineering, Science, and Business often pose similar challenges: a large number of options and uncertainty about the
future. MDP is one of the most powerful tools for solving such problems.
There are several standard methods for finding optimal or approximately optimal policies for MDP. Approaches widely employed
to solve MDP problems include value iteration and policy iteration. Although simple to implement, these approaches are, nevertheless, limited in the size of problems that can be solved, due to excessive
computation required to find closetooptimal solutions.
My thesis proposes a new value iteration and modified policy iteration methods for classes of the expected discounted MDPs and
average cost MDPs.
We establish a class of operators that can be integrated into value iteration and modified policy iteration algorithms for Markov Decision Processes, so as to speed up the convergence of the iterative search. Application of these operators requires a little additional computation per iteration but reduces the number of iterations significantly. The development of the acceleration operators relies on two key properties of Markov operator, namely
contraction mapping and monotonicity in a restricted region. Since Markov operators of the classical value iteration and modified
policy iteration methods for average cost MDPs do not possess the contraction mapping property, for these models we restrict our study to average cost problems that can be formulated as the stochastic shortest path problem.
The performance improvement is significant, while the implementation of the operator into the value iteration is trivial. Numerical studies show that the accelerated methods can be hundreds of times more efficient for solving MDP problems than the other known approaches. The computational savings can be significant especially
when the discount factor approaches 1 and the transition probability matrix becomes dense, in which case the standard iterative algorithms suffer from slow convergence.

3 
A minimum cost and risk mitigation approach for blood collectionZeng, Chenxi 27 May 2016 (has links)
Due to the limited supply and perishable nature of blood products, effective management of blood collection is critical for high quality healthcare delivery. Whole blood is typically collected over a 6 to 8 hour collection window from volunteer donors at sites, e.g., schools, universities, churches, companies, that are a significant distance
from the blood products processing facility and then transported from collection site to processing facility by a blood mobile. The length of time between collecting whole blood and processing it into cryoprecipitate ("cryo"), a critical blood product for controlling massive hemorrhaging, cannot take longer than 8 hours (the 8 hour collection to completion constraint), while the collection to completion constraint for other blood products is 24 hours. In order to meet the collection to completion constraint for cryo, it is often necessary
to have a "middrive collection"; i.e., for a vehicle other than the blood mobile to pickup and transport, at extra cost, whole blood units collected during early in the collection window to the processing facility. In this dissertation, we develop analytical models to: (1)
analyze which collection sites should be designated as cryo collection sites to minimize total collection costs while satisfying the collection to completion constraint and meeting the weekly production target (the nonsplit case), (2) analyze the impact of changing the current process to allow collection windows to be split into two intervals and then determining which intervals should be designated as cryo collection intervals (the split case), (3) insure that the weekly production target is met with high probability. These problems lead to MDP models with large state and action spaces and constraints to guarantee that the weekly production target is met with high probability. These models are computationally intractable for problems having state and action spaces of realistic cardinality. We consider two approaches to guarantee that the weekly production target is met with high probability: (1) a penalty function approach and (2) a chance constraint approach. For the MDP with penalty function approach, we first relax a constraint that significantly reduces the cardinality of the state space and provides a lower bound on the optimal expected weekly cost of collecting whole blood for cryo while satisfying the collection to completion constraint. We then present an action elimination procedure that coupled with the constraint relaxation leads to a computationally tractable lower bound. We then develop several heuristics that generate suboptimal policies and provide an analytical description of the difference between the upper and lower bounds in order to determine the quality of the heuristics. For the multiple decision epoch MDP model with chance constraint approach, we first note by example that a straightforward application of dynamic programming can lead to a suboptimal policy. We then restrict the model to a single decision epoch. We then use a computationally tractable rolling horizon procedure for policy determination. We also present a simple greedy heuristic (another rolling horizon decision
making procedure) based on ranking the collection intervals by middrive pickup cost per unit of expected cryo collected, which results in a competitive suboptimal solution and leads to the development of a practical decision support tool (DST). Using real data from the American Red Cross (ARC), we estimate that this DST reduces total cost by about 30% for the nonsplit case and 70% for the split case, compared to the current practice. Initial implementation of the DST at the ARC Southern regional manufacturing and service center supports our estimates and indicates the potential for significant improvement in current practice.

4 
Reinforcement learning by incremental patchingKim, Min Sub, Computer Science & Engineering, Faculty of Engineering, UNSW January 2007 (has links)
This thesis investigates how an autonomous reinforcement learning agent can improve on an approximate solution by augmenting it with a small patch, which overrides the approximate solution at certain states of the problem. In reinforcement learning, many approximate solutions are smaller and easier to produce than ???flat??? solutions that maintain distinct parameters for each fully enumerated state, but the best solution within the constraints of the approximation may fall well short of global optimality. This thesis proposes that the remaining gap to global optimality can be efficiently minimised by learning a small patch over the approximate solution. In order to improve the agent???s behaviour, algorithms are presented for learning the overriding patch. The patch is grown around particular regions of the problem where the approximate solution is found to be deficient. Two heuristic strategies are proposed for concentrating resources to those areas where inaccuracies in the approximate solution are most costly, drawing a compromise between solution quality and storage requirements. Patching also handles problems with continuous state variables, by two alternative methods: Kuhn triangulation over a fixed discretisation and nearest neighbour interpolation with a variable discretisation. As well as improving the agent???s behaviour, patching is also applied to the agent???s model of the environment. Inaccuracies in the agent???s model of the world are detected by statistical testing, using a selective sampling strategy to limit storage requirements for collecting data. The patching algorithms are demonstrated in several problem domains, illustrating the effectiveness of patching under a wide range of conditions. A scenario drawn from a realtime strategy game demonstrates the ability of patching to handle large complex tasks. These contributions combine to form a general framework for patching over approximate solutions in reinforcement learning. Complex problems cannot be solved by brute force alone, and some form of approximation is necessary to handle large problems. However, this does not mean that the limitations of approximate solutions must be accepted without question. Patching demonstrates one way in which an agent can leverage approximation techniques without losing the ability to handle fine yet important details.

5 
Systems Medicine: An Integrated Approach with Decision Making PerspectiveFaryabi, Babak 14 January 2010 (has links)
Two models are proposed to describe interactions among genes, transcription
factors, and signaling cascades involved in regulating a cellular subsystem. These
models fall within the class of Markovian regulatory networks, and can accommodate
for different biological time scales. These regulatory networks are used to study
pathological cellular dynamics and discover treatments that beneficially alter those
dynamics. The salient translational goal is to design effective therapeutic actions that
desirably modify a pathological cellular behavior via external treatments that vary
the expressions of targeted genes. The objective of therapeutic actions is to reduce
the likelihood of the pathological phenotypes related to a disease. The task of finding
effective treatments is formulated as sequential decision making processes that discriminate
the geneexpression profiles with high pathological competence versus those
with low pathological competence. Thereby, the proposed computational frameworks
provide tools that facilitate the discovery of effective drug targets and the design of
potent therapeutic actions on them. Each of the proposed systembased therapeutic
methods in this dissertation is motivated by practical and analytical considerations.
First, it is determined how asynchronous regulatory models can be used as a tool
to search for effective therapeutic interventions. Then, a constrained intervention method is introduced to incorporate the sideeffects of treatments while searching for
a sequence of potent therapeutic actions. Lastly, to bypass the impediment of model
inference and to mitigate the numerical challenges of exhaustive search algorithms, a
heuristic method is proposed for designing systembased therapies. The presentation
of the key ideas in method is facilitated with the help of several case studies.

6 
Reinforcement learning by incremental patchingKim, Min Sub, Computer Science & Engineering, Faculty of Engineering, UNSW January 2007 (has links)
This thesis investigates how an autonomous reinforcement learning agent can improve on an approximate solution by augmenting it with a small patch, which overrides the approximate solution at certain states of the problem. In reinforcement learning, many approximate solutions are smaller and easier to produce than ???flat??? solutions that maintain distinct parameters for each fully enumerated state, but the best solution within the constraints of the approximation may fall well short of global optimality. This thesis proposes that the remaining gap to global optimality can be efficiently minimised by learning a small patch over the approximate solution. In order to improve the agent???s behaviour, algorithms are presented for learning the overriding patch. The patch is grown around particular regions of the problem where the approximate solution is found to be deficient. Two heuristic strategies are proposed for concentrating resources to those areas where inaccuracies in the approximate solution are most costly, drawing a compromise between solution quality and storage requirements. Patching also handles problems with continuous state variables, by two alternative methods: Kuhn triangulation over a fixed discretisation and nearest neighbour interpolation with a variable discretisation. As well as improving the agent???s behaviour, patching is also applied to the agent???s model of the environment. Inaccuracies in the agent???s model of the world are detected by statistical testing, using a selective sampling strategy to limit storage requirements for collecting data. The patching algorithms are demonstrated in several problem domains, illustrating the effectiveness of patching under a wide range of conditions. A scenario drawn from a realtime strategy game demonstrates the ability of patching to handle large complex tasks. These contributions combine to form a general framework for patching over approximate solutions in reinforcement learning. Complex problems cannot be solved by brute force alone, and some form of approximation is necessary to handle large problems. However, this does not mean that the limitations of approximate solutions must be accepted without question. Patching demonstrates one way in which an agent can leverage approximation techniques without losing the ability to handle fine yet important details.

7 
Sample Complexity of Incremental Policy Gradient Methods for Solving MultiTask Reinforcement LearningBai, Yitao 05 April 2024 (has links)
We consider a multitask learning problem, where an agent is presented a number of N reinforcement learning tasks. To solve this problem, we are interested in studying the gradient approach, which iteratively updates an estimate of the optimal policy using the gradients of the value functions. The classic policy gradient method, however, may be expensive to implement in the multitask settings as it requires access to the gradients of all the tasks at every iteration. To circumvent this issue, in this paper we propose to study an incremental policy gradient method, where the agent only uses the gradient of only one task at each iteration. Our main contribution is to provide theoretical results to characterize the performance of the proposed method. In particular, we show that incremental policy gradient methods converge to the optimal value of the multitask reinforcement learning objectives at a sublinear rate O(1/√k), where k is the number of iterations. To illustrate its performance, we apply the proposed method to solve a simple multitask variant of GridWorld problems, where an agent seeks to find an policy to navigate effectively in different environments. / Master of Science / First, we introduce a popular machine learning technique called Reinforcement Learning (RL), where an agent, such as a robot, uses a policy to choose an action, like moving forward, based on observations from sensors like cameras. The agent receives a reward that helps judge if the policy is good or bad. The objective of the agent is to find a policy that maximizes the cumulative reward it receives by repeating the above process. RL has many applications, including Cruise autonomous cars, Google industry automation, training ChatGPT language models, and Walmart inventory management. However, RL suffers from task sensitivity and requires a lot of training data. For example, if the task changes slightly, the agent needs to train the policy from the beginning. This motivates the technique called MultiTask Reinforcement Learning (MTRL), where different tasks give different rewards and the agent maximizes the sum of cumulative rewards of all the tasks. We focus on the incremental setting where the agent can only access the tasks one by one randomly. In this case, we only need one agent and it is not required to know which task it is performing. We show that the incremental policy gradient methods we proposed converge to the optimal value of the MTRL objectives at a sublinear rate O(1/ √ k), where k is the number of iterations. To illustrate its performance, we apply the proposed method to solve a simple multitask variant of GridWorld problems, where an agent seeks to find an policy to navigate effectively in different environments.

8 
Optimizing Trading Decisions for Hydro Storage Systems using Approximate Dual Dynamic ProgrammingLöhndorf, Nils, Wozabal, David, Minner, Stefan 22 August 2013 (has links) (PDF)
We propose a new approach to optimize operations of hydro storage systems with multiple connected reservoirs whose operators participate in wholesale electricity markets. Our formulation integrates shortterm intraday with longterm interday decisions. The intraday problem considers bidding decisions as well as storage operation during the day and is formulated as a stochastic program. The interday problem is modeled as a Markov decision process of managing storage operation over time, for which we propose integrating stochastic dual dynamic programming with approximate dynamic programming. We show that the approximate solution converges towards an upper bound of the optimal solution. To demonstrate the efficiency of the solution approach, we fit an econometric model to actual price and in inflow data and apply the approach to a case study of an existing hydro storage system. Our results indicate that the approach is tractable for a realworld application and that the gap between theoretical upper and a simulated lower bound decreases sufficiently fast. (authors' abstract)

9 
The Essential Dynamics Algorithm: Essential ResultsMartin, Martin C. 01 May 2003 (has links)
This paper presents a novel algorithm for learning in a class of stochastic Markov decision processes (MDPs) with continuous state and action spaces that trades speed for accuracy. A transform of the stochastic MDP into a deterministic one is presented which captures the essence of the original dynamics, in a sense made precise. In this transformed MDP, the calculation of values is greatly simplified. The online algorithm estimates the model of the transformed MDP and simultaneously does policy search against it. Bounds on the error of this approximation are proven, and experimental results in a bicycle riding domain are presented. The algorithm learns near optimal policies in orders of magnitude fewer interactions with the stochastic MDP, using less domain knowledge. All code used in the experiments is available on the project's web site.

10 
A Stochastic Vendor Managed Inventory Problem and Its VariationsBalun, Pairote 14 May 2004 (has links)
We analyze the problem of distributing units of a product, by a capacitated vehicle, from one storage location (depot) to multiple retailers. The demand processes at the retailers are stochastic and timedependent. Based on current inventory information, the decision maker decides how many units of the product to deposit at the current retailer, or pick up at the depot, and which location to visit next. We refer to this problem as the stochastic vendor managed inventory (SVMI) problem. In the Markov decision process model of the SVMI problem, we show how a retailer continues to be the vehicle's optimal destination as inventory levels of the retailers vary. Furthermore, an optimal inventory action is shown to have monotone relations with the inventory levels. The multiperiod SVMI problem and the infinite horizon (periodic) SVMI problem are analyzed. Additionally, we develop three suboptimal solution procedures, complete a numerical study, and present a case study, which involves a distribution problem at the CocaCola Enterprises, Inc. We consider four variations of the SVMI problem, which differ in the available state information and/or the vehicle routing procedure. Analytically, we compare the optimal expected total rewards for the SVMI problem and its variations. Our computational experience suggests a complementary relationship between the quality of state information and the size of the set of retailers that the vehicle can visit.

Page generated in 0.1743 seconds