Spelling suggestions: "subject:"markov codecision canprocess"" "subject:"markov codecision 3.3vprocess""
1 |
Modeling exotic options with maturity extensions by stochastic dynamic programmingTapeinos, Socratis January 2009 (has links)
The exotic options that are examined in this thesis have a combination of non-standard characteristics which can be found in shout, multi-callable, pathdependent and Bermudan options. These options are called reset options. A reset option is an option which allows the holder to reset, one or more times, certain terms of the contract based on pre-specified rules during the life of the option. Overall in this thesis, an attempt has been made to tackle the modeling challenges that arise from the exotic properties of the reset option embedded in segregated funds. Initially, the relevant literature was reviewed and the lack of published work, advanced enough to deal with the complexities of the reset option, was identified. Hence, there appears to be a clear and urgent need to have more sophisticated approaches which will model the reset option. The reset option on the maturity guarantee of segregated funds is formulated as a non-stationary finite horizon Markov Decision Process. The returns from the underlying asset are modeled using a discrete time approximation of the lognormal model. An Optimal Exercise Boundary of the reset option is derived where a threshold value is depicted such that if the value of the underlying asset price exceeds it then it is optimal for the policyholder to reset his maturity guarantee. Otherwise, it is optimal for the policyholder to rollover his maturity guarantee. It is noteworthy that the model is able to depict the Optimal Exercise Boundary of not just the first but of all the segregated fund contracts which can be issued throughout the planning horizon of the policyholder. The main finding of the model is that as the segregated fund contract approaches its maturity, the threshold value in the Optimal Exercise Boundary increases. However, in the last period before the maturity of the segregated fund, the threshold value decreases. The reason for this is that if the reset option is not exercised it will expire worthless. The model is then extended to re ect on the characteristics of the range of products which are traded in the market. Firstly, the issuer of the segregated fund contract is allowed to charge a management fee to the policyholder. The effect from incorporating this fee is that the policyholder requires a higher return in order to optimally reset his maturity guarantee while the total value of the segregated fund is diminished. Secondly, the maturity guarantee becomes a function of the number of times that the reset option has been exercised. The effect is that the policyholder requires a higher return in order to choose to reset his maturity guarantee while the total value of the segregated fund is diminished. Thirdly, the policyholder is allowed to reset the maturity guarantee at any point in time within each year from the start of the planning horizon, but only once. The effect is that the total value of the segregated fund is increased since the policyholder may lock in higher market gains as he has more reset decision points. In response to the well documented deficiencies of the lognormal model to capture the jumps experienced by stock markets, extensions were built which incorporate such jumps in the original model. The effect from incorporating such jumps is that the policyholder requires a higher return in order to choose to reset his maturity guarantee while the total value of the segregated fund is diminished due to the adverse effect of the negative jumps on the value of the underlying asset.
|
2 |
Optimal threshold policy for opportunistic network coding under phase type arrivalsGunasekara, Charith 01 September 2016 (has links)
Network coding allows each node in a network to perform some coding operations on the data packets and improve the overall throughput of communication. However, network coding cannot be done unless there are enough packets to be coded so at times it may be advantageous to wait for packets to arrive.
We consider a scenario in which two wireless nodes each with its own buffer communicate via a single access point using network coding. The access point first pairs each data packet being sent from each node and then performs the network coding operation. Packets arriving at the access point that are unable to be paired are instead loaded into one of the two buffers at the access point. In the case where one of the buffers is empty and the other is not network coding is not possible. When this happens the access point must either wait for a network coding opportunity, or transmit the unpaired packet without coding. Delaying packet transmission is associated with an increased waiting cost but also allows for an increase in the overall efficiency of wireless spectrum usage, thus a decrease in packet transmission cost. Conversely, sending packets un-coded is associated with a decrease in waiting cost but also a decrease in the overall efficiency of the wireless spectrum usage. Hence, there is a trade-off between decreasing packet delay time, and increasing the efficiency of the wireless spectrum usage.
We show that the optimal waiting policy for this system with respect to total cost, under phase-type packet arrivals, is to have a separate threshold for the buffer size that is dependent on the current phase of each arrival. We then show that the solution to this optimization problem can be obtained by treating it as a double ended push-out queueing theory problem. We develop a new technique to keep track of the packet waiting time and the number of packets waiting in the two ended push-out queue. We use the resulting queueing model to resolve the optimal threshold policy and then analyze the performance of the system using numerical approach. / October 2016
|
3 |
Qualitative analysis of synchronizing probabilistic systems / Analyse qualitative des systèmes probabilistes synchronisantsShirmohammadi, Mahsa 10 December 2014 (has links)
Les Markov Decision Process (MDP) sont des systèmes finis probabilistes avec à la fois des choix aléatoires et des stratégies, et sont ainsi reconnus comme de puissants outils pour modéliser les interactions entre un contrôleur et les réponses aléatoires de l'environment. Mathématiquement, un MDP peut être vu comme un jeu stochastique à un joueur et demi où le contrôleur choisit à chaque tour une action et l'environment répond en choisissant un successeur selon une distribution de probabilités fixée.Il existe deux incomparables représentations du comportement d'un MDP une fois les choix de la stratégie fixés.Dans la représentation classique, un MDP est un générateur de séquences d'états, appelées state-outcome; les conditions gagnantes du joueur sont ainsi exprimées comme des ensembles de séquences désirables d'états qui sont visités pendant le jeu, e.g. les conditions de Borel telles que l'accessibilité. La complexité des problèmes de décision ainsi que la capacité mémoire requise des stratégies gagnantes pour les conditions dites state-outcome ont été déjà fortement étudiées.Depuis peu, les MDPs sont également considérés comme des générateurs de séquences de distributions de probabilités sur les états, appelées distribution-outcome. Nous introduisons des conditions de synchronisation sur les distributions-outcome, qui intuitivement demandent à ce que la masse de probabilité s'accumule dans un (ensemble d') état, potentiellement de façon asymptotique.Une distribution de probabilités est p-synchrone si la masse de probabilité est d'au moins p dans un état; et la séquence de distributions de probabilités est toujours, éventuellement, faiblement, ou fortement p-synchrone si, respectivement toutes, certaines, infiniment plusieurs ou toutes sauf un nombre fini de distributions dans la séquence sont p-synchrones.Pour chaque type de synchronisation, un MDP peut être(i) assurément gagnant si il existe une stratégie qui génère une séquence 1-synchrone;(ii) presque-assurément gagnant si il existe une stratégie qui génère une séquence (1-epsilon)-synchrone et cela pour tout epsilon strictement positif;(iii) asymptotiquement gagnant si pour tout epsilon strictement positif, il existe une stratégie produisant une séquence (1-epsilon)-synchrone.Nous considérons le problème consistant à décider si un MDP est gagnant, pour chaque type de synchronisation et chaque mode gagnant: nous établissons les limites supérieures et inférieures de la complexité de ces problèmes ainsi que la capacité mémoire requise pour une stratégie gagnante optimale.En outre, nous étudions les problèmes de synchronisation pour les automates probabilistes (PAs) qui sont en fait des instances de MDP où les contrôleurs sont restreint à utiliser uniquement des stratégies-mots; c'est à dire qu'ils n'ont pas la possibilité d'observer l'historique de l'exécution du système et ne peuvent connaitre que le nombre de choix effectués jusque là. Les langages synchrones d'un PA sont donc l'ensemble des stratégies-mots synchrones: nous établissons la complexité des problèmes des langages synchrones vides et universels pour chaque mode gagnant.Nous répercutons nos résultats obtenus pour les problèmes de synchronisation sur les MDPs et PAs aux jeux tour à tour à deux joueurs ainsi qu'aux automates finis non-déterministes. En plus de nos résultats principaux, nous établissons de nouveaux résultats de complexité sur les automates finis alternants avec des alphabets à une lettre. Enfin, nous étudions plusieurs variations de synchronisation sur deux instances de systèmes infinis que sont les automates temporisés et pondérés. / Markov decision processes (MDPs) are finite-state probabilistic systems with bothstrategic and random choices, hence well-established to model the interactions between a controller and its randomly responding environment.An MDP can be mathematically viewed as a one and half player stochastic game played in rounds when the controller chooses an action,and the environment chooses a successor according to a fixedprobability distribution.There are two incomparable views on the behavior of an MDP, when thestrategic choices are fixed. In the traditional view, an MDP is a generator of sequence of states, called the state-outcome; the winning condition of the player is thus expressed as a set of desired sequences of states that are visited during the game, e.g. Borel condition such as reachability.The computational complexity of related decision problems and memory requirement of winning strategies for the state-outcome conditions are well-studied.Recently, MDPs have been viewed as generators of sequences of probability distributions over states, calledthe distribution-outcome. We introduce synchronizing conditions defined on distribution-outcomes,which intuitively requires that the probability mass accumulates insome (group of) state(s), possibly in limit.A probability distribution is p-synchronizing if the probabilitymass is at least p in some state, anda sequence of probability distributions is always, eventually,weakly, or strongly p-synchronizing if respectively all, some, infinitely many, or all but finitely many distributions in the sequence are p-synchronizing.For each synchronizing mode, an MDP can be (i) sure winning if there is a strategy that produces a 1-synchronizing sequence; (ii) almost-sure winning if there is a strategy that produces a sequence that is, for all epsilon > 0, a (1-epsilon)-synchronizing sequence; (iii) limit-sure winning if for all epsilon > 0, there is a strategy that produces a (1-epsilon)-synchronizing sequence.We consider the problem of deciding whether an MDP is winning, for each synchronizing and winning mode: we establish matching upper and lower complexity bounds of the problems, as well as the memory requirementfor optimal winning strategies.As a further contribution, we study synchronization in probabilistic automata (PAs), that are kind of MDPs where controllers are restricted to use only word-strategies; i.e. no ability to observe the history of the system execution, but the number of choices made so far.The synchronizing languages of a PA is then the set of all synchronizing word-strategies: we establish the computational complexity of theemptiness and universality problems for all synchronizing languages in all winning modes.We carry over results for synchronizing problems from MDPs and PAs to two-player turn-based games and non-deterministic finite state automata. Along with the main results, we establish new complexity results foralternating finite automata over a one-letter alphabet.In addition, we study different variants of synchronization for timed andweighted automata, as two instances of infinite-state systems.
|
4 |
Proactive Planning through Active Policy Inference in Stochastic EnvironmentsPoulin, Nolan 01 May 2018 (has links)
In multi-agent Markov Decision Processes, a controllable agent must perform optimal planning in a dynamic and uncertain environment that includes another unknown and uncontrollable agent. Given a task specification for the controllable agent, its ability to complete the task can be impeded by an inaccurate model of the intent and behaviors of other agents. In this work, we introduce an active policy inference algorithm that allows a controllable agent to infer a policy of the environmental agent through interaction. Active policy inference is data-efficient and is particularly useful when data are time-consuming or costly to obtain. The controllable agent synthesizes an exploration-exploitation policy that incorporates the knowledge learned about the environment's behavior. Whenever possible, the agent also tries to elicit behavior from the other agent to improve the accuracy of the environmental model. This is done by mapping the uncertainty in the environmental model to a bonus reward, which helps elicit the most informative exploration, and allows the controllable agent to return to its main task as fast as possible. Experiments demonstrate the improved sample efficiency of active learning and the convergence of the policy for the controllable agents.
|
5 |
Dynamic generalisation of continuous action spaces in reinforcement learning : a neurally inspired approachSmith, Andrew James January 2002 (has links)
This thesis is about the dynamic generalisation of continuous action spaces in reinforcement learning problems. The standard Reinforcement Learning (RL) account provides a principled and comprehensive means of optimising a scalar reward signal in a Markov Decision Process. However, the theory itself does not directly address the imperative issue of generalisation which naturally arises as a consequence of large or continuous state and action spaces. A current thrust of research is aimed at fusing the generalisation capabilities of supervised (and unsupervised) learning techniques with the RL theory. An example par excellence is Tesauro’s TD-Gammon. Although much effort has gone into researching ways to represent and generalise over the input space, much less attention has been paid to the action space. This thesis first considers the motivation for learning real-valued actions, and then proposes a set of key properties desirable in any candidate algorithm addressing generalisation of both input and action spaces. These properties include: Provision of adaptive and online generalisation, adherence to the standard theory with a central focus on estimating expected reward, provision for real-valued states and actions, and full support for a real-valued discounted reward signal. Of particular interest are issues pertaining to robustness in non-stationary environments, scalability, and efficiency for real-time learning in applications such as robotics. Since exploring the action space is discovered to be a potentially costly process, the system should also be flexible enough to enable maximum reuse of learned actions. A new approach is proposed which succeeds for the first time in addressing all of the key issues identified. The algorithm, which is based on the ubiquitous self-organising map, is analysed and compared with other techniques including those based on the backpropagation algorithm. The investigation uncovers some important implications of the differences between these two particular approaches with respect to RL. In particular, the distributed representation of the multi-layer perceptron is judged to be something of a double-edged sword offering more sophisticated and more scalable generalising power, but potentially causing problems in dynamic or non-equiprobable environments, and tasks involving a highly varying input-output mapping. The thesis concludes that the self-organising map can be used in conjunction with current RL theory to provide real-time dynamic representation and generalisation of continuous action spaces. The proposed model is shown to be reliable in non-stationary, unpredictable and noisy environments and judged to be unique in addressing and satisfying a number of desirable properties identified as important to a large class of RL problems.
|
6 |
A Simulation Based Approximate Dynamic Programming Approach to Multi-class, Multi-resource Surgical SchedulingAstaraky, Davood 09 January 2013 (has links)
The thesis focuses on a model that seeks to address patient scheduling step of the surgical scheduling process to determine the number of surgeries to perform in a given day. Specifically, provided a master schedule that provides a cyclic breakdown of total OR availability into specific daily allocations to each surgical specialty, we look to provide a scheduling policy for all surgeries that minimizes a combination of the lead time between patient request and surgery date, overtime in the ORs and congestion in the wards. We cast the problem of generating optimal control strategies into the framework of Markov Decision Process (MDP). The Approximate Dynamic Programming (ADP) approach has been employed to solving the model which would otherwise be intractable due to the size of the state space. We assess performance of resulting policy and quality of the driven policy through simulation and we provide our policy insights and conclusions.
|
7 |
An average cost Markov decision process model to decide when to challenge a call in a tennis matchNadimpalli, Vamsi Krishna 16 February 2011 (has links)
In a standard tennis match each player has an unlimited opportunity
to challenge an umpire’s call, but if three incorrect challenges are made in a set he is not allowed to challenge anymore in that set. If the set goes into a tie break the limit on incorrect challenges increases by one. These limited
incorrect challenges are not carried over from one set to another. So this is kind of a limited resource available to the player and if he knows how to use
this resource in a best possible way, there is a scope for increasing his overall chances of winning a match. With the motive of gaining insight on when to challenge a call, we have modeled a single game in a tennis match as a Markov decision process. We have also studied the impact of variables like player’s probability of winning a point, the player’s perception of the challengability
of a call and proportion of challengable calls on the decision making process. / text
|
8 |
Cross-layer adaptive transmission scheduling in wireless networksNgo, Minh Hanh 05 1900 (has links)
A new promising approach for wireless network optimization is from a cross-layer perspective. This thesis focuses on exploiting channel state information (CSI) from the physical layer for optimal transmission scheduling at the medium access control (MAC) layer. The first part of the thesis considers exploiting CSI via a distributed channel-aware MAC protocol. The MAC protocol is analysed using a centralized design approach and a non-cooperative game theoretic approach. Structural results are obtained and provably convergent stochastic approximation algorithms that can estimate the optimal transmission policies are proposed. Especially, in the game theoretic MAC formulation, it is proved that the best response transmission policies are threshold in the channel state and there exists a Nash equilibrium at which every user deploys a threshold transmission policy. This threshold result leads to a particularly efficient stochastic-approximation-based adaptive learning algorithm and a simple distributed implementation of the MAC protocol. Simulations show that the channel-aware MAC protocols result in system throughputs that increase with the number of users.
The thesis also considers opportunistic transmission scheduling from the perspective of a single user using Markov Decision Process (MDP) approaches. Both channel state information and channel memory are exploited for opportunistic transmission. First, a finite horizon MDP transmission scheduling problem is considered. The finite horizon formulation is suitable for short-term delay constraints. It is proved for the finite horizon opportunistic transmission scheduling problem that the optimal transmission policy is threshold in the buffer occupancy state and the transmission time. This two-dimensional threshold structure substantially reduces the computational complexity required to compute and implement the optimal policy. Second, the opportunistic transmission scheduling problem is formulated as an infinite horizon average cost MDP with a constraint on the average waiting cost. An advantage of the infinite horizon formulation is that the optimal policy is stationary. Using the Lagrange dynamic programming theory and the supermodularity method, it is proved that the stationary optimal transmission scheduling policy is a randomized mixture of two policies that are threshold in the buffer occupancy state. A stochastic approximation algorithm and a Q-learning based algorithm that can adaptively estimate the optimal transmission scheduling policies are then proposed.
|
9 |
Evolutionarily Stable Learning and Foraging StrategiesCOWNDEN, DANIEL 01 February 2012 (has links)
This thesis examines a series of problems with the goal of better understanding the fundamental dilemma of whether to invest effort in obtaining information that may lead to better opportunities in the future versus exploiting immediately available opportunities. In particular this work investigates how this dilemma is affected by competition in an evolutionary setting. To achieve this requires both the use of evolutionary game theory, and Markov decision procesess or stochastic dynamic programming. This thesis grows directly out of earlier work on the Social Learning Strategies Tournament. Although I cast the problem in the biological setting of optimal foraging theory, where it fills an obvious gap, this fundamental dilemma should also be of some interest to economists, operations researchers, as well as those working in ecology, evolution and behaviour. / Thesis (Ph.D, Mathematics & Statistics) -- Queen's University, 2012-01-31 19:55:25.11
|
10 |
Refuse or reuse : managing the quality of returns in product recovery systemsMarshall, Sarah Elizabeth January 2012 (has links)
Increasing legislative and societal pressures are forcing manufacturers to become environmentally-conscious and take responsibility for the fate of their goods after they have been used by consumers. As a result, some manufacturers operate hybrid systems which produce new goods and recover used goods. Product recovery describes the process by which used products are returned to their manufacturers or sent to a specialised facility for recovery, before being sold on the original or a secondary market. The quality of the returned goods is a significant issue in product recovery systems as it can affect both the type of recovery and costs associated with it. Quality in product recovery systems has not been adequately studied, with many authors either ignoring the possibility of receiving lower quality returns, or assuming they are disposed of rather than recovered. However, such assumptions ignore the possibility that the firm might be able to salvage value from lower quality returns by using them for parts or materials. This thesis presents four models that investigate the importance of considering the quality of returns in the management of inventory in a product recovery system, by examining the cost-effectiveness of recovering both high quality and low quality returns. The first model is a deterministic lot-sizing model of a product recovery system. It was found that performing both high and low quality recovery reduced the sensitivity of the optimal cost to operational restrictions on the choice of decision variables. The second model is a discrete-time, periodic-review model formulated as a Markov decision process (MDP) and introduces uncertainty in demand, returns, and the quality of the returns. It was found that performing both types of recovery can lead to cost savings and better customer service for firms through an increased fill rate. The third model addresses those industries where produced and recovered goods cannot be sold on the same market due to customers’ perceptions and environmental legalisation. Using an MDP formulation, the model examines a product recovery system in which produced and recovered goods are sold on separate markets. The profitability of offering two-way substitution between these markets was investigated. It was found that offering substitution can allow firms to increase both their profits and fill rates. The fourth model examines the issue of separate markets and substitution in the continuous time domain using a semi-Markov decision process. The continuous nature of the model allows more detailed examination of the substitution decision. It was found that offering substitution can allow firms to increase their profit and in some cases also increase their fill rate. In some cases, production is performed less frequently when downward substitution can be offered, and recovery is performed less often when upward substitution can be offered. The findings of this thesis could be used to help a firm that is currently recovering high quality returns assess the cost-effectiveness of also recovering lower quality returns. Recovering low-quality items, rather than disposing of them, may allow a firm to increase the amount it recycles. The findings highlight the importance of considering the quality of returns when managing a product recovery system as they show that economic gains can be achieved by reusing rather than refusing low quality returns.
|
Page generated in 0.102 seconds