91 |
Age of Information: Fundamentals, Distributions, and ApplicationsAbd-Elmagid, Mohamed Abd-Elaziz 11 July 2023 (has links)
A typical model for real-time status update systems consists of a transmitter node that generates real-time status updates about some physical process(es) of interest and sends them through a communication network to a destination node. Such a model can be used to analyze the performance of a plethora of emerging Internet of Things (IoT)-enabled real-time applications including healthcare, factory automation, autonomous vehicles, and smart homes, to name a few. The performance of these applications highly depends upon the freshness of the information status at the destination node about its monitored physical process(es). Because of that, the main design objective of such real-time status update systems is to ensure timely delivery of status updates from the transmitter node to the destination node. To measure the freshness of information at the destination node, the Age of Information (AoI) has been introduced as a performance metric that accounts for the generation time of each status update (which was ignored by conventional performance metrics, specifically throughput and delay). Since then, there have been two main research directions in the AoI research area. The first direction aimed to analyze/characterize AoI in different queueing-theoretic models/disciplines, and the second direction was focused on the optimization of AoI in different communication systems that deal with time-sensitive information. However, the prior queueing-theoretic analyses of AoI have mostly been limited to the characterization of the average AoI and the prior studies developing AoI/age-aware scheduling/transmission policies have mostly ignored the energy constraints at the transmitter node(s). Motivated by these limitations, this dissertation develops new queueing-theoretic methods that allow the characterization of the distribution of AoI in several classes of status updating systems as well as novel AoI-aware scheduling policies accounting for the energy constraints at the transmitter nodes (for several settings of communication networks) in the process of decision-making using tools from optimization theory and reinforcement learning.
The first part of this dissertation develops a stochastic hybrid system (SHS)-based general framework to facilitate the analysis of characterizing the distribution of AoI in several classes of real-time status updating systems. First, we study a general setting of status updating systems, where a set of source nodes provide status updates about some physical process(es) to a set of monitors. For this setting, the continuous state of the system is formed by the AoI/age processes at different monitors, the discrete state of the system is modeled using a finite-state continuous-time Markov chain, and the coupled evolution of the continuous and discrete states of the system is described by a piecewise linear SHS with linear reset maps. Using the notion of tensors, we derive a system of linear equations for the characterization of the joint moment generating function (MGF) of an arbitrary set of age processes in the network. Afterwards, we study a general setting of gossip networks in which a source node forwards its measurements (in the form of status updates) about some observed physical process to a set of monitoring nodes according to independent Poisson processes. Furthermore, each monitoring node sends status updates about its information status (about the process observed by the source) to the other monitoring nodes according to independent Poisson processes. For this setup, we develop SHS-based methods that allow the characterization of higher-order marginal/joint moments of the age processes in the network. Finally, our SHS-based framework is applied to derive the stationary marginal and joint MGFs for several queueing disciplines and gossip network topologies, using which we derive closed-form expressions for marginal/joint high-order statistics of age processes, such as the variance of each age process and the correlation coefficients between all possible pairwise combinations of age processes.
In the second part of this dissertation, our analysis is focused on understanding the distributional properties of AoI in status updating systems powered by energy harvesting (EH). In particular, we consider a multi-source status updating system in which an EH-powered transmitter node has multiple sources generating status updates about several physical processes. The status updates are then sent to a destination node where the freshness of each status update is measured in terms of AoI. The status updates of each source and harvested energy packets are assumed to arrive at the transmitter according to independent Poisson processes, and the service time of each status update is assumed to be exponentially distributed. For this setup, we derive closed-form expressions of MGF of AoI under several queueing disciplines at the transmitter, including non-preemptive and source-agnostic/source-aware preemptive in service strategies. The generality of our analysis is demonstrated by recovering several existing results as special cases. A key insight from our characterization of the distributional properties of AoI is that it is crucial to incorporate the higher moments of AoI in the implementation/optimization of status updating systems rather than just relying on its average (as has been mostly done in the existing literature on AoI).
In the third and final part of this dissertation, we employ AoI as a performance metric for several settings of communication networks, and develop novel AoI-aware scheduling policies using tools from optimization theory and reinforcement learning. First, we investigate the role of an unmanned aerial vehicle (UAV) as a mobile relay to minimize the average peak AoI for a source-destination pair. For this setup, we formulate an optimization problem to jointly optimize the UAV's flight trajectory as well as energy and service time allocations for packet transmissions. This optimization problem is subject to the UAV's mobility constraints and the total available energy constraints at the source node and UAV. In order to solve this non-convex problem, we propose an efficient iterative algorithm and establish its convergence analytically. A key insight obtained from our results is that the optimal design of the UAV's flight trajectory achieves significant performance gains especially when the available energy at the source node and UAV is limited and/or when the size of the update packet is large. Afterwards, we study a generic system setup for an IoT network in which radio frequency (RF)-powered IoT devices are sensing different physical processes and need to transmit their sensed data to a destination node. For this generic system setup, we develop a novel reinforcement learning-based framework that characterizes the optimal sampling policy for IoT devices with the objective of minimizing the long-term weighted sum of average AoI values in the network. Our analytical results characterize the structural properties of the age-optimal policy, and demonstrate that it has a threshold-based structure with respect to the AoI values for different processes. They further demonstrate that the structures of the age-optimal and throughput-optimal policies are different. Finally, we analytically characterize the structural properties of the AoI-optimal joint sampling and updating policy for wireless powered communication networks while accounting for the costs of generating status updates in the process of decision-making. Our results demonstrate that the AoI-optimal joint sampling and updating policy has a threshold-based structure with respect to different system state variables. / Doctor of Philosophy / A typical model for real-time status update systems consists of a transmitter node that generates real-time status updates about some physical process(es) of interest and sends them through a communication network to a destination node. Such a model can be used to analyze the performance of a plethora of emerging Internet of Things (IoT)-enabled real-time applications including healthcare, factory automation, autonomous vehicles, and smart homes, to name a few. The performance of these applications highly depends upon the freshness of the information status at the destination node about its monitored physical process(es). Because of that, the main design objective of such real-time status update systems is to ensure timely delivery of status updates from the transmitter node to the destination node. To measure the freshness of information at the destination node, the Age of Information (AoI) has been introduced as a performance metric that accounts for the generation time of each status update (which was ignored by conventional performance metrics, specifically throughput and delay). Since then, there have been two main research directions in the AoI research area. The first direction aimed to analyze/characterize AoI in different queueing-theoretic models/disciplines, and the second direction was focused on the optimization of AoI in different communication systems that deal with time-sensitive information. However, the prior queueing-theoretic analyses of AoI have mostly been limited to the characterization of the average AoI and the prior studies developing AoI/age-aware scheduling/transmission policies have mostly ignored the energy constraints at the transmitter node(s). Motivated by these limitations, this dissertation first develops new queueing-theoretic methods that allow the characterization of the distribution of AoI in several classes of status updating systems. Afterwards, using tools from optimization theory and reinforcement learning, novel AoI-aware scheduling policies are developed while accounting for the energy constraints at the transmitter nodes for several settings of communication networks, including unmanned aerial vehicles (UAVs)-assisted and radio frequency (RF)-powered communication networks, in the process of decision-making.
In the first part of this dissertation, a stochastic hybrid system (SHS)-based general framework is first developed to facilitate the analysis of characterizing the distribution of AoI in several classes of real-time status updating systems. Afterwards, this framework is applied to derive the stationary marginal and joint moment generating functions (MGFs) for several queueing disciplines and gossip network topologies, using which we derive closed-form expressions for marginal/joint high-order statistics of age processes, such as the variance of each age process and the correlation coefficients between all possible pairwise combinations of age processes.
In the second part of this dissertation, our analysis is focused on understanding the distributional properties of AoI in status updating systems powered by energy harvesting (EH). In particular, we consider a multi-source status updating system in which an EH-powered transmitter node has multiple sources generating status updates about several physical processes. The status updates are then sent to a destination node where the freshness of each status update is measured in terms of AoI. For this setup, we derive closed-form expressions of MGF of AoI under several queueing disciplines at the transmitter. The generality of our analysis is demonstrated by recovering several existing results as special cases. A key insight from our characterization of the distributional properties of AoI is that it is crucial to incorporate the higher moments of AoI in the implementation/optimization of status updating systems rather than just relying on its average (as has been mostly done in the existing literature on AoI).
In the third and final part of this dissertation, we employ AoI as a performance metric for several settings of communication networks, and develop novel AoI-aware scheduling policies using tools from optimization theory and reinforcement learning. First, we investigate the role of a UAV as a mobile relay to minimize the average peak AoI for a source-destination pair. For this setup, we formulate an optimization problem to jointly optimize the UAV's flight trajectory as well as energy and service time allocations for packet transmissions. This optimization problem is subject to the UAV's mobility constraints and the total available energy constraints at the source node and UAV. A key insight obtained from our results is that the optimal design of the UAV's flight trajectory achieves significant performance gains especially when the available energy at the source node and UAV is limited and/or when the size of the update packet is large. Afterwards, we study a generic system setup for an IoT network in which RF-powered IoT devices are sensing different physical processes and need to transmit their sensed data to a destination node. For this generic system setup, we develop a novel reinforcement learning-based framework that characterizes the optimal sampling policy for IoT devices with the objective of minimizing the long-term weighted sum of average AoI values in the network. Our analytical results characterize the structural properties of the age-optimal policy, and demonstrate that it has a threshold-based structure with respect to the AoI values for different processes. They further demonstrate that the structures of the age-optimal and throughput-optimal policies are different. Finally, we analytically characterize the structural properties of the AoI-optimal joint sampling and updating policy for wireless powered communication networks while accounting for the costs of generating status updates in the process of decision-making. Our results demonstrate that the AoI-optimal joint sampling and updating policy has a threshold-based structure with respect to different system state variables.
|
92 |
Single and Multi-player Stochastic Dynamic OptimizationSaha, Subhamay January 2013 (has links) (PDF)
In this thesis we investigate single and multi-player stochastic dynamic optimization prob-lems. We consider both discrete and continuous time processes. In the multi-player setup we investigate zero-sum games with both complete and partial information. We study partially observable stochastic games with average cost criterion and the state process be-ing discrete time controlled Markov chain. The idea involved in studying this problem is to replace the original unobservable state variable with a suitable completely observable state variable. We establish the existence of the value of the game and also obtain optimal strategies for both players. We also study a continuous time zero-sum stochastic game with complete observation. In this case the state is a pure jump Markov process. We investigate the nite horizon total cost criterion. We characterise the value function via appropriate Isaacs equations. This also yields optimal Markov strategies for both players.
In the single player setup we investigate risk-sensitive control of continuous time Markov chains. We consider both nite and in nite horizon problems. For the nite horizon total cost problem and the in nite horizon discounted cost problem we characterise the value function as the unique solution of appropriate Hamilton Jacobi Bellman equations. We also derive optimal Markov controls in both the cases. For the in nite horizon average cost case we shown the existence of an optimal stationary control. we also give a value iteration scheme for computing the optimal control in the case of nite state and action spaces.
Further we introduce a new class of stochastic processes which we call stochastic processes with \age-dependent transition rates". We give a rigorous construction of the process. We prove that under certain assunptions the process is Feller. We also compute the limiting probabilities for our process. We then study the controlled version of the above process. In this case we take the risk-neutral cost criterion. We solve the in nite horizon discounted cost problem and the average cost problem for this process. The crucial step in analysing these problems is to prove that the original control problem is equivalent to an appropriate semi-Markov decision problem. Then the value functions and optimal controls are characterised using this equivalence and the theory of semi-Markov decision processes (SMDP). The analysis of nite horizon problems becomes di erent from that of in nite horizon problems because of the fact that in this case the idea of converting into an equivalent SMDP does not seem to work. So we deal with the nite horizon total cost problem by showing that our problem is equivalent to another appropriately de ned discrete time Markov decision problem. This allows us to characterise the value function and to nd an optimal Markov control.
|
93 |
Vers le vol à voile longue distance pour drones autonomes / Towards Vision-Based Autonomous Cross-Country Soaring for UAVsStolle, Martin Tobias 03 April 2017 (has links)
Les petit drones à voilure fixe rendent services aux secteurs de la recherche, de l'armée et de l'industrie, mais souffrent toujours de portée et de charge utile limitées. Le vol thermique permet de réduire la consommation d'énergie. Cependant,sans télédétection d'ascendances, un drone ne peut bénéficier d'une ascendance qu'en la rencontrant par hasard. Dans cette thèse, un nouveau cadre pour le vol à voile longue distance autonome est élaboré, permettant à un drone planeur de localiser visuellement des ascendances sous-cumulus et d’en récolter l'énergie de manière efficace. S'appuyant sur le filtre de Kalman non parfumé, une méthode de vision monoculaire est établie pour l'estimation des paramètres d’ascendances. Sa capacité de fournir des estimations convergentes et cohérentes est évaluée par des simulations Monte Carlo. Les incertitudes de modèle, le bruit de traitement de l'image et les trajectoires de l'observateur peuvent dégrader ces estimés. Par conséquent, un deuxième axe de cette thèse est la conception d'un planificateur de trajectoire robuste basé sur des cartes d'ascendances. Le planificateur fait le compromis entre le temps de vol et le risque d’un atterrissage forcé dans les champs tout en tenant compte des incertitudes d'estimation dans le processus de prise de décision. Il est illustré que la charge de calcul du planificateur de trajectoire proposé est réalisable sur une plate-forme informatique peu coûteuse. Les algorithmes proposés d’estimation ainsi que de planification sont évalués conjointement dans un simulateur de vol à 6 axes, mettant en évidence des améliorations significatives par rapport aux vols à voile longue distance autonomes actuels. / Small fixed-wing Unmanned Aerial Vehicles (UAVs) provide utility to research, military, and industrial sectors at comparablyreasonable cost, but still suffer from both limited operational ranges and payload capacities. Thermal soaring flight for UAVsoffers a significant potential to reduce the energy consumption. However, without remote sensing of updrafts, a glider UAVcan only benefit from an updraft when encountering it by chance. In this thesis, a new framework for autonomous cross-country soaring is elaborated, enabling a glider UAV to visually localize sub-cumulus thermal updrafts and to efficiently gain energy from them.Relying on the Unscented Kalman Filter, a monocular vision-based method is established, for remotely estimatingsub-cumulus updraft parameters. Its capability of providing convergent and consistent state estimates is assessed relyingon Monte Carlo Simulations. Model uncertainties, image processing noise, and poor observer trajectories can degrade theestimated updraft parameters. Therefore, a second focus of this thesis is the design of a robust probabilistic path plannerfor map-based autonomous cross-country soaring. The proposed path planner balances between the flight time and theoutlanding risk by taking into account the estimation uncertainties in the decision making process. The suggested updraftestimation and path planning algorithms are jointly assessed in a 6 Degrees Of Freedom simulator, highlighting significantperformance improvements with respect to state of the art approaches in autonomous cross-country soaring while it is alsoshown that the path planner is implementable on a low-cost computer platform.
|
94 |
Apprentissage Intelligent des Robots Mobiles dans la Navigation Autonome / Intelligent Mobile Robot Learning in Autonomous NavigationXia, Chen 24 November 2015 (has links)
Les robots modernes sont appelés à effectuer des opérations ou tâches complexes et la capacité de navigation autonome dans un environnement dynamique est un besoin essentiel pour les robots mobiles. Dans l’objectif de soulager de la fastidieuse tâche de préprogrammer un robot manuellement, cette thèse contribue à la conception de commande intelligente afin de réaliser l’apprentissage des robots mobiles durant la navigation autonome. D’abord, nous considérons l’apprentissage des robots via des démonstrations d’experts. Nous proposons d’utiliser un réseau de neurones pour apprendre hors-ligne une politique de commande à partir de données utiles extraites d’expertises. Ensuite, nous nous intéressons à l’apprentissage sans démonstrations d’experts. Nous utilisons l’apprentissage par renforcement afin que le robot puisse optimiser une stratégie de commande pendant le processus d’interaction avec l’environnement inconnu. Un réseau de neurones est également incorporé et une généralisation rapide permet à l’apprentissage de converger en un certain nombre d’épisodes inférieur à la littérature. Enfin, nous étudions l’apprentissage par fonction de récompenses potentielles compte rendu des démonstrations d’experts optimaux ou non-optimaux. Nous proposons un algorithme basé sur l’apprentissage inverse par renforcement. Une représentation non-linéaire de la politique est désignée et la méthode du max-margin est appliquée permettant d’affiner les récompenses et de générer la politique de commande. Les trois méthodes proposées sont évaluées sur des robots mobiles afin de leurs permettre d’acquérir les compétences de navigation autonome dans des environnements dynamiques et inconnus / Modern robots are designed for assisting or replacing human beings to perform complicated planning and control operations, and the capability of autonomous navigation in a dynamic environment is an essential requirement for mobile robots. In order to alleviate the tedious task of manually programming a robot, this dissertation contributes to the design of intelligent robot control to endow mobile robots with a learning ability in autonomous navigation tasks. First, we consider the robot learning from expert demonstrations. A neural network framework is proposed as the inference mechanism to learn a policy offline from the dataset extracted from experts. Then we are interested in the robot self-learning ability without expert demonstrations. We apply reinforcement learning techniques to acquire and optimize a control strategy during the interaction process between the learning robot and the unknown environment. A neural network is also incorporated to allow a fast generalization, and it helps the learning to converge in a number of episodes that is greatly smaller than the traditional methods. Finally, we study the robot learning of the potential rewards underneath the states from optimal or suboptimal expert demonstrations. We propose an algorithm based on inverse reinforcement learning. A nonlinear policy representation is designed and the max-margin method is applied to refine the rewards and generate an optimal control policy. The three proposed methods have been successfully implemented on the autonomous navigation tasks for mobile robots in unknown and dynamic environments.
|
95 |
A Markovian state-space framework for integrating flexibility into space system design decisionsLafleur, Jarret Marshall 16 December 2011 (has links)
The past decades have seen the state of the art in aerospace system design progress from a scope of simple optimization to one including robustness, with the objective of permitting a single system to perform well even in off-nominal future environments. Integrating flexibility, or the capability to easily modify a system after it has been fielded in response to changing environments, into system design represents a further step forward. One challenge in accomplishing this rests in that the decision-maker must consider not only the present system design decision, but also sequential future design and operation decisions. Despite extensive interest in the topic, the state of the art in designing flexibility into aerospace systems, and particularly space systems, tends to be limited to analyses that are qualitative, deterministic, single-objective, and/or limited to consider a single future time period.
To address these gaps, this thesis develops a stochastic, multi-objective, and multi-period framework for integrating flexibility into space system design decisions. Central to the framework are five steps. First, system configuration options are identified and costs of switching from one configuration to another are compiled into a cost transition matrix. Second, probabilities that demand on the system will transition from one mission to another are compiled into a mission demand Markov chain. Third, one performance matrix for each design objective is populated to describe how well the identified system configurations perform in each of the identified mission demand environments. The fourth step employs multi-period decision analysis techniques, including Markov decision processes (MDPs) from the field of operations research, to find efficient paths and policies a decision-maker may follow. The final step examines the implications of these paths and policies for the primary goal of informing initial system selection.
Overall, this thesis unifies state-centric concepts of flexibility from economics and engineering literature with sequential decision-making techniques from operations research. The end objective of this thesis' framework and its supporting analytic and computational tools is to enable selection of the next-generation space systems today, tailored to decision-maker budget and performance preferences, that will be best able to adapt and perform in a future of changing environments and requirements. Following extensive theoretical development, the framework and its steps are applied to space system planning problems of (1) DARPA-motivated multiple- or distributed-payload satellite selection and (2) NASA human space exploration architecture selection.
|
96 |
Semi-Markov Processes In Dynamic Games And FinanceGoswami, Anindya 02 1900 (has links)
Two different sets of problems are addressed in this thesis. The first one is on partially observed semi-Markov Games (POSMG) and the second one is on semi-Markov modulated financial market model.
In this thesis we study a partially observable semi-Markov game in the infinite time horizon. The study of a partially observable game (POG) involves three major steps: (i) construct an equivalent completely observable game (COG), (ii) establish the equivalence between POG and COG by showing that if COG admits an equilibrium, POG does so, (iii) study the equilibrium of COG and find the corresponding equilibrium of original partially observable problem.
In case of infinite time horizon game problem there are two different payoff criteria. These are discounted payoff criterion and average payoff criterion. At first a partially observable semi-Markov decision process on general state space with discounted cost criterion is studied. An optimal policy is shown to exist by considering a Shapley’s equation for the corresponding completely observable model. Next the discounted payoff problem is studied for two-person zero-sum case. A saddle point equilibrium is shown to exist for this case. Then the variable sum game is investigated. For this case the Nash equilibrium strategy is obtained in Markov class under suitable assumption. Next the POSMG problem on countable state space is addressed for average payoff criterion. It is well known that under this criterion the game problem do not have a solution in general. To ensure a solution one needs some kind of ergodicity of the transition kernel. We find an appropriate ergodicity of partially observed model which in turn induces a geometric ergodicity to the equivalent model. Using this we establish a solution of the corresponding average payoff optimality equation (APOE). Thus the value and a saddle point equilibrium is obtained for the original partially observable model. A value iteration scheme is also developed to find out the average value of the game.
Next we study the financial market model whose key parameters are modulated by semi-Markov processes. Two different problems are addressed under this market assumption. In the first one we show that this market is incomplete. In such an incomplete market we find the locally risk minimizing prices of exotic options in the Follmer Schweizer framework. In this model the stock prices are no more Markov. Generally stock price process is modeled as Markov process because otherwise one may not get a pde representation of price of a contingent claim. To overcome this difficulty we find an appropriate Markov process which includes the stock price as a component and then find its infinitesimal generator. Using Feynman-Kac formula we obtain a system of non-local partial differential equations satisfied by the option price functions in the mildsense. .Next this system is shown to have a classical solution for given initial or boundary conditions.
Then this solution is used to have a F¨ollmer Schweizer decomposition of option price. Thus we obtain the locally risk minimizing prices of different options. Furthermore we obtain an integral equation satisfied by the unique solution of this system. This enable us to compute the price of a contingent claim and find the risk minimizing hedging strategy numerically. Further we develop an efficient and stable numerical method to compute the prices.
Beside this work on derivative pricing, the portfolio optimization problem in semi-Markov modulated market is also studied in the thesis. We find the optimal portfolio selections by optimizing expected utility of terminal wealth. We also obtain the optimal portfolio selections under risk sensitive criterion for both finite and infinite time horizon.
|
97 |
Algorithms for Product Pricing and Energy Allocation in Energy Harvesting Sensor NetworksSindhu, P R January 2014 (has links) (PDF)
In this thesis, we consider stochastic systems which arise in different real-world application contexts. The first problem we consider is based on product adoption and pricing. A monopolist selling a product has to appropriately price the product over time in order to maximize the aggregated profit. The demand for a product is uncertain and is influenced by a number of factors, some of which are price, advertising, and product technology. We study the influence of price on the demand of a product and also how demand affects future prices. Our approach involves mathematically modelling the variation in demand as a function of price and current sales. We present a simulation-based algorithm for computing the optimal price path of a product for a given period of time. The algorithm we propose uses a smoothed-functional based performance gradient descent method to find a price sequence which maximizes the total profit over a planning horizon.
The second system we consider is in the domain of sensor networks. A sensor network is a collection of autonomous nodes, each of which senses the environment. Sensor nodes use energy for sensing and communication related tasks. We consider the problem of finding optimal energy sharing policies that maximize the network performance of a system comprising of multiple sensor nodes and a single energy harvesting(EH) source. Nodes periodically sense a random field and generate data, which is stored in their respective data queues. The EH source harnesses energy from ambient energy sources and the generated energy is stored in a buffer. The nodes require energy for transmission of data and and they receive the energy for this purpose from the EH source. There is a need for efficiently sharing the stored energy in the EH source among the nodes in the system, in order to minimize average delay of data transmission over the long run. We formulate this problem in the framework of average cost infinite-horizon Markov Decision Processes[3],[7]and provide algorithms for the same.
|
98 |
Commande optimale (en Production et Stock) de Systèmes Assemble-To-Order (ATO) avec prise en compte de demandes en composants individuels / Integrated Production and Inventory Control of Assemble-To-Order Systems with Individual Components DemandLi, Zhi 03 September 2013 (has links)
Les systèmes assemble-to-order (ATO) peuvent être considérés comme une affectation de ressources multiples qui induit planification de production, satisfaction des contraintes et affectation des stocks. Les systèmes ATO représentent une stratégie de logistique populaire utilisée en gestion de fabrication. En raison de la complexité croissante des systèmes de fabrication d'aujourd'hui, le défi pour les systèmes ATO est de gérer efficacement les stocks de composants et de trouver les décisions optimales de production et d'affectation.Nous étudions un système ATO avec un produit unique qui est assemblé à partir de plusieurs composants. Le système doit répondre à une demande non seulement du produit assemblé, mais aussi des composants individuels. Nous considérons le cas avec seulement des lost sales puis le cas mixte lost sales et backorders avec des temps de production suivant des lois de type exponentiel et une demande sous forme de loi de Poisson. Nous formulons le problème comme un Processus de décision markovien (MDP), et nous considérons deux critères d'optimalité qui sont le coût actualisé et le coût moyen par période. Nous caractérisons la structure de la politique optimale et étudions l'impact des différents paramètres du système sur cette politique. Nous présentons également plusieurs heuristiques pour le cas lost sales et le cas mixte lost sales et backorders. Ces heuristiques fournissent des méthodes simples, mais efficaces pour contrôler la production et l’affectation des stocks du système ATO / Assemble-to-order (ATO) systems can be regarded as a multiple resource allocation that induces production planning, requirements fulfilling and inventory assignment. ATO is a popular strategy used in manufacturing management. Due to the increasing complexity of today’s manufacturing systems, the challenge for ATO systems is to efficiently manage component inventories and make optimal production and allocation decisions. We study an ATO system with a single product which is assembled from multiple components. The system faces demand not only from the assembled product but also from the individual components. We consider the pure lost sales case and the mixed lost sales and backorders case with exponential production times and Poisson demand. We formulate the problem as a Markov decision process (MDP), and consider it under two optimality criteria: discounted cost and average cost per period. We characterize the structure of the optimal policy and investigate the impact of different system parameters on the optimal policy. We also present several static heuristic policies for the pure lost sales and the mixed lost sales and backorders cases. These static heuristics provide simple, yet effective approaches for controlling production and inventory allocation of ATO system
|
99 |
Algorithms For Stochastic Games And Service SystemsPrasad, H L 05 1900 (has links) (PDF)
This thesis is organized into two parts, one for my main area of research in the field of stochastic games, and the other for my contributions in the area of service systems. We first provide an abstract for my work in stochastic games.
The field of stochastic games has been actively pursued over the last seven decades because of several of its important applications in oligopolistic economics. In the past, zero-sum stochastic games have been modelled and solved for Nash equilibria using the standard techniques of Markov decision processes. General-sum stochastic games on the contrary have posed difficulty as they cannot be reduced to Markov decision processes. Over the past few decades the quest for algorithms to compute Nash equilibria in general-sum stochastic games has intensified and several important algorithms such as stochastic tracing procedure [Herings and Peeters, 2004], NashQ [Hu and Wellman, 2003], FFQ [Littman, 2001], etc., and their generalised representations such as the optimization problem formulations for various reward structures [Filar and Vrieze, 1997] have been proposed. However, they suffer from either lack of generality or are intractable for even medium sized problems or both. In our venture towards algorithms for stochastic games, we start with a non-linear optimization problem and then design a simple gradient descent procedure for the same. Though this procedure gives the Nash equilibrium for a sample problem of terrain exploration, we observe that, in general, it need not be true. We characterize the necessary conditions and define KKT-N point. KKT-N points are those Karush-Kuhn-Tucker (KKT) points which corresponding to Nash equilibria. Thus, for a simple gradient based algorithm to guarantee convergence to Nash equilibrium, all KKT points of the optimization problem need to be KKT-N points, which restricts the applicability of such algorithms.
We then take a step back and start looking at better characterization of those points of the optimization problem which correspond to Nash equilibria of the underlying game. As a result of this exploration, we derive two sets of necessary and sufficient conditions. The first set, KKT-SP conditions, is inspired from KKT conditions itself and is obtained by breaking down the main optimization problem into several sub-problems and then applying KKT conditions to each one of those sub-problems. The second set, SG-SP conditions, is a simplified set of conditions which characterize those Nash points more compactly. Using both KKT-SP and SG-SP conditions, we propose three algorithms, OFF-SGSP, ON-SGSP and DON-SGSP, respectively, which we show provide Nash equilibrium strategies for general-sum discounted stochastic games. Here OFF-SGSP is an off-line algorithm while ONSGSP and DON-SGSP are on-line algorithms. In particular, we believe that DON-SGSP is the first decentralized on-line algorithm for general-sum discounted stochastic games. We show that both our on-line algorithms are computationally efficient. In fact, we show that DON-SGSP is not only applicable for multi-agent scenarios but is also directly applicable for the single-agent case, i.e., MDPs (Markov Decision Processes).
The second part of the thesis focuses on formulating and solving the problem of minimizing the labour-cost in service systems. We define the setting of service systems and then model the labour-cost problem as a constrained discrete parameter Markov-cost process. This Markov process is parametrized by the number of workers in various shifts and with various skill levels. With the number of workers as optimization variables, we provide a detailed formulation of a constrained optimization problem where the objective is the expected long-run averages of the single-stage labour-costs, and the main set of constraints are the expected long-run average of aggregate SLAs (Service Level Agreements). For this constrained optimization problem, we provide two stochastic optimization algorithms, SASOC-SF-N and SASOC-SF-C, which use smoothed functional approaches to estimate gradient and perform gradient descent in the aforementioned constrained optimization problem. SASOC-SF-N uses Gaussian distribution for smoothing while SASOC-SF-C uses Cauchy distribution for the same. SASOC-SF-C is the first Cauchy based smoothing algorithm which requires a fixed number (two) of simulations independent of the number of optimization variables. We show that these algorithms provide an order of magnitude better performance than existing industrial standard tool, OptQuest. We also show that SASOC-SF-C gives overall better performance.
|
100 |
Feature Adaptation Algorithms for Reinforcement Learning with Applications to Wireless Sensor Networks And Road Traffic ControlPrabuchandran, K J January 2016 (has links) (PDF)
Many sequential decision making problems under uncertainty arising in engineering, science and economics are often modelled as Markov Decision Processes (MDPs). In the setting of MDPs, the goal is to and a state dependent optimal sequence of actions that minimizes a certain long-term performance criterion. The standard dynamic programming approach to solve an MDP for the optimal decisions requires a complete model of the MDP and is computationally feasible only for small state-action MDPs. Reinforcement learning (RL) methods, on the other hand, are model-free simulation based approaches for solving MDPs. In many real world applications, one is often faced with MDPs that have large state-action spaces whose model is unknown, however, whose outcomes can be simulated. In order to solve such (large) MDPs, one either resorts to the technique of function approximation in conjunction with RL methods or develops application specific RL methods. A solution based on RL methods with function approximation comes with the associated problem of choosing the right features for approximation and a solution based on application specific RL methods primarily relies on utilizing the problem structure. In this thesis, we investigate the problem of choosing the right features for RL methods based on function approximation as well as develop novel RL algorithms that adaptively obtain best features for approximation. Subsequently, we also develop problem specie RL methods for applications arising in the areas of wireless sensor networks and road traffic control.
In the first part of the thesis, we consider the problem of finding the best features for value function approximation in reinforcement learning for the long-run discounted cost objective. We quantify the error in the approximation for any given feature and the approximation parameter by the mean square Bellman error (MSBE) objective and develop an online algorithm to optimize MSBE.
Subsequently, we propose the first online actor-critic scheme with adaptive bases to find a locally optimal (control) policy for an MDP under the weighted discounted cost objective. The actor performs gradient search in the space of policy parameters using simultaneous perturbation stochastic approximation (SPSA) gradient estimates. This gradient computation however requires estimates of the value function of the policy. The value function is approximated using a linear architecture and its estimate is obtained from the critic. The error in approximation of the value function, however, results in sub-optimal policies. Thus, we obtain the best features by performing a gradient descent on the Grassmannian of features to minimize a MSBE objective. We provide a proof of convergence of our control algorithm to a locally optimal policy and show numerical results illustrating the performance of our algorithm.
In our next work, we develop an online actor-critic control algorithm with adaptive feature tuning for MDPs under the long-run average cost objective. In this setting, a gradient search in the policy parameters is performed using policy gradient estimates to improve the performance of the actor. The computation of the aforementioned gradient however requires estimates of the differential value function of the policy. In order to obtain good estimates of the differential value function, the critic adaptively tunes the features to obtain the best representation of the value function using gradient search in the Grassmannian of features. We prove that our actor-critic algorithm converges to a locally optimal policy. Experiments on two different MDP settings show performance improvements resulting from our feature adaptation scheme.
In the second part of the thesis, we develop problem specific RL solution methods for the two aforementioned applications. In both the applications, the size of the state-action space in the formulated MDPs is large. However, by utilizing the problem structure we develop scalable RL algorithms.
In the wireless sensor networks application, we develop RL algorithms to find optimal energy management policies (EMPs) for energy harvesting (EH) sensor nodes. First, we consider the case of a single EH sensor node and formulate the problem of finding an optimal EMP in the discounted cost MDP setting. We then propose two RL algorithms to maximize network performance. Through simulations, our algorithms are seen to outperform the algorithms in the literature. Our RL algorithms for the single EH sensor node do not scale when there are multiple sensor nodes. In our second work, we consider the problem of finding optimal energy sharing policies that maximize the network performance of a system comprising of multiple sensor nodes and a single energy harvesting (EH) source. We develop efficient energy sharing algorithms, namely Q-learning algorithm with exploration mechanisms based on the -greedy method as well as upper confidence bound (UCB). We extend these algorithms by incorporating state and action space aggregation to tackle state-action space explosion in the MDP. We also develop a cross entropy based method that incorporates policy parameterization in order to find near optimal energy sharing policies. Through numerical experiments, we show that our algorithms yield energy sharing policies that outperform the heuristic greedy method.
In the context of road traffic control, optimal control of traffic lights at junctions or traffic signal control (TSC) is essential for reducing the average delay experienced by the road users. This problem is hard to solve when simultaneously considering all the junctions in the road network. So, we propose a decentralized multi-agent reinforcement learning (MARL) algorithm for solving this problem by considering each junction in the road network as a separate agent (controller) to obtain dynamic TSC policies. We propose two approaches to minimize the average delay. In the first approach, each agent decides the signal duration of its phases in a round-robin (RR) manner using the multi-agent Q-learning algorithm. We show through simulations over VISSIM (microscopic traffic simulator) that our round-robin MARL algorithms perform significantly better than both the standard fixed signal timing (FST) algorithm and the saturation balancing (SAT) algorithm over two real road networks. In the second approach, instead of optimizing green light duration, each agent optimizes the order of the phase sequence. We then employ our MARL algorithms by suitably changing the state-action space and cost structure of the MDP. We show through simulations over VISSIM that our non-round robin MARL algorithms perform significantly better than the FST, SAT and the round-robin MARL algorithms based on the first approach. However, on the other hand, our round-robin MARL algorithms are more practically viable as they conform with the psychology of road users.
|
Page generated in 0.0926 seconds