Global ETD Search

121	Commande optimale (en Production et Stock) de Systèmes Assemble-To-Order (ATO) avec prise en compte de demandes en composants individuels / Integrated Production and Inventory Control of Assemble-To-Order Systems with Individual Components Demand Li, Zhi 03 September 2013 (has links) Les systèmes assemble-to-order (ATO) peuvent être considérés comme une affectation de ressources multiples qui induit planification de production, satisfaction des contraintes et affectation des stocks. Les systèmes ATO représentent une stratégie de logistique populaire utilisée en gestion de fabrication. En raison de la complexité croissante des systèmes de fabrication d'aujourd'hui, le défi pour les systèmes ATO est de gérer efficacement les stocks de composants et de trouver les décisions optimales de production et d'affectation.Nous étudions un système ATO avec un produit unique qui est assemblé à partir de plusieurs composants. Le système doit répondre à une demande non seulement du produit assemblé, mais aussi des composants individuels. Nous considérons le cas avec seulement des lost sales puis le cas mixte lost sales et backorders avec des temps de production suivant des lois de type exponentiel et une demande sous forme de loi de Poisson. Nous formulons le problème comme un Processus de décision markovien (MDP), et nous considérons deux critères d'optimalité qui sont le coût actualisé et le coût moyen par période. Nous caractérisons la structure de la politique optimale et étudions l'impact des différents paramètres du système sur cette politique. Nous présentons également plusieurs heuristiques pour le cas lost sales et le cas mixte lost sales et backorders. Ces heuristiques fournissent des méthodes simples, mais efficaces pour contrôler la production et l’affectation des stocks du système ATO / Assemble-to-order (ATO) systems can be regarded as a multiple resource allocation that induces production planning, requirements fulfilling and inventory assignment. ATO is a popular strategy used in manufacturing management. Due to the increasing complexity of today’s manufacturing systems, the challenge for ATO systems is to efficiently manage component inventories and make optimal production and allocation decisions. We study an ATO system with a single product which is assembled from multiple components. The system faces demand not only from the assembled product but also from the individual components. We consider the pure lost sales case and the mixed lost sales and backorders case with exponential production times and Poisson demand. We formulate the problem as a Markov decision process (MDP), and consider it under two optimality criteria: discounted cost and average cost per period. We characterize the structure of the optimal policy and investigate the impact of different system parameters on the optimal policy. We also present several static heuristic policies for the pure lost sales and the mixed lost sales and backorders cases. These static heuristics provide simple, yet effective approaches for controlling production and inventory allocation of ATO system Systèmes assemble-to-order Contrôle optimal Processus de décision markovien Affectation de ressources multiples Planification Production au plus juste Assemble-to-order systems Optimal control Markov decision processes Multiple resource allocation Planification Lean manufacturing
122	Algorithms For Stochastic Games And Service Systems Prasad, H L 05 1900 (has links) (PDF) This thesis is organized into two parts, one for my main area of research in the field of stochastic games, and the other for my contributions in the area of service systems. We first provide an abstract for my work in stochastic games. The field of stochastic games has been actively pursued over the last seven decades because of several of its important applications in oligopolistic economics. In the past, zero-sum stochastic games have been modelled and solved for Nash equilibria using the standard techniques of Markov decision processes. General-sum stochastic games on the contrary have posed difficulty as they cannot be reduced to Markov decision processes. Over the past few decades the quest for algorithms to compute Nash equilibria in general-sum stochastic games has intensified and several important algorithms such as stochastic tracing procedure [Herings and Peeters, 2004], NashQ [Hu and Wellman, 2003], FFQ [Littman, 2001], etc., and their generalised representations such as the optimization problem formulations for various reward structures [Filar and Vrieze, 1997] have been proposed. However, they suffer from either lack of generality or are intractable for even medium sized problems or both. In our venture towards algorithms for stochastic games, we start with a non-linear optimization problem and then design a simple gradient descent procedure for the same. Though this procedure gives the Nash equilibrium for a sample problem of terrain exploration, we observe that, in general, it need not be true. We characterize the necessary conditions and define KKT-N point. KKT-N points are those Karush-Kuhn-Tucker (KKT) points which corresponding to Nash equilibria. Thus, for a simple gradient based algorithm to guarantee convergence to Nash equilibrium, all KKT points of the optimization problem need to be KKT-N points, which restricts the applicability of such algorithms. We then take a step back and start looking at better characterization of those points of the optimization problem which correspond to Nash equilibria of the underlying game. As a result of this exploration, we derive two sets of necessary and sufficient conditions. The first set, KKT-SP conditions, is inspired from KKT conditions itself and is obtained by breaking down the main optimization problem into several sub-problems and then applying KKT conditions to each one of those sub-problems. The second set, SG-SP conditions, is a simplified set of conditions which characterize those Nash points more compactly. Using both KKT-SP and SG-SP conditions, we propose three algorithms, OFF-SGSP, ON-SGSP and DON-SGSP, respectively, which we show provide Nash equilibrium strategies for general-sum discounted stochastic games. Here OFF-SGSP is an off-line algorithm while ONSGSP and DON-SGSP are on-line algorithms. In particular, we believe that DON-SGSP is the first decentralized on-line algorithm for general-sum discounted stochastic games. We show that both our on-line algorithms are computationally efficient. In fact, we show that DON-SGSP is not only applicable for multi-agent scenarios but is also directly applicable for the single-agent case, i.e., MDPs (Markov Decision Processes). The second part of the thesis focuses on formulating and solving the problem of minimizing the labour-cost in service systems. We define the setting of service systems and then model the labour-cost problem as a constrained discrete parameter Markov-cost process. This Markov process is parametrized by the number of workers in various shifts and with various skill levels. With the number of workers as optimization variables, we provide a detailed formulation of a constrained optimization problem where the objective is the expected long-run averages of the single-stage labour-costs, and the main set of constraints are the expected long-run average of aggregate SLAs (Service Level Agreements). For this constrained optimization problem, we provide two stochastic optimization algorithms, SASOC-SF-N and SASOC-SF-C, which use smoothed functional approaches to estimate gradient and perform gradient descent in the aforementioned constrained optimization problem. SASOC-SF-N uses Gaussian distribution for smoothing while SASOC-SF-C uses Cauchy distribution for the same. SASOC-SF-C is the first Cauchy based smoothing algorithm which requires a fixed number (two) of simulations independent of the number of optimization variables. We show that these algorithms provide an order of magnitude better performance than existing industrial standard tool, OptQuest. We also show that SASOC-SF-C gives overall better performance. Algorithms Stochastic Games Stochastic Games - Algorithms Nash Equilibrium Computation Gradient Descent Schemes Markov Decision Processes Markov Cost Process Labour Costs - Modelling Labor Cost Optimization Nash Equilibria Game Theory
123	Feature Adaptation Algorithms for Reinforcement Learning with Applications to Wireless Sensor Networks And Road Traffic Control Prabuchandran, K J January 2016 (has links) (PDF) Many sequential decision making problems under uncertainty arising in engineering, science and economics are often modelled as Markov Decision Processes (MDPs). In the setting of MDPs, the goal is to and a state dependent optimal sequence of actions that minimizes a certain long-term performance criterion. The standard dynamic programming approach to solve an MDP for the optimal decisions requires a complete model of the MDP and is computationally feasible only for small state-action MDPs. Reinforcement learning (RL) methods, on the other hand, are model-free simulation based approaches for solving MDPs. In many real world applications, one is often faced with MDPs that have large state-action spaces whose model is unknown, however, whose outcomes can be simulated. In order to solve such (large) MDPs, one either resorts to the technique of function approximation in conjunction with RL methods or develops application specific RL methods. A solution based on RL methods with function approximation comes with the associated problem of choosing the right features for approximation and a solution based on application specific RL methods primarily relies on utilizing the problem structure. In this thesis, we investigate the problem of choosing the right features for RL methods based on function approximation as well as develop novel RL algorithms that adaptively obtain best features for approximation. Subsequently, we also develop problem specie RL methods for applications arising in the areas of wireless sensor networks and road traffic control. In the first part of the thesis, we consider the problem of finding the best features for value function approximation in reinforcement learning for the long-run discounted cost objective. We quantify the error in the approximation for any given feature and the approximation parameter by the mean square Bellman error (MSBE) objective and develop an online algorithm to optimize MSBE. Subsequently, we propose the first online actor-critic scheme with adaptive bases to find a locally optimal (control) policy for an MDP under the weighted discounted cost objective. The actor performs gradient search in the space of policy parameters using simultaneous perturbation stochastic approximation (SPSA) gradient estimates. This gradient computation however requires estimates of the value function of the policy. The value function is approximated using a linear architecture and its estimate is obtained from the critic. The error in approximation of the value function, however, results in sub-optimal policies. Thus, we obtain the best features by performing a gradient descent on the Grassmannian of features to minimize a MSBE objective. We provide a proof of convergence of our control algorithm to a locally optimal policy and show numerical results illustrating the performance of our algorithm. In our next work, we develop an online actor-critic control algorithm with adaptive feature tuning for MDPs under the long-run average cost objective. In this setting, a gradient search in the policy parameters is performed using policy gradient estimates to improve the performance of the actor. The computation of the aforementioned gradient however requires estimates of the differential value function of the policy. In order to obtain good estimates of the differential value function, the critic adaptively tunes the features to obtain the best representation of the value function using gradient search in the Grassmannian of features. We prove that our actor-critic algorithm converges to a locally optimal policy. Experiments on two different MDP settings show performance improvements resulting from our feature adaptation scheme. In the second part of the thesis, we develop problem specific RL solution methods for the two aforementioned applications. In both the applications, the size of the state-action space in the formulated MDPs is large. However, by utilizing the problem structure we develop scalable RL algorithms. In the wireless sensor networks application, we develop RL algorithms to find optimal energy management policies (EMPs) for energy harvesting (EH) sensor nodes. First, we consider the case of a single EH sensor node and formulate the problem of finding an optimal EMP in the discounted cost MDP setting. We then propose two RL algorithms to maximize network performance. Through simulations, our algorithms are seen to outperform the algorithms in the literature. Our RL algorithms for the single EH sensor node do not scale when there are multiple sensor nodes. In our second work, we consider the problem of finding optimal energy sharing policies that maximize the network performance of a system comprising of multiple sensor nodes and a single energy harvesting (EH) source. We develop efficient energy sharing algorithms, namely Q-learning algorithm with exploration mechanisms based on the -greedy method as well as upper confidence bound (UCB). We extend these algorithms by incorporating state and action space aggregation to tackle state-action space explosion in the MDP. We also develop a cross entropy based method that incorporates policy parameterization in order to find near optimal energy sharing policies. Through numerical experiments, we show that our algorithms yield energy sharing policies that outperform the heuristic greedy method. In the context of road traffic control, optimal control of traffic lights at junctions or traffic signal control (TSC) is essential for reducing the average delay experienced by the road users. This problem is hard to solve when simultaneously considering all the junctions in the road network. So, we propose a decentralized multi-agent reinforcement learning (MARL) algorithm for solving this problem by considering each junction in the road network as a separate agent (controller) to obtain dynamic TSC policies. We propose two approaches to minimize the average delay. In the first approach, each agent decides the signal duration of its phases in a round-robin (RR) manner using the multi-agent Q-learning algorithm. We show through simulations over VISSIM (microscopic traffic simulator) that our round-robin MARL algorithms perform significantly better than both the standard fixed signal timing (FST) algorithm and the saturation balancing (SAT) algorithm over two real road networks. In the second approach, instead of optimizing green light duration, each agent optimizes the order of the phase sequence. We then employ our MARL algorithms by suitably changing the state-action space and cost structure of the MDP. We show through simulations over VISSIM that our non-round robin MARL algorithms perform significantly better than the FST, SAT and the round-robin MARL algorithms based on the first approach. However, on the other hand, our round-robin MARL algorithms are more practically viable as they conform with the psychology of road users. Wireless Sensor Networks Road Traffic Control Reinforcement Learning Algorithms Markov Decision Processes (MDPs) Sensor Networks Traffic Signal Control (TSC) Reinforcement Learning Energy Harvesting Sensor Nodes Stochastic Approximation Grassmannian Search Computer Science
124	Capturing continuous human movement on a linear network with mobile phone towers / Skattning av kontinuerlig mänsklig rörelse på ett linjärt nätverk med hjälp av mobiltelefon-master Dejby, Jesper January 2017 (has links) Anonymous Call Detail Records (CDR’s) from mobile phone towers provide a unique opportunity to aggregate individual location data to overall human mobility patterns. Flowminder uses this data to improve the welfare of low- and middle-income countries. The movement patterns are studied through key measurements of mobility. This thesis seeks to evaluate the estimates of key measurements obtained with mobile phone towers through simulation of continuous human movement on a linear network. Simulation is made with an agent based approach. Spatial point processes are used to distribute continuous start points of the agents on the linear network. The start point is then equipped with a mark, a path with an end point dependent on the start point. A path from the start point to the end point of an agent is modeled with a Markov Decision Process. The simulated human movement can then be captured with different types of mobile phone tower distributions realized from spatial point processes. The thesis will initially consider homogeneous Poisson and Simple Sequential Inhibition (SSI) processes on a plane and then introduce local clusters (heterogeneity) with Matérn Cluster and SSI processes. The goal of the thesis is to investigate the effects of change in mobile phone tower distribution and call frequency on the estimates of key measurements of mobility. The effects of call frequency are unclear and invite more detailed study. The results suggest that a decrease in the total number of towers generally worsens the estimates and that introducing local clusters also has a negative effect on the estimates. The presented methodology provides a flexible and new way to model continuous human movement along a linear network. Mobile phone tower analysis Mobile phone networks Spatial point processes Linear networks Call detail records Human movement Human trajectories Markov decision processes Continuous random walks Probability Theory and Statistics Sannolikhetsteori och statistik
125	Learning dynamics and decision paradigms in social-ecological dilemmas Barfuss, Wolfram 10 July 2019 (has links) Kollektives Handeln ist erforderlich um nachhaltige Entwicklungspfade in gekoppelten sozial-ökologischen Systemen zu erschließen, fernab von gefährlichen Kippelementen. Ohne anderen Modellierungsprinzipien ihren Nutzen abzuerkennen, schlägt diese Dissertation die Agent-Umwelt Schnittstelle als die mathematische Grundlage für das Modellieren sozial-ökologischer Systeme vor. Zuerst erweitert diese Arbeit eine Methode aus der Literatur der statistischen Physik über Lerndynamiken, um einen deterministischen Grenzübergang von etablierten Verstärkungslernalgorithmen aus der Forschung zu künstlicher Intelligenz herzuleiten. Die resultierenden Lerndynamiken zeigen eine große Bandbreite verschiedener dynamischer Regime wie z.B. Fixpunkte, Grenzzyklen oder deterministisches Chaos. Zweitens werden die hergeleiteten Lerngleichungen auf eine neu eingeführte Umwelt, das Ökologisches Öffentliches Gut, angewendet,. Sie modelliert ein gekoppeltes sozial-ökologisches Dilemma und erweitert damit etablierte soziale Dilemmaspiele um ein ökologisches Kippelement. Bekannte theoretische und empirische Ergebnisse werden reproduziert und neuartige, qualitativ verschiedene Parameterregime aufgezeigt, darunter eines, in dem diese belohnungsoptimierenden Lern-Agenten es vorziehen, gemeinsam unter einem Kollaps der Umwelt zu leiden, als in einer florierenden Umwelt zu kooperieren. Drittens stellt diese Arbeit das Optimierungsparadigma der Lern-Agenten in Frage. Die drei Entscheidungsparadimen ökonomischen Optimierung, Nachhaltigkeit und Sicherheit werden systematisch miteinander verglichen, während sie auf das Management eines umweltlichen Kippelements angewendet werden. Es wird gezeigt, dass kein Paradigma garantiert, Anforderungen anderer Paradigmen zu erfüllen, sowie dass das Fehlen eines Meisterparadigmas von besonderer Bedeutung für das Klimasystem ist, da dieses sich am Rand zwischen Parameterbereichen befinden kann, wo ökonomische Optimierung weder nachhaltig noch sicher wird. / Collective action is required to enter sustainable development pathways in coupled social-ecological systems, safely away from dangerous tipping elements. Without denying the usefulness of other model design principles, this thesis proposes the agent-environment interface as the mathematical foundation for the design of social-ecological system models. First, this work refines techniques from the statistical physics literature on learning dynamics to derive a deterministic limit of established reinforcement learning algorithms from artificial intelligence research. Illustrations of the resulting learning dynamics reveal a wide range of different dynamical regimes, such as fixed points, periodic orbits and deterministic chaos. Second, the derived multi-state learning equations are applied to a newly introduced environment, the Ecological Public Good. It models a coupled social-ecological dilemma, extending established repeated social dilemma games by an ecological tipping element. Known theoretical and empirical results are reproduced and novel qualitatively different parameter regimes are discovered, including one in which these reward-optimizing agents prefer to collectively suffer in environmental collapse rather than cooperating in a prosperous environment. Third, this thesis challenges the reward optimizing paradigm of the learning equations. It presents a novel formal comparison of the three decision paradigms of economic optimization, sustainability and safety for the governance of an environmental tipping element. It is shown that no paradigm guarantees fulfilling requirements imposed by another paradigm. Further, the absence of a master paradigm is shown to be of special relevance for governing the climate system, since the latter may reside at the edge between parameter regimes where economic welfare optimization becomes neither sustainable nor safe. Verstärkendes Lernen Multiagenten Systeme Nichtlineare Dynamik Markov-Entscheidungsprobleme Sozial-Ökologische Systeme Soziale Dilemmata Nachhaltigkeit Reinforcement Learning Multiagent Systems Nonlinear Dynamics Markov Decision Processes Social-Ecological Systems Social Dilemmas Sustainability 530 Physik QH 253 ddc:530
126	Learning in Partially Observable Markov Decision Processes Sachan, Mohit 21 August 2013 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Learning in Partially Observable Markov Decision process (POMDP) is motivated by the essential need to address a number of realistic problems. A number of methods exist for learning in POMDPs, but learning with limited amount of information about the model of POMDP remains a highly anticipated feature. Learning with minimal information is desirable in complex systems as methods requiring complete information among decision makers are impractical in complex systems due to increase of problem dimensionality. In this thesis we address the problem of decentralized control of POMDPs with unknown transition probabilities and reward. We suggest learning in POMDP using a tree based approach. States of the POMDP are guessed using this tree. Each node in the tree has an automaton in it and acts as a decentralized decision maker for the POMDP. The start state of POMDP is known as the landmark state. Each automaton in the tree uses a simple learning scheme to update its action choice and requires minimal information. The principal result derived is that, without proper knowledge of transition probabilities and rewards, the automata tree of decision makers will converge to a set of actions that maximizes the long term expected reward per unit time obtained by the system. The analysis is based on learning in sequential stochastic games and properties of ergodic Markov chains. Simulation results are presented to compare the long term rewards of the system under different decision control algorithms. Learning in POMDP Learning automata tree POMDP Computer programming Data structures (Computer science) Stochastic systems -- Research Game theory -- Mathematical models Sequences (Mathematics) Markov processes Decision making -- Simulation methods User interfaces (Computer systems)
127	Dynamic Routing for Fuel Optimization in Autonomous Vehicles Regatti, Jayanth Reddy 14 August 2018 (has links) No description available. Electrical Engineering Transportation Planning Operations Research Dynamic routing transportation autonomous vehicles simulation simulator reinforcement learning markov decision processes markov decision process mdp borel state mdp traffic flow dynamic programming approximate dynamic programming
128	Opportunistic Scheduling Using Channel Memory in Markov-modeled Wireless Networks Murugesan, Sugumar 26 October 2010 (has links) No description available. Computer Science Electrical Engineering Engineering wireless communication downlink network broadcast opportunistic scheduling imperfect channel state information channel memory Markov channel ARQ feedback restless multi-armed bandit processes
129	An empirical study of stability and variance reduction in DeepReinforcement Learning Lindström, Alexander January 2024 (has links) Reinforcement Learning (RL) is a branch of AI that deals with solving complex sequential decision making problems such as training robots, trading while following patterns and trends, optimal control of industrial processes, and more. These applications span various fields, including data science, factories, finance, and others[1]. The most popular RL algorithm today is Deep Q Learning (DQL), developed by a team at DeepMind, which successfully combines RL with Neural Network (NN). However, combining RL and NN introduces challenges such as numerical instability and unstable learning due to high variance. Among others, these issues are due to the“moving target problem”. To mitigate this problem, the target network was introduced as a solution. However, using a target network slows down learning, vastly increases memory requirements, and adds overheads in running the code. In this thesis, we conduct an empirical study to investigate the importance of target networks. We conduct this empirical study for three scenarios. In the first scenario, we train agents in online learning. The aim here is to demonstrate that the target network can be removed after some point in time without negatively affecting performance. To evaluate this scenario, we introduce the concept of the stabilization point. In thesecond scenario, we pre-train agents before continuing to train them in online learning. For this scenario, we demonstrate the redundancy of the target network by showing that it can be completely omitted. In the third scenario, we evaluate a newly developed activation function called Truncated Gaussian Error Linear Unit (TGeLU). For thisscenario, we train an agent in online learning and show that by using TGeLU as anactivation function, we can completely remove the target network. Through the empirical study of these scenarios, we conjecture and verify that a target network has only transient benefits concerning stability. We show that it has no influence on the quality of the policy found. We also observed that variance was generally higher when using a target network in the later stages of training compared to cases where the target network had been removed. Additionally, during the investigation of the second scenario, we observed that the magnitude of training iterations during pre-training affected the agent’s performance in the online learning phase. This thesis provides a deeper understanding of how the target networkaffects the training process of DQL, some of them - surrounding variance reduction- are contrary to popular belief. Additionally, the results have provided insights into potential future work. These include further explore the benefits of lower variance observed when removing the target network and conducting more efficient convergence analyses for the pre-training part in the second scenario. Reinforcement Learning Markov Decision Processes Neural Network Deep Q Learning Deep Q Network Sigmoid Truncated Gaussian Error Linear Unit Target network Stable learning Online learning Offline learning Computer Engineering Datorteknik
130	Un mécanisme constructiviste d'apprentissage automatique, d'anticipations pour des agents artificiels situés / A Constructivist Anticipatory Learning Mechanism for Situated Artificial Agents Studzinski Perotto, Filipo 11 June 2010 (has links) Cette recherche se caractérise, premièrement, par une discussion théorique sur le concept d'agent autonome, basée sur des éléments issus des paradigmes de l'Intelligence Artificielle Située et de l'Intelligence Artificielle Affective. Ensuite, cette thèse présente le problème de l'apprentissage de modèles du monde, en passant en revue la littérature concernant les travaux qui s'y rapportent. A partir de ces discussions, l'architecture CAES et le mécanisme CALM sont présentes. CAES (Coupled Agent-Environment System) constitue une architecture pour décrire des systèmes bases sur la dichotomie agent-environnement. Il définit l'agent et l'environnement comme deux systèmes partiellement ouverts, en couplage dynamique. Dans CAES, l'agent est compose de deux sous-systèmes, l'esprit et le corps, suivant les principes de la situativite et de la motivation intrinsèque. CALM (Constructivist Anticipatory Learning Mechanism) est un mécanisme d'apprentissage fonde sur l'approche constructiviste de l'Intelligence Artificielle. Il permet a un agent situe de construire un modèle du monde dans des environnements partiellement observables et partiellement déterministes, sous la forme d'un processus de décision markovien partiellement observable et factorise (FPOMDP). Le modèle du monde construit est ensuite utilise pour que l'agent puisse définir une politique d'action visant à améliorer sa propre performance / This research is characterized, first, by a theoretical discussion on the concept of autonomous agent, based on elements taken from the Situated AI and the Affective AI paradigms. Secondly, this thesis presents the problem of learning world models, providing a bibliographic review regarding some related works. From these discussions, the CAES architecture and the CALM mechanism are presented. The CAES (Coupled Agent-Environment System) is an architecture for describing systems based on the agent-environment dichotomy. It defines the agent and the environment as two partially open systems, in dynamic coupling. In CAES, the agent is composed of two sub-systems, mind and body, following the principles of situativity and intrinsic motivation. CALM (Constructivist Learning Anticipatory Mechanism) is based on the constructivist approach to Artificial Intelligence. It allows a situated agent to build a model of the world in environments partially deterministic and partially observable in the form of Partially Observable and Factored Markov Decision Process (FPOMDP). The model of the world is constructed and used for the agent to define a policy for action in order to improve its own performance Intelligence Artificielle Apprentissage Automatique Agents Autonomes Induction de Concepts Développement Cognitif Artificiel Piaget Découverte de Structure dans des FPOMDP Processus de Décision Markovien (MDP) Constructivist Artificial Intelligence Artificial Intelligence Machine Learning Autonomous Agents Concept Induction Artificial Cognitive Development Piaget Structure Discovering in FPOMDPs Markov Decision Processes (MDP)

Search results