Global ETD Search

1	Adaptive Fuzzy Reinforcement Learning for Flock Motion Control Qu, Shuzheng 06 January 2022 (has links) The flock-guidance problem enjoys a challenging structure where multiple optimization objectives are solved simultaneously. This usually necessitates different control approaches to tackle various objectives, such as guidance, collision avoidance, and cohesion. The guidance schemes, in particular, have long suffered from complex tracking-error dynamics. Furthermore, techniques that are based on linear feedback or output feedback strategies obtained at equilibrium conditions either may not hold or degrade when applied to uncertain dynamic environments. Relying on potential functions, embedded within pre-tuned fuzzy inference architectures, lacks robustness under dynamic disturbances. This thesis introduces two adaptive distributed approaches for the autonomous control of multi-agent systems. The first proposed technique has its structure based on an online fuzzy reinforcement learning Value Iteration scheme which is precise and flexible. This distributed adaptive control system simultaneously targets a number of flocking objectives; namely: 1) tracking the leader, 2) keeping a safe distance from the neighboring agents, and 3) reaching a velocity consensus among the agents. In addition to its resilience in the face of dynamic disturbances, the algorithm does not require more than the agent’s position as a feedback signal. The effectiveness of the proposed method is validated with two simulation scenarios and benchmarked against a similar technique from the literature. The second technique is in the form of an online fuzzy recursive least squares-based Policy Iteration control scheme, which employs a recursive least squares algorithm to estimate the weights in the leader tracking subsystem, as a substitute for the original reinforcement learning actor-critic scheme adopted in the first technique. The recursive least squares algorithm demonstrates a faster approximation weight convergence. The time-invariant communication graph utilized in the fuzzy reinforcement learning method is also improved with time-varying graphs, which can smoothly guide the agents to reach a speed consensus. The fuzzy recursive least squares-based technique is simulated with a few scenarios and benchmarked against the fuzzy reinforcement learning method. The scenarios are simulated in CoppeliaSim for a better visualization and more realistic results. reinforcement multi-agent value iteration policy iteration
2	Hierarchical Sampling for Least-Squares Policy Iteration Schwab, Devin 26 January 2016 (has links) No description available. Computer Science reinforcement learning MaxQ LSPI Least-Squares Policy Iteration
3	Numerical Methods for Pricing a Guaranteed Minimum Withdrawal Benefit (GMWB) as a Singular Control Problem Huang, Yiqing January 2011 (has links) Guaranteed Minimum Withdrawal Benefits(GMWB) have become popular riders on variable annuities. The pricing of a GMWB contract was originally formulated as a singular stochastic control problem which results in a Hamilton Jacobi Bellman (HJB) Variational Inequality (VI). A penalty method method can then be used to solve the HJB VI. We present a rigorous proof of convergence of the penalty method to the viscosity solution of the HJB VI assuming the underlying asset follows a Geometric Brownian Motion. A direct control method is an alternative formulation for the HJB VI. We also extend the HJB VI to the case of where the underlying asset follows a Poisson jump diffusion. The HJB VI is normally solved numerically by an implicit method, which gives rise to highly nonlinear discretized algebraic equations. The classic policy iteration approach works well for the Geometric Brownian Motion case. However it is not efficient in some circumstances such as when the underlying asset follows a Poisson jump diffusion process. We develop a combined fixed point policy iteration scheme which significantly increases the efficiency of solving the discretized equations. Sufficient conditions to ensure the convergence of the combined fixed point policy iteration scheme are derived both for the penalty method and direct control method. The GMWB formulated as a singular control problem has a special structure which results in a block matrix fixed point policy iteration converging about one order of magnitude faster than a full matrix fixed point policy iteration. Sufficient conditions for convergence of the block matrix fixed point policy iteration are derived. Estimates for bounds on the penalty parameter (penalty method) and scaling parameter (direct control method) are obtained so that convergence of the iteration can be expected in the presence of round-off error. Computer Science
4	Numerical Methods for Pricing a Guaranteed Minimum Withdrawal Benefit (GMWB) as a Singular Control Problem Huang, Yiqing January 2011 (has links) Guaranteed Minimum Withdrawal Benefits(GMWB) have become popular riders on variable annuities. The pricing of a GMWB contract was originally formulated as a singular stochastic control problem which results in a Hamilton Jacobi Bellman (HJB) Variational Inequality (VI). A penalty method method can then be used to solve the HJB VI. We present a rigorous proof of convergence of the penalty method to the viscosity solution of the HJB VI assuming the underlying asset follows a Geometric Brownian Motion. A direct control method is an alternative formulation for the HJB VI. We also extend the HJB VI to the case of where the underlying asset follows a Poisson jump diffusion. The HJB VI is normally solved numerically by an implicit method, which gives rise to highly nonlinear discretized algebraic equations. The classic policy iteration approach works well for the Geometric Brownian Motion case. However it is not efficient in some circumstances such as when the underlying asset follows a Poisson jump diffusion process. We develop a combined fixed point policy iteration scheme which significantly increases the efficiency of solving the discretized equations. Sufficient conditions to ensure the convergence of the combined fixed point policy iteration scheme are derived both for the penalty method and direct control method. The GMWB formulated as a singular control problem has a special structure which results in a block matrix fixed point policy iteration converging about one order of magnitude faster than a full matrix fixed point policy iteration. Sufficient conditions for convergence of the block matrix fixed point policy iteration are derived. Estimates for bounds on the penalty parameter (penalty method) and scaling parameter (direct control method) are obtained so that convergence of the iteration can be expected in the presence of round-off error. Computer Science
5	Elicitation and planning in Markov decision processes with unknown rewards / Elicitation et planification dans les processus décisionnel de MARKOV avec récompenses inconnues Alizadeh, Pegah 09 December 2016 (has links) Les processus décisionnels de Markov (MDPs) modélisent des problèmes de décisionsséquentielles dans lesquels un utilisateur interagit avec l’environnement et adapte soncomportement en prenant en compte les signaux de récompense numérique reçus. La solutiond’unMDP se ramène à formuler le comportement de l’utilisateur dans l’environnementà l’aide d’une fonction de politique qui spécifie quelle action choisir dans chaque situation.Dans de nombreux problèmes de décision du monde réel, les utilisateurs ont despréférences différentes, donc, les gains de leurs actions sur les états sont différents et devraientêtre re-décodés pour chaque utilisateur. Dans cette thèse, nous nous intéressonsà la résolution des MDPs pour les utilisateurs ayant des préférences différentes.Nous utilisons un modèle nommé MDP à Valeur vectorielle (VMDP) avec des récompensesvectorielles. Nous proposons un algorithme de recherche-propagation qui permetd’attribuer une fonction de valeur vectorielle à chaque politique et de caractériser chaqueutilisateur par un vecteur de préférences sur l’ensemble des fonctions de valeur, où levecteur de préférence satisfait les priorités de l’utilisateur. Etant donné que le vecteurde préférences d’utilisateur n’est pas connu, nous présentons plusieurs méthodes pourrésoudre des MDP tout en approximant le vecteur de préférence de l’utilisateur.Nous introduisons deux algorithmes qui réduisent le nombre de requêtes nécessairespour trouver la politique optimale d’un utilisateur: 1) Un algorithme de recherchepropagation,où nous propageons un ensemble de politiques optimales possibles pourle MDP donné sans connaître les préférences de l’utilisateur. 2) Un algorithme interactifd’itération de la valeur (IVI) sur les MDPs, nommé algorithme d’itération de la valeurbasé sur les avantages (ABVI) qui utilise le clustering et le regroupement des avantages.Nous montrons également comment l’algorithme ABVI fonctionne correctement pourdeux types d’utilisateurs différents: confiant et incertain.Nous travaillons finalement sur une méthode d’approximation par critére de regret minimaxcomme méthode pour trouver la politique optimale tenant compte des informationslimitées sur les préférences de l’utilisateur. Dans ce système, tous les objectifs possiblessont simplement bornés entre deux limites supérieure et inférieure tandis que le systèmeine connaît pas les préférences de l’utilisateur parmi ceux-ci. Nous proposons une méthodeheuristique d’approximation par critère de regret minimax pour résoudre des MDPsavec des récompenses inconnues. Cette méthode est plus rapide et moins complexe queles méthodes existantes dans la littérature. / Markov decision processes (MDPs) are models for solving sequential decision problemswhere a user interacts with the environment and adapts her policy by taking numericalreward signals into account. The solution of an MDP reduces to formulate the userbehavior in the environment with a policy function that specifies which action to choose ineach situation. In many real world decision problems, the users have various preferences,and therefore, the gain of actions on states are different and should be re-decoded foreach user. In this dissertation, we are interested in solving MDPs for users with differentpreferences.We use a model named Vector-valued MDP (VMDP) with vector rewards. We propose apropagation-search algorithm that allows to assign a vector-value function to each policyand identify each user with a preference vector on the existing set of preferences wherethe preference vector satisfies the user priorities. Since the user preference vector is notknown we present several methods for solving VMDPs while approximating the user’spreference vector.We introduce two algorithms that reduce the number of queries needed to find the optimalpolicy of a user: 1) A propagation-search algorithm, where we propagate a setof possible optimal policies for the given MDP without knowing the user’s preferences.2) An interactive value iteration algorithm (IVI) on VMDPs, namely Advantage-basedValue Iteration (ABVI) algorithm that uses clustering and regrouping advantages. Wealso demonstrate how ABVI algorithm works properly for two different types of users:confident and uncertain.We finally work on a minimax regret approximation method as a method for findingthe optimal policy w.r.t the limited information about user’s preferences. All possibleobjectives in the system are just bounded between two higher and lower bounds while thesystem is not aware of user’s preferences among them. We propose an heuristic minimaxregret approximation method for solving MDPs with unknown rewards that is faster andless complex than the existing methods in the literature. Processus décisionnel de Markov Valeur vectorielle MDP Markov decision process Vector-valued MPD Policy iteration Reward elicitation
6	Verifying Value Iteration and Policy Iteration in Coq Masters, David M. 01 June 2021 (has links) No description available. Computer Science Reinforcement Learning Software Verification Coq Value Iteration Policy Iteration
7	Itération sur les politiques optimiste et apprentissage du jeu de Tetris / Optimistic Policy Iteration and Learning the Game of Tetris Thiéry, Christophe 25 November 2010 (has links) Cette thèse s'intéresse aux méthodes d'itération sur les politiques dans l'apprentissage par renforcement à grand espace d'états avec approximation linéaire de la fonction de valeur. Nous proposons d'abord une unification des principaux algorithmes du contrôle optimal stochastique. Nous montrons la convergence de cette version unifiée vers la fonction de valeur optimale dans le cas tabulaire, ainsi qu'une garantie de performances dans le cas où la fonction de valeur est estimée de façon approximative. Nous étendons ensuite l'état de l'art des algorithmes d'approximation linéaire du second ordre en proposant une généralisation de Least-Squares Policy Iteration (LSPI) (Lagoudakis et Parr, 2003). Notre nouvel algorithme, Least-Squares [lambda] Policy Iteration (LS[lambda]PI), ajoute à LSPI un concept venant de [lambda]-Policy Iteration (Bertsekas et Ioffe, 1996) : l'évaluation amortie (ou optimiste) de la fonction de valeur, qui permet de réduire la variance de l'estimation afin d'améliorer l'efficacité de l'échantillonnage. LS[lambda]PI propose ainsi un compromis biais-variance réglable qui peut permettre d'améliorer l'estimation de la fonction de valeur et la qualité de la politique obtenue. Dans un second temps, nous nous intéressons en détail au jeu de Tetris, une application sur laquelle se sont penchés plusieurs travaux de la littérature. Tetris est un problème difficile en raison de sa structure et de son grand espace d'états. Nous proposons pour la première fois une revue complète de la littérature qui regroupe des travaux d'apprentissage par renforcement, mais aussi des techniques de type évolutionnaire qui explorent directement l'espace des politiques et des algorithmes réglés à la main. Nous constatons que les approches d'apprentissage par renforcement sont à l'heure actuelle moins performantes sur ce problème que des techniques de recherche directe de la politique telles que la méthode d'entropie croisée (Szita et Lorincz, 2006). Nous expliquons enfin comment nous avons mis au point un joueur de Tetris qui dépasse les performances des meilleurs algorithmes connus jusqu'ici et avec lequel nous avons remporté l'épreuve de Tetris de la Reinforcement Learning Competition 2008 / This thesis studies policy iteration methods with linear approximation of the value function for large state space problems in the reinforcement learning context. We first introduce a unified algorithm that generalizes the main stochastic optimal control methods. We show the convergence of this unified algorithm to the optimal value function in the tabular case, and a performance bound in the approximate case when the value function is estimated. We then extend the literature of second-order linear approximation algorithms by proposing a generalization of Least-Squares Policy Iteration (LSPI) (Lagoudakis and Parr, 2003). Our new algorithm, Least-Squares [lambda] Policy Iteration (LS[lambda]PI), adds to LSPI an idea of [lambda]-Policy Iteration (Bertsekas and Ioffe, 1996): the damped (or optimistic) evaluation of the value function, which allows to reduce the variance of the estimation to improve the sampling efficiency. Thus, LS[lambda]PI offers a bias-variance trade-off that may improve the estimation of the value function and the performance of the policy obtained. In a second part, we study in depth the game of Tetris, a benchmark application that several works from the literature attempt to solve. Tetris is a difficult problem because of its structure and its large state space. We provide the first full review of the literature that includes reinforcement learning works, evolutionary methods that directly explore the policy space and handwritten controllers. We observe that reinforcement learning is less successful on this problem than direct policy search approaches such as the cross-entropy method (Szita et Lorincz, 2006). We finally show how we built a controller that outperforms the previously known best controllers, and shortly discuss how it allowed us to win the Tetris event of the 2008 Reinforcement Learning Competition Contrôle optimal stochastique Apprentissage par renforcement Programmation dynamique Processus Décisionnels de Markov Least-Squares Policy Iteration [lambda]-Policy Iteration Approximation de la fonction de valeur Tetris Méthode d'entropie croisée
8	Itération sur les Politiques Optimiste et Apprentissage du Jeu de Tetris Thiery, Christophe 25 November 2010 (has links) (PDF) Cette thèse s'intéresse aux méthodes d'itération sur les politiques dans l'apprentissage par renforcement à grand espace d'états avec approximation linéaire de la fonction de valeur. Nous proposons d'abord une unification des principaux algorithmes du contrôle optimal stochastique. Nous montrons la convergence de cette version unifiée vers la fonction de valeur optimale dans le cas tabulaire, ainsi qu'une garantie de performances dans le cas où la fonction de valeur est estimée de façon approximative. Nous étendons ensuite l'état de l'art des algorithmes d'approximation linéaire du second ordre en proposant une généralisation de Least-Squares Policy Iteration (LSPI) (Lagoudakis et Parr, 2003). Notre nouvel algorithme, Least-Squares λ Policy Iteration (LSλPI), ajoute à LSPI un concept venant de λ-Policy Iteration (Bertsekas et Ioffe, 1996) : l'évaluation amortie (ou optimiste) de la fonction de valeur, qui permet de réduire la variance de l'estimation afin d'améliorer l'efficacité de l'échantillonnage. LSλPI propose ainsi un compromis biais-variance réglable qui peut permettre d'améliorer l'estimation de la fonction de valeur et la qualité de la politique obtenue. Dans un second temps, nous nous intéressons en détail au jeu de Tetris, une application sur laquelle se sont penchés plusieurs travaux de la littérature. Tetris est un problème difficile en raison de sa structure et de son grand espace d'états. Nous proposons pour la première fois une revue complète de la littérature qui regroupe des travaux d'apprentissage par renforcement, mais aussi des techniques de type évolutionnaire qui explorent directement l'espace des politiques et des algorithmes réglés à la main. Nous constatons que les approches d'apprentissage par renforcement sont à l'heure actuelle moins performantes sur ce problème que des techniques de recherche directe de la politique telles que la méthode d'entropie croisée (Szita et Lőrincz, 2006). Nous expliquons enfin comment nous avons mis au point un joueur de Tetris qui dépasse les performances des meilleurs algorithmes connus jusqu'ici et avec lequel nous avons remporté l'épreuve de Tetris de la Reinforcement Learning Competition 2008. contrôle optimal stochastique apprentissage par renforcement programmation dynamique Processus Décisionnels de Markov Least-Squares Policy Iteration λ-Policy Iteration approximation de la fonction de valeur compromis biais-variance fonctions de base Tetris méthode d'entropie croisée
9	Regularization in reinforcement learning Farahmand, Amir-massoud Unknown Date No description available. Reinforcement Learning Machine Learning Statistical Learning Theory Sequential Decision-Making Problems Regularization Approximate Value/Policy Iteration Model Selection Regularized Least-Squares Regression Regularized Policy Iteration Regularized Fitted Q-Iteration Regularized LSTD Error Propagation
10	Úlohy stochastického dynamického programování: teorie a aplikace / Stochastic Dynamic Programming Problems: Theory and Applications. Lendel, Gabriel January 2012 (has links) Title: Stochastic Dynamic Programming Problems: Theory and Applications Author: Gabriel Lendel Department: Department of Probability and Mathematical Statistics Supervisor: Ing. Karel Sladký CSc. Supervisor's e-mail address: sladky@utia.cas.cz Abstract: In the present work we study Markov decision processes which provide a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker. We study iterative procedures for finding policy that is optimal or nearly optimal with respect to the selec- ted criteria. Specifically, we mainly examine the task of finding a policy that is optimal with respect to the total expected discounted reward or the average expected reward for discrete or continuous systems. In the work we study policy iteration algorithms and aproximative value iteration algorithms. We give numerical analysis of specific problems. Keywords: Stochastic dynamic programming, Markov decision process, policy ite- ration, value iteration

Search results