Global ETD Search

51	Discrete-time Concurrent Learning for System Identification and Applications: Leveraging Memory Usage for Good Learning Djaneye-Boundjou, Ouboti Seydou Eyanaa January 2017 (has links) No description available. Electrical Engineering Applied Mathematics Mathematics Engineering System identification Function approximation Learning Concurrent Learning
52	Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and an Application Lakshminarayanan, Chandrashekar January 2015 (has links) (PDF) Problems involving optimal sequential making in uncertain dynamic systems arise in domains such as engineering, science and economics. Such problems can often be cast in the framework of Markov Decision Process (MDP). Solving an MDP requires computing the optimal value function and the optimal policy. The idea of dynamic programming (DP) and the Bellman equation (BE) are at the heart of solution methods. The three important exact DP methods are value iteration, policy iteration and linear programming. The exact DP methods compute the optimal value function and the optimal policy. However, the exact DP methods are inadequate in practice because the state space is often large and in practice, one might have to resort to approximate methods that compute sub-optimal policies. Further, in certain cases, the system observations are known only in the form of noisy samples and we need to design algorithms that learn from these samples. In this thesis we study interesting theoretical questions pertaining to approximate and learning algorithms, and also present an interesting application of MDPs in the domain of crowd sourcing. Approximate Dynamic Programming (ADP) methods handle the issue of large state space by computing an approximate value function and/or a sub-optimal policy. In this thesis, we are concerned with conditions that result in provably good policies. Motivated by the limitations of the PBE in the conventional linear algebra, we study the PBE in the (min, +) linear algebra. It is a well known fact that deterministic optimal control problems with cost/reward criterion are (min, +)/(max, +) linear and ADP methods have been developed for such systems in literature. However, it is straightforward to show that inﬁnite horizon discounted reward/cost MDPs are neither (min, +) nor (max, +) linear. We develop novel ADP schemes namely the Approximate Q Iteration (AQI) and Variational Approximate Q Iteration (VAQI), where the approximate solution is a (min, +) linear combination of a set of basis functions whose span constitutes a subsemimodule. We show that the new ADP methods are convergent and we present a bound on the performance of the sub-optimal policy. The Approximate Linear Program (ALP) makes use of linear function approximation (LFA) and oﬀers theoretical performance guarantees. Nevertheless, the ALP is diﬃcult to solve due to the presence of a large number of constraints and in practice, a reduced linear program (RLP) is solved instead. The RLP has a tractable number of constraints sampled from the original constraints of the ALP. Though the RLP is known to perform well in experiments, theoretical guarantees are available only for a speciﬁc RLP obtained under idealized assumptions. In this thesis, we generalize the RLP to deﬁne a generalized reduced linear program (GRLP) which has a tractable number of constraints that are obtained as positive linear combinations of the original constraints of the ALP. The main contribution here is the novel theoretical framework developed to obtain error bounds for any given GRLP. Reinforcement Learning (RL) algorithms can be viewed as sample trajectory based solution methods for solving MDPs. Typically, RL algorithms that make use of stochastic approximation (SA) are iterative schemes taking small steps towards the desired value at each iteration. Actor-Critic algorithms form an important sub-class of RL algorithms, wherein, the critic is responsible for policy evaluation and the actor is responsible for policy improvement. The actor and critic iterations have deferent step-size schedules, in particular, the step-sizes used by the actor updates have to be generally much smaller than those used by the critic updates. Such SA schemes that use deferent step-size schedules for deferent sets of iterates are known as multitimescale stochastic approximation schemes. One of the most important conditions required to ensure the convergence of the iterates of a multi-timescale SA scheme is that the iterates need to be stable, i.e., they should be uniformly bounded almost surely. However, the conditions that imply the stability of the iterates in a multi-timescale SA scheme have not been well established. In this thesis, we provide veritable conditions that imply stability of two timescale stochastic approximation schemes. As an example, we also demonstrate that the stability of a widely used actor-critic RL algorithm follows from our analysis. Crowd sourcing (crowd) is a new mode of organizing work in multiple groups of smaller chunks of tasks and outsourcing them to a distributed and large group of people in the form of an open call. Recently, crowd sourcing has become a major pool for human intelligence tasks (HITs) such as image labeling, form digitization, natural language processing, machine translation evaluation and user surveys. Large organizations/requesters are increasingly interested in crowd sourcing the HITs generated out of their internal requirements. Task starvation leads to huge variation in the completion times of the tasks posted on to the crowd. This is an issue for frequent requesters desiring predictability in the completion times of tasks speciﬁed in terms of percentage of tasks completed within a stipulated amount of time. An important task attribute that aﬀects the completion time of a task is its price. However, a pricing policy that does not take the dynamics of the crowd into account might fail to achieve the desired predictability in completion times. Here, we make use of the MDP framework to compute a pricing policy that achieves predictable completion times in simulations as well as real world experiments. Dynamic Programming (DP) Markov Decision Process (MDP) Bellman Equation CBE Machine Learning Bellman Operator Crowdsourcing Approximate Linear Programming (ALP) Reinforcement Learning Stochastic Approximation Approximate Dynamic Programming (ADP) Approximate Linear Program Linear Function Approximation (LFA) Reduced Linear Program (RLP) Crowd Sourcing Computer Science and Automation
53	Online Learning and Simulation Based Algorithms for Stochastic Optimization Lakshmanan, K January 2012 (has links) (PDF) In many optimization problems, the relationship between the objective and parameters is not known. The objective function itself may be stochastic such as a long-run average over some random cost samples. In such cases finding the gradient of the objective is not possible. It is in this setting that stochastic approximation algorithms are used. These algorithms use some estimates of the gradient and are stochastic in nature. Amongst gradient estimation techniques, Simultaneous Perturbation Stochastic Approximation (SPSA) and Smoothed Functional(SF) scheme are widely used. In this thesis we have proposed a novel multi-time scale quasi-Newton based smoothed functional (QN-SF) algorithm for unconstrained as well as constrained optimization. The algorithm uses the smoothed functional scheme for estimating the gradient and the quasi-Newton method to solve the optimization problem. The algorithm is shown to converge with probability one. We have also provided here experimental results on the problem of optimal routing in a multi-stage network of queues. Policies like Join the Shortest Queue or Least Work Left assume knowledge of the queue length values that can change rapidly or hard to estimate. If the only information available is the expected end-to-end delay as with our case, such policies cannot be used. The QN-SF based probabilistic routing algorithm uses only the total end-to-end delay for tuning the probabilities. We observe from the experiments that the QN-SF algorithm has better performance than the gradient and Jacobi versions of Newton based smoothed functional algorithms. Next we consider constrained routing in a similar queueing network. We extend the QN-SF algorithm to this case. We study the convergence behavior of the algorithm and observe that the constraints are satisfied at the point of convergence. We provide experimental results for the constrained routing setup as well. Next we study reinforcement learning algorithms which are useful for solving Markov Decision Process(MDP) when the precise information on transition probabilities is not known. When the state, and action sets are very large, it is not possible to store all the state-action tuples. In such cases, function approximators like neural networks have been used. The popular Q-learning algorithm is known to diverge when used with linear function approximation due to the ’off-policy’ problem. Hence developing stable learning algorithms when used with function approximation is an important problem. We present in this thesis a variant of Q-learning with linear function approximation that is based on two-timescale stochastic approximation. The Q-value parameters for a given policy in our algorithm are updated on the slower timescale while the policy parameters themselves are updated on the faster scale. We perform a gradient search in the space of policy parameters. Since the objective function and hence the gradient are not analytically known, we employ the efficient one-simulation simultaneous perturbation stochastic approximation(SPSA) gradient estimates that employ Hadamard matrix based deterministic perturbations. Our algorithm has the advantage that, unlike Q-learning, it does not suffer from high oscillations due to the off-policy problem when using function approximators. Whereas it is difficult to prove convergence of regular Q-learning with linear function approximation because of the off-policy problem, we prove that our algorithm which is on-policy is convergent. Numerical results on a multi-stage stochastic shortest path problem show that our algorithm exhibits significantly better performance and is more robust as compared to Q-learning. Future work would be to compare it with other policy-based reinforcement learning algorithms. Finally, we develop an online actor-critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints. We consider the long-run average cost Markov decision process(MDP) framework in which both the objective and the constraint functions are suitable policy-dependent long-run averages of certain sample path functions. The Lagrange multiplier method is used to handle the inequality constraints. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal solution. We also provide the results of numerical experiments on a problem of routing in a multistage queueing network with constraints on long-run average queue lengths. We observe that our algorithm exhibits good performance on this setting and converges to a feasible point. Stochastic Approximation Algorithms Stochastic Optimization Markov Decision Process Reinforcement Learning Algorithm Queueing Networks Queuing Theory Online Q-Learning Algorithm Online Actor-Critic Algorithm Markov Decision Processes Q-learning Algorithm Linear Function Approximation Computer Science
54	Large state spaces and self-supervision in reinforcement learning Touati, Ahmed 08 1900 (has links) L'apprentissage par renforcement (RL) est un paradigme d'apprentissage orienté agent qui s'intéresse à l'apprentissage en interagissant avec un environnement incertain. Combiné à des réseaux de neurones profonds comme approximateur de fonction, l'apprentissage par renforcement profond (Deep RL) nous a permis récemment de nous attaquer à des tâches très complexes et de permettre à des agents artificiels de maîtriser des jeux classiques comme le Go, de jouer à des jeux vidéo à partir de pixels et de résoudre des tâches de contrôle robotique. Toutefois, un examen plus approfondi de ces remarquables succès empiriques révèle certaines limites fondamentales. Tout d'abord, il a été difficile de combiner les caractéristiques souhaitables des algorithmes RL, telles que l'apprentissage hors politique et en plusieurs étapes, et l'approximation de fonctions, de manière à obtenir des algorithmes stables et efficaces dans de grands espaces d'états. De plus, les algorithmes RL profonds ont tendance à être très inefficaces en raison des stratégies d'exploration-exploitation rudimentaires que ces approches emploient. Enfin, ils nécessitent une énorme quantité de données supervisées et finissent par produire un agent étroit capable de résoudre uniquement la tâche sur laquelle il est entrainé. Dans cette thèse, nous proposons de nouvelles solutions aux problèmes de l'apprentissage hors politique et du dilemme exploration-exploitation dans les grands espaces d'états, ainsi que de l'auto-supervision dans la RL. En ce qui concerne l'apprentissage hors politique, nous apportons deux contributions. Tout d'abord, pour le problème de l'évaluation des politiques, nous montrons que la combinaison des méthodes populaires d'apprentissage hors politique et à plusieurs étapes avec une paramétrisation linéaire de la fonction de valeur pourrait conduire à une instabilité indésirable, et nous dérivons une variante de ces méthodes dont la convergence est prouvée. Deuxièmement, pour l'optimisation des politiques, nous proposons de stabiliser l'étape d'amélioration des politiques par une régularisation de divergence hors politique qui contraint les distributions stationnaires d'états induites par des politiques consécutives à être proches les unes des autres. Ensuite, nous étudions l'apprentissage en ligne dans de grands espaces d'états et nous nous concentrons sur deux hypothèses structurelles pour rendre le problème traitable : les environnements lisses et linéaires. Pour les environnements lisses, nous proposons un algorithme en ligne efficace qui apprend activement un partitionnement adaptatif de l'espace commun en zoomant sur les régions les plus prometteuses et fréquemment visitées. Pour les environnements linéaires, nous étudions un cadre plus réaliste, où l'environnement peut maintenant évoluer dynamiquement et même de façon antagoniste au fil du temps, mais le changement total est toujours limité. Pour traiter ce cadre, nous proposons un algorithme en ligne efficace basé sur l'itération de valeur des moindres carrés pondérés. Il utilise des poids exponentiels pour oublier doucement les données qui sont loin dans le passé, ce qui pousse l'agent à continuer à explorer pour découvrir les changements. Enfin, au-delà du cadre classique du RL, nous considérons un agent qui interagit avec son environnement sans signal de récompense. Nous proposons d'apprendre une paire de représentations qui mettent en correspondance les paires état-action avec un certain espace latent. Pendant la phase non supervisée, ces représentations sont entraînées en utilisant des interactions sans récompense pour encoder les relations à longue portée entre les états et les actions, via une carte d'occupation prédictive. Au moment du test, lorsqu'une fonction de récompense est révélée, nous montrons que la politique optimale pour cette récompense est directement obtenue à partir de ces représentations, sans aucune planification. Il s'agit d'une étape vers la construction d'agents entièrement contrôlables. Un thème commun de la thèse est la conception d'algorithmes RL prouvables et généralisables. Dans la première et la deuxième partie, nous traitons de la généralisation dans les grands espaces d'états, soit par approximation de fonctions linéaires, soit par agrégation d'états. Dans la dernière partie, nous nous concentrons sur la généralisation sur les fonctions de récompense et nous proposons un cadre d'apprentissage non-supervisé de représentation qui est capable d'optimiser toutes les fonctions de récompense. / Reinforcement Learning (RL) is an agent-oriented learning paradigm concerned with learning by interacting with an uncertain environment. Combined with deep neural networks as function approximators, deep reinforcement learning (Deep RL) allowed recently to tackle highly complex tasks and enable artificial agents to master classic games like Go, play video games from pixels, and solve robotic control tasks. However, a closer look at these remarkable empirical successes reveals some fundamental limitations. First, it has been challenging to combine desirable features of RL algorithms, such as off-policy and multi-step learning with function approximation in a way that leads to both stable and efficient algorithms in large state spaces. Moreover, Deep RL algorithms tend to be very sample inefficient due to the rudimentary exploration-exploitation strategies these approaches employ. Finally, they require an enormous amount of supervised data and end up producing a narrow agent able to solve only the task that it was trained on. In this thesis, we propose novel solutions to the problems of off-policy learning and exploration-exploitation dilemma in large state spaces, as well as self-supervision in RL. On the topic of off-policy learning, we provide two contributions. First, for the problem of policy evaluation, we show that combining popular off-policy and multi-step learning methods with linear value function parameterization could lead to undesirable instability, and we derive a provably convergent variant of these methods. Second, for policy optimization, we propose to stabilize the policy improvement step through an off-policy divergence regularization that constrains the discounted state-action visitation induced by consecutive policies to be close to one another. Next, we study online learning in large state spaces and we focus on two structural assumptions to make the problem tractable: smooth and linear environments. For smooth environments, we propose an efficient online algorithm that actively learns an adaptive partitioning of the joint space by zooming in on more promising and frequently visited regions. For linear environments, we study a more realistic setting, where the environment is now allowed to evolve dynamically and even adversarially over time, but the total change is still bounded. To address this setting, we propose an efficient online algorithm based on weighted least squares value iteration. It uses exponential weights to smoothly forget data that are far in the past, which drives the agent to keep exploring to discover changes. Finally, beyond the classical RL setting, we consider an agent interacting with its environments without a reward signal. We propose to learn a pair of representations that map state-action pairs to some latent space. During the unsupervised phase, these representations are trained using reward-free interactions to encode long-range relationships between states and actions, via a predictive occupancy map. At test time, once a reward function is revealed, we show that the optimal policy for that reward is directly obtained from these representations, with no planning. This is a step towards building fully controllable agents. A common theme in the thesis is the design of provable RL algorithms that generalize. In the first and the second part, we deal with generalization in large state spaces either by linear function approximation or state aggregation. In the last part, we focus on generalization over reward functions and we propose a task-agnostic representation learning framework that is provably able to solve all reward functions. reinforcement learning Markov decision process artificial agent off-policy learning function approximation exploration-exploitation trade-off self-supervision generalization apprentissage par renforcement processus de décision Markovien agent artificiel apprentissage hors-politique approximation de fonction compromis exploration-exploitation auto-supervision généralisation
55	Accelerated algorithms for temporal difference learning methods Rankawat, Anushree 12 1900 (has links) L'idée centrale de cette thèse est de comprendre la notion d'accélération dans les algorithmes d'approximation stochastique. Plus précisément, nous tentons de répondre à la question suivante : Comment l'accélération apparaît-elle naturellement dans les algorithmes d'approximation stochastique ? Nous adoptons une approche de systèmes dynamiques et proposons de nouvelles méthodes accélérées pour l'apprentissage par différence temporelle (TD) avec approximation de fonction linéaire : Polyak TD(0) et Nesterov TD(0). Contrairement aux travaux antérieurs, nos méthodes ne reposent pas sur une conception des méthodes de TD comme des méthodes de descente de gradient. Nous étudions l'interaction entre l'accélération, la stabilité et la convergence des méthodes accélérées proposées en temps continu. Pour établir la convergence du système dynamique sous-jacent, nous analysons les modèles en temps continu des méthodes d'approximation stochastique accélérées proposées en dérivant la loi de conservation dans un système de coordonnées dilaté. Nous montrons que le système dynamique sous-jacent des algorithmes proposés converge à un rythme accéléré. Ce cadre nous fournit également des recommandations pour le choix des paramètres d'amortissement afin d'obtenir ce comportement convergent. Enfin, nous discrétisons ces ODE convergentes en utilisant deux schémas de discrétisation différents, Euler explicite et Euler symplectique, et nous analysons leurs performances sur de petites tâches de prédiction linéaire. / The central idea of this thesis is to understand the notion of acceleration in stochastic approximation algorithms. Specifically, we attempt to answer the question: How does acceleration naturally show up in SA algorithms? We adopt a dynamical systems approach and propose new accelerated methods for temporal difference (TD) learning with linear function approximation: Polyak TD(0) and Nesterov TD(0). In contrast to earlier works, our methods do not rely on viewing TD methods as gradient descent methods. We study the interplay between acceleration, stability, and convergence of the proposed accelerated methods in continuous time. To establish the convergence of the underlying dynamical system, we analyze continuous-time models of the proposed accelerated stochastic approximation methods by deriving the conservation law in a dilated coordinate system. We show that the underlying dynamical system of our proposed algorithms converges at an accelerated rate. This framework also provides us recommendations for the choice of the damping parameters to obtain this convergent behavior. Finally, we discretize these convergent ODEs using two different discretization schemes, explicit Euler, and symplectic Euler, and analyze their performance on small, linear prediction tasks. Temporal difference learning Stochastic Approximation Accelerated methods Momentum methods Reinforcement learning Approximate Dynamic Programming Function approximation Conservation laws Convergence rates Machine learning Méthodes des différences temporelles Approximation Stochastique Méthodes accélérées Méthodes de quantité de mouvement Apprentissage par renforcement Programmation dynamique approchée Lois de conservation Taux de convergence Apprentissage automatique

Page generated in 0.1051 seconds