Global ETD Search

1	Learning and planning with noise in optimization and reinforcement learning Thomas, Valentin 06 1900 (has links) La plupart des algorithmes modernes d'apprentissage automatique intègrent un certain degré d'aléatoire dans leurs processus, que nous appellerons le bruit, qui peut finalement avoir un impact sur les prédictions du modèle. Dans cette thèse, nous examinons de plus près l'apprentissage et la planification en présence de bruit pour les algorithmes d'apprentissage par renforcement et d'optimisation. Les deux premiers articles présentés dans ce document se concentrent sur l'apprentissage par renforcement dans un environnement inconnu, et plus précisément sur la façon dont nous pouvons concevoir des algorithmes qui utilisent la stochasticité de leur politique et de l'environnement à leur avantage. Notre première contribution présentée dans ce document se concentre sur le cadre de l'apprentissage par renforcement non supervisé. Nous montrons comment un agent laissé seul dans un monde inconnu sans but précis peut apprendre quels aspects de l'environnement il peut contrôler indépendamment les uns des autres, ainsi qu'apprendre conjointement une représentation latente démêlée de ces aspects que nous appellerons \emph{facteurs de variation}. La deuxième contribution se concentre sur la planification dans les tâches de contrôle continu. En présentant l'apprentissage par renforcement comme un problème d'inférence, nous empruntons des outils provenant de la littérature sur les m\'thodes de Monte Carlo séquentiel pour concevoir un algorithme efficace et théoriquement motiv\'{e} pour la planification probabiliste en utilisant un modèle appris du monde. Nous montrons comment l'agent peut tirer parti de note objectif probabiliste pour imaginer divers ensembles de solutions. Les deux contributions suivantes analysent l'impact du bruit de gradient dû à l'échantillonnage dans les algorithmes d'optimisation. La troisième contribution examine le rôle du bruit de l'estimateur du gradient dans l'estimation par maximum de vraisemblance avec descente de gradient stochastique, en explorant la relation entre la structure du bruit du gradient et la courbure locale sur la généralisation et la vitesse de convergence du modèle. Notre quatrième contribution revient sur le sujet de l'apprentissage par renforcement pour analyser l'impact du bruit d'échantillonnage sur l'algorithme d'optimisation de la politique par ascension du gradient. Nous constatons que le bruit d'échantillonnage peut avoir un impact significatif sur la dynamique d'optimisation et les politiques découvertes en apprentissage par renforcement. / Most modern machine learning algorithms incorporate a degree of randomness in their processes, which we will refer to as noise, which can ultimately impact the model's predictions. In this thesis, we take a closer look at learning and planning in the presence of noise for reinforcement learning and optimization algorithms. The first two articles presented in this document focus on reinforcement learning in an unknown environment, specifically how we can design algorithms that use the stochasticity of their policy and of the environment to their advantage. Our first contribution presented in this document focuses on the unsupervised reinforcement learning setting. We show how an agent left alone in an unknown world without any specified goal can learn which aspects of the environment it can control independently from each other as well as jointly learning a disentangled latent representation of these aspects, or factors of variation. The second contribution focuses on planning in continuous control tasks. By framing reinforcement learning as an inference problem, we borrow tools from Sequential Monte Carlo literature to design a theoretically grounded and efficient algorithm for probabilistic planning using a learned model of the world. We show how the agent can leverage the uncertainty of the model to imagine a diverse set of solutions. The following two contributions analyze the impact of gradient noise due to sampling in optimization algorithms. The third contribution examines the role of gradient noise in maximum likelihood estimation with stochastic gradient descent, exploring the relationship between the structure of the gradient noise and local curvature on the generalization and convergence speed of the model. Our fourth contribution returns to the topic of reinforcement learning to analyze the impact of sampling noise on the policy gradient algorithm. We find that sampling noise can significantly impact the optimization dynamics and policies discovered in on-policy reinforcement learning. Deep reinforcement learning Planning Stochastic optimization Generalization Control as inference Representation learning Optimisation stochastique Apprentissage par renforcement Apprentissage de representations Planification
2	Searching for Q* Piché, Alexandre 04 1900 (has links) Les travaux dans cette thèse peuvent être vue à travers le prisme commun de la “recherche de Q” et visent à mettre en évidence l’efficacité de la combinaison des systèmes d’apprentissage par renforcement (RL) profond et la planification. Le RL profond nous permet d’apprendre: 1) des politiques riches à partir desquelles nous pouvons échantillonner des actions futures potentielles, et 2) des fonctions Q précises permettant à l’agent d’évaluer l’impact potentiel de ses actions avant de les prendre. La planification permet à l’agent d’utiliser le calcul pour améliorer sa politique en évaluant plusieurs séquences potentielles d’actions futures et en sélectionnant la plus prometteuse. Dans cette thèse, nous explorons différentes façons de combiner ces deux composantes afin qu’elles se renforcent mutuellement et nous permettent d’obtenir des agents plus robustes. La première contribution de cette thèse cadre le RL et la planification comme un pro- blème d’inférence. Ce cadre nous permet d’utiliser des techniques de Monte Carlo séquentiel pour approximer une distribution sur les trajectoires planifiées optimales. La deuxième contribution met en évidence une connexion entre les réseaux cibles utilisés dans l’appren- tissage Q profond et la régularisation fonctionnelle, ce qui nous conduit à une régularisation des fonctions Q plus flexible et “propre”. La troisième contribution simplifie le problème de RL via l’apprentissage supervisé en modélisant directement le retour futur comme une distribution, permettant à l’agent d’échantillonner des retours conditionnels à l’état présent plutôt qu’être un hyper paramètre specifique à chaque environnement. Enfin, la quatrième contribution propose un nouvel algorithme d’optimisation itératif basé sur l’auto-évaluation et l’auto-amélioration pour les grands modèles de langage, cet algorithme est utilisé pour réduire le taux d’hallucination des modèles sans compromettre leurs utilités. / The research in this thesis can be seen through the common lens of “Searching for Q” and aims to highlight the effectiveness of combining deep Reinforcement Learning (RL) systems and search. Deep RL allows us to learn: 1) rich policies from which we can sample potential future actions, and 2) accurate Q-functions allowing the agent to evaluate the potential impact of its actions before taking them. Search allows the agent to use computation to improve its policy by evaluating multiple potential future sequences of actions and selecting the most promising one. In this thesis, we explore different ways to combine these two components, so they improve one another and allow us to obtain stronger agents. The first contribution of this thesis frames RL and planning as an inference problem. This framing enables us to leverage Sequential Monte Carlo techniques to approximate a distribution over the optimal planned trajectories. The second contribution highlights a connection between Target Networks used in Q-learning and functional regularization, lead- ing us to a more flexible and “proper” regularization of Q-functions. The third contribution simplifies the RL via supervised learning (RvS) problem by directly modeling future return as a distribution, allowing the agent to sample returns on the fly instead of having it be a hyperparameter dependent on the environment. Finally, the fourth contribution proposes a novel iterative optimization algorithm based on self-evaluation and self-prompting for large language models, which reduces the hallucination rates of the model without compromising its helpfulness. Contrôle par inférence probabiliste Apprentisage profond par renforcement Deep reinforcement learning Planning Search Q-learning Control-as-inference Planification

Search results

Learning and planning with noise in optimization and reinforcement learning

Searching for Q*