Return to search

An empirical study of stability and variance reduction in DeepReinforcement Learning

Reinforcement Learning (RL) is a branch of AI that deals with solving complex sequential decision making problems such as training robots, trading while following patterns and trends, optimal control of industrial processes, and more. These applications span various fields, including data science, factories, finance, and others[1]. The most popular RL algorithm today is Deep Q Learning (DQL), developed by a team at DeepMind, which successfully combines RL with Neural Network (NN). However, combining RL and NN introduces challenges such as numerical instability and unstable learning due to high variance. Among others, these issues are due to the“moving target problem”. To mitigate this problem, the target network was introduced as a solution. However, using a target network slows down learning, vastly increases memory requirements, and adds overheads in running the code. In this thesis, we conduct an empirical study to investigate the importance of target networks. We conduct this empirical study for three scenarios. In the first scenario, we train agents in online learning. The aim here is to demonstrate that the target network can be removed after some point in time without negatively affecting performance. To evaluate this scenario, we introduce the concept of the stabilization point. In thesecond scenario, we pre-train agents before continuing to train them in online learning. For this scenario, we demonstrate the redundancy of the target network by showing that it can be completely omitted. In the third scenario, we evaluate a newly developed activation function called Truncated Gaussian Error Linear Unit (TGeLU). For thisscenario, we train an agent in online learning and show that by using TGeLU as anactivation function, we can completely remove the target network. Through the empirical study of these scenarios, we conjecture and verify that a target network has only transient benefits concerning stability. We show that it has no influence on the quality of the policy found. We also observed that variance was generally higher when using a target network in the later stages of training compared to cases where the target network had been removed. Additionally, during the investigation of the second scenario, we observed that the magnitude of training iterations during pre-training affected the agent’s performance in the online learning phase. This thesis provides a deeper understanding of how the target networkaffects the training process of DQL, some of them - surrounding variance reduction- are contrary to popular belief. Additionally, the results have provided insights into potential future work. These include further explore the benefits of lower variance observed when removing the target network and conducting more efficient convergence analyses for the pre-training part in the second scenario.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kau-100210
Date January 2024
CreatorsLindström, Alexander
PublisherKarlstads universitet, Institutionen för matematik och datavetenskap (from 2013)
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.003 seconds