Spelling suggestions: "subject:"reinforcement 1earning algorithms"" "subject:"reinforcement 1earning a.lgorithms""
1 |
Learning successful strategies in repeated general-sum games /Crandall, Jacob W., January 2005 (has links) (PDF)
Thesis (Ph.D.)--Brigham Young University. Dept. of Computer Science, 2005. / Includes bibliographical references (p. 163-168).
|
2 |
Limitations and extensions of the WoLF-PHC algorithm /Cook, Philip R., January 2007 (has links) (PDF)
Thesis (M.S.)--Brigham Young University. Dept. of Computer Science, 2007. / Includes bibliographical references (p. 93-101).
|
3 |
Reinforcement Learning-based Human Operator Decision Support Agent for Highly Transient Industrial ProcessesJianqi Ruan (18066763) 03 March 2024 (has links)
<p dir="ltr"> Most industrial processes are not fully-automated. Although reference tracking can be handled by low-level controllers, initializing and adjusting the reference, or setpoint, values, are commonly tasks assigned to human operators. A major challenge that arises, though, is control policy variation among operators which in turn results in inconsistencies in the final product. In order to guide operators to pursue better and more consistent performance, researchers have explored the optimal control policy through different approaches. Although in different applications, researchers use different approaches, an accurate process model is still crucial to the approaches. However, for a highly transient process (e.g., the startup of a manufacturing process), modeling can be challenging and inaccurate, and approaches highly relying on a process model may not work well. One example is process startup in a twin-roll steel strip casting process and motivates this work. </p><p dir="ltr"><br></p><p dir="ltr"> In this dissertation, I propose three offline reinforcement learning (RL) algorithms which require the RL agent to learn a control policy from a fixed dataset that is pre-collected by human operators during operations of the twin-roll casting process. Compared to existing offline RL algorithms, the proposed algorithms focus on exploiting the best control policy used by human operators rather than exploring new control policies constrained by the existing policies. In addition, in existing offline RL algorithms, there is not enough consideration of the imbalanced dataset problem. In the second and the third proposed algorithms, I leverage the idea of cost sensitive learning to incentivize the RL agent to learn the most valuable control policy, rather than the most common one represented in the dataset. In addition, since the process model is not available, I propose a performance metric that does not require a process model or simulator for agent testing. The third proposed algorithm is compared with benchmark offline RL algorithms and achieves better and more consistent performance.</p>
|
4 |
Feature Adaptation Algorithms for Reinforcement Learning with Applications to Wireless Sensor Networks And Road Traffic ControlPrabuchandran, K J January 2016 (has links) (PDF)
Many sequential decision making problems under uncertainty arising in engineering, science and economics are often modelled as Markov Decision Processes (MDPs). In the setting of MDPs, the goal is to and a state dependent optimal sequence of actions that minimizes a certain long-term performance criterion. The standard dynamic programming approach to solve an MDP for the optimal decisions requires a complete model of the MDP and is computationally feasible only for small state-action MDPs. Reinforcement learning (RL) methods, on the other hand, are model-free simulation based approaches for solving MDPs. In many real world applications, one is often faced with MDPs that have large state-action spaces whose model is unknown, however, whose outcomes can be simulated. In order to solve such (large) MDPs, one either resorts to the technique of function approximation in conjunction with RL methods or develops application specific RL methods. A solution based on RL methods with function approximation comes with the associated problem of choosing the right features for approximation and a solution based on application specific RL methods primarily relies on utilizing the problem structure. In this thesis, we investigate the problem of choosing the right features for RL methods based on function approximation as well as develop novel RL algorithms that adaptively obtain best features for approximation. Subsequently, we also develop problem specie RL methods for applications arising in the areas of wireless sensor networks and road traffic control.
In the first part of the thesis, we consider the problem of finding the best features for value function approximation in reinforcement learning for the long-run discounted cost objective. We quantify the error in the approximation for any given feature and the approximation parameter by the mean square Bellman error (MSBE) objective and develop an online algorithm to optimize MSBE.
Subsequently, we propose the first online actor-critic scheme with adaptive bases to find a locally optimal (control) policy for an MDP under the weighted discounted cost objective. The actor performs gradient search in the space of policy parameters using simultaneous perturbation stochastic approximation (SPSA) gradient estimates. This gradient computation however requires estimates of the value function of the policy. The value function is approximated using a linear architecture and its estimate is obtained from the critic. The error in approximation of the value function, however, results in sub-optimal policies. Thus, we obtain the best features by performing a gradient descent on the Grassmannian of features to minimize a MSBE objective. We provide a proof of convergence of our control algorithm to a locally optimal policy and show numerical results illustrating the performance of our algorithm.
In our next work, we develop an online actor-critic control algorithm with adaptive feature tuning for MDPs under the long-run average cost objective. In this setting, a gradient search in the policy parameters is performed using policy gradient estimates to improve the performance of the actor. The computation of the aforementioned gradient however requires estimates of the differential value function of the policy. In order to obtain good estimates of the differential value function, the critic adaptively tunes the features to obtain the best representation of the value function using gradient search in the Grassmannian of features. We prove that our actor-critic algorithm converges to a locally optimal policy. Experiments on two different MDP settings show performance improvements resulting from our feature adaptation scheme.
In the second part of the thesis, we develop problem specific RL solution methods for the two aforementioned applications. In both the applications, the size of the state-action space in the formulated MDPs is large. However, by utilizing the problem structure we develop scalable RL algorithms.
In the wireless sensor networks application, we develop RL algorithms to find optimal energy management policies (EMPs) for energy harvesting (EH) sensor nodes. First, we consider the case of a single EH sensor node and formulate the problem of finding an optimal EMP in the discounted cost MDP setting. We then propose two RL algorithms to maximize network performance. Through simulations, our algorithms are seen to outperform the algorithms in the literature. Our RL algorithms for the single EH sensor node do not scale when there are multiple sensor nodes. In our second work, we consider the problem of finding optimal energy sharing policies that maximize the network performance of a system comprising of multiple sensor nodes and a single energy harvesting (EH) source. We develop efficient energy sharing algorithms, namely Q-learning algorithm with exploration mechanisms based on the -greedy method as well as upper confidence bound (UCB). We extend these algorithms by incorporating state and action space aggregation to tackle state-action space explosion in the MDP. We also develop a cross entropy based method that incorporates policy parameterization in order to find near optimal energy sharing policies. Through numerical experiments, we show that our algorithms yield energy sharing policies that outperform the heuristic greedy method.
In the context of road traffic control, optimal control of traffic lights at junctions or traffic signal control (TSC) is essential for reducing the average delay experienced by the road users. This problem is hard to solve when simultaneously considering all the junctions in the road network. So, we propose a decentralized multi-agent reinforcement learning (MARL) algorithm for solving this problem by considering each junction in the road network as a separate agent (controller) to obtain dynamic TSC policies. We propose two approaches to minimize the average delay. In the first approach, each agent decides the signal duration of its phases in a round-robin (RR) manner using the multi-agent Q-learning algorithm. We show through simulations over VISSIM (microscopic traffic simulator) that our round-robin MARL algorithms perform significantly better than both the standard fixed signal timing (FST) algorithm and the saturation balancing (SAT) algorithm over two real road networks. In the second approach, instead of optimizing green light duration, each agent optimizes the order of the phase sequence. We then employ our MARL algorithms by suitably changing the state-action space and cost structure of the MDP. We show through simulations over VISSIM that our non-round robin MARL algorithms perform significantly better than the FST, SAT and the round-robin MARL algorithms based on the first approach. However, on the other hand, our round-robin MARL algorithms are more practically viable as they conform with the psychology of road users.
|
5 |
Simulation Based Algorithms For Markov Decision Process And Stochastic OptimizationAbdulla, Mohammed Shahid 05 1900 (has links)
In Chapter 2, we propose several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes (MDPs) with finite state-space under the average cost criterion. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and averaged for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, a memory efficient implementation using a feature-vector representation of the state-space and TD (0) learning along the faster timescale is discussed. A three-timescale simulation based algorithm for solution of infinite horizon discounted-cost MDPs via the Value Iteration approach is also proposed. An approximation of the Dynamic Programming operator T is applied to the value function iterates. A sketch of convergence explaining the dynamics of the algorithm using associated ODEs is presented. Numerical experiments on rate based flow control on a bottleneck node using a continuous-time queueing model are presented using the proposed algorithms.
Next, in Chapter 3, we develop three simulation-based algorithms for finite-horizon MDPs (FHMDPs). The first algorithm is developed for finite state and compact action spaces while the other two are for finite state and finite action spaces. Convergence analysis is briefly sketched. We then concentrate on methods to mitigate the curse of dimensionality that affects FH-MDPs severely, as there is one probability transition matrix per stage. Two parametrized actor-critic algorithms for FHMDPs with compact action sets are proposed, the ‘critic’ in both algorithms learning the policy gradient. We show w.p1convergence to a set with the necessary condition for constrained optima. Further, a third algorithm for stochastic control of stopping time processes is presented. Numerical experiments with the proposed finite-horizon algorithms are shown for a problem of flow control in communication networks.
Towards stochastic optimization, in Chapter 4, we propose five algorithms which are variants of SPSA. The original one measurement SPSA uses an estimate of the gradient of objective function L containing an additional bias term not seen in two-measurement SPSA. We propose a one-measurement algorithm that eliminates this bias, and has asymptotic convergence properties making for easier comparison with the two-measurement SPSA. The algorithm, under certain conditions, outperforms both forms of SPSA with the only overhead being the storage of a single measurement. We also propose a similar algorithm that uses perturbations obtained from normalized Hadamard matrices. The convergence w.p.1 of both algorithms is established. We extend measurement reuse to design three second-order SPSA algorithms, sketch the convergence analysis and present simulation results on an illustrative minimization problem. We then propose several stochastic approximation implementations for related algorithms in flow-control of communication networks, beginning with a discrete-time implementation of Kelly’s primal flow-control algorithm. Convergence with probability1 is shown, even in the presence of communication delays and stochastic effects seen in link congestion indications. Two relevant enhancements are then pursued :a) an implementation of the primal algorithm using second-order information, and b) an implementation where edge-routers rectify misbehaving flows. Also, discrete-time implementations of Kelly’s dual algorithm and primal-dual algorithm are proposed. Simulation results a) verifying the proposed algorithms and, b) comparing stability properties with an algorithm in the literature are presented.
|
6 |
[en] MACHINE LEARNING-BASED MAC PROTOCOLS FOR LORA IOT NETWORKS / [pt] PROTOCOLOS MAC BASEADOS EM APRENDIZADO DE MÁQUINA PARA REDES DE INTERNET DAS COISAS DO TIPO LORADAYRENE FROMETA FONSECA 24 June 2020 (has links)
[pt] Com o rápido crescimento da Internet das Coisas (IoT), surgiram novas tecnologias de comunicação sem fio para atender aos requisitos de longo alcance, baixo custo e baixo consumo de energia exigidos pelos aplicativos de IoT. Nesse contexto, surgiram as redes de longa distância de baixa potência (LPWANs), as quais oferecem diferentes soluções que atendem aos requisitos dos aplicativos de IoT mencionados anteriormente. Entre as soluções LPWAN existentes, o LoRaWAN tem-se destacado por receber atenção significativa da indústria e da academia nos últimos anos. Embora o LoRaWAN ofereça uma combinação atraente de transmissões de dados de longo alcance e baixo consumo de energia, ele ainda enfrenta vários desafios em termos de confiabilidade e escalabilidade. No entanto, devido a sua natureza de código
aberto e à flexibilidade do esquema de modulação no qual ele se baseia (Long Range (LoRa) permite o ajuste de fatores de espalhamento e a potência de transmissão), o LoRaWAN também oferece importantes possibilidades de melhorias. Esta dissertação aproveita a adequação dos algoritmos de Aprendizagem por Reforço (RL) para resolver tarefas de tomada de decisão e os utiliza para ajustar dinamicamente os parâmetros de transmissão dos dispositivos finais LoRaWAN. O sistema proposto, chamado RL-LoRa, mostra melhorias significativas em termos de confiabilidade e escalabilidade quando comparado ao LoRaWAN. Especificamente, diminui a taxa de erro de pacote (PER) média do LoRaWAN em 15 porcento, o que pode aumentar ainda mais a escalabilidade da rede. / [en] With the massive growth of the Internet of Things (IoT), novel wireless communication technologies have emerged to address the long-range, lowcost, and low-power consumption requirements of the IoT applications. In this context, the Low Power Wide Area Networks (LPWANs) have appeared, offering different solutions that meet the IoT applications requirements mentioned before. Among the existing LPWAN solutions, LoRaWAN has stood out for receiving significant attention from both industry and academia in recent years. Although LoRaWAN offers a compelling combination of long-range and low-power consumption data transmissions, it still faces several challenges in terms of reliability and scalability. However, due to its open-source nature and the flexibility of the modulation scheme it is based on (Long Range (LoRa) modulation allows the adjustment of spreading factors and transmit power), LoRaWAN also offers important possibilities for improvements. This thesis takes advantage of the appropriateness of
the Reinforcement Learning (RL) algorithms for solving decision-making tasks, and use them to dynamically adjust the transmission parameters of LoRaWAN end devices. The proposed system, called RL-LoRa, shows significant improvements in terms of reliability and scalability when compared with LoRaWAN. Specifically, it decreases the average Packet Error Ratio (PER) of LoRaWAN by 15 percent, which can further increase the network scalability.
|
7 |
Autonomous Driving with Deep Reinforcement LearningZhu, Yuhua 17 May 2023 (has links)
The researcher developed an autonomous driving simulation by training an end-to-end policy model using deep reinforcement learning algorithms in the Gym-duckietown virtual environment. The control strategy of the model was designed for the lane-following task. Several reinforcement learning algorithms were implemented and the SAC algorithm was chosen to train a non-end-to-end model with the information provided by the environment such as speed as input values, as well as an end-to-end model with images captured by the agent's front camera as input. In this paper, the researcher compared the advantages and disadvantages of the two models using kinetic parameters in the environment and conducted a series of experiments on the control strategy of the end-to-end model to explore the effects of different environmental parameters or reward functions on the models.:CHAPTER 1 INTRODUCTION 1
1.1 AUTONOMOUS DRIVING OVERVIEW 1
1.2 RESEARCH QUESTIONS AND METHODS 3
1.2.1 Research Questions 3
1.2.2 Research Methods 4
1.3 PAPER STRUCTURE 5
CHAPTER 2 RESEARCH BACKGROUND 7
2.1 RESEARCH STATUS 7
2.2 THEORETICAL BASIS 8
2.2.1 Machine Learning 8
2.2.2 Deep Learning 9
2.2.3 Reinforcement Learning 11
2.2.4 Deep Reinforcement Learning 14
CHAPTER 3 METHOD 15
3.1 SIMULATION PLATFORM 16
3.2 CONTROL TASK 17
3.3 OBSERVATION SPACE 18
3.3.1 Information as Observation (Non-end-to-end) 19
3.3.2 Images as Observation (End-to-end) 20
3.4 ACTION SPACE 22
3.5 ALGORITHM 23
3.5.1 Mathematical Foundations 23
3.5.2 Policy Iteration 25
3.6 POLICY ARCHITECTURE 25
3.6.1 Network Architecture for Non-end-to-end Model 26
3.6.2 Network Architecture for End-to-end Model 28
3.7 REWARD SHAPING 29
3.7.1 Calculation of Speed-based Reward Function 30
3.7.2 Calculation of the reward function based on the position of the agent relative to the right lane 31
CHAPTER 4 TRAINING PROCESS 33
4.1 TRAINING PROCESS OF NON-END-TO-END MODEL 34
4.2 TRAINING PROCESS OF END-TO-END MODEL 35
CHAPTER 5 RESULT 38
CHAPTER 6 TEST AND EVALUATION 41
6.1 EVALUATION OF END-TO-END MODEL 43
6.1.1 Speed Tests in Two Scenarios 43
6.1.2 Lateral Deviation between the Agent and the Right Lane’s Centerline 44
6.1.3 Orientation Deviation between the Agent and the Right Lane’s Centerline 45
6.2 COMPARISON OF THE END-TO-END MODEL TO TWO BASELINES IN SIMULATION 46
6.2.1 Comparison with Non-end-to-end Baseline 47
6.2.2 Comparison with PD Baseline 51
6.3 TEST THE EFFECT OF DIFFERENT WEIGHTS ASSIGNMENTS ON THE END-TO-END MODEL 53
CHAPTER 7 CONCLUSION 57 / Der Forscher entwickelte eine autonome Fahrsimulation, indem er ein End-to-End-Regelungsmodell mit Hilfe von Deep Reinforcement Learning-Algorithmen in der virtuellen Umgebung von Gym-duckietown trainierte. Die Kontrollstrategie des Modells wurde für die Aufgabe des Spurhaltens entwickelt. Es wurden mehrere Verstärkungslernalgorithmen implementiert, und der SAC-Algorithmus wurde ausgewählt, um ein Nicht-End-to-End-Modell mit den von der Umgebung bereitgestellten Informationen wie Geschwindigkeit als Eingabewerte sowie ein End-to-End-Modell mit den von der Frontkamera des Agenten aufgenommenen Bildern als Eingabe zu trainieren. In diesem Beitrag verglich der Forscher die Vor- und Nachteile der beiden Modelle unter Verwendung kinetischer Parameter in der Umgebung und führte eine Reihe von Experimenten zur Kontrollstrategie des End-to-End-Modells durch, um die Auswirkungen verschiedener Umgebungsparameter oder Belohnungsfunktionen auf die Modelle zu untersuchen.:CHAPTER 1 INTRODUCTION 1
1.1 AUTONOMOUS DRIVING OVERVIEW 1
1.2 RESEARCH QUESTIONS AND METHODS 3
1.2.1 Research Questions 3
1.2.2 Research Methods 4
1.3 PAPER STRUCTURE 5
CHAPTER 2 RESEARCH BACKGROUND 7
2.1 RESEARCH STATUS 7
2.2 THEORETICAL BASIS 8
2.2.1 Machine Learning 8
2.2.2 Deep Learning 9
2.2.3 Reinforcement Learning 11
2.2.4 Deep Reinforcement Learning 14
CHAPTER 3 METHOD 15
3.1 SIMULATION PLATFORM 16
3.2 CONTROL TASK 17
3.3 OBSERVATION SPACE 18
3.3.1 Information as Observation (Non-end-to-end) 19
3.3.2 Images as Observation (End-to-end) 20
3.4 ACTION SPACE 22
3.5 ALGORITHM 23
3.5.1 Mathematical Foundations 23
3.5.2 Policy Iteration 25
3.6 POLICY ARCHITECTURE 25
3.6.1 Network Architecture for Non-end-to-end Model 26
3.6.2 Network Architecture for End-to-end Model 28
3.7 REWARD SHAPING 29
3.7.1 Calculation of Speed-based Reward Function 30
3.7.2 Calculation of the reward function based on the position of the agent relative to the right lane 31
CHAPTER 4 TRAINING PROCESS 33
4.1 TRAINING PROCESS OF NON-END-TO-END MODEL 34
4.2 TRAINING PROCESS OF END-TO-END MODEL 35
CHAPTER 5 RESULT 38
CHAPTER 6 TEST AND EVALUATION 41
6.1 EVALUATION OF END-TO-END MODEL 43
6.1.1 Speed Tests in Two Scenarios 43
6.1.2 Lateral Deviation between the Agent and the Right Lane’s Centerline 44
6.1.3 Orientation Deviation between the Agent and the Right Lane’s Centerline 45
6.2 COMPARISON OF THE END-TO-END MODEL TO TWO BASELINES IN SIMULATION 46
6.2.1 Comparison with Non-end-to-end Baseline 47
6.2.2 Comparison with PD Baseline 51
6.3 TEST THE EFFECT OF DIFFERENT WEIGHTS ASSIGNMENTS ON THE END-TO-END MODEL 53
CHAPTER 7 CONCLUSION 57
|
Page generated in 0.0852 seconds