• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 64
  • 15
  • 6
  • 4
  • 3
  • 2
  • 2
  • Tagged with
  • 104
  • 104
  • 66
  • 34
  • 26
  • 25
  • 25
  • 20
  • 18
  • 17
  • 16
  • 16
  • 16
  • 16
  • 15
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
91

Techniques d'Apprentissage par Renforcement pour le Routage Adaptatif dans les Réseaux de Télécommunication à Trafic Irrégulie

HOCEINI, SAID 23 November 2004 (has links) (PDF)
L'objectif de ce travail de thèse est de proposer des approches algorithmiques permettant de traiter la problématique du routage adaptatif (RA) dans un réseau de communication à trafic irrégulier. L'analyse des algorithmes existants nous a conduit à retenir comme base de travail l'algorithme Q-Routing (QR); celui-ci s'appuie sur la technique d'apprentissage par renforcement basée sur les modèles de Markov. L'efficacité de ce type de routage dépend fortement des informations sur la charge et la nature du trafic sur le réseau. Ces dernières doivent être à la fois, suffisantes, pertinentes et reflétant la charge réelle du réseau lors de la phase de prise de décision. Pour remédier aux inconvénients des techniques utilisant le QR, nous avons proposé deux algorithmes de RA. Le premier, appelé Q-Neural Routing, s'appuie sur un modèle neuronal stochastique pour estimer et mettre à jour les paramètres nécessaires au RA. Afin d'accélérer le temps de convergence, une deuxième approche est proposée : K-Shortest path Q-Routing. Elle est basée sur la technique de routage multi chemin combiné avec l'algorithme QR, l'espace d'exploration étant réduit aux k meilleurs chemins. Les deux algorithmes proposés sont validés et comparés aux approches traditionnelles en utilisant la plateforme de simulation OPNET, leur efficacité au niveau du RA est mise particulièrement en évidence. En effet, ceux-ci permettent une meilleure prise en compte de l'état du réseau contrairement aux approches classiques.
92

Derivação de modelos de trading de alta frequência em juros utilizando aprendizado por reforço

Castro, Uirá Caiado de 24 August 2017 (has links)
Submitted by Uirá Caiado de Castro (ucaiado@yahoo.com.br) on 2017-08-28T20:17:54Z No. of bitstreams: 1 uira_caiado_tradingRL.pdf: 1000833 bytes, checksum: d530c31d30ddfd98e5978aaaf3170959 (MD5) / Approved for entry into archive by Joana Martorini (joana.martorini@fgv.br) on 2017-08-28T21:06:42Z (GMT) No. of bitstreams: 1 uira_caiado_tradingRL.pdf: 1000833 bytes, checksum: d530c31d30ddfd98e5978aaaf3170959 (MD5) / Made available in DSpace on 2017-08-29T12:42:53Z (GMT). No. of bitstreams: 1 uira_caiado_tradingRL.pdf: 1000833 bytes, checksum: d530c31d30ddfd98e5978aaaf3170959 (MD5) Previous issue date: 2017-08-24 / O presente estudo propõe o uso de um modelo de aprendizagem por reforço para derivar uma estratégia de trading em taxa de juros diretamente de dados históricos de alta frequência do livro de ofertas. Nenhuma suposição sobre a dinâmica do mercado é feita, porém é necessário criar um simulador com o qual o agente de aprendizagem possa interagir para adquirir experiência. Diferentes variáveis relacionadas a microestrutura do mercado são testadas para compor o estado do ambiente. Funções baseadas em P&L e/ou na coerência do posicionamento das ofertas do agente são testadas para avaliar as ações tomadas. Os resultados deste trabalho sugerem algum sucesso na utilização das técnicas propostas quando aplicadas à atividade de trading. Porém, conclui-se que a obtenção de estratégias consistentemente lucrativas dependem muito das restrições colocadas na aprendizagem. / The present study proposes the use of a reinforcement learning model to develop an interest rate trading strategy directly from historical high-frequency order book data. No assumption about market dynamics is made, but it requires creating a simulator wherewith the learning agent can interact to gain experience. Different variables related to the microstructure of the market are tested to compose the state of the environment. Functions based on P&L and/or consistency in the order placement by the agent are tested to evaluate the actions taken. The results suggest some success in bringing the proposed techniques to trading. However, it is presumed that the achievement of consistently profitable strategies is highly dependent on the constraints placed on the learning task.
93

Algorithms for Product Pricing and Energy Allocation in Energy Harvesting Sensor Networks

Sindhu, P R January 2014 (has links) (PDF)
In this thesis, we consider stochastic systems which arise in different real-world application contexts. The first problem we consider is based on product adoption and pricing. A monopolist selling a product has to appropriately price the product over time in order to maximize the aggregated profit. The demand for a product is uncertain and is influenced by a number of factors, some of which are price, advertising, and product technology. We study the influence of price on the demand of a product and also how demand affects future prices. Our approach involves mathematically modelling the variation in demand as a function of price and current sales. We present a simulation-based algorithm for computing the optimal price path of a product for a given period of time. The algorithm we propose uses a smoothed-functional based performance gradient descent method to find a price sequence which maximizes the total profit over a planning horizon. The second system we consider is in the domain of sensor networks. A sensor network is a collection of autonomous nodes, each of which senses the environment. Sensor nodes use energy for sensing and communication related tasks. We consider the problem of finding optimal energy sharing policies that maximize the network performance of a system comprising of multiple sensor nodes and a single energy harvesting(EH) source. Nodes periodically sense a random field and generate data, which is stored in their respective data queues. The EH source harnesses energy from ambient energy sources and the generated energy is stored in a buffer. The nodes require energy for transmission of data and and they receive the energy for this purpose from the EH source. There is a need for efficiently sharing the stored energy in the EH source among the nodes in the system, in order to minimize average delay of data transmission over the long run. We formulate this problem in the framework of average cost infinite-horizon Markov Decision Processes[3],[7]and provide algorithms for the same.
94

Řízení entit ve strategické hře založené na multiagentních systémech / Strategic Game Based on Multiagent Systems

Knapek, Petr January 2019 (has links)
This thesis is focused on designing and implementing system, that adds learning and planning capabilities to agents designed for playing real-time strategy games like StarCraft. It will explain problems of controlling game entities and bots by computer and introduce some often used solutions. Based on analysis, a new system has been designed and implemented. It uses multi-agent systems to control the game, utilizes machine learning methods and is capable of overcoming oponents and adapting to new challenges.
95

Reinforcement Learning for Grid Voltage Stability with FACTS

Oldeen, Joakim, Sharma, Vishnu January 2020 (has links)
With increased penetration of renewable energy sources, maintaining equilibrium between production and consumption in the world’s electrical power systems (EPS) becomes more and more challenging. One way to increase stability and efficiency in an EPS is to use flexible alternating current transmission systems (FACTS). However, an EPS containing multiple FACTS-devices with overlapping areas of influence can lead to negative effects if the reference values they operate around are not updated with sufficient temporal resolution. The reference values are usually set manually by a system operator. The work in this master thesis has investigated how three different reinforcement learning (RL) algorithms can be used to set reference values automatically with higher temporal resolution than a system operator with the aim of increased voltage stability. The three RL algorithms – Q-learning, Deep Q-learning (DQN), and Twindelayed deep deterministic policy gradient (TD3) – were implemented in Python together with a 2-bus EPS test network acting as environment. The 2-bus EPS test network contain two FACTS devices: one for shunt compensation and one for series compensation. The results show that – with respect to reward – DQN was able to perform equally or better than non-RL cases 98.3 % of the time on the simulation test set, while corresponding values for TD3 and Q-learning were 87.3 % and 78.5 % respectively. DQN was able to achieve increased voltage stability on the test network while TD3 showed similar results except during lower loading levels. Q-learning decreased voltage stability on a substantial portion of the test set, even compared to a case without FACTS devices. To help with continued research and possible future real life implementation, a list of suggestions for future work has been established.
96

Not All Goals Are Created Equal : Evaluating Hockey Players in the NHL Using Q-Learning with a Contextual Reward Function

Vik, Jon January 2021 (has links)
Not all goals in the game of ice hockey are created equal: some goals increase the chances of winning more than others. This thesis investigates the result of constructing and using a reward function that takes this fact into consideration, instead of the common binary reward function. The two reward functions are used in a Markov Game model with value iteration. The data used to evaluate the hockey players is play-by-play data from the 2013-2014 season of the National Hockey League (NHL). Furthermore, overtime events, goalkeepers, and playoff games are excluded from the dataset. This study finds that the constructed reward, in general, is less correlated than the binary reward to the metrics: points, time on ice and, star points. However, an increased correlation was found between the evaluated impact and time on ice for center players. Much of the discussion is devoted to the difficulty of validating the results from a player evaluation due to the lack of ground truth. One conclusion from this discussion is that future efforts must be made to establish consensus regarding how the success of a hockey player should be defined.
97

Stabilizing Q-Learning for continuous control

Hui, David Yu-Tung 12 1900 (has links)
L'apprentissage profond par renforcement a produit des décideurs qui jouent aux échecs, au Go, au Shogi, à Atari et à Starcraft avec une capacité surhumaine. Cependant, ces algorithmes ont du mal à naviguer et à contrôler des environnements physiques, contrairement aux animaux et aux humains. Manipuler le monde physique nécessite la maîtrise de domaines d'actions continues tels que la position, la vitesse et l'accélération, contrairement aux domaines d'actions discretes dans des jeux de société et de vidéo. L'entraînement de réseaux neuronaux profonds pour le contrôle continu est instable: les agents ont du mal à apprendre et à conserver de bonnes habitudes, le succès est à haute variance sur hyperparamètres, graines aléatoires, même pour la même tâche, et les algorithmes ont du mal à bien se comporter en dehors des domaines dans lesquels ils ont été développés. Cette thèse examine et améliore l'utilisation de réseaux de neurones profonds dans l'apprentissage par renforcement. Le chapitre 1 explique comment le principe d'entropie maximale produit des fonctions d'objectifs pour l'apprentissage supervisé et non supervisé et déduit, à partir de la dynamique d'apprentissage des réseaux neuronaux profonds, certains termes régulisants pour stabiliser les réseaux neuronaux profonds. Le chapitre 2 fournit une justification de l'entropie maximale pour la forme des algorithmes acteur-critique et trouve une configuration d'un algorithme acteur-critique qui s'entraîne le plus stablement. Enfin, le chapitre 3 examine la dynamique d'apprentissage de l'apprentissage par renforcement profond afin de proposer deux améliorations aux réseaux cibles et jumeaux qui améliorent la stabilité et la convergence. Des expériences sont réalisées dans les simulateurs de physique idéale DeepMind Control, MuJoCo et Box2D. / Deep Reinforcement Learning has produced decision makers that play Chess, Go, Shogi, Atari, and Starcraft with superhuman ability. However, unlike animals and humans, these algorithms struggle to navigate and control physical environments. Manipulating the physical world requires controlling continuous action spaces such as position, velocity, and acceleration, unlike the discrete action spaces of board and video games. Training deep neural networks for continuous control is unstable: agents struggle to learn and retain good behaviors, performance is high variance across hyperparameters, random seed, and even multiple runs of the same task, and algorithms struggle to perform well outside the domains they have been developed in. This thesis finds principles behind the success of deep neural networks in other learning paradigms and examines their impact on reinforcement learning for continuous control. Chapter 1 explains how the maximum-entropy principle produces supervised and unsupervised learning loss functions and derives some regularizers used to stabilize deep networks from the training dynamics of deep learning. Chapter 2 provides a maximum-entropy justification for the form of actor-critic algorithms and finds a configuration of an actor-critic algorithm that trains most stably. Finally, Chapter 3 considers the training dynamics of deep reinforcement learning to propose two improvements to target and twin networks that improve stability and convergence. Experiments are performed within the DeepMind Control, MuJoCo, and Box2D ideal-physics simulators.
98

Generation and Detection of Adversarial Attacks for Reinforcement Learning Policies

Drotz, Axel, Hector, Markus January 2021 (has links)
In this project we investigate the susceptibility ofreinforcement rearning (RL) algorithms to adversarial attacks.Adversarial attacks have been proven to be very effective atreducing performance of deep learning classifiers, and recently,have also been shown to reduce performance of RL agents.The goal of this project is to evaluate adversarial attacks onagents trained using deep reinforcement learning (DRL), aswell as to investigate how to detect these types of attacks. Wefirst use DRL to solve two environments from OpenAI’s gymmodule, namely Cartpole and Lunarlander, by using DQN andDDPG (DRL techniques). We then evaluate the performanceof attacks and finally we also train neural networks to detectattacks. The attacks was successful at reducing performancein the LunarLander environment and CartPole environment.The attack detector was very successful at detecting attacks onthe CartPole environment, but performed not quiet as well onLunarLander.We hypothesize that continuous action space environmentsmay pose a greater difficulty for attack detectors to identifypotential adversarial attacks. / I detta projekt undersöker vikänsligheten hos förstärknings lärda (RL) algotritmerför attacker mot förstärknings lärda agenter. Attackermot förstärknings lärda agenter har visat sig varamycket effektiva för att minska prestandan hos djuptförsärknings lärda klassifierare och har nyligen visat sigockså minska prestandan hos förstärknings lärda agenter.Målet med detta projekt är att utvärdera attacker motdjupt förstärknings lärda agenter och försöka utföraoch upptäcka attacker. Vi använder först RL för attlösa två miljöer från OpenAIs gym module CartPole-v0och ContiniousLunarLander-v0 med DQN och DDPG.Vi utvärderar sedan utförandet av attacker och avslutarslutligen med ett möjligt sätt att upptäcka attacker.Attackerna var mycket framgångsrika i att minskaprestandan i både CartPole-miljön och LunarLandermiljön. Attackdetektorn var mycket framgångsrik medatt upptäcka attacker i CartPole-miljön men presteradeinte lika bra i LunarLander-miljön.Vi hypotiserar att miljöer med kontinuerligahandlingsrum kan innebära en större svårighet fören attack identifierare att upptäcka attacker mot djuptförstärknings lärda agenter. / Kandidatexjobb i elektroteknik 2021, KTH, Stockholm
99

An empirical study of stability and variance reduction in DeepReinforcement Learning

Lindström, Alexander January 2024 (has links)
Reinforcement Learning (RL) is a branch of AI that deals with solving complex sequential decision making problems such as training robots, trading while following patterns and trends, optimal control of industrial processes, and more. These applications span various fields, including data science, factories, finance, and others[1]. The most popular RL algorithm today is Deep Q Learning (DQL), developed by a team at DeepMind, which successfully combines RL with Neural Network (NN). However, combining RL and NN introduces challenges such as numerical instability and unstable learning due to high variance. Among others, these issues are due to the“moving target problem”. To mitigate this problem, the target network was introduced as a solution. However, using a target network slows down learning, vastly increases memory requirements, and adds overheads in running the code. In this thesis, we conduct an empirical study to investigate the importance of target networks. We conduct this empirical study for three scenarios. In the first scenario, we train agents in online learning. The aim here is to demonstrate that the target network can be removed after some point in time without negatively affecting performance. To evaluate this scenario, we introduce the concept of the stabilization point. In thesecond scenario, we pre-train agents before continuing to train them in online learning. For this scenario, we demonstrate the redundancy of the target network by showing that it can be completely omitted. In the third scenario, we evaluate a newly developed activation function called Truncated Gaussian Error Linear Unit (TGeLU). For thisscenario, we train an agent in online learning and show that by using TGeLU as anactivation function, we can completely remove the target network. Through the empirical study of these scenarios, we conjecture and verify that a target network has only transient benefits concerning stability. We show that it has no influence on the quality of the policy found. We also observed that variance was generally higher when using a target network in the later stages of training compared to cases where the target network had been removed. Additionally, during the investigation of the second scenario, we observed that the magnitude of training iterations during pre-training affected the agent’s performance in the online learning phase. This thesis provides a deeper understanding of how the target networkaffects the training process of DQL, some of them - surrounding variance reduction- are contrary to popular belief. Additionally, the results have provided insights into potential future work. These include further explore the benefits of lower variance observed when removing the target network and conducting more efficient convergence analyses for the pre-training part in the second scenario.
100

Resource Allocation for Sequential Decision Making Under Uncertainaty : Studies in Vehicular Traffic Control, Service Systems, Sensor Networks and Mechanism Design

Prashanth, L A January 2013 (has links) (PDF)
A fundamental question in a sequential decision making setting under uncertainty is “how to allocate resources amongst competing entities so as to maximize the rewards accumulated in the long run?”. The resources allocated may be either abstract quantities such as time or concrete quantities such as manpower. The sequential decision making setting involves one or more agents interacting with an environment to procure rewards at every time instant and the goal is to find an optimal policy for choosing actions. Most of these problems involve multiple (infinite) stages and the objective function is usually a long-run performance objective. The problem is further complicated by the uncertainties in the sys-tem, for instance, the stochastic noise and partial observability in a single-agent setting or private information of the agents in a multi-agent setting. The dimensionality of the problem also plays an important role in the solution methodology adopted. Most of the real-world problems involve high-dimensional state and action spaces and an important design aspect of the solution is the choice of knowledge representation. The aim of this thesis is to answer important resource allocation related questions in different real-world application contexts and in the process contribute novel algorithms to the theory as well. The resource allocation algorithms considered include those from stochastic optimization, stochastic control and reinforcement learning. A number of new algorithms are developed as well. The application contexts selected encompass both single and multi-agent systems, abstract and concrete resources and contain high-dimensional state and control spaces. The empirical results from the various studies performed indicate that the algorithms presented here perform significantly better than those previously proposed in the literature. Further, the algorithms presented here are also shown to theoretically converge, hence guaranteeing optimal performance. We now briefly describe the various studies conducted here to investigate problems of resource allocation under uncertainties of different kinds: Vehicular Traffic Control The aim here is to optimize the ‘green time’ resource of the individual lanes in road networks that maximizes a certain long-term performance objective. We develop several reinforcement learning based algorithms for solving this problem. In the infinite horizon discounted Markov decision process setting, a Q-learning based traffic light control (TLC) algorithm that incorporates feature based representations and function approximation to handle large road networks is proposed, see Prashanth and Bhatnagar [2011b]. This TLC algorithm works with coarse information, obtained via graded thresholds, about the congestion level on the lanes of the road network. However, the graded threshold values used in the above Q-learning based TLC algorithm as well as several other graded threshold-based TLC algorithms that we propose, may not be optimal for all traffic conditions. We therefore also develop a new algorithm based on SPSA to tune the associated thresholds to the ‘optimal’ values (Prashanth and Bhatnagar [2012]). Our thresh-old tuning algorithm is online, incremental with proven convergence to the optimal values of thresholds. Further, we also study average cost traffic signal control and develop two novel reinforcement learning based TLC algorithms with function approximation (Prashanth and Bhatnagar [2011c]). Lastly, we also develop a feature adaptation method for ‘optimal’ feature selection (Bhatnagar et al. [2012a]). This algorithm adapts the features in a way as to converge to an optimal set of features, which can then be used in the algorithm. Service Systems The aim here is to optimize the ‘workforce’, the critical resource of any service system. However, adapting the staffing levels to the workloads in such systems is nontrivial as the queue stability and aggregate service level agreement (SLA) constraints have to be complied with. We formulate this problem as a constrained hidden Markov process with a (discrete) worker parameter and propose simultaneous perturbation based simulation optimization algorithms for this purpose. The algorithms include both first order as well as second order methods and incorporate SPSA based gradient estimates in the primal, with dual ascent for the Lagrange multipliers. All the algorithms that we propose are online, incremental and are easy to implement. Further, they involve a certain generalized smooth projection operator, which is essential to project the continuous-valued worker parameter updates obtained from the SASOC algorithms onto the discrete set. We validate our algorithms on five real-life service systems and compare their performance with a state-of-the-art optimization tool-kit OptQuest. Being ��times faster than OptQuest, our scheme is particularly suitable for adaptive labor staffing. Also, we observe that it guarantees convergence and finds better solutions than OptQuest in many cases. Wireless Sensor Networks The aim here is to allocate the ‘sleep time’ (resource) of the individual sensors in an intrusion detection application such that the energy consumption from the sensors is reduced, while keeping the tracking error to a minimum. We model this sleep–wake scheduling problem as a partially-observed Markov decision process (POMDP) and propose novel RL-based algorithms -with both long-run discounted and average cost objectives -for solving this problem. All our algorithms incorporate function approximation and feature-based representations to handle the curse of dimensionality. Further, the feature selection scheme used in each of the proposed algorithms intelligently manages the energy cost and tracking cost factors, which in turn, assists the search for the optimal sleeping policy. The results from the simulation experiments suggest that our proposed algorithms perform better than a recently proposed algorithm from Fuemmeler and Veeravalli [2008], Fuemmeler et al. [2011]. Mechanism Design The setting here is of multiple self-interested agents with limited capacities, attempting to maximize their individual utilities, which often comes at the expense of the group’s utility. The aim of the resource allocator here then is to efficiently allocate the resource (which is being contended for, by the agents) and also maximize the social welfare via the ‘right’ transfer of payments. In other words, the problem is to find an incentive compatible transfer scheme following a socially efficient allocation. We present two novel mechanisms with progressively realistic assumptions about agent types aimed at economic scenarios where agents have limited capacities. For the simplest case where agent types consist of a unit cost of production and a capacity that does not change with time, we provide an enhancement to the static mechanism of Dash et al. [2007] that effectively deters misreport of the capacity type element by an agent to receive an allocation beyond its capacity, which thereby damages other agents. Our model incorporates an agent’s preference to harm other agents through a additive factor in the utility function of an agent and the mechanism we propose achieves strategy proofness by means of a novel penalty scheme. Next, we consider a dynamic setting where agent types evolve and the individual agents here again have a preference to harm others via capacity misreports. We show via a counterexample that the dynamic pivot mechanism of Bergemann and Valimaki [2010] cannot be directly applied in our setting with capacity-limited alim¨agents. We propose an enhancement to the mechanism of Bergemann and V¨alim¨aki [2010] that ensures truth telling w.r.t. capacity type element through a variable penalty scheme (in the spirit of the static mechanism). We show that each of our mechanisms is ex-post incentive compatible, ex-post individually rational, and socially efficient

Page generated in 0.2386 seconds