Spelling suggestions: "subject:"reward actionfunction"" "subject:"reward functionaction""
1 |
Aprendizado por reforço multiagente : uma avaliação de diferentes mecanismos de recompensa para o problema de aprendizado de rotas / Multiagent reinforcement learning : an evaluation of different reward mechanisms for the route learning problemGrunitzki, Ricardo January 2014 (has links)
Esta dissertação de mestrado apresenta um estudo sobre os efeitos de diferentes funções de recompensa, aplicadas em aprendizado por reforço multiagente, para o problema de roteamento de veículos, em redes de tráfego. São abordadas duas funções de recompensas que diferem no alinhamento do sinal numérico enviado do ambiente ao agente. A primeira função, chamada função individual, é alinhada à utilidade individual do agente (veículo ou motorista) e busca minimizar seu tempo de viagem. Já a segunda função, por sua vez, é a chamada difference rewards, essa é alinhada à utilidade global do sistema e tem por objetivo minimizar o tempo médio de viagem na rede (tempo médio de viagem de todos os motoristas). Ambas as abordagens são aplicadas em dois cenários de roteamento de veículos que diferem em: quantidade de motoristas aprendendo, topologia e, consequentemente, nível de complexidade. As abordagens são comparadas com três técnicas de alocação de tráfego presentes na literatura. Resultados apontam que os métodos baseados em aprendizado por reforço apresentam desempenho superior aos métodos de alocação de rotas. Além disso, o alinhamento da função de recompensa à utilidade global proporciona uma melhora significativa nos resultados quando comparados com a função individual. Porém, para o cenário com maior quantidade de agentes aprendendo simultaneamente, ambas as abordagens apresentam soluções equivalentes. / This dissertation presents a study on the effects of different reward functions applyed to multiagent reinforcement learning, for the vehicles routing problem, in traffic networks. Two reward functions that differ in the alignment of the numerical signal sent from the environment to the agent are addressed. The first function, called individual function is aligned with the agent’s (vehicle or driver) utility and seeks to minimize their travel time. The second function, is called difference rewards and is aligned to the system’s utility and aims to minimize the average travel time on the network (average travel time of all drivers). Both approaches are applied to two routing vehicles’ problems, which differ in the number of learning drivers, network topology and therefore, level of complexity. These approaches are compared with three traffic assignment techniques from the literature. Results show that reinforcement learning-based methods yield superior results than traffic assignment methods. Furthermore, the reward function alignment to the global utility, provides a significant improvement in results when compared with the individual function. However, for scenarios with many agents learning simultaneously, both approaches yield equivalent solutions.
|
2 |
Aprendizado por reforço multiagente : uma avaliação de diferentes mecanismos de recompensa para o problema de aprendizado de rotas / Multiagent reinforcement learning : an evaluation of different reward mechanisms for the route learning problemGrunitzki, Ricardo January 2014 (has links)
Esta dissertação de mestrado apresenta um estudo sobre os efeitos de diferentes funções de recompensa, aplicadas em aprendizado por reforço multiagente, para o problema de roteamento de veículos, em redes de tráfego. São abordadas duas funções de recompensas que diferem no alinhamento do sinal numérico enviado do ambiente ao agente. A primeira função, chamada função individual, é alinhada à utilidade individual do agente (veículo ou motorista) e busca minimizar seu tempo de viagem. Já a segunda função, por sua vez, é a chamada difference rewards, essa é alinhada à utilidade global do sistema e tem por objetivo minimizar o tempo médio de viagem na rede (tempo médio de viagem de todos os motoristas). Ambas as abordagens são aplicadas em dois cenários de roteamento de veículos que diferem em: quantidade de motoristas aprendendo, topologia e, consequentemente, nível de complexidade. As abordagens são comparadas com três técnicas de alocação de tráfego presentes na literatura. Resultados apontam que os métodos baseados em aprendizado por reforço apresentam desempenho superior aos métodos de alocação de rotas. Além disso, o alinhamento da função de recompensa à utilidade global proporciona uma melhora significativa nos resultados quando comparados com a função individual. Porém, para o cenário com maior quantidade de agentes aprendendo simultaneamente, ambas as abordagens apresentam soluções equivalentes. / This dissertation presents a study on the effects of different reward functions applyed to multiagent reinforcement learning, for the vehicles routing problem, in traffic networks. Two reward functions that differ in the alignment of the numerical signal sent from the environment to the agent are addressed. The first function, called individual function is aligned with the agent’s (vehicle or driver) utility and seeks to minimize their travel time. The second function, is called difference rewards and is aligned to the system’s utility and aims to minimize the average travel time on the network (average travel time of all drivers). Both approaches are applied to two routing vehicles’ problems, which differ in the number of learning drivers, network topology and therefore, level of complexity. These approaches are compared with three traffic assignment techniques from the literature. Results show that reinforcement learning-based methods yield superior results than traffic assignment methods. Furthermore, the reward function alignment to the global utility, provides a significant improvement in results when compared with the individual function. However, for scenarios with many agents learning simultaneously, both approaches yield equivalent solutions.
|
3 |
Aprendizado por reforço multiagente : uma avaliação de diferentes mecanismos de recompensa para o problema de aprendizado de rotas / Multiagent reinforcement learning : an evaluation of different reward mechanisms for the route learning problemGrunitzki, Ricardo January 2014 (has links)
Esta dissertação de mestrado apresenta um estudo sobre os efeitos de diferentes funções de recompensa, aplicadas em aprendizado por reforço multiagente, para o problema de roteamento de veículos, em redes de tráfego. São abordadas duas funções de recompensas que diferem no alinhamento do sinal numérico enviado do ambiente ao agente. A primeira função, chamada função individual, é alinhada à utilidade individual do agente (veículo ou motorista) e busca minimizar seu tempo de viagem. Já a segunda função, por sua vez, é a chamada difference rewards, essa é alinhada à utilidade global do sistema e tem por objetivo minimizar o tempo médio de viagem na rede (tempo médio de viagem de todos os motoristas). Ambas as abordagens são aplicadas em dois cenários de roteamento de veículos que diferem em: quantidade de motoristas aprendendo, topologia e, consequentemente, nível de complexidade. As abordagens são comparadas com três técnicas de alocação de tráfego presentes na literatura. Resultados apontam que os métodos baseados em aprendizado por reforço apresentam desempenho superior aos métodos de alocação de rotas. Além disso, o alinhamento da função de recompensa à utilidade global proporciona uma melhora significativa nos resultados quando comparados com a função individual. Porém, para o cenário com maior quantidade de agentes aprendendo simultaneamente, ambas as abordagens apresentam soluções equivalentes. / This dissertation presents a study on the effects of different reward functions applyed to multiagent reinforcement learning, for the vehicles routing problem, in traffic networks. Two reward functions that differ in the alignment of the numerical signal sent from the environment to the agent are addressed. The first function, called individual function is aligned with the agent’s (vehicle or driver) utility and seeks to minimize their travel time. The second function, is called difference rewards and is aligned to the system’s utility and aims to minimize the average travel time on the network (average travel time of all drivers). Both approaches are applied to two routing vehicles’ problems, which differ in the number of learning drivers, network topology and therefore, level of complexity. These approaches are compared with three traffic assignment techniques from the literature. Results show that reinforcement learning-based methods yield superior results than traffic assignment methods. Furthermore, the reward function alignment to the global utility, provides a significant improvement in results when compared with the individual function. However, for scenarios with many agents learning simultaneously, both approaches yield equivalent solutions.
|
4 |
Deep reinforcement learning approach to portfolio management / Deep reinforcement learning metod för portföljförvaltningJama, Fuaad January 2023 (has links)
This thesis evaluates the use of a Deep Reinforcement Learning (DRL) approach to portfolio management on the Swedish stock market. The idea is to construct a portfolio that is adjusted daily using the DRL algorithm Proximal policy optimization (PPO) with a multi perceptron neural network. The input to the neural network was historical data in the form of open, high, and low price data. The portfolio is evaluated by its performance against the OMX Stockholm 30 index (OMXS30). Furthermore, three different approaches for optimization are going to be studied, in that three different reward functions are going to be used. These functions are Sharp ratio, cumulative reward (Daily return) and Value at risk reward (which is a daily return with a value at risk penalty). The historival data that is going to be used is from the period 2010-01-01 to 2015-12-31 and the DRL approach is then tested on two different time periods which represents different marked conditions, 2016-01-01 to 2018-12-31 and 2019-01-01 to 2021-12-31. The results show that in the first test period all three methods (corresponding to the three different reward functions) outperform the OMXS30 benchmark in returns and sharp ratio, while in the second test period none of the methods outperform the OMXS30 index. / Målet med det här arbetet var att utvärdera användningen av "Deep reinforcement learning" (DRL) metod för portföljförvaltning på den svenska aktiemarknaden. Idén är att konstruera en portfölj som justeras dagligen med hjälp av DRL algoritmen "Proximal policy optimization" (PPO) med ett neuralt nätverk med flera perceptroner. Inmatningen till det neurala nätverket var historiska data i form av öppnings, lägsta och högsta priser. Portföljen utvärderades utifrån dess prestation mot OMX Stockholm 30 index (OMXS30). Dessutom studerades tre olika tillvägagångssätt för optimering, genom att använda tre olika belöningsfunktioner. Dessa funktioner var Sharp ratio, kumulativ belöning (Daglig avkastning) och Value at risk-belöning (som är en daglig avkastning minus Value at risk-belöning). Den historiska data som användes var från perioden 2010-01-01 till 2015-12-31 och DRL-metoden testades sedan på två olika tidsperioder som representerar olika marknadsförhållanden, 2016-01-01 till 2018-12-31 och 2019-01-01 till 2021-12-31. Resultatet visar att i den första testperioden så överträffade alla tre metoder (vilket motsvarar de tre olika belöningsfunktionerna) OMXS30 indexet i avkastning och sharp ratio, medan i den andra testperioden så överträffade ingen av metoderna OMXS30 indexet.
|
5 |
Designförslag på belöningsfunktioner för självkörande bilar i TORCS som inte krockar / Design suggestion on reward functions for self-driving cars in TORCS that do not crashAndersson, Björn, Eriksson, Felix January 2018 (has links)
Den här studien använder sig av TORCS (The Open Racing Car Simulator) som är ett intressant spel att skapa självkörande bilar i då det finns nitton olika typer av sensorer som beskriver omgivningen för agenten. Problemet för denna studie har varit att identifiera vilka av alla dessa sensorer som kan användas i en belöningsfunktion och hur denna sedan skall implementeras. Studien har anammat en kvantitativa experimentell studie där forskningsfrågan är: Hur kan en belöningsfunktion utformas så att agenten klarar av att manövrera i spelet TORCS utan att krocka och med ett konsekvent resultat Den kvantitativ experimentell studien valdes då författarna behövde designa, implementera, utföra experiment och utvärdera resultatet för respektive belöningsfunktion. Det har utförts totalt femton experiment över tolv olika belöningsfunktioner i spelet TORCS på två olika banor E-Track 5(E-5) och Aalborg. De tolv belöningsfunktionerna utförde varsitt experiment på E-5 där de tre som fick bäst resultat: Charlie, Foxtrot och Juliette utförde ett experiment på Aalborg, då denna är en svårare bana. Detta för att kunna styrka om den kan köra på mer än en bana och om belöningsfunktionen då är generell. Juliette är den belöningsfunktion som var ensam med att klara både E-5 och Aalborg utan att krocka. Genom de utförda experimenten drogs slutsatsen att Juliette uppfyller forskningsfrågan då den klarar bägge banorna utan att krocka och när den lyckas får den ett konsekvent resultat. Studien har därför lyckats designa och implementera en belöningsfunktion som uppfyller forskningsfrågan. / For this study TORCS (The Open Racing Car Simulator) have been used, since it is an interesting game to create self-driving cars in. This is due to the fact there is nineteen different sensors available that describes the environment for the agent. The problem for this study has been to identify what sensor can be used in a reward function and how should this reward function be implemented. The study have been utilizing a quantitative experimental method where the research questions have been: How can a reward function be designed so that an Agent can maneuver in TORCS without crashing and at the same time have a consistent result The quantitative experimental method was picked since the writer’s hade to design, implement, conduct experiment and evaluate the result for each reward function. Fifteen experiments have been conducted over twelve reward functions on two different maps: E-Track 5 (E-5) and Aalborg. Each of the twelve reward function conducted an experiment on E-5, where the three once with the best result: Charlie, Foxtrot and Juliette conducted an additional experiment on Aalborg. The test on Aalborg was conducted in order to prove if the reward function can maneuver on more than one map. Juliette was the only reward function that managed to complete a lap on both E-5 and Aalborg without crashing. Based on the conducted experiment the conclusion that Juliette fulfills the research question was made, due to it being capable of completing both maps without crashing and if it succeeded it gets a consistent result. Therefor this study has succeeded in answering the research question.
|
6 |
Not All Goals Are Created Equal : Evaluating Hockey Players in the NHL Using Q-Learning with a Contextual Reward FunctionVik, Jon January 2021 (has links)
Not all goals in the game of ice hockey are created equal: some goals increase the chances of winning more than others. This thesis investigates the result of constructing and using a reward function that takes this fact into consideration, instead of the common binary reward function. The two reward functions are used in a Markov Game model with value iteration. The data used to evaluate the hockey players is play-by-play data from the 2013-2014 season of the National Hockey League (NHL). Furthermore, overtime events, goalkeepers, and playoff games are excluded from the dataset. This study finds that the constructed reward, in general, is less correlated than the binary reward to the metrics: points, time on ice and, star points. However, an increased correlation was found between the evaluated impact and time on ice for center players. Much of the discussion is devoted to the difficulty of validating the results from a player evaluation due to the lack of ground truth. One conclusion from this discussion is that future efforts must be made to establish consensus regarding how the success of a hockey player should be defined.
|
Page generated in 0.0723 seconds