• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 65
  • 15
  • 6
  • 4
  • 3
  • 2
  • 2
  • 1
  • Tagged with
  • 108
  • 108
  • 67
  • 33
  • 25
  • 25
  • 24
  • 20
  • 18
  • 17
  • 17
  • 16
  • 16
  • 15
  • 15
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
71

Synthèse de comportements par apprentissages par renforcement parallèles : application à la commande d'un micromanipulateur plan

Laurent, Guillaume 18 December 2002 (has links) (PDF)
En microrobotique, la commande des systèmes est délicate car les phénomènes physiques liés à l'échelle microscopique sont complexes. Les méthodes dites d'apprentissage par renforcement constituent une approche intéressante car elles permettent d'établir une stratégie de commande sans connaissance \emph(a priori) sur le système. Au vu des grandes dimensions des espaces d'états des systèmes étudiés, nous avons développé une approche parallèle qui s'inspire à la fois des architectures comportementales et de l'apprentissage par renforcement. Cette architecture, basée sur la parallélisation de l'algorithme du Q-Learning, permet de réduire la complexité du système et d'accélérer l'apprentissage. Sur une application simple de labyrinthe, les résultats obtenus sont bons mais le temps d'apprentissage est trop long pour envisager la commande d'un système réel. Le Q-Learning a alors été remplacé par l'algorithme du Dyna-Q que nous avons adapté à la commande de systèmes non déterministes en ajoutant un historique des dernières transitions. Cette architecture, baptisée Dyna-Q parallèle, permet non seulement d'améliorer la vitesse de convergence, mais aussi de trouver de meilleures stratégies de contrôle. Les expérimentations sur le système de manipulation montrent que l'apprentissage est alors possible en temps réel et sans utiliser de simulation. La fonction de coordination des comportements est efficace si les obstacles sont relativement éloignés les uns des autres. Si ce n'est pas le cas, cette fonction peut créer des maxima locaux qui entraînent temporairement le système dans un cycle. Nous avons donc élaboré une autre fonction de coordination qui synthétise un modèle plus global du système à partir du modèle de transition construit par le Dyna-Q. Cette nouvelle fonction de coordination permet de sortir très efficacement des maxima locaux à condition que la fonction de mise en correspondance utilisée par l'architecture soit robuste.
72

Paramétrage Dynamique et Optimisation Automatique des Réseaux Mobiles 3G et 3G+

Nasri, Ridha 23 January 2009 (has links) (PDF)
La télécommunication radio mobile connait actuellement une évolution importante en termes de diversité de technologies et de services fournis à l'utilisateur final. Il apparait que cette diversité complexifie les réseaux cellulaires et les opérations d'optimisation manuelle du paramétrage deviennent de plus en plus compliquées et couteuses. Par conséquent, les couts d'exploitation du réseau augmentent corrélativement pour les operateurs. Il est donc essentiel de simplifier et d'automatiser ces taches, ce qui permettra de réduire les moyens consacrés à l'optimisation manuelle des réseaux. De plus, en optimisant ainsi de manière automatique les réseaux mobiles déployés, il sera possible de retarder les opérations de densification du réseau et l'acquisition de nouveaux sites. Le paramétrage automatique et optimal permettra donc aussi d'étaler voire même de réduire les investissements et les couts de maintenance du réseau. Cette thèse introduit de nouvelles méthodes de paramétrage automatique (auto-tuning) des algorithmes RRM (Radio Resource Management) dans les réseaux mobiles 3G et au delà du 3G. L'auto-tuning est un processus utilisant des outils de contrôle comme les contrôleurs de logique floue et d'apprentissage par renforcement. Il ajuste les paramètres des algorithmes RRM afin d'adapter le réseau aux fluctuations du trafic. Le fonctionnement de l'auto-tuning est basé sur une boucle de régulation optimale pilotée par un contrôleur qui est alimenté par les indicateurs de qualité du réseau. Afin de trouver le paramétrage optimal du réseau, le contrôleur maximise une fonction d'utilité, appelée aussi fonction de renforcement. Quatre cas d'études sont décrits dans cette thèse. Dans un premier temps, l'auto-tuning de l'algorithme d'allocation des ressources radio est présenté. Afin de privilégier les utilisateurs du service temps réel (voix), une bande de garde est réservée pour eux. Cependant dans le cas ou le trafic temps réel est faible, il est important d'exploiter cette ressource pour d'autres services. L'auto-tuning permet donc de faire un compromis optimal de la qualité perçue dans chaque service en adaptant les ressources réservées en fonction du trafic de chaque classe du service. Le second cas est l'optimisation automatique et dynamique des paramètres de l'algorithme du soft handover en UMTS. Pour l'auto-tuning du soft handover, un contrôleur est implémenté logiquement au niveau du RNC et règle automatiquement les seuils de handover en fonction de la charge radio de chaque cellule ainsi que de ses voisines. Cette approche permet d'équilibrer la charge radio entre les cellules et ainsi augmenter implicitement la capacité du réseau. Les simulations montrent que l'adaptation des seuils du soft handover en UMTS augmente la capacité de 30% par rapport au paramétrage fixe. L'approche de l'auto-tuning de la mobilité en UMTS est étendue pour les systèmes LTE (3GPP Long Term Evolution) mais dans ce cas l'auto-tuning est fondé sur une fonction d'auto-tuning préconstruite. L'adaptation des marges de handover en LTE permet de lisser les interférences intercellulaires et ainsi augmenter le débit perçu pour chaque utilisateur du réseau. Finalement, un algorithme de mobilité adaptative entre les deux technologies UMTS et WLAN est proposé. L'algorithme est orchestré par deux seuils, le premier est responsable du handover de l'UMTS vers le WLAN et l'autre du handover dans le sens inverse. L'adaptation de ces deux seuils permet une exploitation optimale et conjointe des ressources disponibles dans les deux technologies. Les résultats de simulation d'un réseau multi-systèmes exposent également un gain important en capacité.
73

Melhoria na converg?ncia do algoritmo Q-Learning na aplica??o de sistemas tutores inteligentes

Paiva, ?verton de Oliveira 16 August 2016 (has links)
Submitted by Jos? Henrique Henrique (jose.neves@ufvjm.edu.br) on 2017-06-22T22:29:53Z No. of bitstreams: 2 everton_oliveira_paiva.pdf: 3688473 bytes, checksum: 00c67bcc4d4564b69bb64a0b596743fc (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Rodrigo Martins Cruz (rodrigo.cruz@ufvjm.edu.br) on 2017-06-23T13:21:09Z (GMT) No. of bitstreams: 2 everton_oliveira_paiva.pdf: 3688473 bytes, checksum: 00c67bcc4d4564b69bb64a0b596743fc (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2017-06-23T13:21:09Z (GMT). No. of bitstreams: 2 everton_oliveira_paiva.pdf: 3688473 bytes, checksum: 00c67bcc4d4564b69bb64a0b596743fc (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2016 / O uso sistemas computacionais como complemento ou substitui??o da sala de aula ? cada vez mais comum na educa??o e os Sistemas Tutores Inteligentes (STIs) s?o uma dessas alternativas. Portanto ? fundamental desenvolver STIs capazes tanto de ensinar quanto aprender informa??es relevantes sobre o aluno atrav?s de t?cnicas de intelig?ncia artificial. Esse aprendizado acontece por meio da intera??o direta entre o STI e o aluno que ? geralmente demorada. Esta disserta??o apresenta a inser??o da metaheur?sticas Lista Tabu e GRASP com o objetivo de acelerar esse aprendizado. Para avaliar o desempenho dessa modifica??o, foi desenvolvido um simulador de STI. Nesse sistema, foram realizadas simula??es computacionais para comparar o desempenho da tradicional pol?tica de explora??o aleat?ria e as metaheur?sticas propostas Lista Tabu e GRASP. Os resultados obtidos atrav?s dessas simula??es e os testes estat?sticos aplicados indicam fortemente que a introdu??o de meta-heur?sticas adequadas melhoram o desempenho do algoritmo de aprendizado em STIs. / Disserta??o (Mestrado Profissional) ? Programa de P?s-Gradua??o em Educa??o, Universidade Federal dos Vales do Jequitinhonha e Mucuri, 2016. / Using computer systems as a complement or replacement for the classroom experience is an increasingly common practice in education and Intelligent Tutoring Systems (ITS) are one of these alternatives. Therefore, it is crucial to develop ITS that are capable of both teaching and learning relevant information about the student through artificial intelligence techniques. This learning process occurs by means of direct, and generally slow, interaction between the ITS and the student. This dissertation presents the insertion of meta-heuristic Tabu search and GRASP with the purpose of accelera ting learning. An ITS simulator was developed to evaluate the performance of this change. Computer simulations were conducted in order to compare the performance of traditional randomized search methods with the meta-heuristic Tabu search. Results obtained from these simulations and statistical tests strongly indicate that the introduction of meta-heuristics in exploration policy improves the performance of the learning algorithm in ITS.
74

Algoritmo Q-learning como estrat?gia de explora??o e/ou explota??o para metaheur?sticas GRASP e algoritmo gen?tico

Lima J?nior, Francisco Chagas de 20 March 2009 (has links)
Made available in DSpace on 2014-12-17T14:54:52Z (GMT). No. of bitstreams: 1 FranciscoCLJ.pdf: 1181019 bytes, checksum: b3894e0c93f85d3cf920c7015daef964 (MD5) Previous issue date: 2009-03-20 / Techniques of optimization known as metaheuristics have achieved success in the resolution of many problems classified as NP-Hard. These methods use non deterministic approaches that reach very good solutions which, however, don t guarantee the determination of the global optimum. Beyond the inherent difficulties related to the complexity that characterizes the optimization problems, the metaheuristics still face the dilemma of xploration/exploitation, which consists of choosing between a greedy search and a wider exploration of the solution space. A way to guide such algorithms during the searching of better solutions is supplying them with more knowledge of the problem through the use of a intelligent agent, able to recognize promising regions and also identify when they should diversify the direction of the search. This way, this work proposes the use of Reinforcement Learning technique - Q-learning Algorithm - as exploration/exploitation strategy for the metaheuristics GRASP (Greedy Randomized Adaptive Search Procedure) and Genetic Algorithm. The GRASP metaheuristic uses Q-learning instead of the traditional greedy-random algorithm in the construction phase. This replacement has the purpose of improving the quality of the initial solutions that are used in the local search phase of the GRASP, and also provides for the metaheuristic an adaptive memory mechanism that allows the reuse of good previous decisions and also avoids the repetition of bad decisions. In the Genetic Algorithm, the Q-learning algorithm was used to generate an initial population of high fitness, and after a determined number of generations, where the rate of diversity of the population is less than a certain limit L, it also was applied to supply one of the parents to be used in the genetic crossover operator. Another significant change in the hybrid genetic algorithm is the proposal of a mutually interactive cooperation process between the genetic operators and the Q-learning algorithm. In this interactive/cooperative process, the Q-learning algorithm receives an additional update in the matrix of Q-values based on the current best solution of the Genetic Algorithm. The computational experiments presented in this thesis compares the results obtained with the implementation of traditional versions of GRASP metaheuristic and Genetic Algorithm, with those obtained using the proposed hybrid methods. Both algorithms had been applied successfully to the symmetrical Traveling Salesman Problem, which was modeled as a Markov decision process / T?cnicas de otimiza??o conhecidas como metaheur?sticas t?m obtido sucesso na resolu??o de problemas classificados como NP - ?rduos. Estes m?todos utilizam abordagens n?o determin?sticas que geram solu??es pr?ximas do ?timo sem, no entanto, garantir a determina??o do ?timo global. Al?m das dificuldades inerentes ? complexidade que caracteriza os problemas NP-?rduos, as metaheur?sticas enfrentam ainda o dilema de explora??o/explota??o, que consiste em escolher entre intensifica??o da busca em uma regi?o espec?fica e a explora??o mais ampla do espa?o de solu??es. Uma forma de orientar tais algoritmos em busca de melhores solu??es ? supri-los de maior conhecimento do problema atrav?s da utiliza??o de um agente inteligente, capaz de reconhecer regi?es promissoras e/ou identificar em que momento dever? diversificar a dire??o de busca, isto pode ser feito atrav?s da aplica??o de Aprendizagem por Refor?o. Neste contexto, este trabalho prop?e o uso de uma t?cnica de Aprendizagem por Refor?o - especificamente o Algoritmo Q-learning - como uma estrat?gia de explora??o/explota??o para as metaheur?sticas GRASP (Greedy Randomized Adaptive Search Procedure) e Algoritmo Gen?tico. Na implementa??o da metaheur?stica GRASP proposta, utilizou-se o Q-learning em substitui??o ao algoritmo guloso-aleat?rio tradicionalmente usado na fase de constru??o. Tal substitui??o teve como objetivo melhorar a qualidade das solu??es iniciais que ser?o utilizadas na fase de busca local do GRASP, e, ao mesmo tempo, suprir esta metaheur?sticas de um mecanismo de mem?ria adaptativa que permita a reutiliza??o de boas decis?es tomadas em itera??es passadas e que evite a repeti??o de decis?es n?o promissoras. No Algoritmo Gen?tico, o algoritmo Q-learning foi utilizado para gerar uma popula??o inicial de alta aptid?o, e ap?s um determinado n?mero de gera??es, caso a taxa de diversidade da popula??o seja menor do que um determinado limite L, ele ? tamb?m utilizado em uma forma alternativa de operador de cruzamento. Outra modifica??o importante no algoritmo gen?tico h?brido ? a proposta de um processo de intera??o mutuamente cooperativa entre o os operadores gen?ticos e o Algoritmo Q-learning. Neste processo interativo/cooperativo o algoritmo Q-learning recebe uma atualiza??o adicional na matriz dos Q-valores com base na solu??o elite da popula??o corrente. Os experimentos computacionais apresentados neste trabalho consistem em comparar os resultados obtidos com a implementa??o de vers?es tradicionais das metaheur?sticas citadas, com aqueles obtidos utilizando os m?todos h?bridos propostos. Ambos os algoritmos foram aplicados com sucesso ao problema do caixeiro viajante sim?trico, que por sua vez, foi modelado como um processo de decis?o de Markov
75

Agentes-Q: um algoritmo de roteamento distribuído e adaptativo para redes de telecomunicações / Q-Agents: an adaptive and distributed routing algorithm for telecommunications networks

Karla Vittori 14 April 2000 (has links)
As redes de telecomunicações são responsáveis pelo envio de informação entre pontos de origem e destino. Dentre os diversos dispositivos que participam deste processo, destaca-se o sistema de roteamento, que realiza a seleção das rotas a serem percorridas pelas mensagens ao longo da rede e sua condução ao destino desejado. O avanço das tecnologias utilizadas pelas redes de telecomunicações provocou a necessidade de novos sistemas de roteamento, que sejam capazes de lidar corretamente com as diversas situações enfrentadas atualmente. Dentro deste contexto, este projeto de pesquisa desenvolveu um algoritmo de roteamento adaptativo e distribuído, resultado da integração de três estratégias de aprendizagem e da adição de alguns mecanismos extras, com o objetivo de obter um algoritmo eficiente e robusto às diversas variações das condições de operação da rede. As abordagens utilizadas foram a aprendizagem-Q, aprendizagem por reforço dual e aprendizagem baseada no comportamento coletivo de formigas. O algoritmo desenvolvido foi aplicado a duas redes de comutação de circuitos e seu desempenho foi comparado ao de dois algoritmos baseados no comportamento coletivo de formigas, que foram aplicados com sucesso ao problema de roteamento. Os experimentos conduzidos envolveram situações reais enfrentadas pelas redes, como variações dos seus padrões de tráfego, nível de carga e topologia. Além disto, foram realizados testes envolvendo a presença de ruído nas informações utilizadas para a seleção das rotas a serem percorridas pelas chamadas. O algoritmo proposto obteve melhores resultados que os demais, apresentando maior capacidade de adaptação às diversas situações consideradas. Os experimentos demonstraram que novos mecanismos de otimização devem ser anexados ao algoritmo proposto, para melhorar seu comportamento exploratório sob variações permanentes do nível de carga da rede e presença de ruído nos dados utilizados em suas tarefas. / The telecommunications networks are responsible for transmiting information between source and destination points in a fast, secure and reliable way, providing low cost and high quality services. Among the several devices that takes place on this process, there is thre routing system, which selects the routes to be traversed by the messages through the network and their forwarding to the destination desired. The advances in tecnologies used by telecommunications networks caused the necessity of new routing systems, that can work correctly with the situations faced by current telecommunications networks. Hence, this research project developed an adaptive and distributed routing algorithm, resulting of the integration of three leaming strategies and addition of some extra mechanisms, with the goal of having a robust and adaptive algorithm to the several variations on operation network conditions. The approaches chosen were Q-learning, dual reinforcement learning and learning based on collective behavior of ants. The developed algorithm was applied to two circuit-switching telecommunications networks and its performance was compared to two algorithms based on ant colony behavior, which were used with success to solve the routing problem. The experiments run comprised real situations faced by telecommunications networks, like variations on the network traffic patterns, load level and topology. Moreover, we did some tests with the presence of noise in information used to select the routes to be traversed by calls. The algorithm proposed produced better results than the others, showing higher capacity of adaptation to the several situations considered. The experiments showed that new optimization mechanisms must be added to the routing algorithm developed, to improve its exploratory behavior under permanent variations on network load level and presence of noise in data used in its tasks.
76

Protocolo de Negociação Baseado em Aprendizagem-Q para Bolsa de Valores / Negotiation Protocol Based in Q-Learning for Stock Exchange

Cunha, Rafael de Souza 04 March 2013 (has links)
Made available in DSpace on 2016-08-17T14:53:24Z (GMT). No. of bitstreams: 1 Dissertacao Rafael de Souza.pdf: 5581665 bytes, checksum: 4edbe8b1f2b84008b5129a93038f2fee (MD5) Previous issue date: 2013-03-04 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / In this work, we applied the technology of Multi-Agent Systems (MAS) in the capital market, i.e., the stock market, specifically in Bolsa de Mercadorias e Futuros de São Paulo (BM&FBovespa). The research focused mainly on negotiation protocols and learning of investors agents. Within the Stock Exchange competitive field, the development of an agent that could learn to negotiate, could become differential for investors who wish to increase their profits. The decision-making based on historical data is motivation for further research in the same direction, however, we sought a different approach with regard to the representation of the states of q-learning algorithm. The reinforcement learning, in particular q-learning, has been shown to be effective in environments with various historical data and seeking reward decisions with positive results. That way it is possible to apply in the purchase and sale of shares, an algorithm that rewards the profit and punishes the loss. Moreover, to achieve their goals agents need to negotiate according to specific protocols of stock exchange. Therefore, endeavor was also the specifications of the rules of negotiation between agents that allow the purchase and sale of shares. Through the exchange of messages between agents, it is possible to determine how the trading will occur and facilitate communication between them, because it sets a standard of how it will happen. Therefore, in view of the specification of negotiation protocols based on q-learning, this research has been the modeling of intelligent agents and models of learning and negotiation required for decision making entities involved. / Neste trabalho, aplicou-se a tecnologia de Sistemas MultiAgente (SMA) no mercado de capitais, isto é, na Bolsa de Valores, especificamente na Bolsa de Mercadorias e Futuros de São Paulo (BM&FBovespa). A pesquisa concentrou-se principalmente nos protocolos de negociação envolvidos e na aprendizagem dos agentes investidores. Dentro do cenário competitivo da Bolsa de Valores, o desenvolvimento de um agente que aprendesse a negociar poderia se tornar diferencial para os investidores que desejam obter lucros cada vez maiores. A tomada de decisão baseada em dados históricos é motivação para outras pesquisas no mesmo sentido, no entanto, buscou-se uma abordagem diferenciada no que diz respeito à representação dos estados do algoritmo de aprendizagem-q. A aprendizagem por reforço, em especial a aprendizagem-q, tem demonstrado ser eficiente em ambientes com vários dados históricos e que procuram recompensar decisões com resultados positivos. Dessa forma é possível aplicar na compra e venda de ações, um algoritmo que premia o lucro e pune o prejuízo. Além disso, para conseguir alcançar seus objetivos os agentes precisam negociar de acordo com os protocolos específicos da bolsa de valores. Sendo assim, procurou-se também as especificações das regras de negociação entre os agentes que permitirão a compra e venda de títulos da bolsa. Através da troca de mensagens entre os agentes, é possível determinar como a negociação ocorrerá e facilitará comunicação entre os mesmos, pois fica padronizada a forma como isso acontecerá. Logo, tendo em vista as especificações dos protocolos de negociação baseados em aprendizagem-q, tem-se nesta pesquisa a modelagem dos agentes inteligentes e os modelos de aprendizagem e negociação necessários para a tomada de decisão das entidades envolvidas.
77

卷積深度Q-學習之ETF自動交易系統 / Convolutional Deep Q-learning for ETF Automated Trading System

陳非霆, Chen, Fei-Ting Unknown Date (has links)
本篇文章使用了增強學習與捲積深度學習結合的DQCN模型製作交易系統,希望藉由此交易系統能自行判斷是否買賣ETF,由於ETF屬於穩定性高且手續費高的衍生性金融商品,所以該系統不即時性的做買賣,採用每二十個開盤日進行一次買賣,並由這20個開盤日進行買賣的預測,希望該系統能最大化我們未來的報酬。 DQN是一種增強學習的模型,並在其中使用深度學習進行動作價值的預測,利用增強學習的自我更新動作價值的機制,再用深度學習強大的學習能力成就了人工智慧,並在其取得良好的成效。 / In this paper, we used DCQN model, which is combined with reinforcement learning and CNN to train a trading system and hope the trading system could judge whether buy or sell ETFs. Since ETFs is a derivative financial good with high stability and related fee, the system does not perform real-time trading and it performs every 20 trading day. The system predicts value of action based on data in the last 20 opening days to maximize our future rewards. DQN is a reinforcement learning model, using deep learning to predict value of actions in model. Combined with the RL's mechanism, which updates value of actions, and deep learning, which has a strong ability of learning, to finish an artificial intelligence. We got a perfect effect.
78

Distributed spectrum sensing and interference management for cognitive radios with low capacity control channels

Van Den Biggelaar, Olivier 05 October 2012 (has links)
Cognitive radios have been proposed as a new technology to counteract the spectrum scarcity issue and increase the spectral efficiency. In cognitive radios, the sparse assigned frequency bands are opened to secondary users, provided that interference induced on the primary licensees is negligible. Cognitive radios are established in two steps: the radios firstly sense the available frequency bands by detecting the presence of primary users and secondly communicate using the bands that have been identified as not in use by the primary users.<p><p>In this thesis we investigate how to improve the efficiency of cognitive radio networks when multiple cognitive radios cooperate to sense the spectrum or control their interferences. A major challenge in the design of cooperating devices lays in the need for exchange of information between these devices. Therefore, in this thesis we identify three specific types of control information exchange whose efficiency can be improved. Specifically, we first study how cognitive radios can efficiently exchange sensing information with a coordinator node when the reporting channels are noisy. Then, we propose distributed learning algorithms allowing to allocate the primary network sensing times and the secondary transmission powers within the secondary network. Both distributed allocation algorithms minimize the need for information exchange compared to centralized allocation algorithms. / Doctorat en Sciences de l'ingénieur / info:eu-repo/semantics/nonPublished
79

[en] PESSIMISTIC Q-LEARNING: AN ALGORITHM TO CREATE BOTS FOR TURN-BASED GAMES / [pt] Q-LEARNING PESSIMISTA: UM ALGORITMO PARA GERAÇÃO DE BOTS DE JOGOS EM TURNOS

ADRIANO BRITO PEREIRA 25 January 2017 (has links)
[pt] Este documento apresenta um novo algoritmo de aprendizado por reforço, o Q-Learning Pessimista. Nossa motivação é resolver o problema de gerar bots capazes de jogar jogos baseados em turnos e contribuir para obtenção de melhores resultados através dessa extensão do algoritmo Q-Learning. O Q-Learning Pessimista explora a flexibilidade dos cálculos gerados pelo Q-Learning tradicional sem a utilização de força bruta. Para medir a qualidade do bot gerado, consideramos qualidade como a soma do potencial de vitória e empate em um jogo. Nosso propósito fundamental é gerar bots de boa qualidade para diferentes jogos. Desta forma, podemos utilizar este algoritmo para famílias de jogos baseados em turno. Desenvolvemos um framework chamado Wisebots e realizamos experimentos com alguns cenários aplicados aos seguintes jogos tradicionais: TicTacToe, Connect-4 e CardPoints. Comparando a qualidade do Q-Learning Pessimista com a do Q-Learning tradicional, observamos ganhos de 0,8 por cento no TicTacToe, obtendo um algoritmo que nunca perde. Observamos também ganhos de 35 por cento no Connect-4 e de 27 por cento no CardPoints, elevando ambos da faixa de 50 por cento a 60 por cento para 90 por cento a 100 por cento de qualidade. Esses resultados ilustram o potencial de melhoria com o uso do Q-Learning Pessimista, sugerindo sua aplicação aos diversos tipos de jogos de turnos. / [en] This document presents a new algorithm for reinforcement learning method, Q-Learning Pessimistic. Our motivation is to resolve the problem of generating bots able to play turn-based games and contribute to achieving better results through this extension of the Q-Learning algorithm. The Q-Learning Pessimistic explores the flexibility of the calculations generated by the traditional Q-learning without the use of force brute. To measure the quality of bot generated, we consider quality as the sum of the potential to win and tie in a game. Our fundamental purpose, is to generate bots with good quality for different games. Thus, we can use this algorithm to families of turn-based games. We developed a framework called Wisebots and conducted experiments with some scenarios applied to the following traditional games TicTacToe, Connect-4 and CardPoints. Comparing the quality of Pessimistic Q-Learning with the traditional Q-Learning, we observed gains to 100 per cent in the TicTacToe, obtaining an algorithm that never loses. Also observed in 35 per cent gains Connect-4 and 27 per cent in CardPoints, increasing both the range of 60 per cent to 80 per cent for 90 per cent to 100 per cent of quality. These results illustrate the potential for improvement with the use of Q-Learning Pessimistic, suggesting its application to various types of games.
80

Deep Reinforcement Learning for the Optimization of Combining Raster Images in Forest Planning

Wen, Yangyang January 2021 (has links)
Raster images represent the treatment options of how the forest will be cut. Economic benefits from cutting the forest will be generated after the treatment is selected and executed. Existing raster images have many clusters and small sizes, this becomes the principal cause of overhead. If we can fully explore the relationship among the raster images and combine the old data sets according to the optimization algorithm to generate a new raster image, then this result will surpass the existing raster images and create higher economic benefits.    The question of this project is can we create a dynamic model that treats the updating pixel’s status as an agent selecting options for an empty raster image in response to neighborhood environmental and landscape parameters. This project is trying to explore if it is realistic to use deep reinforcement learning to generate new and superior raster images. Finally, this project aims to explore the feasibility, usefulness, and effectiveness of deep reinforcement learning algorithms in optimizing existing treatment options.    The problem was modeled as a Markov decision process, in which the pixel to be updated was an agent of the empty raster image, which would determine the choice of the treatment option for the current empty pixel. This project used the Deep Q learning neural network model to calculate the Q values. The temporal difference reinforcement learning algorithm was applied to predict future rewards and to update model parameters.   After the modeling was completed, this project set up the model usefulness experiment to test the usefulness of the model. Then the parameter correlation experiment was set to test the correlation between the parameters and the benefit of the model. Finally, the trained model was used to generate a larger size raster image to test its effectiveness.

Page generated in 0.0745 seconds