• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 84
  • 20
  • 13
  • 8
  • 5
  • 2
  • 1
  • Tagged with
  • 162
  • 162
  • 162
  • 49
  • 35
  • 34
  • 28
  • 26
  • 25
  • 24
  • 24
  • 23
  • 22
  • 20
  • 20
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
91

MODEL-FREE ALGORITHMS FOR CONSTRAINED REINFORCEMENT LEARNING IN DISCOUNTED AND AVERAGE REWARD SETTINGS

Qinbo Bai (19804362) 07 October 2024 (has links)
<p dir="ltr">Reinforcement learning (RL), which aims to train an agent to maximize its accumulated reward through time, has attracted much attention in recent years. Mathematically, RL is modeled as a Markov Decision Process, where the agent interacts with the environment step by step. In practice, RL has been applied to autonomous driving, robotics, recommendation systems, and financial management. Although RL has been greatly studied in the literature, most proposed algorithms are model-based, which requires estimating the transition kernel. To this end, we begin to study the sample efficient model-free algorithms under different settings.</p><p dir="ltr">Firstly, we propose a conservative stochastic primal-dual algorithm in the infinite horizon discounted reward setting. The proposed algorithm converts the original problem from policy space to the occupancy measure space, which makes the non-convex problem linear. Then, we advocate the use of a randomized primal-dual approach to achieve O(\eps^-2) sample complexity, which matches the lower bound.</p><p dir="ltr">However, when it comes to the infinite horizon average reward setting, the problem becomes more challenging since the environment interaction never ends and can’t be reset, which makes reward samples not independent anymore. To solve this, we design an epoch-based policy-gradient algorithm. In each epoch, the whole trajectory is divided into multiple sub-trajectories with an interval between each two of them. Such intervals are long enough so that the reward samples are asymptotically independent. By controlling the length of trajectory and intervals, we obtain a good gradient estimator and prove the proposed algorithm achieves O(T^3/4) regret bound.</p>
92

Computing Quantiles in Markov Reward Models

Ummels, Michael, Baier, Christel 10 July 2014 (has links) (PDF)
Probabilistic model checking mainly concentrates on techniques for reasoning about the probabilities of certain path properties or expected values of certain random variables. For the quantitative system analysis, however, there is also another type of interesting performance measure, namely quantiles. A typical quantile query takes as input a lower probability bound p ∈ ]0,1] and a reachability property. The task is then to compute the minimal reward bound r such that with probability at least p the target set will be reached before the accumulated reward exceeds r. Quantiles are well-known from mathematical statistics, but to the best of our knowledge they have not been addressed by the model checking community so far. In this paper, we study the complexity of quantile queries for until properties in discrete-time finite-state Markov decision processes with nonnegative rewards on states. We show that qualitative quantile queries can be evaluated in polynomial time and present an exponential algorithm for the evaluation of quantitative quantile queries. For the special case of Markov chains, we show that quantitative quantile queries can be evaluated in pseudo-polynomial time.
93

Optimisation des Systèmes Partiellement Observables dans les Réseaux Sans-fil : Théorie des jeux, Auto-adaptation et Apprentissage / Optimization of Partially Observable Systems in Wireless Networks : Game Theory, Self-adaptivity and Learning

Habachi, Oussama 28 September 2012 (has links)
La dernière décennie a vu l'émergence d'Internet et l'apparition des applications multimédia qui requièrent de plus en plus de bande passante, ainsi que des utilisateurs qui exigent une meilleure qualité de service. Dans cette perspective, beaucoup de travaux ont été effectués pour améliorer l'utilisation du spectre sans fil.Le sujet de ma thèse de doctorat porte sur l'application de la théorie des jeux, la théorie des files d'attente et l'apprentissage dans les réseaux sans fil,en particulier dans des environnements partiellement observables. Nous considérons différentes couches du modèle OSI. En effet, nous étudions l'accès opportuniste au spectre sans fil à la couche MAC en utilisant la technologie des radios cognitifs (CR). Par la suite, nous nous concentrons sur le contrôle de congestion à la couche transport, et nous développons des mécanismes de contrôle de congestion pour le protocole TCP. / Since delay-sensitive and bandwidth-intense multimedia applications have emerged in the Internet, the demand for network resources has seen a steady increase during the last decade. Specifically, wireless networks have become pervasive and highly populated.These motivations are behind the problems considered in this dissertation.The topic of my PhD is about the application of game theory, queueing theory and learning techniques in wireless networks under some QoS constraints, especially in partially observable environments.We consider different layers of the protocol stack. In fact, we study the Opportunistic Spectrum Access (OSA) at the Medium Access Control (MAC) layer through Cognitive Radio (CR) approaches.Thereafter, we focus on the congestion control at the transport layer, and we develop some congestion control mechanisms under the TCP protocol.The roadmap of the research is as follows. Firstly, we focus on the MAC layer, and we seek for optimal OSA strategies in CR networks. We consider that Secondary Users (SUs) take advantage of opportunities in licensed channels while ensuring a minimum level of QoS. In fact, SUs have the possibility to sense and access licensed channels, or to transmit their packets using a dedicated access (like 3G). Therefore, a SU has two conflicting goals: seeking for opportunities in licensed channels, but spending energy for sensing those channels, or transmitting over the dedicated channel without sensing, but with higher transmission delay. We model the slotted and the non-slotted systems using a queueing framework. Thereafter, we analyze the non-cooperative behavior of SUs, and we prove the existence of a Nash equilibrium (NE) strategy. Moreover, we measure the gap of performance between the centralized and the decentralized systems using the Price of Anarchy (PoA).Even if the OSA at the MAC layer was deeply investigated in the last decade, the performance of SUs, such as energy consumption or Quality of Service (QoS) guarantee, was somehow ignored. Therefore, we study the OSA taking into account energy consumption and delay. We consider, first, one SU that access opportunistically licensed channels, or transmit its packets through a dedicated channel. Due to the partial spectrum sensing, the state of the spectrum is partially observable. Therefore, we use the Partially Observable Markov Decision Process (POMDP) framework to design an optimal OSA policy for SUs. Specifically, we derive some structural properties of the value function, and we prove that the optimal OSA policy has a threshold structure.Thereafter, we extend the model to the context of multiple SUs. We study the non-cooperative behavior of SUs and we prove the existence of a NE. Moreover, we highlight a paradox in this situation: more opportunities in the licensed spectrum may lead to worst performances for SUs. Thereafter, we focus on the study of spectrum management issues. In fact, we introduce a spectrum manager to the model, and we analyze the hierarchical game between the network manager and SUs.Finally, we focus on the transport layer and we study the congestion control for wireless networks under some QoS and Quality of Experience (QoE) constraints. Firstly, we propose a congestion control algorithm that takes into account applications' parameters and multimedia quality. In fact, we consider that network users maximize their expected multimedia quality by choosing the congestion control strategy. Since users ignore the congestion status at bottleneck links, we use a POMDP framework to determine the optimal congestion control strategy.Thereafter, we consider a subjective measure of the multimedia quality, and we propose a QoE-based congestion control algorithm. This algorithm bases on QoE feedbacks from receivers in order to adapt the congestion window size. Note that the proposed algorithms are designed based on some learning methods in order to face the complexity of solving POMDP problems.
94

Descoberta e reuso de polí­ticas parciais probabilísticas no aprendizado por reforço. / Discovery and reuse of probabilistic partial policies in reinforcement learning.

Bonini, Rodrigo Cesar 21 November 2018 (has links)
O aprendizado por reforço é uma técnica bem sucedida, porém lenta, para treinar agentes autônomos. Algumas soluções baseadas em políticas parciais podem ser usadas para acelerar o aprendizado e para transferir comportamentos aprendidos entre tarefas encapsulando uma política parcial. No entanto, geralmente essas políticas parciais são específicas para uma única tarefa, não levam em consideração recursos semelhantes entre tarefas e podem não corresponder exatamente a um comportamento ideal quando transferidas para outra tarefa diferente. A transferência descuidada pode fornecer más soluções para o agente, dificultando o processo de aprendizagem. Sendo assim, este trabalho propõe uma maneira de descobrir e reutilizar de modo probabilístico políticas parciais orientadas a objetos aprendidas, a fim de permitir melhores escolhas de atuação para o agente em múltiplas tarefas diferentes. A avaliação experimental mostra que a proposta é capaz de aprender e reutilizar com sucesso políticas parciais em diferentes tarefas. / Reinforcement Learning is a successful yet slow technique to train autonomous agents. Option-based solutions can be used to accelerate learning and to transfer learned behaviors across tasks by encapsulating a partial policy. However, commonly these options are specific for a single task, do not take in account similar features between tasks and may not correspond exactly to an optimal behavior when transferred to another task. Therefore, careless transfer might provide bad options to the agent, hampering the learning process. This work proposes a way to discover and reuse learned objectoriented options in a probabilistic way in order to enable better actuation choices to the agent in multiple different tasks. The experimental evaluation show that the proposal is able to learn and successfully reuse options across different tasks.
95

Tutor de ensino: módulo de agente de avaliação do comportamento de alunos no aprendizado em cursos de engenharia / Teaching tutor: evaluation agent module students behavior learning in engineering courses.

Santos, Valdomiro dos 15 June 2016 (has links)
O comportamento e o desempenho acadêmico dos alunos em cursos de Engenharia é um campo fértil, interessante e crescente de investigação. Este trabalho apresenta os resultados obtidos na análise estocástica do progresso dos alunos em 15 cursos de graduação das diferentes opções oferecidas pela Escola Politécnica da Universidade de São Paulo (EPUSP). Para realizar esta análise, foi desenvolvido um agente de avaliação aplicando-se o Processo de Decisão de Markov (PDM). Esse agente de avaliação extrai observações parciais dos estados atuais das notas dos alunos nas disciplinas cursadas e possibilita a identificação de ações adequadas para modelar autonomamente o comportamento futuro do aluno. O algoritmo aplicado estima o esforço que representa o estado cognitivo do aluno baseado em uma relação de pares estado/ação, calculada com base nas notas obtidas ao longo do período compreendido entre os 2000 e 2010. O período em que um aluno obteve uma nota de aprovação torna possível o estudo temporal desse evento, o que permite a utilização de métodos de agrupamento de dados, como os modelos ocultos de Markov, para a avaliação do comportamento das notas dos alunos durante os cursos de Engenharia. O presente estudo se fundamentou no agrupamento das notas dos alunos em três níveis para a classificação dos comportamentos das notas desses alunos. / The students behavior and academic performance in engineering programs is a fruitful field, interesting and crescent research. This paper presents the results of student progress obtained in stochastic analysis in 15 undergraduate courses of offered by the Escola Politécnica of the São Paulo University (EPUSP). An evaluation agent was developed to perform this analysis, applying the Markov Decision Process (PDM). This evaluation agent extracts partial observations of the current state of students\' grades in courses taken, enabling the identification of appropriate actions to autonomously shape the student future behaviour. The algorithm applied estimates the effort that represents the cognitive state of the student on states/action, based on the grades obtained during the period between 2000 and 2010. The period which a student received a passing grade makes possible the temporal study of this event, allowing the use of data grouping methods, such as hidden Markov models for the evaluation of the behaviours of students\' grades for the courses of engineering. This study is based on students grades at three different levels, classifying the behaviour of the notes.
96

Planejamento probabilístico sensível a risco com ILAO* e função utilidade exponencial / Probabilistic risk-sensitive planning with ILAO* and exponential utility function

Elthon Manhas de Freitas 18 October 2018 (has links)
Os processos de decisão de Markov (Markov Decision Process - MDP) têm sido usados para resolução de problemas de tomada de decisão sequencial. Existem problemas em que lidar com os riscos do ambiente para obter um resultado confiável é mais importante do que maximizar o retorno médio esperado. MDPs que lidam com esse tipo de problemas são chamados de processos de decisão de Markov sensíveis a risco (Risk-Sensitive Markov Decision Process - RSMDP). Dentre as diversas variações de RSMDP, estão os trabalhos baseados em utilidade exponencial que utilizam um fator de risco, o qual modela a atitude a risco do agente e que pode ser propensa ou aversa. Os algoritmos existentes na literatura para resolver esse tipo de RSMDPs são ineficientes se comparados a outros algoritmos de MDP. Neste projeto, é apresentada uma solução que pode ser usada em problemas maiores, tanto por executar cálculos apenas em estados relevantes para atingir um conjunto de estados meta partindo de um estado inicial, quanto por permitir processamento de números com expoentes muito elevados para os ambientes computacionais atuais. Os experimentos realizados evidenciam que (i) o algoritmo proposto é mais eficiente, se comparado aos algoritmos estado-da-arte para RSMDPs; e (ii) o uso da técnica LogSumExp permite resolver o problema de trabalhar com expoentes muito elevados em RSMDPs. / Markov Decision Process (MDP) has been used very efficiently to solve sequential decision-making problems. There are problems where dealing with environmental risks to get a reliable result is more important than maximizing the expected average return. MDPs that deal with this type of problem are called risk-sensitive Markov decision processes (RSMDP). Among the several variations of RSMDP are the works based on exponential utility that use a risk factor, which models the agent\'s risk attitude that can be prone or averse. The algorithms in the literature to solve this type of RSMDPs are inefficient when compared to other MDP algorithms. In this project, a solution is presented that can be used in larger problems, either by performing calculations only in relevant states to reach a set of meta states starting from an initial state, or by allowing the processing of numbers with very high exponents for the current computational environments. The experiments show that (i) the proposed algorithm is more efficient when compared to state-of-the-art algorithms for RSMDPs; and (ii) the LogSumExp technique solves the problem of working with very large exponents in RSMDPs
97

Aprendizado por reforço em lote: um estudo de caso para o problema de tomada de decisão em processos de venda / Batch reinforcement learning: a case study for the problem of decision making in sales processes

Lacerda, Dênis Antonio 12 December 2013 (has links)
Planejamento Probabilístico estuda os problemas de tomada de decisão sequencial de um agente, em que as ações possuem efeitos probabilísticos, modelados como um processo de decisão markoviano (Markov Decision Process - MDP). Dadas a função de transição de estados probabilística e os valores de recompensa das ações, é possível determinar uma política de ações (i.e., um mapeamento entre estado do ambiente e ações do agente) que maximiza a recompensa esperada acumulada (ou minimiza o custo esperado acumulado) pela execução de uma sequência de ações. Nos casos em que o modelo MDP não é completamente conhecido, a melhor política deve ser aprendida através da interação do agente com o ambiente real. Este processo é chamado de aprendizado por reforço. Porém, nas aplicações em que não é permitido realizar experiências no ambiente real, por exemplo, operações de venda, é possível realizar o aprendizado por reforço sobre uma amostra de experiências passadas, processo chamado de aprendizado por reforço em lote (Batch Reinforcement Learning). Neste trabalho, estudamos técnicas de aprendizado por reforço em lote usando um histórico de interações passadas, armazenadas em um banco de dados de processos, e propomos algumas formas de melhorar os algoritmos existentes. Como um estudo de caso, aplicamos esta técnica no aprendizado de políticas para o processo de venda de impressoras de grande formato, cujo objetivo é a construção de um sistema de recomendação de ações para vendedores iniciantes. / Probabilistic planning studies the problems of sequential decision-making of an agent, in which actions have probabilistic effects, and can be modeled as a Markov decision process (MDP). Given the probabilities and reward values of each action, it is possible to determine an action policy (in other words, a mapping between the state of the environment and the agent\'s actions) that maximizes the expected reward accumulated by executing a sequence of actions. In cases where the MDP model is not completely known, the best policy needs to be learned through the interaction of the agent in the real environment. This process is called reinforcement learning. However, in applications where it is not allowed to perform experiments in the real environment, for example, sales process, it is possible to perform the reinforcement learning using a sample of past experiences. This process is called Batch Reinforcement Learning. In this work, we study techniques of batch reinforcement learning (BRL), in which learning is done using a history of past interactions, stored in a processes database. As a case study, we apply this technique for learning policies in the sales process for large format printers, whose goal is to build a action recommendation system for beginners sellers.
98

Combinatorial optimization and Markov decision process for planning MRI examinations / Planification des examens IRM à l'aide de processus de décision markovien et optimisation combinatoire

Geng, Na 29 April 2010 (has links)
Cette thèse propose un nouveau processus de réservation d'examens IRM (Imagerie par Résonance Magnétique) afin de réduire les temps d’attente d’examens d'imagerie des patients atteint d'un AVC (Accident Vasculaire Cérébral) soignés dans une unité neurovasculaire. Le service d’imagerie réserve chaque semaine pour l'unité neurovasculaire un nombre donné de créneaux d'examens IRM appelés CTS afin d’assurer un diagnostic rapide aux patients. L'unité neurovasculaire garde la possibilité de réservations régulières appelées RTS pour pallier les variations des flux de patients.Nous donnons d'abord une formulation mathématique du problème d'optimisation pour déterminer le nombre et la répartition des créneaux CTS appelée contrat et une politique d'affectation des patients entre les créneaux CTS ou les réservations RTS. L'objectif est de trouver le meilleur compromis entre le délai d'examens et le nombre de créneaux CTS non utilisés. Pour un contrat donné, nous avons mis en évidence les propriétés et la forme des politiques d'affectation optimales à l'aide d'une approche de processus de décision markovien à coût moyen et coût actualisé. Le contrat est ensuite déterminé par une approche d'approximation Monté Carlo et amélioré par des recherches locales. Les expérimentations numériques montrent que la nouvelle méthode de réservation permet de réduire de manière importante les délais d'examens au prix des créneaux inutilisés.Afin de réduire le nombre de CTS inutilisé, nous explorons ensuite la possibilité d’annuler des créneaux CTS un ou deux jours en avance. Une approche de processus de décision markovien est de nouveau utilisée pour prouver les propriétés et la forme de la politique optimale d’annulation. Les expérimentations numériques montrent que l'annulation avancée des créneaux CTS permet de réduire de manière importante les créneaux CTS inutilisés avec une augmentation légère des délais d'attente. / This research is motivated by our collaborations with a large French university teaching hospital in order to reduce the Length of Stay (LoS) of stroke patients treated in the neurovascular department. Quick diagnosis is critical for stroke patients but relies on expensive and heavily used imaging facilities such as MRI (Magnetic Resonance Imaging) scanners. Therefore, it is very important for the neurovascular department to reduce the patient LoS by reducing their waiting time of imaging examinations. From the neurovascular department perspective, this thesis proposes a new MRI examinations reservation process in order to reduce patient waiting times without degrading the utilization of MRI. The service provider, i.e., the imaging department, reserves each week a certain number of appropriately distributed contracted time slots (CTS) for the neurovascular department to ensure quick MRI examination of stroke patients. In addition to CTS, it is still possible for stroke patients to get MRI time slots through regular reservation (RTS). This thesis first proposes a stochastic programming model to simultaneously determine the contract decision, i.e., the number of CTS and its distribution, and the patient assignment policy to assign patients to either CTS or RTS. To solve this problem, structure properties of the optimal patient assignment policy for a given contract are proved by an average cost Markov decision process (MDP) approach. The contract is determined by a Monte Carlo approximation approach and then improved by local search. Computational experiments show that the proposed algorithms can efficiently solve the model. The new reservation process greatly reduces the average waiting time of stroke patients. At the same time, some CTS cannot be used for the lack of patients.To reduce the unused CTS, we further explore the possibility of the advance cancellation of CTS. Structure properties of optimal control policies for one-day and two-day advance cancellation are established separately via an average-cost MDP approach with appropriate modeling and advanced convexity concepts used in control of queueing systems. Computational experiments show that appropriate advance cancellations of CTS greatly reduce the unused CTS with nearly the same waiting times.
99

Tutor de ensino: módulo de agente de avaliação do comportamento de alunos no aprendizado em cursos de engenharia / Teaching tutor: evaluation agent module students behavior learning in engineering courses.

Valdomiro dos Santos 15 June 2016 (has links)
O comportamento e o desempenho acadêmico dos alunos em cursos de Engenharia é um campo fértil, interessante e crescente de investigação. Este trabalho apresenta os resultados obtidos na análise estocástica do progresso dos alunos em 15 cursos de graduação das diferentes opções oferecidas pela Escola Politécnica da Universidade de São Paulo (EPUSP). Para realizar esta análise, foi desenvolvido um agente de avaliação aplicando-se o Processo de Decisão de Markov (PDM). Esse agente de avaliação extrai observações parciais dos estados atuais das notas dos alunos nas disciplinas cursadas e possibilita a identificação de ações adequadas para modelar autonomamente o comportamento futuro do aluno. O algoritmo aplicado estima o esforço que representa o estado cognitivo do aluno baseado em uma relação de pares estado/ação, calculada com base nas notas obtidas ao longo do período compreendido entre os 2000 e 2010. O período em que um aluno obteve uma nota de aprovação torna possível o estudo temporal desse evento, o que permite a utilização de métodos de agrupamento de dados, como os modelos ocultos de Markov, para a avaliação do comportamento das notas dos alunos durante os cursos de Engenharia. O presente estudo se fundamentou no agrupamento das notas dos alunos em três níveis para a classificação dos comportamentos das notas desses alunos. / The students behavior and academic performance in engineering programs is a fruitful field, interesting and crescent research. This paper presents the results of student progress obtained in stochastic analysis in 15 undergraduate courses of offered by the Escola Politécnica of the São Paulo University (EPUSP). An evaluation agent was developed to perform this analysis, applying the Markov Decision Process (PDM). This evaluation agent extracts partial observations of the current state of students\' grades in courses taken, enabling the identification of appropriate actions to autonomously shape the student future behaviour. The algorithm applied estimates the effort that represents the cognitive state of the student on states/action, based on the grades obtained during the period between 2000 and 2010. The period which a student received a passing grade makes possible the temporal study of this event, allowing the use of data grouping methods, such as hidden Markov models for the evaluation of the behaviours of students\' grades for the courses of engineering. This study is based on students grades at three different levels, classifying the behaviour of the notes.
100

Planejamento probabilístico sensível a risco com ILAO* e função utilidade exponencial / Probabilistic risk-sensitive planning with ILAO* and exponential utility function

Freitas, Elthon Manhas de 18 October 2018 (has links)
Os processos de decisão de Markov (Markov Decision Process - MDP) têm sido usados para resolução de problemas de tomada de decisão sequencial. Existem problemas em que lidar com os riscos do ambiente para obter um resultado confiável é mais importante do que maximizar o retorno médio esperado. MDPs que lidam com esse tipo de problemas são chamados de processos de decisão de Markov sensíveis a risco (Risk-Sensitive Markov Decision Process - RSMDP). Dentre as diversas variações de RSMDP, estão os trabalhos baseados em utilidade exponencial que utilizam um fator de risco, o qual modela a atitude a risco do agente e que pode ser propensa ou aversa. Os algoritmos existentes na literatura para resolver esse tipo de RSMDPs são ineficientes se comparados a outros algoritmos de MDP. Neste projeto, é apresentada uma solução que pode ser usada em problemas maiores, tanto por executar cálculos apenas em estados relevantes para atingir um conjunto de estados meta partindo de um estado inicial, quanto por permitir processamento de números com expoentes muito elevados para os ambientes computacionais atuais. Os experimentos realizados evidenciam que (i) o algoritmo proposto é mais eficiente, se comparado aos algoritmos estado-da-arte para RSMDPs; e (ii) o uso da técnica LogSumExp permite resolver o problema de trabalhar com expoentes muito elevados em RSMDPs. / Markov Decision Process (MDP) has been used very efficiently to solve sequential decision-making problems. There are problems where dealing with environmental risks to get a reliable result is more important than maximizing the expected average return. MDPs that deal with this type of problem are called risk-sensitive Markov decision processes (RSMDP). Among the several variations of RSMDP are the works based on exponential utility that use a risk factor, which models the agent\'s risk attitude that can be prone or averse. The algorithms in the literature to solve this type of RSMDPs are inefficient when compared to other MDP algorithms. In this project, a solution is presented that can be used in larger problems, either by performing calculations only in relevant states to reach a set of meta states starting from an initial state, or by allowing the processing of numbers with very high exponents for the current computational environments. The experiments show that (i) the proposed algorithm is more efficient when compared to state-of-the-art algorithms for RSMDPs; and (ii) the LogSumExp technique solves the problem of working with very large exponents in RSMDPs

Page generated in 0.0918 seconds