• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 84
  • 20
  • 13
  • 8
  • 5
  • 2
  • 1
  • Tagged with
  • 162
  • 162
  • 162
  • 49
  • 35
  • 34
  • 28
  • 26
  • 25
  • 24
  • 24
  • 23
  • 22
  • 20
  • 20
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
121

Aprendizado por reforço em lote: um estudo de caso para o problema de tomada de decisão em processos de venda / Batch reinforcement learning: a case study for the problem of decision making in sales processes

Dênis Antonio Lacerda 12 December 2013 (has links)
Planejamento Probabilístico estuda os problemas de tomada de decisão sequencial de um agente, em que as ações possuem efeitos probabilísticos, modelados como um processo de decisão markoviano (Markov Decision Process - MDP). Dadas a função de transição de estados probabilística e os valores de recompensa das ações, é possível determinar uma política de ações (i.e., um mapeamento entre estado do ambiente e ações do agente) que maximiza a recompensa esperada acumulada (ou minimiza o custo esperado acumulado) pela execução de uma sequência de ações. Nos casos em que o modelo MDP não é completamente conhecido, a melhor política deve ser aprendida através da interação do agente com o ambiente real. Este processo é chamado de aprendizado por reforço. Porém, nas aplicações em que não é permitido realizar experiências no ambiente real, por exemplo, operações de venda, é possível realizar o aprendizado por reforço sobre uma amostra de experiências passadas, processo chamado de aprendizado por reforço em lote (Batch Reinforcement Learning). Neste trabalho, estudamos técnicas de aprendizado por reforço em lote usando um histórico de interações passadas, armazenadas em um banco de dados de processos, e propomos algumas formas de melhorar os algoritmos existentes. Como um estudo de caso, aplicamos esta técnica no aprendizado de políticas para o processo de venda de impressoras de grande formato, cujo objetivo é a construção de um sistema de recomendação de ações para vendedores iniciantes. / Probabilistic planning studies the problems of sequential decision-making of an agent, in which actions have probabilistic effects, and can be modeled as a Markov decision process (MDP). Given the probabilities and reward values of each action, it is possible to determine an action policy (in other words, a mapping between the state of the environment and the agent\'s actions) that maximizes the expected reward accumulated by executing a sequence of actions. In cases where the MDP model is not completely known, the best policy needs to be learned through the interaction of the agent in the real environment. This process is called reinforcement learning. However, in applications where it is not allowed to perform experiments in the real environment, for example, sales process, it is possible to perform the reinforcement learning using a sample of past experiences. This process is called Batch Reinforcement Learning. In this work, we study techniques of batch reinforcement learning (BRL), in which learning is done using a history of past interactions, stored in a processes database. As a case study, we apply this technique for learning policies in the sales process for large format printers, whose goal is to build a action recommendation system for beginners sellers.
122

Deep Reinforcement Learning for the Optimization of Combining Raster Images in Forest Planning

Wen, Yangyang January 2021 (has links)
Raster images represent the treatment options of how the forest will be cut. Economic benefits from cutting the forest will be generated after the treatment is selected and executed. Existing raster images have many clusters and small sizes, this becomes the principal cause of overhead. If we can fully explore the relationship among the raster images and combine the old data sets according to the optimization algorithm to generate a new raster image, then this result will surpass the existing raster images and create higher economic benefits.    The question of this project is can we create a dynamic model that treats the updating pixel’s status as an agent selecting options for an empty raster image in response to neighborhood environmental and landscape parameters. This project is trying to explore if it is realistic to use deep reinforcement learning to generate new and superior raster images. Finally, this project aims to explore the feasibility, usefulness, and effectiveness of deep reinforcement learning algorithms in optimizing existing treatment options.    The problem was modeled as a Markov decision process, in which the pixel to be updated was an agent of the empty raster image, which would determine the choice of the treatment option for the current empty pixel. This project used the Deep Q learning neural network model to calculate the Q values. The temporal difference reinforcement learning algorithm was applied to predict future rewards and to update model parameters.   After the modeling was completed, this project set up the model usefulness experiment to test the usefulness of the model. Then the parameter correlation experiment was set to test the correlation between the parameters and the benefit of the model. Finally, the trained model was used to generate a larger size raster image to test its effectiveness.
123

On Non-Classical Stochastic Shortest Path Problems

Piribauer, Jakob 13 October 2021 (has links)
The stochastic shortest path problem lies at the heart of many questions in the formal verification of probabilistic systems. It asks to find a scheduler resolving the non-deterministic choices in a weighted Markov decision process (MDP) that minimizes or maximizes the expected accumulated weight before a goal state is reached. In the classical setting, it is required that the scheduler ensures that a goal state is reached almost surely. For the analysis of systems without guarantees on the occurrence of an event of interest (reaching a goal state), however, schedulers that miss the goal with positive probability are of interest as well. We study two non-classical variants of the stochastic shortest path problem that drop the restriction that the goal has to be reached almost surely. These variants ask for the optimal partial expectation, obtained by assigning weight 0 to paths not reaching the goal, and the optimal conditional expectation under the condition that the goal is reached, respectively. Both variants have only been studied in structures with non-negative weights. We prove that the decision versions of these non-classical stochastic shortest path problems in MDPs with arbitrary integer weights are at least as hard as the Positivity problem for linear recurrence sequences. This Positivity problem is an outstanding open number-theoretic problem, closely related to the famous Skolem problem. A decid- ability result for the Positivity problem would imply a major breakthrough in analytic number theory. The proof technique we develop can be applied to a series of further problems. In this way, we obtain Positivity-hardness results for problems addressing the termination of one-counter MDPs, the satisfaction of energy objectives, the satisfaction of cost constraints and the computation of quantiles, the conditional value-at-risk – an important risk measure – for accumulated weights, and the model-checking problem of frequency-LTL. Despite these Positivity-hardness results, we show that the optimal values for the non-classical stochastic shortest path problems can be achieved by weight-based deter- ministic schedulers and that the optimal values can be approximated in exponential time. In MDPs with non-negative weights, it is known that optimal partial and conditional expectations can be computed in exponential time. These results rely on the existence of a saturation point, a bound on the accumulated weight above which optimal schedulers can behave memorylessly. We improve the result for partial expectations by showing that the least possible saturation point can be computed efficiently. Further, we show that a simple saturation point also allows us to compute the optimal conditional value-at-risk for the accumulated weight in MDPs with non-negative weights. Moreover, we introduce the notions of long-run probability and long-run expectation addressing the long-run behavior of a system. These notions quantify the long-run average probability that a path property is satisfied on a suffix of a run and the long-run average expected amount of weight accumulated before the next visit to a target state, respectively. We establish considerable similarities of the corresponding optimization problems with non-classical stochastic shortest path problems. On the one hand, we show that the threshold problem for optimal long-run probabilities of regular co-safety properties is Positivity-hard via the Positivity-hardness of non-classical stochastic shortest path problems. On the other hand, we show that optimal long-run expectations in MDPs with arbitrary integer weights and long-run probabilities of constrained reachability properties (a U b) can be computed in exponential time using the existence of a saturation point.
124

Cognitive Modeling for Human-Automation Interaction: A Computational Model of Human Trust and Self-Confidence

Katherine Jayne Williams (11517103) 22 November 2021 (has links)
Across a range of sectors, including transportation and healthcare, the use of automation to assist humans with increasingly complex tasks is also demanding that such systems are more interactive with human users. Given the role of cognitive factors in human decision-making during their interactions with automation, models enabling human cognitive state estimation and prediction could be used by autonomous systems to appropriately adapt their behavior. However, accomplishing this requires mathematical models of human cognitive state evolution that are suitable for algorithm design. In this thesis, a computational model of coupled human trust and self-confidence dynamics is proposed. The dynamics are modeled as a partially observable Markov decision process that leverages behavioral and self-report data as observations for estimation of the cognitive states. The use of an asymmetrical structure in the emission probability functions enables labeling and interpretation of the coupled cognitive states. The model is trained and validated using data collected from 340 participants. Analysis of the transition probabilities shows that the model captures nuanced effects, in terms of participants' decisions to rely on an autonomous system, that result as a function of the combination of their trust in the automation and self-confidence. Implications for the design of human-aware autonomous systems are discussed, particularly in the context of human trust and self-confidence calibration.
125

Leveraging Help Requests In Pomdp Intelligent Tutors

Folsom-Kovarik, Jeremiah 01 January 2012 (has links)
Intelligent tutoring systems (ITSs) are computer programs that model individual learners and adapt instruction to help each learner differently. One way ITSs differ from human tutors is that few ITSs give learners a way to ask questions. When learners can ask for help, their questions have the potential to improve learning directly and also act as a new source of model data to help the ITS personalize instruction. Inquiry modeling gives ITSs the ability to answer learner questions and refine their learner models with an inexpensive new input channel. In order to support inquiry modeling, an advanced planning formalism is applied to ITS learner modeling. Partially observable Markov decision processes (POMDPs) differ from more widely used ITS architectures because they can plan complex action sequences in uncertain situations with machine learning. Tractability issues have previously precluded POMDP use in ITS models. This dissertation introduces two improvements, priority queues and observation chains, to make POMDPs scale well and encompass the large problem sizes that real-world ITSs must confront. A new ITS was created to support trainees practicing a military task in a virtual environment. The development of the Inquiry Modeling POMDP Adaptive Trainer (IMP) began with multiple formative studies on human and simulated learners that explored inquiry modeling and POMDPs in intelligent tutoring. The studies suggest the new POMDP representations will be effective in ITS domains having certain common characteristics. iv Finally, a summative study evaluated IMP’s ability to train volunteers in specific practice scenarios. IMP users achieved post-training scores averaging up to 4.5 times higher than users who practiced without support and up to twice as high as trainees who used an ablated version of IMP with no inquiry modeling. IMP’s implementation and evaluation helped explore questions about how inquiry modeling and POMDP ITSs work, while empirically demonstrating their efficacy
126

Random Edge is not faster than Random Facet on Linear Programs / Random Edge är inte snabbare än Random Facet på linjära program

Hedblom, Nicole January 2023 (has links)
A Linear Program is a problem where the goal is to maximize a linear function subject to a set of linear inequalities. Geometrically, this can be rephrased as finding the highest point on a polyhedron. The Simplex method is a commonly used algorithm to solve Linear Programs. It traverses the vertices of the polyhedron, and in each step, it selects one adjacent better vertex and moves there. There can be multiple vertices to choose from, and therefore the Simplex method has different variants deciding how the next vertex is selected. One of the most natural variants is Random Edge, which in each step of the Simplex method uniformly at random selects one of the better adjacent vertices. It is interesting and non-trivial to study the complexity of variants of the Simplex method in the number of variables, d, and inequalities, N. In 2011, Friedmann, Hansen, and Zwick found a class of Linear Programs for which the Random Edge algorithm is subexponential with complexity 2^Ω(N^(1/4)), where d=Θ(N). Previously all known lower bounds were polynomial. We give an improved lower bound of 2^Ω(N^(1/2)), for Random Edge on Linear Programs where d=Θ(N). Another well studied variant of the Simplex method is Random Facet. It is upper bounded by 2^O(N^(1/2)) when d=Θ(N). Thus we prove that Random Edge is not faster than Random Facet on Linear Programs where d=Θ(N). Our construction is very similar to the previous construction of Friedmann, Hansen and Zwick. We construct a Markov Decision Process which behaves like a binary counter with linearly many levels and linearly many nodes on each level. The new idea is a new type of delay gadget which can switch quickly from 0 to 1 in some circumstances, leading to fewer nodes needed on each level of the construction. The key idea is that it is worth taking a large risk of getting a small negative reward if the potential positive reward is large enough in comparison. / Ett linjärt program är ett problem där målet är att maximiera en linjär funktion givet en mängd linjära olikheter. Geometriskt kan detta omformuleras som att hitta den högsta punkten på en polyeder. Simplexmetoden är en algoritm som ofta används för att lösa linjära program. Den besöker hörnen i polyedern, och i varje steg väljer den ett närliggande bättre hörn och flyttar dit. Det kan finnas flera hörn att välja mellan, och därför finns det olika varianter av simplexmetoden som bestämmer hur nästa hörn ska väljas. En av de mest naturliga varianterna är Random Edge, som i varje steg av simplexmetoden, uniformt slumpmässigt väljer ett av de närliggande bättre hörnen. Det är intressant och icke-trivialt att studera komplexiteten av olika varianter av simplexmetoden i antalet variabler, d, och olikheter N. År 2011 hittade Friedmann, Hansen och Zwick en familj av linjära program där Random Edge är subexponentiell med komplexitet 2^Ω(N^(1/4)), där d=Θ(N). Innan dess var alla kända undre gränser polynomiska. Vi ger en förbättrad undre gräns på 2^Ω(N^(1/2)), för Random Edge på linjära program där d=Θ(N). En annan välstuderad variant av simplexmetoden är Random Facet. Dess komplexitet har en övre gräns på 2^O(N^(1/2)) när d=Θ(N). Alltså bevisar vi att Random Edge inte är snabbare än Random Facet på linjära program där d=Θ(N). Vår konstruktion är väldigt lik den tidigare konstruktionen av Friedmann, Hansen och Zwick. Vi konstruerar en Markov-beslutsprocess som beter sig som en binär räknare med linjärt många nivåer och linjärt många noder på varje nivå. Den nya idén är en ny typ av försenings-multinod som kan byta snabbt från 0 till 1 i vissa fall, vilket leder till att det behövs färre noder på varje nivå av konstruktionen. Nyckelidén är att det är värt att ta en stor risk att få en liten negativ poäng om den potentiella positiva poängen är stor nog i jämförelse.
127

Risk-aware Autonomous Driving Using POMDPs and Responsibility-Sensitive Safety / POMDP-modellerad Riskmedveten Autonom Körning med Riskmått

Skoglund, Caroline January 2021 (has links)
Autonomous vehicles promise to play an important role aiming at increased efficiency and safety in road transportation. Although we have seen several examples of autonomous vehicles out on the road over the past years, how to ensure the safety of autonomous vehicle in the uncertain and dynamic environment is still a challenging problem. This thesis studies this problem by developing a risk-aware decision making framework. The system that integrates the dynamics of an autonomous vehicle and the uncertain environment is modelled as a Partially Observable Markov Decision Process (POMDP). A risk measure is proposed based on the Responsibility-Sensitive Safety (RSS) distance, which quantifying the minimum distance to other vehicles for ensuring safety. This risk measure is incorporated into the reward function of POMDP for achieving a risk-aware decision making. The proposed risk-aware POMDP framework is evaluated in two case studies. In a single-lane car following scenario, it is shown that the ego vehicle is able to successfully avoid a collision in an emergency event where a vehicle in front of it makes a full stop. In the merge scenario, the ego vehicle successfully enters the main road from a ramp with a satisfactory distance to other vehicles. As a conclusion, the risk-aware POMDP framework is able to realize a trade-off between safety and usability by keeping a reasonable distance and adapting to other vehicles behaviours. / Autonoma fordon förutspås spela en stor roll i framtiden med målen att förbättra effektivitet och säkerhet för vägtransporter. Men även om vi sett flera exempel av autonoma fordon ute på vägarna de senaste åren är frågan om hur säkerhet ska kunna garanteras ett utmanande problem. Det här examensarbetet har studerat denna fråga genom att utveckla ett ramverk för riskmedvetet beslutsfattande. Det autonoma fordonets dynamik och den oförutsägbara omgivningen modelleras med en partiellt observerbar Markov-beslutsprocess (POMDP från engelskans “Partially Observable Markov Decision Process”). Ett riskmått föreslås baserat på ett säkerhetsavstånd förkortat RSS (från engelskans “Responsibility-Sensitive Safety”) som kvantifierar det minsta avståndet till andra fordon för garanterad säkerhet. Riskmåttet integreras i POMDP-modellens belöningsfunktion för att åstadkomma riskmedvetna beteenden. Den föreslagna riskmedvetna POMDP-modellen utvärderas i två fallstudier. I ett scenario där det egna fordonet följer ett annat fordon på en enfilig väg visar vi att det egna fordonet kan undvika en kollision då det framförvarande fordonet bromsar till stillastående. I ett scenario där det egna fordonet ansluter till en huvudled från en ramp visar vi att detta görs med ett tillfredställande avstånd till andra fordon. Slutsatsen är att den riskmedvetna POMDP-modellen lyckas realisera en avvägning mellan säkerhet och användbarhet genom att hålla ett rimligt säkerhetsavstånd och anpassa sig till andra fordons beteenden.
128

Geometry of Optimization in Markov Decision Processes and Neural Network-Based PDE Solvers

Müller, Johannes 07 June 2024 (has links)
This thesis is divided into two parts dealing with the optimization problems in Markov decision processes (MDPs) and different neural network-based numerical solvers for partial differential equations (PDEs). In Part I we analyze the optimization problem arising in (partially observable) Markov decision processes using tools from algebraic statistics and information geometry, which can be viewed as neighboring fields of applied algebra and differential geometry, respectively. Here, we focus on infinite horizon problems and memoryless stochastic policies. Markov decision processes provide a mathematical framework for sequential decision-making on which most current reinforcement learning algorithms are built. They formalize the task of optimally controlling the state of a system through appropriate actions. For fully observable problems, the action can be selected knowing the current state of the system. This case has been studied extensively and optimizing the action selection is known to be equivalent to solving a linear program over the (generalized) stationary distributions of the Markov decision process, which are also referred to as state-action frequencies. In Chapter 3, we study partially observable problems where an action must be chosen based solely on an observation of the current state, which might not fully reveal the underlying state. We characterize the feasible state-action frequencies of partially observable Markov decision processes by polynomial inequalities. In particular, the optimization problem in partially observable MDPs is described as a polynomially constrained linear objective program that generalizes the (dual) linear programming formulation of fully observable problems. We use this to study the combinatorial and algebraic complexity of this optimization problem and to upper bound the number of critical points over the individual boundary components of the feasible set. Furthermore, we show that our polynomial programming formulation can be used to effectively solve partially observable MDPs using interior point methods, numerical algebraic techniques, and convex relaxations. Gradient-based methods, including variants of natural gradient methods, have gained tremendous attention in the theoretical reinforcement learning community, where they are commonly referred to as (natural) policy gradient methods. In Chapter 4, we provide a unified treatment of a variety of natural policy gradient methods for fully observable problems by studying their state-action frequencies from the standpoint of information geometry. For a variety of NPGs and reward functions, we show that the trajectories in state-action space are solutions of gradient flows with respect to Hessian geometries, based on which we obtain global convergence guarantees and convergence rates. In particular, we show linear convergence for unregularized and regularized NPG flows with the metrics proposed by Morimura and co-authors and Kakade by observing that these arise from the Hessian geometries of the entropy and conditional entropy, respectively. Further, we obtain sublinear convergence rates for Hessian geometries arising from other convex functions like log-barriers. We provide experimental evidence indicating that our predicted rates are essentially tight. Finally, we interpret the discrete-time NPG methods with regularized rewards as inexact Newton methods if the NPG is defined with respect to the Hessian geometry of the regularizer. This yields local quadratic convergence rates of these methods for step size equal to the inverse penalization strength, which recovers existing results as special cases. Part II addresses neural network-based PDE solvers that have recently experienced tremendous growth in popularity and attention in the scientific machine learning community. We focus on two approaches that represent the approximation of a solution of a PDE as the minimization over the parameters of a neural network: the deep Ritz method and physically informed neural networks. In Chapter 5, we study the theoretical properties of the boundary penalty for these methods and obtain a uniform convergence result for the deep Ritz method for a large class of potentially nonlinear problems. For linear PDEs, we estimate the error of the deep Ritz method in terms of the optimization error, the approximation capabilities of the neural network, and the strength of the penalty. This reveals a trade-off in the choice of the penalization strength, where too little penalization allows large boundary values, and too strong penalization leads to a poor solution of the PDE inside the domain. For physics-informed networks, we show that when working with neural networks that have zero boundary values also the second derivatives of the solution are approximated whereas otherwise only lower-order derivatives are approximated. In Chapter 6, we propose energy natural gradient descent, a natural gradient method with respect to second-order information in the function space, as an optimization algorithm for physics-informed neural networks and the deep Ritz method. We show that this method, which can be interpreted as a generalized Gauss-Newton method, mimics Newton’s method in function space except for an orthogonal projection onto the tangent space of the model. We show that for a variety of PDEs, natural energy gradients converge rapidly and approximations to the solution of the PDE are several orders of magnitude more accurate than gradient descent, Adam and Newton’s methods, even when these methods are given more computational time.:Chapter 1. Introduction 1 1.1 Notation and conventions 7 Part I. Geometry of Markov decision processes 11 Chapter 2. Background on Markov decision processes 12 2.1 State-action frequencies 19 2.2 The advantage function and Bellman optimality 23 2.3 Rational structure of the reward and an explicit line theorem 26 2.4 Solution methods for Markov decision processes 35 Chapter 3. State-action geometry of partially observable MDPs 44 3.1 The state-action polytope of fully observables systems 45 3.2 State-action geometry of partially observable systems 54 3.3 Number and location of critical points 69 3.4 Reward optimization in state-action space (ROSA) 83 Chapter 4. Geometry and convergence of natural policy gradient methods 94 4.1 Natural gradients 96 4.2 Natural policy gradient methods 101 4.3 Convergence of natural policy gradient flows 107 4.4 Locally quadratic convergence for regularized problems 128 4.5 Discussion and outlook 131 Part II. Neural network-based PDE solvers 133 Chapter 5. Theoretical analysis of the boundary penalty method for neural network-based PDE solvers 134 5.1 Presentation and discussion of the main results 137 5.2 Preliminaries regarding Sobolev spaces and neural networks 146 5.3 Proofs regarding uniform convergence for the deep Ritz method 150 5.4 Proofs of error estimates for the deep Ritz method 156 5.5 Proofs of implications of exact boundary values in residual minimization 167 Chapter 6. Energy natural gradients for neural network-based PDE solvers 174 6.1 Energy natural gradients 176 6.2 Experiments 183 6.3 Conclusion and outlook 192 Bibliography 193
129

Cognitive Networks: Foundations to Applications

Friend, Daniel 21 April 2009 (has links)
Fueled by the rapid advancement in digital and wireless technologies, the ever-increasing capabilities of wireless devices have placed upon us a tremendous challenge - how to put all of this capability to effective use. Individually, wireless devices have outpaced the ability of users to optimally configure them. Collectively, the complexity is far more daunting. Research in cognitive networks seeks to provide a solution to the diffculty of effectively using the expanding capabilities of wireless networks by embedding greater degrees of intelligence within the network itself. In this dissertation, we address some fundamental questions related to cognitive networks, such as "What is a cognitive network?" and "What methods may be used to design a cognitive network?" We relate cognitive networks to a common artificial intelligence (AI) framework, the multi-agent system (MAS). We also discuss the key elements of learning and reasoning, with the ability to learn being the primary differentiator for a cognitive network. Having discussed some of the fundamentals, we proceed to further illustrate the cognitive networking principle by applying it to two problems: multichannel topology control for dynamic spectrum access (DSA) and routing in a mobile ad hoc network (MANET). The multichannel topology control problem involves confguring secondary network parameters to minimize the probability that the secondary network will cause an outage to a primary user in the future. This requires the secondary network to estimate an outage potential map, essentially a spatial map of predicted primary user density, which must be learned using prior observations of spectral occupancy made by secondary nodes. Due to the complexity of the objective function, we provide a suboptimal heuristic and compare its performance against heuristics targeting power-based and interference-based topology control objectives. We also develop a genetic algorithm to provide reference solutions since obtaining optimal solutions is impractical. We show how our approach to this problem qualifies as a cognitive network. In presenting our second application, we address the role of network state observations in cognitive networking. Essentially, we need a way to quantify how much information is needed regarding the state of the network to achieve a desired level of performance. This question is applicable to networking in general, but becomes increasingly important in the cognitive network context because of the potential volume of information that may be desired for decision-making. In this case, the application is routing in MANETs. Current MANET routing protocols are largely adapted from routing algorithms developed for wired networks. Although optimal routing in wired networks is grounded in dynamic programming, the critical assumption, static link costs and states, that enables the use of dynamic programming for wired networks need not apply to MANETs. We present a link-level model of a MANET, which models the network as a stochastically varying graph that possesses the Markov property. We present the Markov decision process as the appropriate framework for computing optimal routing policies for such networks. We then proceed to analyze the relationship between optimal policy and link state information as a function of minimum distance from the forwarding node. The applications that we focus on are quite different, both in their models as well as their objectives. This difference is intentional and signficant because it disassociates the technology, i.e. cognitive networks, from the application of the technology. As a consequence, the versatility of the cognitive networks concept is demonstrated. Simultaneously, we are able to address two open problems and provide useful results, as well as new perspective, on both multichannel topology control and MANET routing. This material is posted here with permission from the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of Virginia Tech library's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this material, you agree to all provisions of the copyright laws protecting it. / Ph. D.
130

Parsimonious reasoning in reinforcement learning for better credit assignment

Ma, Michel 08 1900 (has links)
Le contenu de cette thèse explore la question de l’attribution de crédits à long terme dans l’apprentissage par renforcement du point de vue d’un biais inductif de parcimonie. Dans ce contexte, un agent parcimonieux cherche à comprendre son environnement en utilisant le moins de variables possible. Autrement dit, si l’agent est crédité ou blâmé pour un certain comportement, la parcimonie l’oblige à attribuer ce crédit (ou blâme) à seulement quelques variables latentes sélectionnées. Avant de proposer de nouvelles méthodes d’attribution parci- monieuse de crédits, nous présentons les travaux antérieurs relatifs à l’attribution de crédits à long terme en relation avec l’idée de sparsité. Ensuite, nous développons deux nouvelles idées pour l’attribution de crédits dans l’apprentissage par renforcement qui sont motivées par un raisonnement parcimonieux : une dans le cadre sans modèle et une pour l’apprentissage basé sur un modèle. Pour ce faire, nous nous appuyons sur divers concepts liés à la parcimonie issus de la causalité, de l’apprentissage supervisé et de la simulation, et nous les appliquons dans un cadre pour la prise de décision séquentielle. La première, appelée évaluation contrefactuelle de la politique, prend en compte les dévi- ations mineures de ce qui aurait pu être compte tenu de ce qui a été. En restreignant l’espace dans lequel l’agent peut raisonner sur les alternatives, l’évaluation contrefactuelle de la politique présente des propriétés de variance favorables à l’évaluation des politiques. L’évaluation contrefactuelle de la politique offre également une nouvelle perspective sur la rétrospection, généralisant les travaux antérieurs sur l’attribution de crédits a posteriori. La deuxième contribution de cette thèse est un algorithme augmenté d’attention latente pour l’apprentissage par renforcement basé sur un modèle : Latent Sparse Attentive Value Gra- dients (LSAVG). En intégrant pleinement l’attention dans la structure d’optimisation de la politique, nous montrons que LSAVG est capable de résoudre des tâches de mémoire active que son homologue sans modèle a été conçu pour traiter, sans recourir à des heuristiques ou à un biais de l’estimateur original. / The content of this thesis explores the question of long-term credit assignment in reinforce- ment learning from the perspective of a parsimony inductive bias. In this context, a parsi- monious agent looks to understand its environment through the least amount of variables possible. Alternatively, given some credit or blame for some behavior, parsimony forces the agent to assign this credit (or blame) to only a select few latent variables. Before propos- ing novel methods for parsimonious credit assignment, previous work relating to long-term credit assignment is introduced in relation to the idea of sparsity. Then, we develop two new ideas for credit assignment in reinforcement learning that are motivated by parsimo- nious reasoning: one in the model-free setting, and one for model-based learning. To do so, we build upon various parsimony-related concepts from causality, supervised learning, and simulation, and apply them to the Markov Decision Process framework. The first of which, called counterfactual policy evaluation, considers minor deviations of what could have been given what has been. By restricting the space in which the agent can reason about alternatives, counterfactual policy evaluation is shown to have favorable variance properties for policy evaluation. Counterfactual policy evaluation also offers a new perspective to hindsight, generalizing previous work in hindsight credit assignment. The second contribution of this thesis is a latent attention augmented algorithm for model-based reinforcement learning: Latent Sparse Attentive Value Gradients (LSAVG). By fully inte- grating attention into the structure for policy optimization, we show that LSAVG is able to solve active memory tasks that its model-free counterpart was designed to tackle, without resorting to heuristics or biasing the original estimator.

Page generated in 0.0777 seconds