• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 46
  • 17
  • 15
  • 5
  • 4
  • 1
  • Tagged with
  • 111
  • 111
  • 111
  • 37
  • 28
  • 21
  • 19
  • 19
  • 18
  • 17
  • 17
  • 15
  • 14
  • 14
  • 13
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
71

Simulation Based Algorithms For Markov Decision Process And Stochastic Optimization

Abdulla, Mohammed Shahid 05 1900 (has links)
In Chapter 2, we propose several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes (MDPs) with finite state-space under the average cost criterion. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and averaged for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, a memory efficient implementation using a feature-vector representation of the state-space and TD (0) learning along the faster timescale is discussed. A three-timescale simulation based algorithm for solution of infinite horizon discounted-cost MDPs via the Value Iteration approach is also proposed. An approximation of the Dynamic Programming operator T is applied to the value function iterates. A sketch of convergence explaining the dynamics of the algorithm using associated ODEs is presented. Numerical experiments on rate based flow control on a bottleneck node using a continuous-time queueing model are presented using the proposed algorithms. Next, in Chapter 3, we develop three simulation-based algorithms for finite-horizon MDPs (FHMDPs). The first algorithm is developed for finite state and compact action spaces while the other two are for finite state and finite action spaces. Convergence analysis is briefly sketched. We then concentrate on methods to mitigate the curse of dimensionality that affects FH-MDPs severely, as there is one probability transition matrix per stage. Two parametrized actor-critic algorithms for FHMDPs with compact action sets are proposed, the ‘critic’ in both algorithms learning the policy gradient. We show w.p1convergence to a set with the necessary condition for constrained optima. Further, a third algorithm for stochastic control of stopping time processes is presented. Numerical experiments with the proposed finite-horizon algorithms are shown for a problem of flow control in communication networks. Towards stochastic optimization, in Chapter 4, we propose five algorithms which are variants of SPSA. The original one measurement SPSA uses an estimate of the gradient of objective function L containing an additional bias term not seen in two-measurement SPSA. We propose a one-measurement algorithm that eliminates this bias, and has asymptotic convergence properties making for easier comparison with the two-measurement SPSA. The algorithm, under certain conditions, outperforms both forms of SPSA with the only overhead being the storage of a single measurement. We also propose a similar algorithm that uses perturbations obtained from normalized Hadamard matrices. The convergence w.p.1 of both algorithms is established. We extend measurement reuse to design three second-order SPSA algorithms, sketch the convergence analysis and present simulation results on an illustrative minimization problem. We then propose several stochastic approximation implementations for related algorithms in flow-control of communication networks, beginning with a discrete-time implementation of Kelly’s primal flow-control algorithm. Convergence with probability1 is shown, even in the presence of communication delays and stochastic effects seen in link congestion indications. Two relevant enhancements are then pursued :a) an implementation of the primal algorithm using second-order information, and b) an implementation where edge-routers rectify misbehaving flows. Also, discrete-time implementations of Kelly’s dual algorithm and primal-dual algorithm are proposed. Simulation results a) verifying the proposed algorithms and, b) comparing stability properties with an algorithm in the literature are presented.
72

Measuring and Influencing Sequential Joint Agent Behaviours

Raffensperger, Peter Abraham January 2013 (has links)
Algorithmically designed reward functions can influence groups of learning agents toward measurable desired sequential joint behaviours. Influencing learning agents toward desirable behaviours is non-trivial due to the difficulties of assigning credit for global success to the deserving agents and of inducing coordination. Quantifying joint behaviours lets us identify global success by ranking some behaviours as more desirable than others. We propose a real-valued metric for turn-taking, demonstrating how to measure one sequential joint behaviour. We describe how to identify the presence of turn-taking in simulation results and we calculate the quantity of turn-taking that could be observed between independent random agents. We demonstrate our turn-taking metric by reinterpreting previous work on turn-taking in emergent communication and by analysing a recorded human conversation. Given a metric, we can explore the space of reward functions and identify those reward functions that result in global success in groups of learning agents. We describe 'medium access games' as a model for human and machine communication and we present simulation results for an extensive range of reward functions for pairs of Q-learning agents. We use the Nash equilibria of medium access games to develop predictors for determining which reward functions result in turn-taking. Having demonstrated the predictive power of Nash equilibria for turn-taking in medium access games, we focus on synthesis of reward functions for stochastic games that result in arbitrary desirable Nash equilibria. Our method constructs a reward function such that a particular joint behaviour is the unique Nash equilibrium of a stochastic game, provided that such a reward function exists. This method builds on techniques for designing rewards for Markov decision processes and for normal form games. We explain our reward design methods in detail and formally prove that they are correct.
73

Reinforcement Learning for Parameter Control of Image-Based Applications

Taylor, Graham January 2004 (has links)
The significant amount of data contained in digital images present barriers to methods of learning from the information they hold. Noise and the subjectivity of image evaluation further complicate such automated processes. In this thesis, we examine a particular area in which these difficulties are experienced. We attempt to control the parameters of a multi-step algorithm that processes visual information. A framework for approaching the parameter selection problem using reinforcement learning agents is presented as the main contribution of this research. We focus on the generation of state and action space, as well as task-dependent reward. We first discuss the automatic determination of fuzzy membership functions as a specific case of the above problem. Entropy of a fuzzy event is used as a reinforcement signal. Membership functions representing brightness have been automatically generated for several images. The results show that the reinforcement learning approach is superior to an existing simulated annealing-based approach. The framework has also been evaluated by optimizing ten parameters of the text detection for semantic indexing algorithm proposed by Wolf et al. Image features are defined and extracted to construct the state space. Generalization to reduce the state space is performed with the fuzzy ARTMAP neural network, offering much faster learning than in the previous tabular implementation, despite a much larger state and action space. Difficulties in using a continuous action space are overcome by employing the DIRECT method for global optimization without derivatives. The chosen parameters are evaluated using metrics of recall and precision, and are shown to be superior to the parameters previously recommended. We further discuss the interplay between intermediate and terminal reinforcement.
74

Hierarchical reinforcement learning for spoken dialogue systems

Cuayáhuitl, Heriberto January 2009 (has links)
This thesis focuses on the problem of scalable optimization of dialogue behaviour in speech-based conversational systems using reinforcement learning. Most previous investigations in dialogue strategy learning have proposed flat reinforcement learning methods, which are more suitable for small-scale spoken dialogue systems. This research formulates the problem in terms of Semi-Markov Decision Processes (SMDPs), and proposes two hierarchical reinforcement learning methods to optimize sub-dialogues rather than full dialogues. The first method uses a hierarchy of SMDPs, where every SMDP ignores irrelevant state variables and actions in order to optimize a sub-dialogue. The second method extends the first one by constraining every SMDP in the hierarchy with prior expert knowledge. The latter method proposes a learning algorithm called 'HAM+HSMQ-Learning', which combines two existing algorithms in the literature of hierarchical reinforcement learning. Whilst the first method generates fully-learnt behaviour, the second one generates semi-learnt behaviour. In addition, this research proposes a heuristic dialogue simulation environment for automatic dialogue strategy learning. Experiments were performed on simulated and real environments based on a travel planning spoken dialogue system. Experimental results provided evidence to support the following claims: First, both methods scale well at the cost of near-optimal solutions, resulting in slightly longer dialogues than the optimal solutions. Second, dialogue strategies learnt with coherent user behaviour and conservative recognition error rates can outperform a reasonable hand-coded strategy. Third, semi-learnt dialogue behaviours are a better alternative (because of their higher overall performance) than hand-coded or fully-learnt dialogue behaviours. Last, hierarchical reinforcement learning dialogue agents are feasible and promising for the (semi) automatic design of adaptive behaviours in larger-scale spoken dialogue systems. This research makes the following contributions to spoken dialogue systems which learn their dialogue behaviour. First, the Semi-Markov Decision Process (SMDP) model was proposed to learn spoken dialogue strategies in a scalable way. Second, the concept of 'partially specified dialogue strategies' was proposed for integrating simultaneously hand-coded and learnt spoken dialogue behaviours into a single learning framework. Third, an evaluation with real users of hierarchical reinforcement learning dialogue agents was essential to validate their effectiveness in a realistic environment.
75

Algoritmos assíncronos de iteração de política para Processos de Decisão Markovianos com Probabilidades Intervalares / Asynchronous policy iteration algorithms for Bounded-parameter Markov Decision Processes

Reis, Willy Arthur Silva 02 August 2019 (has links)
Um Processo de Decisão Markoviano (MDP) pode ser usado para modelar problemas de decisão sequencial. No entanto, podem existir limitações na obtenção de probabilidades para modelagem da transição de estados ou falta de confiabilidade nas informações existentes sobre estas probabilidades. Um modelo menos restritivo e que pode resolver este problema é o Processo de Decisão Markoviano com Probabilidades Intervalares (BMDP), que permite a representação imprecisa das probabilidades de transição de estados e raciocínio sobre uma solução robusta. Para resolver BMDPs de horizonte infinito, existem os algoritmos síncronos de Iteração de Valor Intervalar e Iteração de Política Robusto, que são ineficientes quando o tamanho do espaço de estados é grande. Neste trabalho são propostos algoritmos assíncronos de Iteração de Política baseados no particionamento do espaço de estados em subconjuntos aleatórios (Robust Asynchronous Policy Iteration - RAPI) ou em componentes fortemente conexos (Robust Topological Policy Iteration - RTPI). Também são propostas formas de inicializar a função valor e a política dos algoritmos, de forma a melhorar a convergência destes. O desempenho dos algoritmos propostos é avaliado em comparação com o algoritmo de Iteração de Política Robusto para BMDPs para domínios de planejamento existentes e um novo domínio proposto. Os resultados dos experimentos realizados mostram que (i) quanto mais estruturado é o domínio, melhor é o desempenho do algoritmo RTPI; (ii) o uso de computação paralela no algoritmo RAPI possui um pequeno ganho computacional em relação à sua versão sequencial; e (iii) uma boa inicialização da função valor e política pode impactar positivamente o tempo de convergência dos algoritmos. / A Markov Decision Process (MDP) can be used to model sequential decision problems. However, there may be limitations in obtaining probabilities for state transition modeling or lack of reliability in existing information on these probabilities. A less restrictive model that can solve this problem is the Bounded-parameter Markov Decision Process (BMDP), which allows the imprecise representation of the transition probabilities and reasoning about a robust solution. To solve infinite horizon BMDPs, there are synchronous algorithms such as Interval Value Iteration and Robust Policy Iteration, which are inefficient for large state spaces. In this work, we propose new asynchronous Policy Iteration algorithms based on state space partitioning in random subsets (Robust Asynchronous Policy Iteration - RAPI) or in strongly connected components (Robust Topological Policy Iteration - RTPI). We also propose ways to initialize the value function and policy of the algorithms, in order to improve their convergence. The performance of the proposed algorithms is evaluated in comparison with the Robust Policy Iteration algorithm for BMDPs for existing planning domains and a proposed new domain. The results of the experiments show that (i) the more structured the domain, the better is the performance of the RTPI algorithm; (ii) the use of parallel computing in the RAPI algorithm has a small computational gain compared to its sequential version; and (iii) a good initialization of the value function and policy can positively impact the convergence time of the algorithms.
76

Policy Explanation and Model Refinement in Decision-Theoretic Planning

Khan, Omar Zia January 2013 (has links)
Decision-theoretic systems, such as Markov Decision Processes (MDPs), are used for sequential decision-making under uncertainty. MDPs provide a generic framework that can be applied in various domains to compute optimal policies. This thesis presents techniques that offer explanations of optimal policies for MDPs and then refine decision theoretic models (Bayesian networks and MDPs) based on feedback from experts. Explaining policies for sequential decision-making problems is difficult due to the presence of stochastic effects, multiple possibly competing objectives and long-range effects of actions. However, explanations are needed to assist experts in validating that the policy is correct and to help users in developing trust in the choices recommended by the policy. A set of domain-independent templates to justify a policy recommendation is presented along with a process to identify the minimum possible number of templates that need to be populated to completely justify the policy. The rejection of an explanation by a domain expert indicates a deficiency in the model which led to the generation of the rejected policy. Techniques to refine the model parameters such that the optimal policy calculated using the refined parameters would conform with the expert feedback are presented in this thesis. The expert feedback is translated into constraints on the model parameters that are used during refinement. These constraints are non-convex for both Bayesian networks and MDPs. For Bayesian networks, the refinement approach is based on Gibbs sampling and stochastic hill climbing, and it learns a model that obeys expert constraints. For MDPs, the parameter space is partitioned such that alternating linear optimization can be applied to learn model parameters that lead to a policy in accordance with expert feedback. In practice, the state space of MDPs can often be very large, which can be an issue for real-world problems. Factored MDPs are often used to deal with this issue. In Factored MDPs, state variables represent the state space and dynamic Bayesian networks model the transition functions. This helps to avoid the exponential growth in the state space associated with large and complex problems. The approaches for explanation and refinement presented in this thesis are also extended for the factored case to demonstrate their use in real-world applications. The domains of course advising to undergraduate students, assisted hand-washing for people with dementia and diagnostics for manufacturing are used to present empirical evaluations.
77

Reinforcement Learning for Parameter Control of Image-Based Applications

Taylor, Graham January 2004 (has links)
The significant amount of data contained in digital images present barriers to methods of learning from the information they hold. Noise and the subjectivity of image evaluation further complicate such automated processes. In this thesis, we examine a particular area in which these difficulties are experienced. We attempt to control the parameters of a multi-step algorithm that processes visual information. A framework for approaching the parameter selection problem using reinforcement learning agents is presented as the main contribution of this research. We focus on the generation of state and action space, as well as task-dependent reward. We first discuss the automatic determination of fuzzy membership functions as a specific case of the above problem. Entropy of a fuzzy event is used as a reinforcement signal. Membership functions representing brightness have been automatically generated for several images. The results show that the reinforcement learning approach is superior to an existing simulated annealing-based approach. The framework has also been evaluated by optimizing ten parameters of the text detection for semantic indexing algorithm proposed by Wolf et al. Image features are defined and extracted to construct the state space. Generalization to reduce the state space is performed with the fuzzy ARTMAP neural network, offering much faster learning than in the previous tabular implementation, despite a much larger state and action space. Difficulties in using a continuous action space are overcome by employing the DIRECT method for global optimization without derivatives. The chosen parameters are evaluated using metrics of recall and precision, and are shown to be superior to the parameters previously recommended. We further discuss the interplay between intermediate and terminal reinforcement.
78

Optimal Streaming Of Rate Adaptable Video

Gurses, Eren 01 June 2006 (has links) (PDF)
In this study, we study the dynamics of network adaptive video streaming and propose novel algorithms for rate distortion control in video streaming. While doing so, we maintain inter-protocol fairness with TCP (Transmission Control Protocol) that is the dominant transport protocol in the current Internet. The proposed algorithms are retransmission-based and necessitate the use of playback buffers in order to tolerate the extra latency introduced by retransmissions. In the first part, we propose a practical network-adaptive streaming scheme based on TCP transport and the idea of Selective Frame Discarding (SFD) that makes use of two-layer temporally scalable video. The efficacy of the SFD scheme is validated for playout buffer times in the order of seconds and therefore makes it suitable more for delay tolerant streaming applications. In the second part of the thesis, we propose an application layer rate-distortion control algorithm which provides Optimal Scheduling and Rate Control (OSRC) policies in the average reward sense in order to achieve efficient streaming of video. The Optimal Scheduling (OS) we propose maximizes the probability of successfully on time delivery according to a prespecified set of rate constraints, and different channel conditions by using Markov Decision Process (MDP) models. On the other hand optimal rate control (RC) is achieved by calculating the optimal rate constraint which minimizes the average distortion of a video streaming session by making use of the video distortion model derived for lossy channels and achievable success probabilities provided by the set of optimal schedules. For numerical examples, we focus on an equation-based TCP friendly rate control (TFRC) protocol where transport layer retransmissions are disabled and Fine Granular Scalable (FGS) coded video is used for improved rate adaptation capabilities but with an additional rate distortion penalty. The efficacy of the proposed OSRC algorithm is demonstrated by means of both analytical results and ns-2 simulations.
79

Sensor-based prognostics and structured maintenance policies for components with complex degradation

Elwany, Alaa H. 23 September 2009 (has links)
We propose a mathematical framework that integrates low-level sensory signals from monitoring engineering systems and their components with high-level decision models for maintenance optimization. Our objective is to derive optimal adaptive maintenance strategies that capitalize on condition monitoring information to update maintenance actions based upon the current state of health of the system. We refer to this sensor-based decision methodology as "sense-and-respond logistics". As a first step, we develop and extend degradation models to compute and periodically update the remaining life distribution of fielded components using in situ degradation signals. Next, we integrate these sensory updated remaining life distributions with maintenance decision models to; (1) determine, in real-time, the optimal time to replace a component such that the lost opportunity costs due to early replacements are minimized and system utilization is increased, and (2) sequentially determine the optimal time to order a spare part such that inventory holding costs are minimized while preventing stock outs. Lastly, we integrate the proposed degradation model with Markov process models to derive structured replacement and spare parts ordering policies. In particular, we show that the optimal maintenance policy for our problem setting is a monotonically non-decreasing control limit type policy. We validate our methodology using real-world data from monitoring a piece of rotating machinery using vibration accelerometers. We also demonstrate that the proposed sense-and-respond decision methodology results in better decisions and reduced costs compared to other traditional approaches.
80

Value of information and supply uncertainty in supply chains

Cheong, Tae Su 16 August 2011 (has links)
This dissertation focuses on topics related to the value of real-time information and/or to supply uncertainties due to uncertain lead-times and yields in supply chains. The first two of these topics address issues associated with freight transportation, while the remaining two topics are concerned with inventory replenishment. We first assess the value of dynamic tour determination for the traveling salesman problem (TSP). Given a network with traffic dynamics that can be modeled as a Markov chain, we present a policy determination procedure that optimally builds a tour dynamically. We then explore the potential for expected total travel cost reduction due to dynamic tour determination, relative to two a priori tour determination procedures. Second, we consider the situation where the decision to continue or abort transporting perishable freight from an origin to a destination can be made at intermediate locations, based on real-time freight status monitoring. We model the problem as a partially observed Markov decision process (POMDP) and develop an efficient procedure for determining an optimal policy. We determine structural characteristics of an optimal policy and upper and lower bounds on the optimal reward function. Third, we analyze a periodic review inventory control problem with lost sales and random yields and present conditions that guarantee the existence of an optimal policy having a so-called staircase structure. We make use of this structure to accelerate both value iteration and policy evaluation. Lastly, we examine a model of inventory replenishment where both lead time and supply qualities are uncertain. We model this problem as an MDP and show that the weighted sum of inventory in transit and inventory at the destination is a sufficient statistic, assuming that random shrinkage can occur from the origin to the supply system or destination, shrinkage is deterministic within the supply system and from the supply system to the destination, and no shrinkage occurs once goods reach the destination.

Page generated in 0.0823 seconds