Global ETD Search

71	The reinforcement learning method : A feasible and sustainable control strategy for efficient occupant-centred building operation in smart cities May, Ross January 2019 (has links) Over half of the world’s population lives in urban areas, a trend which is expected to only grow as we move further into the future. With this increasing trend in urbanisation, challenges are presented in the form of the management of urban infrastructure systems. As an essential infrastructure of any city, the energy system presents itself as one of the biggest challenges. As cities expand in population and economically, global energy consumption increases and as a result so do greenhouse gas (GHG) emissions. To achieve the 2030 Agenda’s sustainable development goal on energy (SDG 7), renewable energy and energy efficiency have been shown as key strategies for attaining SDG 7. As the largest contributor to climate change, the building sector is responsible for more than half of the global final energy consumption and GHG emissions. As people spend most of their time indoors, the demand for energy is made worse as a result of maintaining the comfort level of the indoor environment. However, the emergence of the smart city and the internet of things (IoT) offers the opportunity for the smart management of buildings. Focusing on the latter strategy towards attaining SDG 7, intelligent building control offers significant potential for saving energy while respecting occupant comfort (OC). Most intelligent control strategies, however, rely on complex mathematical models which require a great deal of expertise to construct thereby costing in time and money. Furthermore, if these are inaccurate then energy is wasted and the comfort of occupants is decreased. Moreover, any change in the physical environment such as retrofits result in obsolete models which must be re-identified to match the new state of the environment. This model-based approach seems unsustainable and so a new model-free alternative is proposed. One such alternative is the reinforcement learning (RL) method. This method provides a beautiful solution to accomplishing the tradeoff between energy efficiency and OC within the smart city and more importantly to achieving SDG 7. To address the feasibility of RL as a sustainable control strategy for efficient occupant-centred building operation, a comprehensive review of RL for controlling OC in buildings as well as a case study implementing RL for improving OC via a window system are presented. The outcomes of each seem to suggest RL as a feasible solution, however, more work is required in the form of addressing current open issues such as cooperative multi-agent RL (MARL) needed for multi-occupant/multi-zonal buildings. Markov decision processes Reinforcement learning Control Building Indoor comfort Occupant Building Technologies Husbyggnad Computer and Information Sciences Data- och informationsvetenskap
72	Simulation Based Algorithms For Markov Decision Process And Stochastic Optimization Abdulla, Mohammed Shahid 05 1900 (has links) In Chapter 2, we propose several two-timescale simulation-based actor-critic algorithms for solution of infinite horizon Markov Decision Processes (MDPs) with finite state-space under the average cost criterion. On the slower timescale, all the algorithms perform a gradient search over corresponding policy spaces using two different Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates. On the faster timescale, the differential cost function corresponding to a given stationary policy is updated and averaged for enhanced performance. A proof of convergence to a locally optimal policy is presented. Next, a memory efficient implementation using a feature-vector representation of the state-space and TD (0) learning along the faster timescale is discussed. A three-timescale simulation based algorithm for solution of infinite horizon discounted-cost MDPs via the Value Iteration approach is also proposed. An approximation of the Dynamic Programming operator T is applied to the value function iterates. A sketch of convergence explaining the dynamics of the algorithm using associated ODEs is presented. Numerical experiments on rate based flow control on a bottleneck node using a continuous-time queueing model are presented using the proposed algorithms. Next, in Chapter 3, we develop three simulation-based algorithms for finite-horizon MDPs (FHMDPs). The first algorithm is developed for finite state and compact action spaces while the other two are for finite state and finite action spaces. Convergence analysis is briefly sketched. We then concentrate on methods to mitigate the curse of dimensionality that affects FH-MDPs severely, as there is one probability transition matrix per stage. Two parametrized actor-critic algorithms for FHMDPs with compact action sets are proposed, the ‘critic’ in both algorithms learning the policy gradient. We show w.p1convergence to a set with the necessary condition for constrained optima. Further, a third algorithm for stochastic control of stopping time processes is presented. Numerical experiments with the proposed finite-horizon algorithms are shown for a problem of flow control in communication networks. Towards stochastic optimization, in Chapter 4, we propose five algorithms which are variants of SPSA. The original one measurement SPSA uses an estimate of the gradient of objective function L containing an additional bias term not seen in two-measurement SPSA. We propose a one-measurement algorithm that eliminates this bias, and has asymptotic convergence properties making for easier comparison with the two-measurement SPSA. The algorithm, under certain conditions, outperforms both forms of SPSA with the only overhead being the storage of a single measurement. We also propose a similar algorithm that uses perturbations obtained from normalized Hadamard matrices. The convergence w.p.1 of both algorithms is established. We extend measurement reuse to design three second-order SPSA algorithms, sketch the convergence analysis and present simulation results on an illustrative minimization problem. We then propose several stochastic approximation implementations for related algorithms in flow-control of communication networks, beginning with a discrete-time implementation of Kelly’s primal flow-control algorithm. Convergence with probability1 is shown, even in the presence of communication delays and stochastic effects seen in link congestion indications. Two relevant enhancements are then pursued :a) an implementation of the primal algorithm using second-order information, and b) an implementation where edge-routers rectify misbehaving flows. Also, discrete-time implementations of Kelly’s dual algorithm and primal-dual algorithm are proposed. Simulation results a) verifying the proposed algorithms and, b) comparing stability properties with an algorithm in the literature are presented. Markov Processes - Data Processing Algorithms Simulation Markov Decision Processes (MDPs) Finite Horizon Markov Decision Processes Stochastic Approximation - Algorithms Network Flow-Control FH-MDP Algorithms Stochastic Optimization Reinforcement Learning Algorithms Computational Mathematics
73	Measuring and Influencing Sequential Joint Agent Behaviours Raffensperger, Peter Abraham January 2013 (has links) Algorithmically designed reward functions can influence groups of learning agents toward measurable desired sequential joint behaviours. Influencing learning agents toward desirable behaviours is non-trivial due to the difficulties of assigning credit for global success to the deserving agents and of inducing coordination. Quantifying joint behaviours lets us identify global success by ranking some behaviours as more desirable than others. We propose a real-valued metric for turn-taking, demonstrating how to measure one sequential joint behaviour. We describe how to identify the presence of turn-taking in simulation results and we calculate the quantity of turn-taking that could be observed between independent random agents. We demonstrate our turn-taking metric by reinterpreting previous work on turn-taking in emergent communication and by analysing a recorded human conversation. Given a metric, we can explore the space of reward functions and identify those reward functions that result in global success in groups of learning agents. We describe 'medium access games' as a model for human and machine communication and we present simulation results for an extensive range of reward functions for pairs of Q-learning agents. We use the Nash equilibria of medium access games to develop predictors for determining which reward functions result in turn-taking. Having demonstrated the predictive power of Nash equilibria for turn-taking in medium access games, we focus on synthesis of reward functions for stochastic games that result in arbitrary desirable Nash equilibria. Our method constructs a reward function such that a particular joint behaviour is the unique Nash equilibrium of a stochastic game, provided that such a reward function exists. This method builds on techniques for designing rewards for Markov decision processes and for normal form games. We explain our reward design methods in detail and formally prove that they are correct. multi-agent systems multi-agent reinforcement learning decentralised systems resource allocation turn-taking Nash equilibria emergent behaviour reward functions reward design Markov decision processes Markov chains
74	Reinforcement Learning for Parameter Control of Image-Based Applications Taylor, Graham January 2004 (has links) The significant amount of data contained in digital images present barriers to methods of learning from the information they hold. Noise and the subjectivity of image evaluation further complicate such automated processes. In this thesis, we examine a particular area in which these difficulties are experienced. We attempt to control the parameters of a multi-step algorithm that processes visual information. A framework for approaching the parameter selection problem using reinforcement learning agents is presented as the main contribution of this research. We focus on the generation of state and action space, as well as task-dependent reward. We first discuss the automatic determination of fuzzy membership functions as a specific case of the above problem. Entropy of a fuzzy event is used as a reinforcement signal. Membership functions representing brightness have been automatically generated for several images. The results show that the reinforcement learning approach is superior to an existing simulated annealing-based approach. The framework has also been evaluated by optimizing ten parameters of the text detection for semantic indexing algorithm proposed by Wolf et al. Image features are defined and extracted to construct the state space. Generalization to reduce the state space is performed with the fuzzy ARTMAP neural network, offering much faster learning than in the previous tabular implementation, despite a much larger state and action space. Difficulties in using a continuous action space are overcome by employing the DIRECT method for global optimization without derivatives. The chosen parameters are evaluated using metrics of recall and precision, and are shown to be superior to the parameters previously recommended. We further discuss the interplay between intermediate and terminal reinforcement. Systems Design reinforcement learning artificial neural networks image processing computer vision text detection artificial intelligence machine learning parameter control optimization Markov decision processes fuzzy ARTMAP
75	Hierarchical reinforcement learning for spoken dialogue systems Cuayáhuitl, Heriberto January 2009 (has links) This thesis focuses on the problem of scalable optimization of dialogue behaviour in speech-based conversational systems using reinforcement learning. Most previous investigations in dialogue strategy learning have proposed flat reinforcement learning methods, which are more suitable for small-scale spoken dialogue systems. This research formulates the problem in terms of Semi-Markov Decision Processes (SMDPs), and proposes two hierarchical reinforcement learning methods to optimize sub-dialogues rather than full dialogues. The first method uses a hierarchy of SMDPs, where every SMDP ignores irrelevant state variables and actions in order to optimize a sub-dialogue. The second method extends the first one by constraining every SMDP in the hierarchy with prior expert knowledge. The latter method proposes a learning algorithm called 'HAM+HSMQ-Learning', which combines two existing algorithms in the literature of hierarchical reinforcement learning. Whilst the first method generates fully-learnt behaviour, the second one generates semi-learnt behaviour. In addition, this research proposes a heuristic dialogue simulation environment for automatic dialogue strategy learning. Experiments were performed on simulated and real environments based on a travel planning spoken dialogue system. Experimental results provided evidence to support the following claims: First, both methods scale well at the cost of near-optimal solutions, resulting in slightly longer dialogues than the optimal solutions. Second, dialogue strategies learnt with coherent user behaviour and conservative recognition error rates can outperform a reasonable hand-coded strategy. Third, semi-learnt dialogue behaviours are a better alternative (because of their higher overall performance) than hand-coded or fully-learnt dialogue behaviours. Last, hierarchical reinforcement learning dialogue agents are feasible and promising for the (semi) automatic design of adaptive behaviours in larger-scale spoken dialogue systems. This research makes the following contributions to spoken dialogue systems which learn their dialogue behaviour. First, the Semi-Markov Decision Process (SMDP) model was proposed to learn spoken dialogue strategies in a scalable way. Second, the concept of 'partially specified dialogue strategies' was proposed for integrating simultaneously hand-coded and learnt spoken dialogue behaviours into a single learning framework. Third, an evaluation with real users of hierarchical reinforcement learning dialogue agents was essential to validate their effectiveness in a realistic environment. 020
76	Algoritmos assíncronos de iteração de política para Processos de Decisão Markovianos com Probabilidades Intervalares / Asynchronous policy iteration algorithms for Bounded-parameter Markov Decision Processes Reis, Willy Arthur Silva 02 August 2019 (has links) Um Processo de Decisão Markoviano (MDP) pode ser usado para modelar problemas de decisão sequencial. No entanto, podem existir limitações na obtenção de probabilidades para modelagem da transição de estados ou falta de confiabilidade nas informações existentes sobre estas probabilidades. Um modelo menos restritivo e que pode resolver este problema é o Processo de Decisão Markoviano com Probabilidades Intervalares (BMDP), que permite a representação imprecisa das probabilidades de transição de estados e raciocínio sobre uma solução robusta. Para resolver BMDPs de horizonte infinito, existem os algoritmos síncronos de Iteração de Valor Intervalar e Iteração de Política Robusto, que são ineficientes quando o tamanho do espaço de estados é grande. Neste trabalho são propostos algoritmos assíncronos de Iteração de Política baseados no particionamento do espaço de estados em subconjuntos aleatórios (Robust Asynchronous Policy Iteration - RAPI) ou em componentes fortemente conexos (Robust Topological Policy Iteration - RTPI). Também são propostas formas de inicializar a função valor e a política dos algoritmos, de forma a melhorar a convergência destes. O desempenho dos algoritmos propostos é avaliado em comparação com o algoritmo de Iteração de Política Robusto para BMDPs para domínios de planejamento existentes e um novo domínio proposto. Os resultados dos experimentos realizados mostram que (i) quanto mais estruturado é o domínio, melhor é o desempenho do algoritmo RTPI; (ii) o uso de computação paralela no algoritmo RAPI possui um pequeno ganho computacional em relação à sua versão sequencial; e (iii) uma boa inicialização da função valor e política pode impactar positivamente o tempo de convergência dos algoritmos. / A Markov Decision Process (MDP) can be used to model sequential decision problems. However, there may be limitations in obtaining probabilities for state transition modeling or lack of reliability in existing information on these probabilities. A less restrictive model that can solve this problem is the Bounded-parameter Markov Decision Process (BMDP), which allows the imprecise representation of the transition probabilities and reasoning about a robust solution. To solve infinite horizon BMDPs, there are synchronous algorithms such as Interval Value Iteration and Robust Policy Iteration, which are inefficient for large state spaces. In this work, we propose new asynchronous Policy Iteration algorithms based on state space partitioning in random subsets (Robust Asynchronous Policy Iteration - RAPI) or in strongly connected components (Robust Topological Policy Iteration - RTPI). We also propose ways to initialize the value function and policy of the algorithms, in order to improve their convergence. The performance of the proposed algorithms is evaluated in comparison with the Robust Policy Iteration algorithm for BMDPs for existing planning domains and a proposed new domain. The results of the experiments show that (i) the more structured the domain, the better is the performance of the RTPI algorithm; (ii) the use of parallel computing in the RAPI algorithm has a small computational gain compared to its sequential version; and (iii) a good initialization of the value function and policy can positively impact the convergence time of the algorithms. Asynchronous policy iteration Iteração de política assíncrono Planejamento probabilístico Probabilistic planning
77	Policy Explanation and Model Refinement in Decision-Theoretic Planning Khan, Omar Zia January 2013 (has links) Decision-theoretic systems, such as Markov Decision Processes (MDPs), are used for sequential decision-making under uncertainty. MDPs provide a generic framework that can be applied in various domains to compute optimal policies. This thesis presents techniques that offer explanations of optimal policies for MDPs and then refine decision theoretic models (Bayesian networks and MDPs) based on feedback from experts. Explaining policies for sequential decision-making problems is difficult due to the presence of stochastic effects, multiple possibly competing objectives and long-range effects of actions. However, explanations are needed to assist experts in validating that the policy is correct and to help users in developing trust in the choices recommended by the policy. A set of domain-independent templates to justify a policy recommendation is presented along with a process to identify the minimum possible number of templates that need to be populated to completely justify the policy. The rejection of an explanation by a domain expert indicates a deficiency in the model which led to the generation of the rejected policy. Techniques to refine the model parameters such that the optimal policy calculated using the refined parameters would conform with the expert feedback are presented in this thesis. The expert feedback is translated into constraints on the model parameters that are used during refinement. These constraints are non-convex for both Bayesian networks and MDPs. For Bayesian networks, the refinement approach is based on Gibbs sampling and stochastic hill climbing, and it learns a model that obeys expert constraints. For MDPs, the parameter space is partitioned such that alternating linear optimization can be applied to learn model parameters that lead to a policy in accordance with expert feedback. In practice, the state space of MDPs can often be very large, which can be an issue for real-world problems. Factored MDPs are often used to deal with this issue. In Factored MDPs, state variables represent the state space and dynamic Bayesian networks model the transition functions. This helps to avoid the exponential growth in the state space associated with large and complex problems. The approaches for explanation and refinement presented in this thesis are also extended for the factored case to demonstrate their use in real-world applications. The domains of course advising to undergraduate students, assisted hand-washing for people with dementia and diagnostics for manufacturing are used to present empirical evaluations. Decision-Theoretic Planning Markov Decision Processes Sequential Decision Making Reasoning under Uncertainty Bayesian Networks Policy Explanation Parameter Learning Model Refinement Computer Science
78	Reinforcement Learning for Parameter Control of Image-Based Applications Taylor, Graham January 2004 (has links) The significant amount of data contained in digital images present barriers to methods of learning from the information they hold. Noise and the subjectivity of image evaluation further complicate such automated processes. In this thesis, we examine a particular area in which these difficulties are experienced. We attempt to control the parameters of a multi-step algorithm that processes visual information. A framework for approaching the parameter selection problem using reinforcement learning agents is presented as the main contribution of this research. We focus on the generation of state and action space, as well as task-dependent reward. We first discuss the automatic determination of fuzzy membership functions as a specific case of the above problem. Entropy of a fuzzy event is used as a reinforcement signal. Membership functions representing brightness have been automatically generated for several images. The results show that the reinforcement learning approach is superior to an existing simulated annealing-based approach. The framework has also been evaluated by optimizing ten parameters of the text detection for semantic indexing algorithm proposed by Wolf et al. Image features are defined and extracted to construct the state space. Generalization to reduce the state space is performed with the fuzzy ARTMAP neural network, offering much faster learning than in the previous tabular implementation, despite a much larger state and action space. Difficulties in using a continuous action space are overcome by employing the DIRECT method for global optimization without derivatives. The chosen parameters are evaluated using metrics of recall and precision, and are shown to be superior to the parameters previously recommended. We further discuss the interplay between intermediate and terminal reinforcement. Systems Design reinforcement learning artificial neural networks image processing computer vision text detection artificial intelligence machine learning parameter control optimization Markov decision processes fuzzy ARTMAP
79	Optimal Streaming Of Rate Adaptable Video Gurses, Eren 01 June 2006 (has links) (PDF) In this study, we study the dynamics of network adaptive video streaming and propose novel algorithms for rate distortion control in video streaming. While doing so, we maintain inter-protocol fairness with TCP (Transmission Control Protocol) that is the dominant transport protocol in the current Internet. The proposed algorithms are retransmission-based and necessitate the use of playback buffers in order to tolerate the extra latency introduced by retransmissions. In the first part, we propose a practical network-adaptive streaming scheme based on TCP transport and the idea of Selective Frame Discarding (SFD) that makes use of two-layer temporally scalable video. The efficacy of the SFD scheme is validated for playout buffer times in the order of seconds and therefore makes it suitable more for delay tolerant streaming applications. In the second part of the thesis, we propose an application layer rate-distortion control algorithm which provides Optimal Scheduling and Rate Control (OSRC) policies in the average reward sense in order to achieve efficient streaming of video. The Optimal Scheduling (OS) we propose maximizes the probability of successfully on time delivery according to a prespecified set of rate constraints, and different channel conditions by using Markov Decision Process (MDP) models. On the other hand optimal rate control (RC) is achieved by calculating the optimal rate constraint which minimizes the average distortion of a video streaming session by making use of the video distortion model derived for lossy channels and achievable success probabilities provided by the set of optimal schedules. For numerical examples, we focus on an equation-based TCP friendly rate control (TFRC) protocol where transport layer retransmissions are disabled and Fine Granular Scalable (FGS) coded video is used for improved rate adaptation capabilities but with an additional rate distortion penalty. The efficacy of the proposed OSRC algorithm is demonstrated by means of both analytical results and ns-2 simulations.
80	Sensor-based prognostics and structured maintenance policies for components with complex degradation Elwany, Alaa H. 23 September 2009 (has links) We propose a mathematical framework that integrates low-level sensory signals from monitoring engineering systems and their components with high-level decision models for maintenance optimization. Our objective is to derive optimal adaptive maintenance strategies that capitalize on condition monitoring information to update maintenance actions based upon the current state of health of the system. We refer to this sensor-based decision methodology as "sense-and-respond logistics". As a first step, we develop and extend degradation models to compute and periodically update the remaining life distribution of fielded components using in situ degradation signals. Next, we integrate these sensory updated remaining life distributions with maintenance decision models to; (1) determine, in real-time, the optimal time to replace a component such that the lost opportunity costs due to early replacements are minimized and system utilization is increased, and (2) sequentially determine the optimal time to order a spare part such that inventory holding costs are minimized while preventing stock outs. Lastly, we integrate the proposed degradation model with Markov process models to derive structured replacement and spare parts ordering policies. In particular, we show that the optimal maintenance policy for our problem setting is a monotonically non-decreasing control limit type policy. We validate our methodology using real-world data from monitoring a piece of rotating machinery using vibration accelerometers. We also demonstrate that the proposed sense-and-respond decision methodology results in better decisions and reduced costs compared to other traditional approaches. Replacement models Spare parts inventory Maintenance policies Markov decision processes Reliability Degradation models Machinery Maintenance and repair Plant performance Monitoring Reliability (Engineering)

Search results