Global ETD Search

151	Strojové učení ve strategických hrách / Machine Learning in Strategic Games Vlček, Michael January 2018 (has links) Machine learning is spearheading progress for the field of artificial intelligence in terms of providing competition in strategy games to a human opponent, be it in a game of chess, Go or poker. A field of machine learning, which shows the most promising results in playing strategy games, is reinforcement learning. The next milestone for the current research lies in a computer game Starcraft II, which outgrows the previous ones in terms of complexity, and represents a potential new breakthrough in this field. The paper focuses on analysis of the problem, and suggests a solution incorporating a reinforcement learning algorithm A2C and hyperparameter optimization implementation PBT, which could mean a step forward for the current progress.
152	Quality of Service Aware Mechanisms for (Re)Configuring Data Stream Processing Applications on Highly Distributed Infrastructure / Mécanismes prenant en compte la qualité de service pour la (re)configuration d’applications de traitement de flux de données sur une infrastructure hautement distribuée Da Silva Veith, Alexandre 23 September 2019 (has links) Une grande partie de ces données volumineuses ont plus de valeur lorsqu'elles sont analysées rapidement, au fur et à mesure de leur génération. Dans plusieurs scénarios d'application émergents, tels que les villes intelligentes, la surveillance opérationnelle de grandes infrastructures et l'Internet des Objets (Internet of Things), des flux continus de données doivent être traités dans des délais très brefs. Dans plusieurs domaines, ce traitement est nécessaire pour détecter des modèles, identifier des défaillances et pour guider la prise de décision. Les données sont donc souvent rassemblées et analysées par des environnements logiciels conçus pour le traitement de flux continus de données. Ces environnements logiciels pour le traitement de flux de données déploient les applications sous-la forme d'un graphe orienté ou de dataflow. Un dataflow contient une ou plusieurs sources (i.e. capteurs, passerelles ou actionneurs); opérateurs qui effectuent des transformations sur les données (e.g., filtrage et agrégation); et des sinks (i.e., éviers qui consomment les requêtes ou stockent les données). Nous proposons dans cette thèse un ensemble de stratégies pour placer les opérateurs dans une infrastructure massivement distribuée cloud-edge en tenant compte des caractéristiques des ressources et des exigences des applications. En particulier, nous décomposons tout d'abord le graphe d'application en identifiant quelques comportements tels que des forks et des joints, puis nous le plaçons dynamiquement sur l'infrastructure. Des simulations et un prototype prenant en compte plusieurs paramètres d'application démontrent que notre approche peut réduire la latence de bout en bout de plus de 50% et aussi améliorer d'autres métriques de qualité de service. L'espace de recherche de solutions pour la reconfiguration des opérateurs peut être énorme en fonction du nombre d'opérateurs, de flux, de ressources et de liens réseau. De plus, il est important de minimiser le coût de la migration tout en améliorant la latence. Des travaux antérieurs, Reinforcement Learning (RL) et Monte-Carlo Tree Searh (MCTS) ont été utilisés pour résoudre les problèmes liés aux grands nombres d’actions et d’états de recherche. Nous modélisons le problème de reconfiguration d'applications sous la forme d'un processus de décision de Markov (MDP) et étudions l'utilisation des algorithmes RL et MCTS pour concevoir des plans de reconfiguration améliorant plusieurs métriques de qualité de service. / A large part of this big data is most valuable when analysed quickly, as it is generated. Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructure, and Internet of Things (IoT), continuous data streams must be processed under very short delays. In multiple domains, there is a need for processing data streams to detect patterns, identify failures, and gain insights. Data is often gathered and analysed by Data Stream Processing Engines (DSPEs).A DSPE commonly structures an application as a directed graph or dataflow. A dataflow has one or multiple sources (i.e., gateways or actuators); operators that perform transformations on the data (e.g., filtering); and sinks (i.e., queries that consume or store the data). Most complex operator transformations store information about previously received data as new data is streamed in. Also, a dataflow has stateless operators that consider only the current data. Traditionally, Data Stream Processing (DSP) applications were conceived to run in clusters of homogeneous resources or on the cloud. In a cloud deployment, the whole application is placed on a single cloud provider to benefit from virtually unlimited resources. This approach allows for elastic DSP applications with the ability to allocate additional resources or release idle capacity on demand during runtime to match the application requirements.We introduce a set of strategies to place operators onto cloud and edge while considering characteristics of resources and meeting the requirements of applications. In particular, we first decompose the application graph by identifying behaviours such as forks and joins, and then dynamically split the dataflow graph across edge and cloud. Comprehensive simulations and a real testbed considering multiple application settings demonstrate that our approach can improve the end-to-end latency in over 50% and even other QoS metrics. The solution search space for operator reassignment can be enormous depending on the number of operators, streams, resources and network links. Moreover, it is important to minimise the cost of migration while improving latency. Reinforcement Learning (RL) and Monte-Carlo Tree Search (MCTS) have been used to tackle problems with large search spaces and states, performing at human-level or better in games such as Go. We model the application reconfiguration problem as a Markov Decision Process (MDP) and investigate the use of RL and MCTS algorithms to devise reconfiguring plans that improve QoS metrics. Mécanismes Qualité de service (re)configuration Infrastructure hautement distribuée Internet des Objets Réseau edge-cloud Théorie des files d'attente Processus de décision de Markov Reinforcement Learning Mechanisms Quality of Service (re)configuration Data Stream Processing Applications Highly Distributed Infrastructure Internet of Things Edge-cloud infrastructure Queueing theory Markov Decision Process Reinforcement Learning
153	Network Utility Maximization Based on Information Freshness Cho-Hsin Tsai (12225227) 20 April 2022 (has links) <p>It is predicted that there would be 41.6 billion IoT devices by 2025, which has kindled new interests on the timing coordination between sensors and controllers, i.e., how to use the waiting time to improve the performance. Sun et al. showed that a <i>controller</i> can strictly improve the data freshness, the so-called Age-of-Information (AoI), via careful scheduling designs. The optimal waiting policy for the <i>sensor</i> side was later characterized in the context of remote estimation. The first part of this work develops the jointly optimal sensor/controller waiting policy. It generalizes the above two important results in that not only do we consider joint sensor/controller designs, but we also assume random delay in both the forward and feedback directions. </p> <p> </p> <p>The second part of the work revisits and significantly strengthens the seminal results of Sun et al on the following fronts: (i) When designing the optimal offline schemes with full knowledge of the delay distributions, a new <i>fixed-point-based</i> method is proposed with <i>quadratic convergence rate</i>; (ii) When the distributional knowledge is unavailable, two new low-complexity online algorithms are proposed, which provably attain the optimal average AoI penalty; and (iii) the online schemes also admit a modular architecture, which allows the designer to <i>upgrade</i> certain components to handle additional practical challenges. Two such upgrades are proposed: (iii.1) the AoI penalty function incurred at the destination is unknown to the source node and must also be estimated on the fly, and (iii.2) the unknown delay distribution is Markovian instead of i.i.d. </p> <p> </p> <p>With the exponential growth of interconnected IoT devices and the increasing risk of excessive resource consumption in mind, the third part of this work derives an optimal joint cost-and-AoI minimization solution for multiple coexisting source-destination (S-D) pairs. The results admit a new <i>AoI-market-price</i>-based interpretation and are applicable to the setting of (i) general heterogeneous AoI penalty functions and Markov delay distributions for each S-D pair, and (ii) a general network cost function of aggregate throughput of all S-D pairs. </p> <p> </p> <p>In each part of this work, extensive simulation is used to demonstrate the superior performance of the proposed schemes. The discussion on analytical as well as numerical results sheds some light on designing practical network utility maximization protocols.</p> Computer Engineering Information Systems Coding and Information Theory Networking and Communications Information Engineering and Theory Age-of-information (AoI) Data freshness Information freshness Remote estimation Online algorithm Fixed-point equation Stochastic approximation Stochastic control Information update system Markov decision process (MDP) Network utility maximization Information theory Wireless networking Networking Communication systems Communication theory
154	Large state spaces and self-supervision in reinforcement learning Touati, Ahmed 08 1900 (has links) L'apprentissage par renforcement (RL) est un paradigme d'apprentissage orienté agent qui s'intéresse à l'apprentissage en interagissant avec un environnement incertain. Combiné à des réseaux de neurones profonds comme approximateur de fonction, l'apprentissage par renforcement profond (Deep RL) nous a permis récemment de nous attaquer à des tâches très complexes et de permettre à des agents artificiels de maîtriser des jeux classiques comme le Go, de jouer à des jeux vidéo à partir de pixels et de résoudre des tâches de contrôle robotique. Toutefois, un examen plus approfondi de ces remarquables succès empiriques révèle certaines limites fondamentales. Tout d'abord, il a été difficile de combiner les caractéristiques souhaitables des algorithmes RL, telles que l'apprentissage hors politique et en plusieurs étapes, et l'approximation de fonctions, de manière à obtenir des algorithmes stables et efficaces dans de grands espaces d'états. De plus, les algorithmes RL profonds ont tendance à être très inefficaces en raison des stratégies d'exploration-exploitation rudimentaires que ces approches emploient. Enfin, ils nécessitent une énorme quantité de données supervisées et finissent par produire un agent étroit capable de résoudre uniquement la tâche sur laquelle il est entrainé. Dans cette thèse, nous proposons de nouvelles solutions aux problèmes de l'apprentissage hors politique et du dilemme exploration-exploitation dans les grands espaces d'états, ainsi que de l'auto-supervision dans la RL. En ce qui concerne l'apprentissage hors politique, nous apportons deux contributions. Tout d'abord, pour le problème de l'évaluation des politiques, nous montrons que la combinaison des méthodes populaires d'apprentissage hors politique et à plusieurs étapes avec une paramétrisation linéaire de la fonction de valeur pourrait conduire à une instabilité indésirable, et nous dérivons une variante de ces méthodes dont la convergence est prouvée. Deuxièmement, pour l'optimisation des politiques, nous proposons de stabiliser l'étape d'amélioration des politiques par une régularisation de divergence hors politique qui contraint les distributions stationnaires d'états induites par des politiques consécutives à être proches les unes des autres. Ensuite, nous étudions l'apprentissage en ligne dans de grands espaces d'états et nous nous concentrons sur deux hypothèses structurelles pour rendre le problème traitable : les environnements lisses et linéaires. Pour les environnements lisses, nous proposons un algorithme en ligne efficace qui apprend activement un partitionnement adaptatif de l'espace commun en zoomant sur les régions les plus prometteuses et fréquemment visitées. Pour les environnements linéaires, nous étudions un cadre plus réaliste, où l'environnement peut maintenant évoluer dynamiquement et même de façon antagoniste au fil du temps, mais le changement total est toujours limité. Pour traiter ce cadre, nous proposons un algorithme en ligne efficace basé sur l'itération de valeur des moindres carrés pondérés. Il utilise des poids exponentiels pour oublier doucement les données qui sont loin dans le passé, ce qui pousse l'agent à continuer à explorer pour découvrir les changements. Enfin, au-delà du cadre classique du RL, nous considérons un agent qui interagit avec son environnement sans signal de récompense. Nous proposons d'apprendre une paire de représentations qui mettent en correspondance les paires état-action avec un certain espace latent. Pendant la phase non supervisée, ces représentations sont entraînées en utilisant des interactions sans récompense pour encoder les relations à longue portée entre les états et les actions, via une carte d'occupation prédictive. Au moment du test, lorsqu'une fonction de récompense est révélée, nous montrons que la politique optimale pour cette récompense est directement obtenue à partir de ces représentations, sans aucune planification. Il s'agit d'une étape vers la construction d'agents entièrement contrôlables. Un thème commun de la thèse est la conception d'algorithmes RL prouvables et généralisables. Dans la première et la deuxième partie, nous traitons de la généralisation dans les grands espaces d'états, soit par approximation de fonctions linéaires, soit par agrégation d'états. Dans la dernière partie, nous nous concentrons sur la généralisation sur les fonctions de récompense et nous proposons un cadre d'apprentissage non-supervisé de représentation qui est capable d'optimiser toutes les fonctions de récompense. / Reinforcement Learning (RL) is an agent-oriented learning paradigm concerned with learning by interacting with an uncertain environment. Combined with deep neural networks as function approximators, deep reinforcement learning (Deep RL) allowed recently to tackle highly complex tasks and enable artificial agents to master classic games like Go, play video games from pixels, and solve robotic control tasks. However, a closer look at these remarkable empirical successes reveals some fundamental limitations. First, it has been challenging to combine desirable features of RL algorithms, such as off-policy and multi-step learning with function approximation in a way that leads to both stable and efficient algorithms in large state spaces. Moreover, Deep RL algorithms tend to be very sample inefficient due to the rudimentary exploration-exploitation strategies these approaches employ. Finally, they require an enormous amount of supervised data and end up producing a narrow agent able to solve only the task that it was trained on. In this thesis, we propose novel solutions to the problems of off-policy learning and exploration-exploitation dilemma in large state spaces, as well as self-supervision in RL. On the topic of off-policy learning, we provide two contributions. First, for the problem of policy evaluation, we show that combining popular off-policy and multi-step learning methods with linear value function parameterization could lead to undesirable instability, and we derive a provably convergent variant of these methods. Second, for policy optimization, we propose to stabilize the policy improvement step through an off-policy divergence regularization that constrains the discounted state-action visitation induced by consecutive policies to be close to one another. Next, we study online learning in large state spaces and we focus on two structural assumptions to make the problem tractable: smooth and linear environments. For smooth environments, we propose an efficient online algorithm that actively learns an adaptive partitioning of the joint space by zooming in on more promising and frequently visited regions. For linear environments, we study a more realistic setting, where the environment is now allowed to evolve dynamically and even adversarially over time, but the total change is still bounded. To address this setting, we propose an efficient online algorithm based on weighted least squares value iteration. It uses exponential weights to smoothly forget data that are far in the past, which drives the agent to keep exploring to discover changes. Finally, beyond the classical RL setting, we consider an agent interacting with its environments without a reward signal. We propose to learn a pair of representations that map state-action pairs to some latent space. During the unsupervised phase, these representations are trained using reward-free interactions to encode long-range relationships between states and actions, via a predictive occupancy map. At test time, once a reward function is revealed, we show that the optimal policy for that reward is directly obtained from these representations, with no planning. This is a step towards building fully controllable agents. A common theme in the thesis is the design of provable RL algorithms that generalize. In the first and the second part, we deal with generalization in large state spaces either by linear function approximation or state aggregation. In the last part, we focus on generalization over reward functions and we propose a task-agnostic representation learning framework that is provably able to solve all reward functions. reinforcement learning Markov decision process artificial agent off-policy learning function approximation exploration-exploitation trade-off self-supervision generalization apprentissage par renforcement processus de décision Markovien agent artificiel apprentissage hors-politique approximation de fonction compromis exploration-exploitation auto-supervision généralisation
155	Utilisation des communications Device-to-Device pour améliorer l'efficacité des réseaux cellulaires / Use of Device-to-Device communications for efficient cellular networks Ibrahim, Rita 04 February 2019 (has links) Cette thèse étudie les communications directes entre les mobiles, appelées communications D2D, en tant que technique prometteuse pour améliorer les futurs réseaux cellulaires. Cette technologie permet une communication directe entre deux terminaux mobiles sans passer par la station de base. La modélisation, l'évaluation et l'optimisation des différents aspects des communications D2D constituent les objectifs fondamentaux de cette thèse et sont réalisés principalement à l'aide des outils mathématiques suivants: la théorie des files d'attente, l'optimisation de Lyapunov et les processus de décision markovien partiellement observable POMDP. Les résultats de cette étude sont présentés en trois parties. Dans la première partie, nous étudions un schéma de sélection entre mode cellulaire et mode D2D. Nous dérivons les régions de stabilité des scénarios suivants: réseaux cellulaires purs et réseaux cellulaires où les communications D2D sont activées. Une comparaison entre ces deux scénarios conduit à l'élaboration d'un algorithme de sélection entre le mode cellulaire et le mode D2D qui permet d'améliorer la capacité du réseau. Dans la deuxième partie, nous développons un algorithme d'allocation de ressources des communications D2D. Les utilisateurs D2D sont en mesure d'estimer leur propre qualité de canal, cependant la station de base a besoin de recevoir des messages de signalisation pour acquérir cette information. Sur la base de cette connaissance disponibles au niveau des utilisateurs D2D, une approche d'allocation des ressources est proposée afin d'améliorer l'efficacité énergétique des communications D2D. La version distribuée de cet algorithme s'avère plus performante que celle centralisée. Dans le schéma distribué des collisions peuvent se produire durant la transmission de l'état des canaux D2D ; ainsi un algorithme de réduction des collisions est élaboré. En outre, la mise en œuvre des algorithmes centralisé et distribué dans un réseau cellulaire, type LTE, est décrite en détails. Dans la troisième partie, nous étudions une politique de sélection des relais D2D mobiles. La mobilité des relais représente un des principaux défis que rencontre toute stratégie de sélection de relais. Le problème est modélisé par un processus contraint de décision markovien partiellement observable qui prend en compte le dynamisme des relais et vise à trouver la politique de sélection de relais qui optimise la performance du réseau cellulaire sous des contraintes de coût. / This thesis considers Device-to-Device (D2D) communications as a promising technique for enhancing future cellular networks. Modeling, evaluating and optimizing D2D features are the fundamental goals of this thesis and are mainly achieved using the following mathematical tools: queuing theory, Lyapunov optimization and Partially Observed Markov Decision Process (POMDP). The findings of this study are presented in three parts. In the first part, we investigate a D2D mode selection scheme. We derive the queuing stability regions of both scenarios: pure cellular networks and D2D-enabled cellular networks. Comparing both scenarios leads us to elaborate a D2D vs cellular mode selection design that improves the capacity of the network. In the second part, we develop a D2D resource allocation algorithm. We observe that D2D users are able to estimate their local Channel State Information (CSI), however the base station needs some signaling exchange to acquire this information. Based on the D2D users' knowledge of their local CSI, we provide an energy efficient resource allocation framework that shows how distributed scheduling outperforms centralized one. In the distributed approach, collisions may occur between the different CSI reporting; thus, we propose a collision reduction algorithm. Moreover, we give a detailed description on how both centralized and distributed algorithms can be implemented in practice. In the third part, we propose a mobile relay selection policy in a D2D relay-aided network. Relays' mobility appears as a crucial challenge for defining the strategy of selecting the optimal D2D relays. The problem is formulated as a constrained POMDP which captures the dynamism of the relays and aims to find the optimal relay selection policy that maximizes the performance of the network under cost constraints. Réseaux Cellulaires Sélection de mode de communication Allocation des ressources Sélection des relais Théorie des files d'attente Optimisation Lyapunov Device-to-Device (D2D) communications Cellular Networks Mode selection Resource Allocation Relay selection Queuing theory Lyapunov optimization
156	On choice models in the context of MDPs Mohammadpour, Sobhan 10 1900 (has links) Cette thèse se penche sur les modèles de choix, des distributions sur des ensembles d'alternatives. Les modèles de choix sur les processus décisionnels de Markov (MDP) peuvent décomposer de très grands espaces alternatifs en procédures étape par étape conçues pour non seulement combattre la malédiction de la dimensionnalité mais aussi pour mieux refléter la dynamique sous-jacente. La première partie est consacrée à l'estimation du temps de trajet dans le cadre de la modélisation du choix de chemin. Les modèles de choix de chemin sont des modèles de choix sur l'ensemble des chemins utilisés pour modéliser le flux de circulation. Intuitivement, le temps de trajet est l'une des caractéristiques les plus importantes lors du choix des chemins, mais les temps de trajet ne sont pas toujours connus. En revanche, le cadre classique suppose que ces deux étapes sont séquentielles, car les temps de trajet des arcs font partie de l'entrée du processus d'estimation du choix de chemin. Pourtant, les interdépendances complexes signifient que ce modèle de choix de chemin peut complémenter toute observation lors de l'estimation des temps de trajet. Nous construisons un modèle statistique pour l'estimation du temps de trajet et proposons de marginaliser les caractéristiques non observées. En utilisant ces idées, nous montrons que nous sommes capables d'apprendre des modèles de choix de chemin sans observer de chemins réels et à différentes granularités. La deuxième partie se concentre sur les échecs des MDP régularisés et comment la régularisation peut avoir des effets secondaires inattendus, tels que la divergence dans les chemins stochastiques les plus courts ou des fonctions de valeur déraisonnablement grandes. Les MDP régularisés ne sont rien d'autre qu'une application des modèles de choix aux MDP. Ils sont utilisés dans l'apprentissage par renforcement (RL) pour obtenir, entre autres choses, un modèle de choix sur les trajectoires possibles pour l'apprentissage par renforcement inverse, transférer des connaissances préalables au modèle, ou obtenir des politiques qui exploitent tous les objectifs dans l'environnement. Ces effets secondaires sont exacerbés dans les espaces d'action dépendants de l'état. Comme mesure d'atténuation, nous introduisons deux transformations potentielles, et nous évaluons leur performance sur un problème de conception de médicaments. / This thesis delves on choice models, distributions on sets of alternatives. Choice models on Markov decision processes (MDPs) can break down very large alternative spaces into step-by-step procedures designed to not only tackle the curse of dimensionality but also to reflect the underlying dynamics better. The first part is devoted to travel time estimation as part of path choice modeling. Path choice models are choice models on the set of paths used to model traffic flow. Intuitively, travel time is one of the more important features when choosing paths, yet travel times are not always known. In contrast, the classical setting assumes that these two steps are sequential, as arc travel times are part of the input of the path choice estimation process. Yet the intricate interdependences mean that that path choice model can complement any observation when estimating travel times. We build a statistical model for travel time estimation and propose marginalizing the unobserved features. Using these ideas, we show that we are able to learn path choice models without observing actual paths and at different granularity. The second part focuses on the failings of regularized MDPs and how regularization may have unexpected side effects, such as divergence in stochastic shortest paths or unreasonably large value functions. Regularized MDPs are nothing but an application of choice models to MDPs. They are used in reinforcement learning (RL) to get, among other things, a choice model on possible trajectories for inverse reinforcement learning, transfer prior knowledge to the model, or to get policies that exploit all goals in the environment. These side effects are exacerbated in state-dependent action spaces. As a mitigation, we introduce two potential transformations, and we benchmark their performance on a drug design problem. Estimation du temps de trajet Route choice modeling Path choice models Modèles de choix de chemin Modélisation du choix d’itinéraire Maximum entropy reinforcement learning Regularized Markov decision process Travel time estimation
157	ENABLING RIDE-SHARING IN ON-DEMAND AIR SERVICE OPERATIONS THROUGH REINFORCEMENT LEARNING Apoorv Maheshwari (11564572) 22 November 2021 (has links) The convergence of various technological and operational advancements has reinstated the interest in On-Demand Air Service (ODAS) as a viable mode of transportation. ODAS enables an end-user to be transported in an aircraft between their desired origin and destination at their preferred time without advance notice. Industry, academia, and the government organizations are collaborating to create technology solutions suited for large-scale implementation of this mode of transportation. Market studies suggest reducing vehicle operating cost per passenger as one of the biggest enablers of this market. To enable ODAS, an ODAS operator controls a fleet of aircraft that are deployed across a set of nodes (e.g., airports, vertiports) to satisfy end-user transportation requests. There is a gap in the literature for a tractable and online methodology that can enable ride-sharing in the on-demand operations while maintaining a publicly acceptable level of service (such as with low waiting time). The need for an approach that not only supports a dynamic-stochastic formulation but can also handle uncertainty with unknowable properties, drives me towards the field of Reinforcement Learning (RL). In this work, a novel two-layer hierarchical RL framework is proposed that can distribute a fleet of aircraft across a nodal network as well as perform real-time scheduling for an ODAS operator. The top layer of the framework - the Fleet Distributor - is modeled as a Partially Observable Markov Decision Process whereas the lower layer - the Trip Request Manager - is modeled as a Semi-Markov Decision Process. This framework is successfully demonstrated and assessed through various studies for a hypothetical ODAS operator in the Chicago region. This approach provides a new way of solving fleet distribution and scheduling problems in aviation. It also bridges the gap between the state-of-the-art RL advancements and node-based transportation network problems. Moreover, this work provides a non-proprietary approach to reasonably model ODAS operations that can be leveraged by researchers and policy makers. Knowledge representation and reasoning Operations research Advanced Air Mobility reinforcement learning artificial intelligence operations strategy scheduling Aerospace and defense industry Aeronautics. machine learning-based Air Transportation system Ride-sharing Markov decision process (MDP) uncertainty and fluctuations Aerospace Engineering Operations Research
158	Belief-aided Robust Control for Remote Electrical Tilt Optimization Jönsson, Jack January 2021 (has links) Remote Electrical Tilt (RET) is a method for configuring antenna downtilt in base stations to optimize mobile network performance. Reinforcement Learning (RL) is an approach to automating the process by letting an agent learn an optimal control strategy and adapt to the dynamic environment. Applying RL in real world comes with challenges, for the RET problem there are performance requirements and partial observability of the system through exogenous factors inducing noise in observations. This thesis proposes a solution method through modeling the problem by a Partially Observable Markov Decision Process (POMDP). The set of hidden states are modeled as a high- level representation of situations requiring one of the possible actions uptilt, downtilt, no change. From this model, a Bayesian Neural Network (BNN) is trained to predict an observation model, relating observed Key Performance Indicators (KPIs) to the hidden states. The observation model is used for estimating belief state probabilities of each hidden state, from which decision of control action is made through a restrictive threshold policy. Experiments comparing the method to a baseline Deep Q- network (DQN) agent shows the method able to reach the same average performance increase as the baseline while outperforming the baseline in two metrics important for robust and safe control behaviour, the worst- case minimum reward increase and the average reward increase per number of tilt actions. / Fjärrstyrning av Elektrisk Lutning (FEL) är en metod för att reglera lutningen av antenner i basstationer för att optimera presentandan i ett mobilnätverk. Förstärkande Inlärning (FI) används som metod för att automatisera processen genom att låta en agent lära sig en optimal strategi för reglering och anpassa sig till den dynamiska miljön. Att tillämpa FI i ett verkligt scenario innebär utmaningar, för FEL specifikt finns det krav på en viss nivå av prestanda samt endast en delvis observerbarhet av systemet på grund av externa faktorer som orsakar brus i observationerna. I detta arbete föreslås en metod för att hantera detta genom att modellera problemet som en Delvis Observerbar Markovprocess (DOM). De dolda tillstånden modelleras för att representera situationer där var och en av de möjliga aktionerna behövs, det vill säga att luta antennen upp, ner eller inte ändra på lutningen. Utifrån denna modellering så tränas ett Bayesiskt Neuralt Nätverk (BNN) för att estimera en observationsmodel som kopplar observerade nyckeltal till de dolda tillstånden. Denna observationsmodel används för att estimera sannolikheten att vardera dolt tillstånd är det rätta. Utifrån dessa sannolikheter så görs valet av aktion genom ett tröskelvärde på sannolikheterna. Genom experiment som jämför metoden med en standardimplementering av en agent baserad på ett Djupt Qnätverk (DQN) visas att metoden har samma prestation när det kommer till en medelnivå på prestandaökning i nätverket. Metoden överträffar dock standardmetoden i två andra mätvärden som är viktiga ur aspekten säker och robust reglering, minimumvärdet på prestandaökningen samt medelökningen av prestandan per antal up- och nerlutningar som används. Mobile Network Optimization Remote Electrical Tilt Robust Control Reinforcement Learning Belief State Estimation Bayesian Neural Network Optimering av Mobilnätverk Fjärrstyrning av Elektrisk Lutning Robust Reglering Delvis Observerbar Markovprocess Tillståndsestimering Bayesiskt Neuralt Nätverk Elektroteknik och elektronik
159	Integrating Maintenance Planning and Production Scheduling: Making Operational Decisions with a Strategic Perspective Aramon Bajestani, Maliheh 16 July 2014 (has links) In today's competitive environment, the importance of continuous production, quality improvement, and fast delivery has forced production and delivery processes to become highly reliable. Keeping equipment in good condition through maintenance activities can ensure a more reliable system. However, maintenance leads to temporary reduction in capacity that could otherwise be utilized for production. Therefore, the coordination of maintenance and production is important to guarantee good system performance. The central thesis of this dissertation is that integrating maintenance and production decisions increases efficiency by ensuring high quality production, effective resource utilization, and on-time deliveries. Firstly, we study the problem of integrated maintenance and production planning where machines are preventively maintained in the context of a periodic review production system with uncertain yield. Our goal is to provide insight into the optimal maintenance policy, increasing the number of finished products. Specifically, we prove the conditions that guarantee the optimal maintenance policy has a threshold type. Secondly, we address the problem of integrated maintenance planning and production scheduling where machines are correctively maintained in the context of a dynamic aircraft repair shop. To solve the problem, we view the dynamic repair shop as successive static repair scheduling sub-problems over shorter periods. Our results show that the approach that uses logic-based Benders decomposition to solve the static sub-problems, schedules over longer horizon, and quickly adjusts the schedule increases the utilization of aircraft in the long term. Finally, we tackle the problem of integrated maintenance planning and production scheduling where machines are preventively maintained in the context of a multi-machine production system. Depending on the deterioration process of machines, we design decomposed techniques that deal with the stochastic and combinatorial challenges in different, coupled stages. Our results demonstrate that the integrated approaches decrease the total maintenance and lost production cost, maximizing the on-time deliveries. We also prove sufficient conditions that guarantee the monotonicity of the optimal maintenance policy in both machine state and the number of customer orders. Within these three contexts, this dissertation demonstrates that the integrated maintenance and production decision-making increases the process efficiency to produce high quality products in a timely manner. Scheduling Maintenance Optimization Production Planning Decomposition Logic-based Benders Decomposition Repair Shop Scheduling Dynamic Scheduling Constraint Programming Random Yield Periodic Review Production System Rescheduling Mixed Integer Programming Machine Deterioration Flowshop Scheduling Markov Decision Process Threshold Maintenance Policy Hybrid Optimization Minimizing the Number of Tardy Jobs Maintenance Capacity Limit Decomposed, but Coupled Algorithms Yield Management 0796 0546

Search results