Spelling suggestions: "subject:"partially observable"" "subject:"partially observables""
21 |
Risk-aware Autonomous Driving Using POMDPs and Responsibility-Sensitive Safety / POMDP-modellerad Riskmedveten Autonom Körning med RiskmåttSkoglund, Caroline January 2021 (has links)
Autonomous vehicles promise to play an important role aiming at increased efficiency and safety in road transportation. Although we have seen several examples of autonomous vehicles out on the road over the past years, how to ensure the safety of autonomous vehicle in the uncertain and dynamic environment is still a challenging problem. This thesis studies this problem by developing a risk-aware decision making framework. The system that integrates the dynamics of an autonomous vehicle and the uncertain environment is modelled as a Partially Observable Markov Decision Process (POMDP). A risk measure is proposed based on the Responsibility-Sensitive Safety (RSS) distance, which quantifying the minimum distance to other vehicles for ensuring safety. This risk measure is incorporated into the reward function of POMDP for achieving a risk-aware decision making. The proposed risk-aware POMDP framework is evaluated in two case studies. In a single-lane car following scenario, it is shown that the ego vehicle is able to successfully avoid a collision in an emergency event where a vehicle in front of it makes a full stop. In the merge scenario, the ego vehicle successfully enters the main road from a ramp with a satisfactory distance to other vehicles. As a conclusion, the risk-aware POMDP framework is able to realize a trade-off between safety and usability by keeping a reasonable distance and adapting to other vehicles behaviours. / Autonoma fordon förutspås spela en stor roll i framtiden med målen att förbättra effektivitet och säkerhet för vägtransporter. Men även om vi sett flera exempel av autonoma fordon ute på vägarna de senaste åren är frågan om hur säkerhet ska kunna garanteras ett utmanande problem. Det här examensarbetet har studerat denna fråga genom att utveckla ett ramverk för riskmedvetet beslutsfattande. Det autonoma fordonets dynamik och den oförutsägbara omgivningen modelleras med en partiellt observerbar Markov-beslutsprocess (POMDP från engelskans “Partially Observable Markov Decision Process”). Ett riskmått föreslås baserat på ett säkerhetsavstånd förkortat RSS (från engelskans “Responsibility-Sensitive Safety”) som kvantifierar det minsta avståndet till andra fordon för garanterad säkerhet. Riskmåttet integreras i POMDP-modellens belöningsfunktion för att åstadkomma riskmedvetna beteenden. Den föreslagna riskmedvetna POMDP-modellen utvärderas i två fallstudier. I ett scenario där det egna fordonet följer ett annat fordon på en enfilig väg visar vi att det egna fordonet kan undvika en kollision då det framförvarande fordonet bromsar till stillastående. I ett scenario där det egna fordonet ansluter till en huvudled från en ramp visar vi att detta görs med ett tillfredställande avstånd till andra fordon. Slutsatsen är att den riskmedvetna POMDP-modellen lyckas realisera en avvägning mellan säkerhet och användbarhet genom att hålla ett rimligt säkerhetsavstånd och anpassa sig till andra fordons beteenden.
|
22 |
Deep Reinforcement Learning for Autonomous Highway Driving ScenarioPradhan, Neil January 2021 (has links)
We present an autonomous driving agent on a simulated highway driving scenario with vehicles such as cars and trucks moving with stochastically variable velocity profiles. The focus of the simulated environment is to test tactical decision making in highway driving scenarios. When an agent (vehicle) maintains an optimal range of velocity it is beneficial both in terms of energy efficiency and greener environment. In order to maintain an optimal range of velocity, in this thesis work I proposed two novel reward structures: (a) gaussian reward structure and (b) exponential rise and fall reward structure. I trained respectively two deep reinforcement learning agents to study their differences and evaluate their performance based on a set of parameters that are most relevant in highway driving scenarios. The algorithm implemented in this thesis work is double-dueling deep-Q-network with prioritized experience replay buffer. Experiments were performed by adding noise to the inputs, simulating Partially Observable Markov Decision Process in order to obtain reliability comparison between different reward structures. Velocity occupancy grid was found to be better than binary occupancy grid as input for the algorithm. Furthermore, methodology for generating fuel efficient policies has been discussed and demonstrated with an example. / Vi presenterar ett autonomt körföretag på ett simulerat motorvägsscenario med fordon som bilar och lastbilar som rör sig med stokastiskt variabla hastighetsprofiler. Fokus för den simulerade miljön är att testa taktiskt beslutsfattande i motorvägsscenarier. När en agent (fordon) upprätthåller ett optimalt hastighetsområde är det fördelaktigt både när det gäller energieffektivitet och grönare miljö. För att upprätthålla ett optimalt hastighetsområde föreslog jag i detta avhandlingsarbete två nya belöningsstrukturer: (a) gaussisk belöningsstruktur och (b) exponentiell uppgång och nedgång belöningsstruktur. Jag utbildade respektive två djupförstärkande inlärningsagenter för att studera deras skillnader och utvärdera deras prestanda baserat på en uppsättning parametrar som är mest relevanta i motorvägsscenarier. Algoritmen som implementeras i detta avhandlingsarbete är dubbel-duell djupt Q- nätverk med prioriterad återuppspelningsbuffert. Experiment utfördes genom att lägga till brus i ingångarna, simulera delvis observerbar Markov-beslutsprocess för att erhålla tillförlitlighetsjämförelse mellan olika belöningsstrukturer. Hastighetsbeläggningsgaller visade sig vara bättre än binärt beläggningsgaller som inmatning för algoritmen. Dessutom har metodik för att generera bränsleeffektiv politik diskuterats och demonstrerats med ett exempel.
|
23 |
System Availability Maximization and Residual Life Prediction under Partial ObservationsJiang, Rui 10 January 2012 (has links)
Many real-world systems experience deterioration with usage and age, which often leads to low product quality, high production cost, and low system availability. Most previous maintenance and reliability models in the literature do not incorporate condition monitoring information for decision making, which often results in poor failure prediction for partially observable deteriorating systems. For that reason, the development of fault prediction and control scheme using condition-based maintenance techniques has received considerable attention in recent years. This research presents a new framework for predicting failures of a partially observable deteriorating system using Bayesian control techniques. A time series model is fitted to a vector observation process representing partial information about the system state. Residuals are then calculated using the fitted model, which are indicative of system deterioration. The deterioration process is modeled as a 3-state continuous-time homogeneous Markov process. States 0 and 1 are not observable, representing healthy (good) and unhealthy (warning) system operational conditions, respectively. Only the failure state 2 is assumed to be observable. Preventive maintenance can be carried out at any sampling epoch, and corrective maintenance is carried out upon system failure. The form of the optimal control policy that maximizes the long-run expected average availability per unit time has been investigated. It has been proved that a control limit policy is optimal for decision making. The model parameters have been estimated using the Expectation Maximization (EM) algorithm. The optimal Bayesian fault prediction and control scheme, considering long-run average availability maximization along with a practical statistical constraint, has been proposed and compared with the age-based replacement policy. The optimal control limit and sampling interval are calculated in the semi-Markov decision process (SMDP) framework. Another Bayesian fault prediction and control scheme has been developed based on the average run length (ARL) criterion. Comparisons with traditional control charts are provided. Formulae for the mean residual life and the distribution function of system residual life have been derived in explicit forms as functions of a posterior probability statistic. The advantage of the Bayesian model over the well-known 2-parameter Weibull model in system residual life prediction is shown. The methodologies are illustrated using simulated data, real data obtained from the spectrometric analysis of oil samples collected from transmission units of heavy hauler trucks in the mining industry, and vibration data from a planetary gearbox machinery application.
|
24 |
System Availability Maximization and Residual Life Prediction under Partial ObservationsJiang, Rui 10 January 2012 (has links)
Many real-world systems experience deterioration with usage and age, which often leads to low product quality, high production cost, and low system availability. Most previous maintenance and reliability models in the literature do not incorporate condition monitoring information for decision making, which often results in poor failure prediction for partially observable deteriorating systems. For that reason, the development of fault prediction and control scheme using condition-based maintenance techniques has received considerable attention in recent years. This research presents a new framework for predicting failures of a partially observable deteriorating system using Bayesian control techniques. A time series model is fitted to a vector observation process representing partial information about the system state. Residuals are then calculated using the fitted model, which are indicative of system deterioration. The deterioration process is modeled as a 3-state continuous-time homogeneous Markov process. States 0 and 1 are not observable, representing healthy (good) and unhealthy (warning) system operational conditions, respectively. Only the failure state 2 is assumed to be observable. Preventive maintenance can be carried out at any sampling epoch, and corrective maintenance is carried out upon system failure. The form of the optimal control policy that maximizes the long-run expected average availability per unit time has been investigated. It has been proved that a control limit policy is optimal for decision making. The model parameters have been estimated using the Expectation Maximization (EM) algorithm. The optimal Bayesian fault prediction and control scheme, considering long-run average availability maximization along with a practical statistical constraint, has been proposed and compared with the age-based replacement policy. The optimal control limit and sampling interval are calculated in the semi-Markov decision process (SMDP) framework. Another Bayesian fault prediction and control scheme has been developed based on the average run length (ARL) criterion. Comparisons with traditional control charts are provided. Formulae for the mean residual life and the distribution function of system residual life have been derived in explicit forms as functions of a posterior probability statistic. The advantage of the Bayesian model over the well-known 2-parameter Weibull model in system residual life prediction is shown. The methodologies are illustrated using simulated data, real data obtained from the spectrometric analysis of oil samples collected from transmission units of heavy hauler trucks in the mining industry, and vibration data from a planetary gearbox machinery application.
|
25 |
Optimal Control Problems In Communication Networks With Information Delays And Quality Of Service ConstraintsKuri, Joy 02 1900 (has links)
In this thesis, we consider optimal control problems arising in high-speed integrated communication networks with Quality of Service (QOS) constraints. Integrated networks are expected to carry a large variety of traffic sources with widely varying traffic characteristics and performance requirements. Broadly, the traffic sources fall into two categories: (a) real-time sources with specified performance criteria, like small end to end delay and loss probability (sources of this type are referred to as Type 1 sources below), and (b) sources that do not have stringent performance criteria and do not demand performance guarantees from the network - the so-called Best Effort Type sources (these are referred to as Type 2 sources below). From the network's point of view, Type 2 sources are much more "controllable" than Type 1 sources, in the sense that the Type 2 sources can be dynamically slowed down, stopped or speeded up depending on traffic congestion in the network, while for Type 1 sources, the only control action available in case of congestion is packet dropping. Carrying sources of both types in the same network concurrently while meeting the performance objectives of Type 1 sources is a challenge and raises the question of equitable sharing of resources. The objective is to carry as much Type 2 traffic as possible without sacrificing the performance requirements of Type 1 traffic.
We consider simple models that capture this situation. Consider a network node through which two connections pass, one each of Types 1 and 2. One would like to maximize the throughput of the Type 2 connection while ensuring that the Type 1 connection's performance objectives are met. This can be set up as a constrained optimization problem that, however, is very hard to solve. We introduce a parameter b that represents the "cost" of buffer occupancy by Type 2 traffic. Since buffer space is limited and shared, a queued Type 2 packet means that a buffer position is not available for storing a Type 1 packet; to discourage the Type 2 connection from hogging the buffer, the cost parameter b is introduced, while a reward for each Type 2 packet coming into the buffer encourages the Type 2 connection to transmit at a high rate.
Using standard on-off models for the Type 1 sources, we show how values can be assigned to the parameter b; the value depends on the characteristics of the Type 1 connection passing through the node, i.e., whether it is a Variable Bit Rate (VBR) video connection or a Continuous Bit Rate (CBR) connection etc. Our approach gives concrete networking significance to the parameter b, which has long been considered as an abstract parameter in reward-penalty formulations of flow control problems (for example, [Stidham '85]).
Having seen how to assign values to b, we focus on the Type 2 connection next. Since Type 2 connections do not have strict performance requirements, it is possible to defer transmitting a Type 2 packet, if the conditions downstream so warrant. This leads to the question: what is the "best" transmission policy for Type 2 packets? Decisions to transmit or not must be based on congestion conditions downstream; however, the network state that is available at any instant gives information that is old, since feedback latency is an inherent feature of high speed networks. Thus the problem is to identify the best transmission policy under delayed feedback information.
We study this problem in the framework of Markov Decision Theory. With appropriate assumptions on the arrivals, service times and scheduling discipline at a network node, we formulate our problem as a Partially Observable Controlled Markov Chain (PO-CMC). We then give an equivalent formulation of the problem in terms of a Completely Observable Controlled Markov Chain (CO-CMC) that is easier to deal with., Using Dynamic Programming and Value Iteration, we identify structural properties of an optimal transmission policy when the delay in obtaining feedback information is one time slot. For both discounted and average cost criteria, we show that the optimal policy has a two-threshold structure, with the threshold on the observed queue length depending, on whether a Type 2 packet was transmitted in the last slot or not.
For an observation delay k > 2, the Value Iteration technique does not yield results. We use the structure of the problem to provide computable upper and lower bounds to the optimal value function. A study of these bounds yields information about the structure of the optimal policy for this problem. We show that for appropriate values of the parameters of the problem, depending on the number of transmissions in the last k steps, there is an "upper cut off" number which is a value such that if the observed queue length is greater than or equal to this number, the optimal action is to not transmit. Since the number of transmissions in the last k steps is between 0 and A: both inclusive, we have a stack of (k+1) upper cut off values. We conjecture that these (k + l) values axe thresholds and the optimal policy for this problem has a (k + l)-threshold structure.
So far it has been assumed that the parameters of the problem are known at the transmission control point. In reality, this is usually not known and changes over time. Thus, one needs an adaptive transmission policy that keeps track of and adjusts to changing network conditions. We show that the information structure in our problem admits a simple adaptive policy that performs reasonably well in a quasi-static traffic environment.
Up to this point, the models we have studied correspond to a single hop in a virtual connection. We consider the multiple hop problem next. A basic matter of interest here is whether one should have end to end or hop by hop controls. We develop a sample path approach to answer this question. It turns out that depending on the relative values of the b parameter in the transmitting node and its downstream neighbour, sometimes end to end controls are preferable while at other times hop by hop controls are preferable.
Finally, we consider a routing problem in a high speed network where feedback information is delayed, as usual. As before, we formulate the problem in the framework of Markov Decision Theory and apply Value Iteration to deduce structural properties of an optimal control policy. We show that for both discounted and average cost criteria, the optimal policy for an observation delay of one slot is Join the Shortest Expected Queue (JSEQ) - a natural and intuitively satisfactory extension of the well-known Join the Shortest Queue (JSQ) policy that is optimal when there is no feedback delay (see, for example, [Weber 78]). However, for an observation delay of more than one slot, we show that the JSEQ policy is not optimal. Determining the structure of the optimal policy for a delay k>2 appears to be very difficult using the Value Iteration approach; we explore some likely policies by simulation.
|
26 |
Semi-Markov Processes In Dynamic Games And FinanceGoswami, Anindya 02 1900 (has links)
Two different sets of problems are addressed in this thesis. The first one is on partially observed semi-Markov Games (POSMG) and the second one is on semi-Markov modulated financial market model.
In this thesis we study a partially observable semi-Markov game in the infinite time horizon. The study of a partially observable game (POG) involves three major steps: (i) construct an equivalent completely observable game (COG), (ii) establish the equivalence between POG and COG by showing that if COG admits an equilibrium, POG does so, (iii) study the equilibrium of COG and find the corresponding equilibrium of original partially observable problem.
In case of infinite time horizon game problem there are two different payoff criteria. These are discounted payoff criterion and average payoff criterion. At first a partially observable semi-Markov decision process on general state space with discounted cost criterion is studied. An optimal policy is shown to exist by considering a Shapley’s equation for the corresponding completely observable model. Next the discounted payoff problem is studied for two-person zero-sum case. A saddle point equilibrium is shown to exist for this case. Then the variable sum game is investigated. For this case the Nash equilibrium strategy is obtained in Markov class under suitable assumption. Next the POSMG problem on countable state space is addressed for average payoff criterion. It is well known that under this criterion the game problem do not have a solution in general. To ensure a solution one needs some kind of ergodicity of the transition kernel. We find an appropriate ergodicity of partially observed model which in turn induces a geometric ergodicity to the equivalent model. Using this we establish a solution of the corresponding average payoff optimality equation (APOE). Thus the value and a saddle point equilibrium is obtained for the original partially observable model. A value iteration scheme is also developed to find out the average value of the game.
Next we study the financial market model whose key parameters are modulated by semi-Markov processes. Two different problems are addressed under this market assumption. In the first one we show that this market is incomplete. In such an incomplete market we find the locally risk minimizing prices of exotic options in the Follmer Schweizer framework. In this model the stock prices are no more Markov. Generally stock price process is modeled as Markov process because otherwise one may not get a pde representation of price of a contingent claim. To overcome this difficulty we find an appropriate Markov process which includes the stock price as a component and then find its infinitesimal generator. Using Feynman-Kac formula we obtain a system of non-local partial differential equations satisfied by the option price functions in the mildsense. .Next this system is shown to have a classical solution for given initial or boundary conditions.
Then this solution is used to have a F¨ollmer Schweizer decomposition of option price. Thus we obtain the locally risk minimizing prices of different options. Furthermore we obtain an integral equation satisfied by the unique solution of this system. This enable us to compute the price of a contingent claim and find the risk minimizing hedging strategy numerically. Further we develop an efficient and stable numerical method to compute the prices.
Beside this work on derivative pricing, the portfolio optimization problem in semi-Markov modulated market is also studied in the thesis. We find the optimal portfolio selections by optimizing expected utility of terminal wealth. We also obtain the optimal portfolio selections under risk sensitive criterion for both finite and infinite time horizon.
|
27 |
Learning in Partially Observable Markov Decision ProcessesSachan, Mohit 21 August 2013 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Learning in Partially Observable Markov Decision process (POMDP) is motivated by the essential need to address a number of realistic problems. A number of methods exist for learning in POMDPs, but learning with limited amount of information about the model of POMDP remains a highly anticipated feature. Learning with minimal information is desirable in complex systems as methods requiring complete information among decision makers are impractical in complex systems due to increase of problem dimensionality.
In this thesis we address the problem of decentralized control of POMDPs with unknown transition probabilities and reward. We suggest learning in POMDP using a tree based approach. States of the POMDP are guessed using this tree. Each node in the tree has an automaton in it and acts as a decentralized decision maker for the POMDP. The start state of POMDP is known as the landmark state. Each automaton in the tree uses a simple learning scheme to update its action choice and requires minimal information. The principal result derived is that, without proper knowledge of transition probabilities and rewards, the automata tree of decision makers will converge to a set of actions that maximizes the long term expected reward per unit time obtained by the system. The analysis is based on learning in sequential stochastic games and properties of ergodic Markov chains. Simulation results are presented to compare the long term rewards of the system under different decision control algorithms.
|
28 |
Scheduling in Wireless Networks with Limited and Imperfect Channel KnowledgeOuyang, Wenzhuo 18 August 2014 (has links)
No description available.
|
29 |
Opportunistic Scheduling Using Channel Memory in Markov-modeled Wireless NetworksMurugesan, Sugumar 26 October 2010 (has links)
No description available.
|
30 |
Uma arquitetura de Agentes BDI para auto-regulação de Trocas Sociais em Sistemas Multiagentes Abertos / SELF-REGULATION OF PERSONALITY-BASED SOCIAL EXCHANGES IN OPEN MULTIAGENT SYSTEMSGonçalves, Luciano Vargas 31 March 2009 (has links)
Made available in DSpace on 2016-03-22T17:26:22Z (GMT). No. of bitstreams: 1
dm2_Luciano_vargas.pdf: 637463 bytes, checksum: b08b63e8c6a347cd2c86fc24fdfd8986 (MD5)
Previous issue date: 2009-03-31 / The study and development of systems to control interactions in multiagent systems
is an open problem in Artificial Intelligence. The system of social exchange values
of Piaget is a social approach that allows for the foundations of the modeling of interactions
between agents, where the interactions are seen as service exchanges between
pairs of agents, with the evaluation of the realized or received services, thats is, the investments
and profits in the exchange, and credits and debits to be charged or received,
respectively, in future exchanges. This evaluation may be performed in different ways
by the agents, considering that they may have different exchange personality traits. In an
exchange process along the time, the different ways in the evaluation of profits and losses
may cause disequilibrium in the exchange balances, where some agents may accumulate
profits and others accumulate losses. To solve the exchange equilibrium problem, we use
the Partially Observable Markov Decision Processes (POMDP) to help the agent decision
of actions that can lead to the equilibrium of the social exchanges. Then, each agent has
its own internal process to evaluate its current balance of the results of the exchange process
between the other agents, observing its internal state, and with the observation of its
partner s exchange behavior, it is able to deliberate on the best action it should perform
in order to get the equilibrium of the exchanges. Considering an open multiagent system,
it is necessary a mechanism to recognize the different personality traits, to build the
POMDPs to manage the exchanges between the pairs of agents. This recognizing task
is done by Hidden Markov Models (HMM), which, from models of known personality
traits, can approximate the personality traits of the new partners, just by analyzing observations
done on the agent behaviors in exchanges. The aim of this work is to develop an
hybrid agent architecture for the self-regulation of social exchanges between personalitybased
agents in a open multiagent system, based in the BDI (Beliefs, Desires, Intentions)
architecture, where the agent plans are obtained from optimal policies of POMDPs, which
model personality traits that are recognized by HMMs. To evaluate the proposed approach
some simulations were done considering (known or new) different personality traits / O estudo e desenvolvimento de sistemas para o controle de interações em sistemas
multiagentes é um tema em aberto dentro da Inteligência Artificial. O sistema de valores
de trocas sociais de Piaget é uma abordagem social que possibilita fundamentar a modelagem
de interações de agentes, onde as interações são vistas como trocas de serviços entre
pares de agentes, com a valorização dos serviços realizados e recebidos, ou seja, investimentos
e ganhos na troca realizada, e, também os créditos e débitos a serem cobrados
ou recebidos, respectivamente, em trocas futuras. Esta avaliação pode ser realizada de
maneira diferenciada pelos agentes envolvidos, considerando que estes apresentam traços
de personalidade distintos. No decorrer de processo de trocas sociais a forma diferenciada
de avaliar os ganhos e perdas nas interações pode causar desequilíbrio nos balanços
de trocas dos agentes, onde alguns agentes acumulam ganhos e outros acumulam perdas.
Para resolver a questão do equilíbrio das trocas, encontrou-se nos Processos de Decisão
de Markov Parcialmente Observáveis (POMDP) uma metodologia capaz de auxiliar a tomada
de decisões de cursos de ações na busca do equilíbrio interno dos agentes. Assim,
cada agente conta com um mecanismo próprio para avaliar o seu estado interno, e, de
posse das observações sobre o comportamento de troca dos parceiros, torna-se apto para
deliberar sobre as melhores ações a seguir na busca do equilíbrio interno para o par de
agentes. Com objetivo de operar em sistema multiagentes aberto, torna-se necessário um
mecanismo para reconhecer os diferentes traços de personalidade, viabilizando o uso de
POMDPs nestes ambientes. Esta tarefa de reconhecimento é desempenhada pelos Modelos
de Estados Ocultos de Markov (HMM), que, a partir de modelos de traços de personalidade
conhecidos, podem inferir os traços aproximados de novos parceiros de interações,
através das observações sobre seus comportamentos nas trocas. O objetivo deste trabalho
é desenvolver uma arquitetura de agentes híbrida para a auto-regulação de trocas sociais
entre agentes baseados em traços de personalidade em sistemas multiagentes abertos. A
arquitetura proposta é baseada na arquitetura BDI (Beliefs, Desires, Intentions), onde os
planos dos agentes são obtidos através de políticas ótimas de POMDPs, que modelam
traços de personalidade reconhecidos através de HMMs. Para avaliar a proposta, foram
realizadas simulações envolvendo traços de personalidade conhecidos e novos traços
|
Page generated in 0.0842 seconds