Global ETD Search

321	Troca dinâmica de versões de componentes de programas no modelo de objetos Haetinger, Werner January 1998 (has links) A manutenção de software a uma realidade presente em todos os sistemas de computação, gerando a necessidade de novas versões que alterem as funcionalidades existentes no software ou adicionem novas. Particularmente, sistemas de tempo-real nem sempre podem ser descontinuados tomando-se indisponíveis para realizar a instalação de uma nova versão. Tais sistemas evidenciam a necessidade de substituição de componentes, representados por funções, procedimentos, módulos ou objetos, durante o processo de execução do programa ou sistema. Outrossim, apos ser realizada a substituição da versão, o componente não pode apresentar falha sob pena de comprometer o fornecimento dos seus serviços. Portanto. constata-se a importância de novas técnicas de manutenção de software que não prejudiquem a sua disponibilidade e confiabilidade. A abordagem aqui proposta a utilizar uma arquitetura reflexiva aliada a técnicas típicas do domínio da tolerância a falhas para promover a separação entre as atividades de substituição e validação de componentes e as funcionalidades executadas pelo pr6prio componente. No decorrer deste trabalho são apresentados diversos cenários de sistemas que podem se beneficiar da troca dinâmica de componentes e abordadas varias facetas do problema de substituição. A proposta a apoiada por um estudo de caso, implementado na linguagem de programação Java e seus diferentes protocolos de reflexão computacional. / Software maintenance is a present reality in all computational systems. This demands the frequent installation of new versions. Usually, real-time systems cannot be interrupted to install a new version. For such systems, the replacement of components, represented by functions, procedures, modulus or objects, must be performed during the execution of the program or system. Even when the old version has been replaced, the new one should not contain faults that could invalidate its services. Therefore, we need new software maintenance techniques that can mantain the system availability and realibility. The approach proposed here consists in using a reflective architecture along with techniques which are typical of the fault tolerant domain. The procedure is carried out by keeping a clear separation between validation activities and the functions executed by the component itself. We present several scenarios to which the dynamical exchange of components can be applied. Different aspects of the replacing issue are also addressed. The proposal is supported by a specific application which has been implemented in the Java language and its different protocols of computational reflection. Engenharia : Software Tolerancia : Falhas Orientacao : Objetos Reflexao computacional Fault tolerance Object orientation Computational reflection Dynamic software Version change
322	Adaptation en ligne de mécanismes de tolérance aux fautes par une approche à composants ouverts / On-line fault tolerance mechanisms adaptation based on open component models Pareaud, Thomas 27 January 2009 (has links) L'adaptation en-ligne du logiciel de tolérance aux fautes permet de renforce la sûreté de fonctionnement du système et prenant en compte son environnement. L’adaptation nécessite de nouvelles techniques de conception. Ces travaux visent à comprendre et maîtriser l'impact des modifications du logiciel de tolérance aux fautes en opération sur les fonctionnalités du système, pour en maîtriser les effets de bords. L’approche proposée introduit une architecture réflexive à composants et une modélisation du logiciel. Un modèle structurel du logiciel permet de calculer et appliquer les modifications du contenu du logiciel. Un modèle comportemental décrit les observations attendues en fonctionnement. Il permet de déterminer les états permettant d’appliquer les modifications, d’amener et de maintenir le système dans ces états. Ces travaux montrent que, grâce aux capacités de manipulation et de contrôle en ligne du logiciel, la modification des mécanismes de tolérance aux fautes peut être réalisée en ligne de manière maîtrisée. / On-line fault tolerance adaptation aims at enforcing system dependability by taking into account operational conditions and environment. Adapting the system requires new design techniques. This work aims at understanding and mastering the impact of such software modification in operation, especially regarding side effects on functionalities and dependability properties. Our approach relies on a reflective architecture based on components and models of the software that reflects on the one hand the content of the software in terms of state and algorithms (architectural model) and on the other hand the expected correct behaviour (behavioural model). The first one is used to determine the modifications and apply them at runtime, and the second one is used to drive the system in a state in which modifications can be done consistently, and maintain the system in such a state. We show that; thanks to manipulation capabilities and execution control, we can master the modification of fault tolerance software and ensure correctness properties. Adaptation Réflexivité Modèle à composant Tolérance aux fautes Modèle comportemental Adaptation Reflection Component model Fault tolerance Behavioural modelling
323	Uma arquitetura otimizada para a detecção de falhas em grades computacionais. / A failure detection architecture optimized for grid computing platforms. Fernando Tarlá Cardoso Lemos 07 November 2012 (has links) A detecção de falhas em uma plataforma distribuída é um componente essencial para uma grande quantidade de estratégias de tolerância a falhas, como a restauração do estado das aplicações distribuídas através de checkpointing e message logging. Porém, esta detecção frequentemente depende da comunicação confiável entre os nós de processamento e os módulos de detecção de falhas. Em grades computacionais hierárquicas com limitações de conectividade, a comunicação direta entre nós e módulos de detecção é muitas vezes impossível. Outro fator que dificulta a detecção de falhas em grades computacionais é a localização geograficamente esparsa entre as instituições e os recursos computacionais disponíveis na grade e a consequente utilização de redes de longa distância para os conectar. Esta dissertação apresenta uma arquitetura para a detecção de falhas em plataformas distribuídas otimizada para o funcionamento em grades computacionais hierárquicas, levando suas limitações e requisitos em consideração. A arquitetura, denominada GFDA (Grid Fault Detection Architecture), é estruturada em módulos de detecção das falhas que afetam nós computacionais disponibilizados na grade, módulos de detecção de falhas das aplicações distribuídas, e módulos de coleção, processamento e encaminhamento das notificações de falha e recuperação emitidas pelos módulos de detecção. Detalhes da implementação e da verificação do funcionamento correto da arquitetura são apresentados, bem como resultados obtidos através da execução de componentes da arquitetura em um cluster de computadores simulado através de máquinas virtuais. São propostas técnicas para a otimização da qualidade de serviço de detecção de falhas. Os resultados obtidos com a utilização destas técnicas são comparados com resultados obtidos com abordagens tradicionais. Observa-se que as técnicas implementadas na arquitetura GFDA para o processamento de notificações de falha e recuperação e a introdução de redundância nas mensagens trocadas entre os módulos de detecção de falhas traz resultados positivos em condições adversas de conectividade. Conclui-se que a arquitetura GFDA contribui para o estabelecimento de uma solução viável para a detecção de falhas em uma grade computacional hierárquica em que há restrições de conectividade entre os nós computacionais. / In distributed platforms, fault detection is an essential requirement to a wide range of fault tolerance techniques, such as restoring the state of distributed applications with checkpointing and message logging. However, fault detection often depends on reliable communication between the processing nodes and detection fault modules. Direct communication between the nodes and detection modules is often impossible in hierarchical grid computing platforms. The physical distance between the institutions and resources available on the grid, and thus the requirement of long distance networks connecting them, is another factor that makes direct fault detection in computer grids a challenge. This thesis presents a fault detection architecture for distributed platforms, optimized for usage in hierarchical grids and thus taking into account its restrictions and requirements. The architecture, named GFDA (Grid Fault Detection Architecture), is structured as fault detection modules for faults that affect the computing nodes available on the grid, detection modules for faults that affect the distributed applications, and modules that perform the collection, processing and forwarding of the fault and recovery notifications generated by the detection modules. This thesis presents implementation details, an evaluation of the correctness of the designed architecture, and results obtained through the deployment of parts of the architecture in a simulated cluster that uses virtual machines to simulate computing nodes. Techniques to optimize the quality of the detection fault service are proposed. The results obtained through the usage of such techniques are compared to the results obtained through traditional approaches. Positive results were extracted even under adverse connectivity conditions by using techniques such as the processing of fault and recovery notifications and the introduction of redundant information in the messages exchanged between the detection modules. It is concluded that the GFDA architecture contributes to the establishment of a viable solution for fault detection in a hierarchical grid computing platform that presents connectivity restrictions between the nodes. Detecção de falhas Detecção distribuída de falhas Grades computacionais Tolerância a falhas Distributed fault detection Fault detection Fault tolerance Grid computing
324	Análise da associação dos protocolos de roteamento AODV e DSR com o algoritmo Gossip, sistema de Quorum e com um novo algoritmo de economia de energia, PWSave. / Association analisys of the routing protocols AODV and DSR with Gossip, Quorum system and a new algorithm, PWSave. Renata Lopes Rosa 15 July 2009 (has links) Este trabalho estuda a implementação do sistema de Quorum associado ao algoritmo epidêmico Gossip, a implementação de um novo algoritmo de economia de energia - o PWSave - e o protocolo de roteamento AODV em um cenário com e sem falhas de uma rede ad hoc com mobilidade. Optou-se por implementar este trabalho em um ambiente de simulação, dado que a modelagem matemática da associação do Gossip, Quorum e PWSave com os 80 nós - quantidade de nós escolhida para o ambiente de simulação - apresentaria maior complexidade e demora ao abranger todas as variáveis de ambiente desse conjunto de soluções para cada nó presente na rede. A rotina de programação - com o uso de loops para os trabalhos repetitivos - presente no ambiente de simulação permite que os experimentos sejam efetuados mais rapidamente e com menor probabilidade de erros. Os estudos [1], [2] demonstraram, respectivamente, que soluções abrangendo o algoritmo epidêmico Gossip e o sistema de compartilhamento de dados Quorum apresentam resultados favoráveis para uma rede ad hoc com alta mobilidade. Em [1] é apresentado um cenário muito próximo ao implementado neste trabalho, com a utilização do algoritmo Gossip ao protocolo de roteamento Ad-Hoc On-Demand Distance Vector (AODV). Os parâmetros analisados foram os mesmos, a saber: routes requests (RREQ), perda de pacote, vazão e latência. Os resultados do cenário simulado mostram uma diminuição no número de RREQs em uma rede ad hoc, e os demais parâmetros, medidos no ambiente de simulação, são pouco afetados. De acordo com [2] constata-se que há um aumento da resiliência e da vazão da rede e uma menor sobrecarga causada pela distribuição da informação na rede ad hoc pelo sistema de Quorum. A associação do algoritmo Gossip com o sistema de Quorum resultou em uma diminuição considerável de RREQs e perda de pacotes, mas o parâmetro de consumo de energia, que deve ser um fator importante em uma rede ad hoc e/ou uma rede sensor, não apresentou nenhuma melhora. Portanto, foi implementada uma solução adicional ao Gossip e ao Quorum, com o desenvolvimento de um novo algoritmo de economia de energia denominado de PWSave, no simulador Glomosim com o protocolo de roteamento AODV. O PWSave é responsável pelo adormecimento dos nós da rede que não estejam processando informações, ou seja, os nós, no momento do adormecimento, não poderão trocar dados ou auxiliar na formação de rotas da rede. O PWSave associado ao Gossip e ao sistema de Quorum apresenta resultados que refletem ma diminuição no consumo de energia próxima a 10% em comparação com a solução da associação do Gossip com o sistema de Quorum sem a implementação de PWSave. Os resultados da simulação mostram que a associação de Gossip, Quorum e PWSave acarreta uma redução no número de RREQs e na taxa de perda de pacotes sem degradar muito características de fluxo e latência, além de propiciar uma considerável economia no consumo de energia. / This work studies the implementation of the Quorum system associated with the Gossip algorithm, the implementation of a new power saving algorithm - the PWSave - and the routing protocol AODV in a scenario with and without failures of an ad hoc network with mobility. It has been chosen to implement this work in an environment of simulation, because the mathematical modeling of the association of Gossip, Quorum and PWSave with 80 nodes - number of nodes that has been chosen for the simulation environment - would present a higher complexity and delay to address all environment variables of the solutions set for each node present in the network. The programming routine - with the use of loops for the repetitive works - present in the simulation environment allows the experiments to be performed faster and with less probability of errors. The studies [1], [2] have shown, respectively, that solutions covering the Gossip epidemic algorithm and the system for sharing data Quorum show favorable results for an ad hoc network with high mobility. In [1] is presented a scenario very close to that implemented in this work, using the Gossip algorithm associated to the routing protocol Ad-Hoc On-Demand Distance Vector (AODV). The parameters analyzed were the same: routes requests (RREQ), packet loss, throughput and latency. The simulated scenario results show a decrease in the number of RREQs in an ad hoc network, and the other parameters, measured in the simulation environment, are little afected. According to [2] it is noted that there is an increase in the resilience and throughput of the network and a lower overload caused by the distribution of the information in the ad hoc network by the Quorum system. The association of the Gossip algorithm with the Quorum system resulted in a considerable decrease of RREQs and packet loss, but the parameter of energy consumption, which is an important factor in an ad hoc network and/or a sensor network, shows no improvement. Therefore, an additional solution was associated to the Gossip and to the Quorum, with the development of a new power saving algorithm named PWSave, in the simulator Glomosim with the routing protocol AODV. The PWSave is responsable for the sleeping state of the network nodes when they are not processing information: the nodes at the time of sleep cannot exchange data or assist in the building of network routes. The PWSave associated with the Gossip and Quorum system provides a decrease of the energy consumption close to 10% compared to the association solution of the Gossip with the Quorum system without the PWSave implementation. The results of simulation show that the association of the Gossip, Quorum and PWSave produces a reduction in the number of RREQs and in the rate of packets loss without degrading much the throughput and latency characteristics, providing a considerable energy consumption economy. Gossip Protocolo de roteamento Rede ad hoc Sistema de Quorum Tolerância a falha Ad hoc networks Fault tolerance Gossip Quorum system Routing protocol
325	ADC : ambiente para experimentação e avaliação de protocolos de difusão confiável / Reliable broadcast protocols experimentation and evaluation environment (ADC) Barcelos, Patricia Pitthan de Araujo January 1996 (has links) Uma tendência recente em sistemas de computação é distribuir a computação entre diversos processadores físicos. Isto conduz a dois tipos de sistemas: sistemas fortemente acoplados e sistemas fracamente acoplados. Este trabalho enfoca os sistemas de computação classificados como fracamente acoplados, ou sistemas distribuídos, como são popularmente conhecidos. Um sistema distribuído, segundo [BAB 86], pode ser definido como um conjunto de processadores autônomos que não compartilham memória, não tem acesso a clocks' globais e cuja comunicação é realizada somente por troca de mensagens. As exigências intrínsecas de sistemas distribuídos compreendem a confiabilidade e a disponibilidade. Estas exigências tem levado a um crescente interesse em técnicas de tolerância a falhas, cujo objetivo é manter a consistência do sistema distribuído, mesmo na ocorrência de falhas. Uma técnica de tolerância a falhas amplamente utilizada em sistemas distribuídos é a técnica de difusão confiável. A difusão confiável é uma técnica de redundância de software, onde um processador dissemina um valor para os demais processadores em um sistema distribuído, o qual esta sujeito a falhas [BAB 85]. Por ser uma técnica básica de comunicação, diversos procedimentos de tolerância a falhas baseiam-se em difusão confiável. Este trabalho descreve a implementação de um ambiente de apoio a sistemas distribuídos intitulado Ambiente para Experimentação e Avaliação de Protocolos de Difusão Confiável (ADC). Neste ambiente são utilizados os recursos da difusão confiável para a obtenção de uma concordância entre todos os membros do sistema livres de falha. Esta concordância, conhecida como consenso, é obtida através de algoritmos de consenso, os quais visam introduzir o grau de confiabilidade exigido pelos sistemas distribuídos. O ADC (Ambiente para Experimentação e Avaliação de Protocolos de Difusão Confiável) foi desenvolvido em estações de trabalho SUN (SunOS) utilizando o sistema operacional de rede heterogêneo HetNOS [BAA 93] desenvolvido na UFRGS. O ambiente foi implementado com base em um estudo realizado sobre protocolos de difusão confiável [BAR 94]. Através da implementação do ADC e possível simular a execução de protocolos de difusão confiável aplicando modelos propostos para os mesmos. Desta execução são extraídos resultados, sobre os quais pode-se realizar uma analise. Esta análise tem sua fundamentação principalmente nos parâmetros de desempenho, confiabilidade e complexidade. Tanto a implementação do ADC como a realização da analise do modelo proposto foram realizados tendo como suporte alguns dos protocolos de difusão confiável disponíveis na literatura. O principal objetivo deste ambiente consiste na experimentação, ou seja, na verificação da relação teórico-prática dos sistemas distribuídos perante a utilização de uma técnica de redundância de software, a difusão confiável. Através deste ambiente torna-se possível a determinação de parâmetros tais como o número de mensagens de difusão trocadas entre os processos, o número de mensagens de retransmissão enviadas, o número de mensagens emitidas durante todo o processamento do modelo, etc. Estes parâmetros resultam numa analise consistente de protocolos de difusão confiável. / A recent trend in computing systems is to distribute the computation between several physical processors. This leads to two different systems: closely coupled systems and loosely coupled systems. This work focuses on computing systems classified as loosely coupled or distributed systems, as they are commonly known. According to [BAB 86], a distributed system can be defined as a set of autonomous processors with no shared memory, no global clocks and whose comunication is performed only by message exchange. The inherent requirements of distributed systems include reliability and availability. These have caused an increasing interest in fault tolerance techniques, whose goal is to keep the distributed system consistent despite failures. A fault tolerance technique largely used in distributed systems is reliable broadcast. Reliable broadcast is a software redundancy technique, where a processor disseminates a value to other processors in a distributed system, in which failures can occur [BAB85]. Because it is a basic communication technique, several fault tolerance procedures are based on reliable broadcast. This work describes the implementation of a support environment for distributed systems called Reliable Broadcast Protocols Experimentation and Evaluation Environment (ADC). Reliable broadcast resources are used in this environment to obtain an agreement among all off-failure system components. This agreement, called consensus, has been obtained through consensus algorithms, which aim to introduce the reliability degree required in distributed systems. The ADC has been developed in Sun workstation (SunOS) using the heterogeneous operating system HetNOS [BAA 93] which was developed at UFRGS. The environment has been implemented based on a research about reliable broadcast protocols [BAR 94]. Through the ADC it is possible to simulate the execution of reliable broadcast protocols applying proposed models to them. From this execution results are extracted, and over them analysis can be done. This analysis has been based essentialy in parameters such as performance, reliability and complexity. Some classical reliable broadcast protocols were used as a support to ADC implementation and model analysis. The main goal of this environment consists in validating diffusion protocols in a practical distributed systems environment, facing reliable broadcast. Through this environment it can be possible the analysis of important parameters resolution such as the number of messages exchanged between process, the number of retransmission of messages sent, the number of messages sent during the whole model processing, others. These parameters result in a consistent analysis of reliable broadcast protocols. Confiabilidade : Computadores Tolerancia : Falhas Sistemas distribuidos Difusao confiavel Fault tolerance Distributed systems Greoup communication Reliable broadcast Consensus
326	Compiler-Assisted Software Fault Tolerance for Microcontrollers Bohman, Matthew Kendall 01 March 2018 (has links) Commercial off-the-shelf (COTS) microcontrollers can be useful for non-critical processing on spaceborne platforms. Many of these microprocessors are inexpensive and consume little power. However, the software running on these processors is vulnerable to radiation upsets, which can cause unpredictable program execution or corrupt data. Space missions cannot allow these errors to interrupt functionality or destroy gathered data. As a result, several techniques have been developed to reduce the effect of these upsets. Some proposed techniques involve altering the processor hardware, which is impossible for a COTS device. Alternately, the software running on the microcontroller can be modified to detect or correct data corruption. There have been several proposed approaches for software mitigation. Some take advantage of advanced architectural features, others modify software by hand, and still others focus their techniques on specific microarchitectures. However, these approaches do not consider the limited resources of microcontrollers and are difficult to use across multiple platforms. This thesis explores fully automated software-based mitigation to improve the reliability of microcontrollers and microcontroller software in a high radiation environment. Several difficulties associated with automating software protection in the compilation step are also discussed. Previous mitigation techniques are examined, resulting in the creation of COAST (COmpiler-Assisted Software fault Tolerance), a tool that automatically applies software protection techniques to user code. Hardened code has been verified by a fault injection campaign; the mean work to failure increased, on average, by 21.6x. When tested in a neutron beam, the neutron cross sections of programs decreased by an average of 23x, and the average mean work to failure increased by 5.7x. LLVM TMR DWC software reliability microcontroller Texas Instruments MSP430 ARM radiation testing fault injection fault tolerance Electrical and Computer Engineering
327	A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy Baldwin, Andrew Lockett 01 January 2012 (has links) Semiconductor manufacturing defects adversely affect yield and reliability. Manufacturers expend vast resources to reduce defects within their processes. As the minimum feature size get smaller, defects become increasingly difficult to prevent. Defects can change the behavior of a logic circuit resulting in a fault. Manufacturers and designers may improve yield, reliability, and profitability by using design techniques that make products robust even in the presence of faults. Triple modular redundancy (TMR) is a fault tolerant technique commonly used to mask faults using voting outcomes from three processing elements (PE). TMR is effective at masking errors as long as no more than a single processing element is faulty. Time distributed voting (TDV) is proposed as an active fault tolerant technique. TDV addresses the shortcomings of triple modular redundancy (TMR) in the presence of multiple faulty processing elements. A faulty PE may not be incorrect 100% of the time. When a faulty element generates correct results, a majority is formed with the healthy PE. TDV observes voting outcomes over time to make a statistical decision whether a PE is healthy or faulty. In simulation, fault coverage is extended to 98.6% of multiple faulty PE cases. As an active fault tolerant technique, TDV identifies faulty PE's so that actions may be taken to replace or disable them in the system. TDV may provide a positive impact to semiconductor manufacturers by improving yield and reliability even as fault frequency increases. Aliasing Reliability Triple modular redundancy Semiconductors -- Materials -- Testing Fault tolerance (Engineering) Semiconductors -- Defects -- Research Electrical and Computer Engineering
328	Fault tree analysis for automotive pressure sensor assembly lines Antony, Albin. January 2006 (has links) Thesis (M.S.)--State University of New York at Binghamton, Thomas J. Watson School of Engineering and Applied Science, Systems Science and Industrial Engineering Department, 2006. / Includes bibliographical references.
329	Analysis and Development of Error-Job Mapping and Scheduling for Network-on-Chips with Homogeneous Processors Karlsson, Erik January 2010 (has links) <p>Due to increased complexity of today’s computer systems, which are manufactured in recent semiconductor technologies, and the fact that recent semiconductor technologies are more liable to soft errors (non-permanent errors) it is inherently difficult to ensure that the systems are and will remain error-free. Depending on the application, a soft error can have serious consequences for the system. It is therefore important to detect the presence of soft errors as early as possible and recover from the erroneous state and maintain correct operation. There is an entire research area devoted on proposing, implementing and analyzing techniques that can detect and recover from these errors, known as fault tolerance. The drawback of using faulttolerance is that it usually introduces some overhead. This overhead may be for instance redundant hardware, which increases the cost of the system, or it may be a time overhead that negatively impacts on system performance. Thus a main concern when applying fault tolerance is to minimize the imposed overhead while the system is still able to deliver the correct error-free operation. In this thesis we have analyzed one well known fault tolerant technique, Rollback-Recovery with Checkpointing (RRC). This technique is able to detect and recover from errors by taking and storing checkpoints during the execution of a job.Therefore we can think as if a job is divided into a number of execution segments and a checkpoint is taken after executing each execution segment. This technique requires the job to be concurrently executed on two processors. At each checkpoint, both processors exchange data, which contains enough information for the job’s state. The exchanged data are then compared. If the data differ, it means that an error is detected in the previous execution segment and it is therefore re-executed. If the exchanged data are the same, it means that no errors are detected and the data are stored as a safe point from which the job can be restarted later. A time overhead due to exchanging data between processors is therefore introduced, and it increases the average execution time of a job, i.e. the average time required for a given job to complete. The overhead depends on the number of links that has to be traversed (due to data exchange) after each execution segment and the number of execution segments that are needed for the given job. The number of links that has to be traversed after each execution segment is twice the distance between the processors that are executing the same job concurrently. However, this is only true if all the links are fully functional. A link failure can result in a longer route for communication between the processors. Even though all links arefully functional, the number of execution segments still depends on error-free probabilities of the processors, and these error-free probabilities can vary between processors. This implies that the choice of processors affects the total number of links the communication has to traverse. Choosing two processors with higher error-free probability further away from eachother increases the distance, but decreases the number of execution segments, which can result in a lower overhead. By carefully determining the mapping for a given job, one can decrease the overhead, hence decreasing the average execution time. Since it is very common to have a larger number of jobs than available resources, it is not only important to find a good mapping to decrease the average execution time for a whole system, but also a good order of execution for a given set jobs (scheduling of the jobs). We propose in this thesis several mapping and scheduling algorithms that aim to reduce the average execution time in a fault-tolerant multiprocessor System-on-Chip, which uses Network-on-Chip as an underlying interconnect architecture, so that the fault-tolerant technique (RRC) can perform efficiently.</p> NoC Network-on-Chip MPSoC fault tolerance fault-tolerant job mapping scheduling RRC minimize AET efficient Computer engineering Datorteknik
330	Scheduling and Optimization of Fault-Tolerant Embedded Systems Izosimov, Viacheslav January 2006 (has links) <p>Safety-critical applications have to function correctly even in presence of faults. This thesis deals with techniques for tolerating effects of transient and intermittent faults. Reexecution, software replication, and rollback recovery with checkpointing are used to provide the required level of fault tolerance. These techniques are considered in the context of distributed real-time systems with non-preemptive static cyclic scheduling.</p><p>Safety-critical applications have strict time and cost constrains, which means that not only faults have to be tolerated but also the constraints should be satisfied. Hence, efficient system design approaches with consideration of fault tolerance are required.</p><p>The thesis proposes several design optimization strategies and scheduling techniques that take fault tolerance into account. The design optimization tasks addressed include, among others, process mapping, fault tolerance policy assignment, and checkpoint distribution.</p><p>Dedicated scheduling techniques and mapping optimization strategies are also proposed to handle customized transparency requirements associated with processes and messages. By providing fault containment, transparency can, potentially, improve testability and debugability of fault-tolerant applications.</p><p>The efficiency of the proposed scheduling techniques and design optimization strategies is evaluated with extensive experiments conducted on a number of synthetic applications and a real-life example. The experimental results show that considering fault tolerance during system-level design optimization is essential when designing cost-effective fault-tolerant embedded systems.</p> Embedded systems Real-Time systems Design optimization Fault tolerance Transient faults Soft errors Computer and systems science Data- och systemvetenskap

Search results