Global ETD Search

311	Análise da associação dos protocolos de roteamento AODV e DSR com o algoritmo Gossip, sistema de Quorum e com um novo algoritmo de economia de energia, PWSave. / Association analisys of the routing protocols AODV and DSR with Gossip, Quorum system and a new algorithm, PWSave. Rosa, Renata Lopes 15 July 2009 (has links) Este trabalho estuda a implementação do sistema de Quorum associado ao algoritmo epidêmico Gossip, a implementação de um novo algoritmo de economia de energia - o PWSave - e o protocolo de roteamento AODV em um cenário com e sem falhas de uma rede ad hoc com mobilidade. Optou-se por implementar este trabalho em um ambiente de simulação, dado que a modelagem matemática da associação do Gossip, Quorum e PWSave com os 80 nós - quantidade de nós escolhida para o ambiente de simulação - apresentaria maior complexidade e demora ao abranger todas as variáveis de ambiente desse conjunto de soluções para cada nó presente na rede. A rotina de programação - com o uso de loops para os trabalhos repetitivos - presente no ambiente de simulação permite que os experimentos sejam efetuados mais rapidamente e com menor probabilidade de erros. Os estudos [1], [2] demonstraram, respectivamente, que soluções abrangendo o algoritmo epidêmico Gossip e o sistema de compartilhamento de dados Quorum apresentam resultados favoráveis para uma rede ad hoc com alta mobilidade. Em [1] é apresentado um cenário muito próximo ao implementado neste trabalho, com a utilização do algoritmo Gossip ao protocolo de roteamento Ad-Hoc On-Demand Distance Vector (AODV). Os parâmetros analisados foram os mesmos, a saber: routes requests (RREQ), perda de pacote, vazão e latência. Os resultados do cenário simulado mostram uma diminuição no número de RREQs em uma rede ad hoc, e os demais parâmetros, medidos no ambiente de simulação, são pouco afetados. De acordo com [2] constata-se que há um aumento da resiliência e da vazão da rede e uma menor sobrecarga causada pela distribuição da informação na rede ad hoc pelo sistema de Quorum. A associação do algoritmo Gossip com o sistema de Quorum resultou em uma diminuição considerável de RREQs e perda de pacotes, mas o parâmetro de consumo de energia, que deve ser um fator importante em uma rede ad hoc e/ou uma rede sensor, não apresentou nenhuma melhora. Portanto, foi implementada uma solução adicional ao Gossip e ao Quorum, com o desenvolvimento de um novo algoritmo de economia de energia denominado de PWSave, no simulador Glomosim com o protocolo de roteamento AODV. O PWSave é responsável pelo adormecimento dos nós da rede que não estejam processando informações, ou seja, os nós, no momento do adormecimento, não poderão trocar dados ou auxiliar na formação de rotas da rede. O PWSave associado ao Gossip e ao sistema de Quorum apresenta resultados que refletem ma diminuição no consumo de energia próxima a 10% em comparação com a solução da associação do Gossip com o sistema de Quorum sem a implementação de PWSave. Os resultados da simulação mostram que a associação de Gossip, Quorum e PWSave acarreta uma redução no número de RREQs e na taxa de perda de pacotes sem degradar muito características de fluxo e latência, além de propiciar uma considerável economia no consumo de energia. / This work studies the implementation of the Quorum system associated with the Gossip algorithm, the implementation of a new power saving algorithm - the PWSave - and the routing protocol AODV in a scenario with and without failures of an ad hoc network with mobility. It has been chosen to implement this work in an environment of simulation, because the mathematical modeling of the association of Gossip, Quorum and PWSave with 80 nodes - number of nodes that has been chosen for the simulation environment - would present a higher complexity and delay to address all environment variables of the solutions set for each node present in the network. The programming routine - with the use of loops for the repetitive works - present in the simulation environment allows the experiments to be performed faster and with less probability of errors. The studies [1], [2] have shown, respectively, that solutions covering the Gossip epidemic algorithm and the system for sharing data Quorum show favorable results for an ad hoc network with high mobility. In [1] is presented a scenario very close to that implemented in this work, using the Gossip algorithm associated to the routing protocol Ad-Hoc On-Demand Distance Vector (AODV). The parameters analyzed were the same: routes requests (RREQ), packet loss, throughput and latency. The simulated scenario results show a decrease in the number of RREQs in an ad hoc network, and the other parameters, measured in the simulation environment, are little afected. According to [2] it is noted that there is an increase in the resilience and throughput of the network and a lower overload caused by the distribution of the information in the ad hoc network by the Quorum system. The association of the Gossip algorithm with the Quorum system resulted in a considerable decrease of RREQs and packet loss, but the parameter of energy consumption, which is an important factor in an ad hoc network and/or a sensor network, shows no improvement. Therefore, an additional solution was associated to the Gossip and to the Quorum, with the development of a new power saving algorithm named PWSave, in the simulator Glomosim with the routing protocol AODV. The PWSave is responsable for the sleeping state of the network nodes when they are not processing information: the nodes at the time of sleep cannot exchange data or assist in the building of network routes. The PWSave associated with the Gossip and Quorum system provides a decrease of the energy consumption close to 10% compared to the association solution of the Gossip with the Quorum system without the PWSave implementation. The results of simulation show that the association of the Gossip, Quorum and PWSave produces a reduction in the number of RREQs and in the rate of packets loss without degrading much the throughput and latency characteristics, providing a considerable energy consumption economy. Ad hoc networks Fault tolerance Gossip Gossip Protocolo de roteamento Quorum system Rede ad hoc Routing protocol Sistema de Quorum Tolerância a falha
312	Um framework para coordenação do tratamento de exceções em sistemas tolerantes a falhas / A framework for exception handling coordination in fault-tolerant systems Pereira, David Paulo 09 March 2007 (has links) A adoção em larga escala de redes de computadores e gerenciadores de banco de dados contribuiu para o surgimento de sistemas de informação complexos. Atualmente, estes sistemas tornaram-se elementos essenciais na vida das pessoas, dando suporte a processos de negócio e serviços corporativos indispensáveis à sociedade, como automação bancária e telefonia. A utilização de componentes na estruturação destes sistemas promove maior qualidade e flexibilidade ao produto e agiliza o processo de desenvolvimento. Entretanto, para que estes benefícios sejam totalmente observados, é fundamental que os provedores de componentes de prateleira projetem especificações precisas, completas e consistentes. Geralmente, as especificações omitem ou negligenciam o comportamento dos componentes nas situações de falha. Desta forma, a utilização de componentes não confiáveis, cujos comportamentos não podem ser inteiramente previstos, compromete seriamente o projeto de sistemas tolerantes a falhas. Uma estratégia para a especificação de componentes tolerantes a falhas é informar a ocorrência de erros através de exceções e realizar a recuperação dos mesmos por rotinas de tratamento correspondentes. A especificação deve separar claramente o comportamento normal do excepcional, destinado à recuperação do erro. Entretanto, em sistemas concorrentes e distribuídos, a especificação apenas deste tratamento local não é suficiente. Uma exceção pode ser lançada em decorrência de erros sistêmicos (i.e. problemas de rede) que afetam todo o sistema. Assim, determinadas exceções devem ser tratadas em nível arquitetural, envolvendo os demais componentes no tratamento. O modelo conceitual de ações Atômicas Coordenadas (ações CA - Coordinated Atomic actions), bastante aplicado na estruturação de sistemas tolerantes a falhas, define um mecanismo geral para a coordenação do tratamento excepcional dos componentes, que cooperam na execução das atividades e competem por recursos compartilhados. Portanto, o modelo de ações CA oferece uma solução potencialmente viável para a especificação do tratamento de exceções em nível arquitetural. Este trabalho propõe um framework para a especificação do tratamento de exceções em nível arquitetural, baseando-se no modelo de aninhamento de ações CA e utilizando a linguagem orientada a eventos CSP (Communicating Sequential Processes). Sua principal característica é prover um protocolo padronizado para a coordenação do tratamento de exceções, que envolve a cooperação dos componentes do sistema. Além disso, é apresentada uma estratégia para a verificação formal dos sistemas na ferramenta FDR (Failure Divergence Refinement), com base no modelo de refinamento por rastros. / The widespread scale adoption of computer networks and database management systems has contributed to the arising of complex information systems. Nowadays, these systems have become essential aspects in the everyday life, supporting business processes and indispensable enterprise services to society such as banking automation and telephony. The usage of components in structuring of these systems promotes higher quality and flexibility to the product and accelerates the software development process. However, in order to fully observe the benefits it is essential that the suppliers of these COTS (commercial off-the-shelf) design precise, complete and consistent specifications. Generally, the specifications omit or neglect the behavior of these components in exceptional situations. Therefore, the usage of untrustworthy components whose behavior cannot be entirely foreseen seriously compromise the design of fault-tolerant systems. One of the strategies used for the specification of fault-tolerant components is to inform the occurrence of errors through exceptions and make its recovering by the correspondent exception handling routines. The specification should separate clearly the normal behavior from the exceptional one, specially designed for error recovery. However, in concurrent and distributed systems, specification of local exception handling is not enough. An exception could be raised as a result of systemic errors (i.e. network errors) which affect the entire system, thus specific types of exceptions should be treated at an architectural level involving all the other components in this handling activity. The conceptual model of Coordinated Atomic (CA) actions, often applied in the structuring of fault-tolerant systems, defines a general mechanism for coordination of exception handling with components that cooperate while executing activities and compete for shared resources. Therefore, the model of CA actions offers a perfectly viable solution for the specification of exception handling at an architectural level. This work proposes a framework for the specification of exception handling at an architectural level, based on the nesting model of CA actions and using the event-oriented language CSP (Communicating Sequential Processes). Its main characteristic is to provide a standardized protocol for coordination of exception handling that involves the cooperation of system components. Moreover, it is presented a formal strategy for system verification using the FDR (Failure Divergence Refinement) tool, based on the traces refinement model. ações atômicas coordenadas coordinated atomic actions csp csp exception handling fault-tolerance formal methods métodos formais tolerância a falhas tratamento de exceções
313	Detecção de faltas: uma abordagem baseada no comportamento de processos / Fault detection an approach based on process behavior Pereira, Cássio Martini Martins 25 March 2011 (has links) A diminuição no custo de computadores pessoais tem favorecido a construção de sistemas computacionais complexos, tais como aglomerados e grades. Devido ao grande número de recursos existentes nesses sistemas, a probabilidade de que faltas ocorram é alta. Uma abordagem que auxilia a tornar sistemas mais robustos na presença de faltas é a detecção de sua ocorrência, a fim de que processos possam ser reiniciados em estados seguros, ou paralisados em estados que não ofereçam riscos. Abordagens comumente adotadas para detecção seguem, basicamente, três tipos de estratégias: as baseadas em mensagens de controle, em estatística e em aprendizado de máquina. No entanto, elas tipicamente não consideram o comportamento de processos ao longo do tempo. Observando essa limitação nas pesquisas relacionadas, este trabalho apresenta uma abordagem para medir a variação no comportamento de processos ao longo do tempo, a fim de que mudanças inesperadas sejam detectadas. Essas mudanças são consideradas, no contexto deste trabalho, como faltas, as quais representam transições indesejadas entre estados de um processo e podem levá-lo a processamento incorreto, fora de sua especificação. A proposta baseia-se na estimação de cadeias de Markov que representam estados visitados por um processo durante sua execução. Variações nessas cadeias são utilizadas para identificar faltas. A abordagem proposta é comparada à técnica de aprendizado de máquina Support Vector Machines, bem como à técnica estatística Auto-Regressive Integrated Moving Average. Essas técnicas foram escolhidas para comparação por estarem entre as mais empregadas na literatura. Experimentos realizados mostraram que a abordagem proposta possui, com erro \'alfa\' = 1%, um F-Measure maior do que duas vezes o alcançado pelas outras técnicas. Realizou-se também um estudo adicional de predição de faltas. Nesse sentido, foi proposta uma técnica preditiva baseada na reconstrução do comportamento observado do sistema. A avaliação da técnica mostrou que ela pode aumentar em até uma ordem de magnitude a disponibilidade (em horas) de um sistema / The cost reduction for personal computers has enabled the construction of complex computational systems, such as clusters and grids. Because of the large number of resources available on those systems, the probability that faults may occur is high. An approach that helps to make systems more robust in the presence of faults is their detection, in order to restart or stop processes in safe states. Commonly adopted approaches for detection basically follow one of three strategies: the one based on control messages, on statistics or on machine learning. However, they typically do not consider the behavior of processes over time. Observing this limitation in related researches, this work presents an approach to measure the level of variation in the behavior of processes over time, so that unexpected changes are detected. These changes are considered, in the context of this work, as faults, which represent undesired transitions between process states and may cause incorrect processing, outside the specification. The approach is based on the estimation of Markov Chains that represent states visited by a process during its execution. Variations in these chains are used to identify faults. The approach is compared to the machine learning technique Support Vector Machines, as well as to the statistical technique Auto-Regressive Integrated Moving Average. These techniques have been selected for comparison because they are among the ones most employed in the literature. Experiments conducted have shown that the proposed approach has, with error \'alpha\'= 1%, an F-Measure higher than twice the one achieved by the other techniques. A complementary study has also been conducted about fault prediction. In this sense, a predictive approach based on the reconstruction of system behavior was proposed. The evaluation of the technique showed that it can provide up to an order of magnitude greater availability of a system in terms of uptime hours Agrupamento Aprendizado de máquina Clustering Detecção de faltas Fault detection Fault prediction Fault tolerance Machine learning Predição de faltas Tolerância a faltas
314	Frame-level redundancy scrubbing technique for SRAM-based FPGAs / Técnica de correção usando a redudância a nível de quadro para FPGAs baseados em SRAM Seclen, Jorge Lucio Tonfat January 2015 (has links) Confiabilidade é um parâmetro de projeto importante para aplicações criticas tanto na Terra como também no espaço. Os FPGAs baseados em memoria SRAM são atrativos para implementar aplicações criticas devido a seu alto desempenho e flexibilidade. No entanto, estes FPGAs são susceptíveis aos efeitos da radiação tais como os erros transientes na memoria de configuração. Além disso, outros efeitos como o envelhecimento (aging) ou escalonamento da tensão de alimentação (voltage scaling) incrementam a sensibilidade à radiação dos FPGAs. Nossos resultados experimentais mostram que o envelhecimento e o escalonamento da tensão de alimentação podem aumentar ao menos duas vezes a susceptibilidade de FPGAs baseados em SRAM a erros transientes. Estes resultados são inovadores porque estes combinam três efeitos reais que acontecem em FPGAs baseados em SRAM. Os resultados podem guiar aos projetistas a prever os efeitos dos erros transientes durante o tempo de operação do dispositivo em diferentes níveis de tensão. A correção da memoria usando a técnica de scrubbing é um método efetivo para corrigir erros transientes em memorias SRAM, mas este método impõe custos adicionais em termos de área e consumo de energia. Neste trabalho, nos propomos uma nova técnica de scrubbing usando a redundância interna a nível de quadros chamada FLR- scrubbing. Esta técnica possui mínimo consumo de energia sem comprometer a capacidade de correção. Como estudo de caso, a técnica foi implementada em um FPGA de tamanho médio Xilinx Virtex-5, ocupando 8% dos recursos disponíveis e consumindo seis vezes menos energia que um circuito corretor tradicional chamado blind scrubber. Além, a técnica proposta reduz o tempo de reparação porque evita o uso de uma memoria externa como referencia. E como outra contribuição deste trabalho, nos apresentamos os detalhes de uma plataforma de injeção de falhas múltiplas que permite emular os erros transientes na memoria de configuração do FPGA usando reconfiguração parcial dinâmica. Resultados de campanhas de injeção são apresentados e comparados com experimentos de radiação acelerada. Finalmente, usando a plataforma de injeção de falhas proposta, nos conseguimos analisar a efetividade da técnica FLR-scrubbing. Nos também confirmamos estes resultados com experimentos de radiação acelerada. / Reliability is an important design constraint for critical applications at ground-level and aerospace. SRAM-based FPGAs are attractive for critical applications due to their high performance and flexibility. However, they are susceptible to radiation effects such as soft errors in the configuration memory. Furthermore, the effects of aging and voltage scaling increment the sensitivity of SRAM-based FPGAs to soft errors. Experimental results show that aging and voltage scaling can increase at least two times the susceptibility of SRAM-based FPGAs to Soft Error Rate (SER). These findings are innovative because they combine three real effects that occur in SRAM-based FPGAs. Results can guide designers to predict soft error effects during the lifetime of devices operating at different power supply voltages. Memory scrubbing is an effective method to correct soft errors in SRAM memories, but it imposes an overhead in terms of silicon area and energy consumption. In this work, it is proposed a novel scrubbing technique using internal frame redundancy called Frame-level Redundancy Scrubbing (FLRscrubbing) with minimum energy consumption overhead without compromising the correction capabilities. As a case study, the FLR-scrubbing controller was implemented on a mid-size Xilinx Virtex-5 FPGA device, occupying 8% of available slices and consumes six times less energy per scrubbed frame than a classic blind scrubber. Also, the technique reduces the repair time by avoiding the use of an external golden memory for reference. As another contribution, this work presents the details of a Multiple Fault Injection Platform that emulates the configuration memory upsets of an FPGA using dynamic partial reconfiguration. Results of fault injection campaigns are presented and compared with accelerated ground-level radiation experiments. Finally, using our proposed fault injection platform it was possible to analyze the effectiveness of the FLR-scrubbing technique. Accelerated radiation tests confirmed these results. Microeletrônica Fpga Circuitos digitais SRAM-based FPGA Soft error Memory scrubbing Reliability Single event upsets Fault tolerance Microelectronics
315	FITT : fault injection test tool to validate safety communication protocols / FITT : a fault injection tool to validate safety communication protocols / Uma ferramenta de injeção de falhas para validar protocolos de comunicação seguros Dobler, Rodrigo Jaureguy January 2016 (has links) Protocolos de comunicação seguros são essenciais em ambientes de automação industrial, onde falhas não detectadas na comunicação de dispositivos podem provocar danos irreparáveis à vida ou ao meio-ambiente. Esses protocolos seguros devem ser desenvolvidos de acordo com alguma norma de segurança, como a IEC 61508. Segundo ela, faz parte do processo de implementação destes protocolos, a escolha de técnicas adequadas de validação, entre elas a injeção de falhas, a qual deve considerar um modelo de falhas apropriado ao ambiente de operação do protocolo. Geralmente, esses ambientes são caracterizados pela existência de diversas formas de interferência elétrica e eletromagnética, as quais podem causar falhas nos sistemas eletrônicos existentes. Nos sistemas de comunicação de dados, isto pode levar a destruição do sinal de dados e causar estados de operação equivocados nos dispositivos. Assim, é preciso utilizar uma técnica de injeção de falhas que permita simular os tipos de erros de comunicação que podem ocorrer nos ambientes industriais. Dessa forma, será possível verificar o comportamento dos mecanismos de tolerância falhas na presença de falhas e assegurar o seu correto funcionamento. Para esta finalidade, este trabalho apresenta o desenvolvimento do injetor de falhas FITT para validação de protocolos de comunicação seguros. Esta ferramenta foi desenvolvida para ser utilizada com o sistema operacional Linux. O injetor faz uso do PF_RING, um módulo para o Kernel do Linux, que é responsável por realizar a comunicação direta entre as interfaces de rede e o injetor de falhas. Assim os pacotes não precisam passar pelas estruturas do Kernel do Linux, evitando que atrasos adicionais sejam inseridos no processo de recebimento e envio de mensagens. As funções de falhas desenvolvidas seguem o modelo de falhas de comunicação descrito na norma IEC 61508. Esse modelo é composto pelos erros de repetição, perda, inserção, sequência incorreta, endereçamento, corrupção de dados, atraso, mascaramento e falhas de memória em switches. / Safe communication protocols are essential in industrial automation environments, where undetected failures in the communication of devices can cause irreparable damage to life or to the environment. These safe protocols must be developed according to some safety standard, like IEC 61508. According to it, part of the process of implementing these protocols is to select appropriate techniques for validation, including the fault injection, which should consider an appropriate fault model for the operating environment of the protocol. Generally, these environments are characterized by the existence of various forms of electric and electromagnetic interference, which can cause failures in existing electronic systems. In data communication systems, this can lead to the destruction of the data signal and cause erroneous operation states in the devices. Thus, it is necessary to use a fault injection technique that allows simulating the types of communication errors that may occur in industrial environments. So, it will be possible to verify the behavior of the fault tolerance mechanisms in the presence of failures and ensure its correct functioning. For this purpose, this work presents the development of FITT fault injector for validation of safety communication protocols. This tool was developed to be used with Linux operating system. The fault injector makes use of PF_RING, a module for the Linux Kernel and that is responsible to perform the direct communication between the network interfaces and the fault injector. Thus the packages do not need to go through the Linux Kernel structures, avoiding additional delays to be inserted into the process of receiving and sending messages. The developed fault injection functions follow the communication fault model described in the IEC61508 standard, composed by the errors of repetition, loss, insertion, incorrect sequence, addressing, data corruption, delay, masking and memory failures within switches. The fault injection tests applied with this model allow to properly validate the fault tolerance mechanisms of safety protocols. Tolerancia : Falhas Protocolos : Comunicacao ABNT Fault injection Safety communication protocols Validation Errors detection Fault tolerance mechanisms IEC61508 Tests
316	Resilient regular expression matching on FPGAs with fast error repair / Avaliação resiliente de expressões regulares em FPGAs com rápida correção de erros Leipnitz, Marcos Tomazzoli January 2017 (has links) O paradigma Network Function Virtualization (NFV) promete tornar as redes de computadores mais escaláveis e flexíveis, através do desacoplamento das funções de rede de hardware dedicado e fornecedor específico. No entanto, funções de rede computacionalmente intensivas podem ser difíceis de virtualizar sem degradação de desempenho. Neste contexto, Field-Programmable Gate Arrays (FPGAs) têm se mostrado uma boa opção para aceleração por hardware de funções de rede virtuais que requerem alta vazão, sem se desviar do conceito de uma infraestrutura NFV que visa alta flexibilidade. A avaliação de expressões regulares é um mecanismo importante e computacionalmente intensivo, usado para realizar Deep Packet Inpection, que pode ser acelerado por FPGA para atender aos requisitos de desempenho. Esta solução, no entanto, apresenta novos desafios em relação aos requisitos de confiabilidade. Particularmente para FPGAs baseados em SRAM, soft errors na memória de configuração são uma ameaça de confiabilidade significativa. Neste trabalho, apresentamos um mecanismo de tolerância a falhas abrangente para lidar com falhas de configuração na funcionalidade de módulos de avaliação de expressões regulares baseados em FPGA. Além disso, é introduzido um mecanismo de correção de erros que considera o posicionamento desses módulos no FPGA para reduzir o tempo de reparo do sistema, melhorando a confiabilidade e a disponibilidade. Os resultados experimentais mostram que a taxa de falha geral e o tempo de reparo do sistema podem ser reduzidos em 95% e 90%, respectivamente, com custos de área e performance admissíveis. / The Network Function Virtualization (NFV) paradigm promises to make computer networks more scalable and flexible by decoupling the network functions (NFs) from dedicated and vendor-specific hardware. However, network and compute intensive NFs may be difficult to virtualize without performance degradation. In this context, Field-Programmable Gate Arrays (FPGAs) have been shown to be a good option for hardware acceleration of virtual NFs that require high throughput, without deviating from the concept of an NFV infrastructure which aims at high flexibility. Regular expression matching is an important and compute intensive mechanism used to perform Deep Packet Inspection, which can be FPGA-accelerated to meet performance constraints. This solution, however, introduces new challenges regarding dependability requirements. Particularly for SRAM-based FPGAs, soft errors on the configuration memory are a significant dependability threat. In this work we present a comprehensive fault tolerance mechanism to deal with configuration faults on the functionality of FPGA-based regular expression matching engines. Moreover, a placement-aware scrubbing mechanism is introduced to reduce the system repair time, improving the system reliability and availability. Experimental results show that the overall failure rate and the system mean time to repair can be reduced in 95% and 90%, respectively, with manageable area and performance costs. Microeletrônica Tolerancia : Falhas Fpga Field-Programmable Gate Array Repair Time Fault-Tolerance Regular Expression Matching Network Function Virtualization
317	Fault Tolerance in Linear Algebraic Methods using Erasure Coded Computations Xuejiao Kang (5929862) 16 January 2019 (has links) <p>As parallel and distributed systems scale to hundreds of thousands of cores and beyond, fault tolerance becomes increasingly important -- particularly on systems with limited I/O capacity and bandwidth. Error correcting codes (ECCs) are used in communication systems where errors arise when bits are corrupted silently in a message. Error correcting codes can detect and correct erroneous bits. Erasure codes, an instance of error correcting codes that deal with data erasures, are widely used in storage systems. An erasure code addsredundancy to the data to tolerate erasures. </p> <p><br> </p> <p>In this thesis, erasure coded computations are proposed as a novel approach to dealing with processor faults in parallel and distributed systems. We first give a brief review of traditional fault tolerance methods, error correcting codes, and erasure coded storage. The benefits and challenges of erasure coded computations with respect to coding scheme, fault models and system support are also presented.</p> <p><br> </p> <p>In the first part of my thesis, I demonstrate the novel concept of erasure coded computations for linear system solvers. Erasure coding augments a given problem instance with redundant data. This augmented problem is executed in a fault oblivious manner in a faulty parallel environment. In the event of faults, we show how we can compute the true solution from potentially fault-prone solutions using a computationally inexpensive procedure. The results on diverse linear systems show that our technique has several important advantages: (i) as the hardware platform scales in size and in number of faults, our scheme yields increasing improvement in resource utilization, compared to traditional schemes; (ii) the proposed scheme is easy to code as the core algorithm remains the same; (iii) the general scheme is flexible to accommodate a range of computation and communication trade-offs. </p> <p><br> </p> <p>We propose a new coding scheme for augmenting the input matrix that satisfies the recovery equations of erasure coding with high probability in the event of random failures. This coding scheme also minimizes fill (non-zero elements introduced by the coding block), while being amenable to efficient partitioning across processing nodes. Our experimental results show that the scheme adds minimal overhead for fault tolerance, yields excellent parallel efficiency and scalability, and is robust to different fault arrival models and fault rates.</p> <p><br> </p> <p>Building on these results, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point during execution, we only solve a system with the same size as the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate the observed faults. We present, in details, the augmentation process, the parallel formulation, and the performance of our method. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance.</p> <p><br> </p> <p>Based on the promising results for linear system solvers, we apply the concept of erasure coded computation to eigenvalue problems, which arise in many applications including machine learning and scientific simulations. Erasure coded computation is used to design a fault tolerant eigenvalue solver. The original eigenvalue problem is reformulated into a generalized eigenvalue problem defined on appropriate augmented matrices. We present the augmentation scheme, the necessary conditions for augmentation blocks, and the proofs of equivalence of the original eigenvalue problem and the reformulated generalized eigenvalue problem. Finally, we show how the eigenvalues can be derived from the augmented system in the event of faults. </p> <p><br> </p> <p>We present detailed experiments, which demonstrate the excellent convergence properties of our fault tolerant TraceMin eigensolver in the average case. In the worst case where the row-column pairs that have the most impact on eigenvalues are erased, we present a novel scheme that computes the augmentation blocks as the computation proceeds, using the estimates of leverage scores of row-column pairs as they are computed by the iterative process. We demonstrate low overhead, excellent scalability in terms of the number of faults, and the robustness to different fault arrival models and fault rates for our method.</p> <p><br> </p> <p>In summary, this thesis presents a novel approach to fault tolerance based on erasure coded computations, demonstrates it in the context of important linear algebra kernels, and validates its performance on a diverse set of problems on scalable parallel computing platforms. As parallel systems scale to hundreds of thousands of processing cores and beyond, these techniques present the most scalable fault tolerant mechanisms currently available.</p><br> Computation Theory and Mathematics Distributed Computing Numerical Computation Fault Tolerance Linear Solver Eigenvalue Solver Erasure Coding Parallel Computing
318	Reconfiguração no t-node em caso de falhas / Reconfiguration on the t-node machine under fault Nunes, Raul Ceretta January 1993 (has links) Procedimentos de reconfiguração são usados em diversos sistemas para isolar módulos falhos e recuperar o sistema após a ocorrência de erros. Em ambientes multiprocessadores, onde existe redundância implícita de nodos processadores, vários algoritmos de reconfiguração já foram propostos. Entretanto a maior parte destes algoritmos destina-se a topologias específicas bastante exploradas como, por exemplo, arquiteturas na forma de arrays e árvores. Neste trabalho é apresentada uma estratégia de detecção/reconfiguração para tolerar falhas na máquina T-NODE. Esta máquina possui uma arquitetura multiprocessadora fracamente acoplada, que tem como processador base o transputer. Sua arquitetura de interconexão é definida pelo usuário; a organização de barramentos implementada com base em uma chave crossbar, a qual permite uma variada e fácil gama de opções. Assim, os algoritmos tradicionais de reconfiguração não se aplicam pois são excessivamente restritivos. A análise da arquitetura e do software de baixo nível existentes para a T-NODE revelou recursos praticamente inexistentes a nível de controle de falhas nos processadores e erros no processamento. Mesmo considerando-se que o principal objetivo desta máquina é a obtenção de alto desempenho, é possível implementar procedimentos que melhorem suas características de confiabilidade. Neste estudo é apresentada uma maneira de melhorar o nível de tolerância a falhas da máquina de modo que ela possa ser usada em tarefas mais exigentes do ponto de vista de confiabilidade, sem perda excessiva de desempenho. A estratégia definda usa a técnica de redundância dinâmica com detecção de falhas on-line e recuperação do sistema através do isolamento da falha por reconfiguração e conseqüente reinicialização do sistema. A validação da estratégia foi feita pela construção de um protótipo utilizando a linguagem OCCAM2 e um processador transputer conectado ao barramento de um microcomputador PC. No protótipo foram implementados três processos distintos: o testador, o supervisor e o reconfigurador. Estes processos têm respectivamente, as funções de testar os nodos processadores, supervisionar os resultados dos testes e reconfigurar o sistema quando da ocorrência de uma falha. / In many systems, reconfiguration strategies are used to remove failed components and to recuperate system from the resulting errors. Various reconfiguration algorithms have been proposed with the goal of covering faults in multiprocessing systems, but most of them support only specific architecture styles, as arrays or trees. In this study, a reconfiguration algorithm is proposed whose goal is to tolerate faults in the T-NODE machine. The T-NODE is a loosed coupled, multiprocessor machine based on transputers. The analysis of the architecture and of the system software existing for the T-NODE has shown that, in practice, there were not special resources aiming to control processor faults and processing errors. Even considering that the main goal of this machine is processing with high performance, it is possible to implement alternative procedures which result in better reliability characteristics. By other way, the interconnection architecture of this machine is defined by the user; its bus organization implemented with the aid of a crossbar switch allows choices among several possibilities. Consequently, traditional algorithms do not apply because they are too restrictive. Therefore, the research here related aims to improve the fault-tolerance parameters of this machine without changing significantly its original performance. The strategy here presented uses a dynamic redundancy technique with on-line fault detection; system recovery is get by logically isolating the faulty module, reconfiguring the others and restarting the system. The validation of the strategy has been done with the construction of a prototype using the OCCAM2 language and a transputer processor connected to the bus of a microcomputer (PC). Three different processes have been implemented in the prototype: the tester, the supervisior and the reconfigurator. These processes have respectively the functions of: testing the processing nodes, to supervise tests results and to reconfigure the system under fault occurrence. Arquitetura de computadores Tolerancia : Falhas Processamento paralelo Transputer T-node Arquiteturas paralelas Reconfiguracao Reconfiguration Transputer T-NODE Parallel architecture Fault tolerance
319	Abordagem de teoria dos jogos evolucionários para modelagem de aplicações de live streaming em redes peer-to-peer / Evolutionary game theory approach for modeling live streaming applications over peer-to-peer networks Watanabe, Sandra Satyko Guimarães January 2010 (has links) Existe um interesse crescente do mercado por aplicações de multimídia em streaming via rede. Particularmente, as aplicações de live streaming que utilizam a tecnologia de redes P2P para a disseminação de conteúdo têm sido alvo de grande atenção. Aplicações como PPLive e PPStream provam que as aplicações de live streaming em redes P2P são uma realidade com relação à tecnologia atual. Os sistemas de live streaming fornecem um serviço de multicast no nível de aplicação para transmissões ao vivo na Internet. Essas aplicações de live streaming, quando executadas em redes P2P, têm potencial para serem altamente robustas, escaláveis e adaptativas devido à redundância e não dependência de recursos particulares dentre os nodos participantes. Porém, para fazer uso de todas as vantagens disponíveis, a aplicação deve contornar alguns desafios: i) manter a qualidade de playback mesmo com a inerente dinamicidade das redes P2P; ii) impedir que nodos incorretos escondam ações maliciosas atrás do anonimato que existe em P2P; iii) manter a taxa de upload dos nodos participantes da aplicação em um nível aceitável. A taxa de upload dos nodos é muito importante porque a aplicação de live streaming em P2P é uma aplicação cooperativa. Desta forma, esperase que todo novo usuário ajude a aplicação retransmitindo pacotes para outros usuários, mantendo, desta forma, a capacidade global de upload do sistema. Infelizmente, manter a cooperação em live streaming não é uma tarefa trivial, visto que cada nodo enfrenta o dilema social do interesse próprio (individualmente é melhor explorar a cooperação dos outros usuários sem reciprocidade) versus a cooperação para com o grupo. A principal contribuição deste trabalho consiste na apresentação de um modelo matemático baseado em Teoria dos Jogos Evolucionários, cujo objetivo é ajudar a compreender as aplicações de live streaming em redes P2P e os fatores que influenciam o seu correto funcionamento. Como contribuição secundária, este trabalho fornece uma análise estatística do comportamento do download e upload observado nestas aplicações. A análise estatística mostra que existe um decaimento da variância temporal de download e upload nas aplicações de live streaming, e que tal decaimento segue uma lei de potência. Os resultados evolucionários do modelo indicam que, se a queda do índice de satisfação dos usuários com a taxa de download for suave, e se a redução da satisfação devido ao custo de upload for insignificante, então existe um ambiente propício para que a cooperação entre os nodos cresça. De forma inversa, se a queda do índice de satisfação dos usuários com a taxa de download for abrupta, e a redução da satisfação devido ao custo de upload for significativa, então existe um ambiente propício para proliferação de nodos oportunistas. A realização e descrição desta pesquisa é composta de quatro etapas principais: i) a delimitação do cenário de live streaming e a definição do jogo para modelagem; ii) a definição do conjunto de estratégias e da função de utilidade; iii) a criação do modelo; iv) a análise do modelo e a apresentação dos resultados de simulação. A análise do modelo abrange três fases: i) análise estatística e comparação das características de download e upload dos dois simuladores utilizados; ii) avaliação do modelo de Teoria dos Jogos Evolucionários através de simulações; e iii) análise dos resultados evolucionários gerados pelo simulador de Teoria dos Jogos Evolucionários. / There is a growing interest in the market for networked multimedia applications. Live streaming applications that use the technology of P2P networks for distribution of live content have specially been the subject of great attention. Applications such as PPLive and PPStrem demonstrate that P2P live streaming applications are already possible with our present technology. Live streaming systems provide a multicast service in the application level for live broadcasts to users through the Internet. These systems executing in P2P networks have the potential to be highly robust, scalable and adaptive due to the characteristics of these scenarios. However, to take advantage of these potential properties, they must overcome some challenges: i) to maintain the playback quality even with the inherent dynamics of P2P networks; ii) to prevent that incorrect peers hide malicious behavior behind their anonymity; iii) to maintain the upload contribution of peers at acceptable levels. The upload contribution of peers is highly important because live streaming applications are cooperative applications. Therefore, every new user must help the application forwarding packets to other users, thereby maintaining the global upload capacity of the system. Unfortunately, the maintenance of cooperation in live streaming system is not a trivial task, since each node faces the social dilemma of self-interest (individually is always better to explore the cooperation of other users without reciprocity) versus cooperation to the group. The main contribution of this dissertation is the presentation of a mathematical model based on Evolutionary Game Theory, whose goal is to help understanding live streaming P2P applications and the factors that influence their correct operation. As a secondary contribution, this work provides a statistical analysis of download and upload behaviors of peers in live streaming P2P systems. The statistical analysis indicates that there is a decay in the download and upload variances, and that this decay follows a power law. The evolutionary results of the model indicate that, if the satisfaction of users with the download rate is smooth, and the reduction of satisfaction due to the upload cost is negligible, then there is a favorable environment for the growth of cooperation. Conversely, if the satisfaction of users with the download rate is abrupt, and the reduction of satisfaction due to the upload cost is significant, then there is a favorable environment to the proliferation of opportunistic nodes. The realization and description of this research is composed of four main steps: i) the definition of the live streaming scenario and the definition of the game to model this scenario; ii) the definition of the strategy set and of the utility function; iii) the suggestion of a model; iv) the analysis of the proposed model and the presentation of obtained results. The model analysis comprehends three phases: i) the statistical analysis and the comparison of the characteristics of download and upload of the two simulators used in this work; ii) the evaluation of the Evolutionary Game Theory model through simulation; and iii) the analysis of the results generated by the Evolutionary Game Theory simulator. Sistemas distribuidos P2P Teoria : Jogos Tolerancia : Falhas Peer-to-peer networks Live streaming Evolutionary game theory Fault tolerance
320	Reintegração de servidores em sistemas distribuídos / Reintegration of failed server in distributed systems Pasin, Marcia January 1998 (has links) Sistemas distribuídos representam uma plataforma ideal para implementação de sistemas computacionais com alta confiabilidade e disponibilidade devido a redundância fornecida por um grande número de estações interligadas. Falhas de um servidor podem ser contornadas pela reconfiguração do sistema. Entretanto falhas em seqüência que afetem múltiplas estações comprometem não apenas o desempenho do sistema, mas também a continuidade do serviço e sua confiabilidade. Assim, servidores falhos, que tenham sido isolados do sistema, devem ser reintegrados tão logo quanto possível para não comprometer a disponibilidade do sistema computacional. Este trabalho trata da atualização do estado de servidores e da troca de informação que o servidor recuperado realiza para integrar-se aos demais membros do sistema através de um procedimento chamado reintegração do servidor. E assumido um ambiente distribuído que garante alta confiabilidade em aplicações convencionais através da técnica de replicação de arquivos. O servidor a ser reintegrado faz parte de um grupo de replicação e volta a participar ativamente do grupo tão logo seja reintegrado. Para tanto, considera-se a estratégia de replicação por copia primaria e um sistema distribuído experimental, compatível com o NFS, desenvolvido na UFRGS para aplicar a reintegração de servidor. Os métodos de atualização de arquivos para a reintegração do servidor foram implementadas no ambiente UNIX. / Distributed systems are an ideal platform to develop high reliable computer applications due to the redundancy supplied by a great number of interconnected workstations. Failed stations can be masked reconfiguring the system. However, sequential faults, that affect multiple stations, not just decrease the performance of the system, but also affect the continuity of the service and its reliability. Thus, failed stations working as servers, that have been isolated from the system, should be reintegrated as soon as possible to not impair the system availability. This work is exactly about methods to update the state of failed servers. It deals also with the change of information that the recovered server accomplishes to be integrated to the other members of the service group through a process called reintegration of server. It is assumed a distributed environment that guarantees high reliability in conventional applications through replication of files. The server to be reintegrated is part of a replication group and it participates actively of the service group again as soon as it is reintegrated. Our approach is based on a primary copy. The file actualization methods to the reintegration of server were implemented in an UNIX environment. To illustrate our approach we will describe how the integration of repaired server can be made a fault-tolerant system. The experimental distributed system, compatible with NFS, was designed at the UFRGS. Confiabilidade : Computadores Tolerancia : Falhas Reintegracao : Servidores Sistemas operacionais distribuidos Distributed file systems Fault tolerance Reintegration of failed server

Search results