Global ETD Search

211	Applying dual core lockstep in embedded processors to mitigate radiation induced soft errors / Aplicando dual core lockstep em processadores embarcados para mitigar falhas transientes induzidas por radiação Oliveira, Ádria Barros de January 2017 (has links) Os processadores embarcados operando em sistemas de segurança ou de missão crítica não podem falhar. Qualquer falha neste tipo de aplicação pode levar a consequências inaceitáveis, como risco de vida ou danos à propriedade ou ao meio ambiente. Os sistemas embarcados que operam em aplicações aeroespaciais são sucetíveis à falhas transientes induzidas por radiação. Entretanto, os efeitos de radiação também podem ser observados ao nível do solo. Falhas transientes afetam os processadores modificando os valores armazenados em elementos de memória, tais como registradores e memória de dados. Essas falhas podem levar o processador a executar incorretamente a aplicação, provocando erros na saída ou travamentos no sistema. Os avanços recentes em processadores embarcados concistem na integração de processadores hard-core e FPGAs. Tais dispositivos, comumente chamados de Sistemas-em-Chip Totalmente Programáveis (APSoCs), também são sucetíveis aos efeitos de radiação. Com objetivo de minimizar esse problema de tolerância a falhas, este trabalho apresenta um Dual-Core LockStep (DCLS) como uma técnica de tolerância para mitigar falhas induzidas por radiação que afetam processadores embarcados em APSoCs. Lockstep é um método baseado em redundância usado para detectar e corrigir falhas transientes. O DCLS proposto é implementado em um processador ARM Cortex-A9 hard-core embarcado no APSoC Zynq-7000. A eficiência da abordagem implementada foi validada tanto em aplicações executando em bare-metal como no sistema operacional FreeRTOS. Experimentos com íons pesados e emulação de falhas por injeção foram executados para analisar a sucetibilidade do sistema a inversão de bits. Os resultados obtidos mostram que a abordagem é capaz de diminuir a seção de choque do sistema com uma alta taxa de proteção. O sistema DCLS mitigou com sucesso até 78% das falhas injetadas. Otimizações de software também foram avaliadas para uma melhor compreenção dos trade-offs entre desempenho e confiabilidade. Através da análise de diferentes partições de software, observou-se que o tempo de execução de um bloco da aplicação deve ser muito maior que o tempo de verificação para que se obtenha menor impacto em desempenho. A avaliação de otimizações de compilador demonstrou que utilizar o nível O3 aumenta a vulnerabilidade da aplicação à falhas transientes. Como o O3 requer o uso de mais registradores que os otros níveis de otimização, o sistema se torna mais sucetível à falhas. Por outro lado, os resultados dos experimentos de radiação apontam que a aplicação compilada com nível O3 obtém maior Carga de Trabalho Média Entre Falhas (MWBF). Como a aplicação executa mais rápido, mais dados são computados corretamente antes da ocorrência de um erro. / The embedded processors operating in safety- or mission-critical systems are not allowed to fail. Any failure in such applications could lead to unacceptable consequences as life risk or significant damage to property or environment. Concerning faults originated by the radiation-induced soft errors, the embedded systems operating in aerospace applications are particularly susceptible. However, the radiation effects can also be observed at ground level. Soft errors affect processors by modifying values stored in memory elements, such as registers and data memory. These faults may lead the processor to execute an application incorrectly, generating output errors or leading hangs and crashes in the system. The recent advances in embedded systems concern the integration of hard-core processors and FPGAs. Such devices, called All Programmable System-on-Chip (APSoC), are also susceptible to radiation effects. Aiming to address this fault tolerance problem this work presents a Dual-Core LockStep (DCLS) as a fault tolerance technique to mitigate radiation-induced faults affecting processors embedded into APSoCs. Lockstep is a method based on redundancy used to detect and correct soft errors. The proposed DCLS is implemented in a hard-core ARM Cortex-A9 embedded into a Zynq-7000 APSoC. The approach efficiency was validated not only on applications running in baremetal but also on top of FreeRTOS systems. Heavy ions experiments and fault injection emulation were performed to analyze the system susceptibility to bit-flips. The obtained results show that the approach is able to decrease the system cross section with a high rate of protection. The DCLS system successfully mitigated up to 78% of the injected faults. Software optimizations were also evaluated to understand the trade-offs between performance and reliability better. By the analysis of different software partitions, it was observed that the execution time of an application block must to be much longer than the verification time to achieve fewer performance penalties. The compiler optimizations assessment demonstrate that using O3 level increases the application vulnerability to soft errors. Because O3 handles more registers than other optimizations, the system is more susceptible to faults. On the other hand, results from radiation experiments show that O3 level provides a higher Mean Workload Between Failures (MWBF). As the application runs faster, more data are correctly computed before an error occurrence. Microeletrônica Sistemas embarcados Embedded Processors Reliability Fault Injection Radiation Experiments Soft Errors Lockstep Fault Tolerance
212	Adaptive and polymorphic VLIW processor to dynamically balance performance, energy consumption, and fault tolerance / Processador VLIW adaptativo e polimórfico para equilibrar de forma dinâmica o desempenho, o consumo de energia e a tolerância a falhas Sartor, Anderson Luiz January 2018 (has links) Ao se projetar um novo processador, o desempenho não é mais o único objetivo de otimização. Reduzir o consumo de energia também é essencial, pois, enquanto a maior parte dos dispositivos embarcados depende fortemente de bateria, os processadores de propósito geral (GPPs) são restringidos pelos limites da energia térmica de projeto (TDP – thermal design power). Além disso, devido à evolução da tecnologia, a taxa de falhas transientes tem aumentado nos processadores modernos, o que afeta a confiabilidade de sistemas tanto no espaço quanto no nível do mar. Adicionalmente, a maioria dos processadores homogêneos e heterogêneos tem um design fixo, o que limita a adaptação em tempo de execução. Nesse cenário, nós propomos dois designs de processadores que são capazes de realizar o trade-off entre esses eixos de acordo com a aplicação alvo e os requisitos do sistema. Ambos designs baseiam-se em um mecanismo de duplicação de instruções com rollback que detecta e corrige falhas, um módulo de power gating para reduzir o consumo de energia das unidades funcionais. O primeiro é chamado de processador adaptativo e usa thresholds, definidos em tempo de projeto, para adaptar a execução da aplicação Adicionalmente, ele controla o ILP da aplicação para criar mais oportunidade de duplicação e de power gating. O segundo design é chamado processador polimórfico e ele avalia (em tempo de execução) a melhor configuração de hardware a ser usada para cada aplicação. Ele também explora o hardware disponível para maximizar o número de aplicações que são executadas em paralelo. Para a versão adaptativa usando uma configuração orientada a otimização de energia, é possível, em média, economizar 37,2% de energia com um overhead de apenas 8,2% em performance, mantendo baixos níveis de defeito, quando comparado a um design tolerante a falhas. Para a versão polimórfica, os resultados mostram que a reconfiguração dinâmica do processador é capaz de adaptar eficientemente o hardware ao comportamento da aplicação, de acordo com os requisitos especificados pelo designer, chegando a 94.88% do resultado de um processador oráculo quando o trade-off entre os três eixos é considerado. Por outro lado, a melhor configuração estática apenas atinge 28.24% do resultado do oráculo. / Performance is no longer the only optimization goal when designing a new processor. Reducing energy consumption is also mandatory: while most of the embedded devices are heavily dependent on battery power, General-Purpose Processors (GPPs) are being pulled back by the limits of Thermal Design Power (TDP). Moreover, due to technology scaling, soft error rate (i.e., transient faults) has been increasing in modern processors, which affects the reliability of both space and ground-level systems. In addition, most traditional homogeneous and heterogeneous processors have a fixed design, which limits its runtime adaptability. Therefore, they are not able to cope with the changing application behavior when one considers the axes of fault tolerance, performance, and energy consumption altogether. In this context, we propose two processor designs that are able to trade-off these three axes according to the application at hand and system requirements. Both designs rely on an instruction duplication with rollback mechanism that can detect and correct errors and a power gating module to reduce the energy consumption of the functional units The former design, called adaptive processor, uses thresholds defined at design time to allow runtime adaptation of the application’s execution and controls the application’s Instruction-Level Parallelism (ILP) to create more slots for duplication or power gating. The latter design (polymorphic processor) takes the former one step further by dynamically reconfiguring the hardware and evaluating different processor configurations for each application, and it also exploits the available pipelanes to maximize the number of applications that are executed concurrently. For the adaptive processor using an energy-oriented configuration, it is possible, on average, to reduce energy consumption by 37.2% with an overhead of only 8.2% in performance, while maintaining low levels of failure rate, when compared to a fault-tolerant design. For the polymorphic processor, results show that the dynamic reconfiguration of the processor is able to efficiently match the hardware to the behavior of the application, according to the requirements of the designer, achieving 94.88% of the result of an oracle processor when the trade-off between the three axes is considered. On the other hand, the best static configuration only achieves 28.24% of the oracle’s result. Tolerância a falhas Processamento paralelo Adaptive processor Fault tolerance Energy consumption Performance VLIW
213	Automated design flow for applying triple modular redundancy in complex semi-custom digital integrated circuits / Fluxo de projeto automatizado para aplicar redundância modular tripla em circuitos semicustomizados complexos Benites, Luis Alberto Contreras January 2018 (has links) Os efeitos de radiação têm sido um dos problemas mais sérios em aplicações militares e espaciais. Mas eles também são uma preocupação crescente em tecnologias modernas, mesmo para aplicações comerciais no nível do solo. A proteção dos circuitos integrados contra os efeitos da radiação podem ser obtidos através do uso de processos de fabricação aprimorados e de estratégias em diferentes estágios do projeto do circuito. A técnica de TMR é bem conhecida e amplamente empregada para mascarar falhas únicas sem detectálas. No entanto, o projeto de circuitos TMR não é automatizado por ferramentas EDA comerciais e até mesmo eles podem remover parcial ou totalmente a lógica redundante. Por outro lado, existem várias ferramentas que podem ser usadas para implementar a técnica de TMR em circuitos integrados, embora a maioria delas sejam ferramentas comerciais licenciadas, convenientes apenas para dispositivos específicos, ou com uso restrito por causa do regime ITAR. O presente trabalho pretende superar esses incovenientes, para isso uma metodologia é proposta para automatizar o projeto de circuitos TMR utilizando um fluxo de projeto comercial. A abordagem proposta utiliza um netlist estruturado para implementar automaticamente os circuitos TMR em diferentes níveis de granularidade de redundância para projetos baseados em células e FPGA. A otimização do circuito TMR resultante também é aplicada com base na abordagem do dimensionamento de portas lógicas. Além disso, a verificação do circuito TMR implementado é baseada na verificação de equivalência e garante sua funcionalidade correta e sua capacidade de tolerancia a falhas simples. Experimentos com um circuito derivado de HLS e uma descrição ofuscada do soft-core ARM Cortex-M0 foram realizados para mostrar o uso e as vantagens do fluxo de projeto proposto. Diversas questões relacionadas à remoção da lógica redundante implementada foram encontradas, bem como o impacto no incremento de área causado pelos votadores de maioria. Além disso, a confiabilidade de diferentes implementações de TMR do soft core ARM sintetizado em FPGA foi avaliada usando campanhas de injeção de falhas emuladas. Como resultado, foi reforçado o nível de alta confiabilidade da implemntação com mais fina granularidade, mesmo na presença de até 10 falhas acumuladas, e a menor capacidade de mitigação correspondente à replicação de flip-flops apenas. / Radiation effects have been one of the most serious issues in military and space applications. But they are also an increasing concern in modern technologies, even for commercial applications at the ground level. Protection or hardening of integrated circuits against radiation effects can be obtained through the use of enhanced fabrication processes and strategies at different stages of the circuit design. The triple modular redundancy (TMR) technique is a widely and well-known technique employed to mask single faults without detecting them. However, the design of TMR circuits is not automated by commercial electronic design automation (EDA) tools and even they can remove partially or totally the redundant logic. On the other hand, there are several tools that can be used to implement the TMR technique in integrated circuits, although most of them are licensed commercial tools, convenient only for specific devices, or with restricted use because of the International Traffic in Arms Regulations (ITAR) regimen. The present work intends to overcome these issues so a methodology is proposed to automate the design of TMR circuits using a commercial design flow. The proposed approach uses a structured netlist to implement automatically TMR circuits at different granularity levels of redundancy for cell-based and field-programmable gate array (FPGA) designs. Optimization of the resulting TMR circuit is also applied based on the gate sizing approach. Moreover, verification of the implemented TMR circuit is based on equivalence checking, and guarantee its correct functionality and its fault-tolerant capability against soft errors. Experiments with an high-level synthesis (HLS)-derived circuit and an obfuscated description of the ARM Cortex-M0 soft-core are performed to show the use and the advantages of the proposed design flow. Several issues related to the removal of the implemented redundant logic were found as well as the impact in the increment of area caused by the majority voters. Furthermore, the reliability of different TMR implementations of the ARM soft-core synthesized in FPGA was evaluated using emulated-simulation fault injection campaigns. As a result, it was reinforced the high-reliability level of the finest granularity implementation even in the presence of up to 10 accumulated faults and the poorest mitigation capacity corresponding to the replication of flip-flops solely. Microeletrônica Tolerancia : Falhas Fault tolerance FPGA ASIC Semicustom design flow Equivalence checking TMR Radiation effects
214	Designing and evaluating hybrid techniques to detect transient faults in processors embedded in FPGAs / Desenvolvendo e Avaliando técnicas híbridas para detectar falhas transientes em processadores embarcados em FPGAs / Entwurf und auswertung von hybrid-techniken zur erkennung von transienten fehlern in FPGA eingebetteten prozessoren Azambuja, José Rodrigo Furlanetto de January 2013 (has links) Der aktuelle Stand der Technologie bringt schnellere und kleinere Bausteine für die Herstellung von integrierten Schaltungen mit sich, die während sie effizienter sind auch anfälliger für Strahlung werden. Kleinere Abmessungen der Transistoren, höhere Integrationsdichte, geringere Versorgungsspannungen und höhere Betriebsfrequenzen sind einige der Charakteristika, die energiegeladene Partikel zu einer Herausforderung machen, wenn man integrierte Schaltungen in rauen Umgebungen einsetzt. Diese Art der Partikel hat einen sehr großen Einfluss auf Prozessoren, die in einer solchen Umgebung eingesetzt werden. Sowohl die Ausführung des Programms, welche durch fehlerhafte Sprünge in der Programmsequenz beeinflusst wird, als auch Daten, die in speichernden Elementen wie Programmspeicher, Datenspeicher oder in Registern abgelegt sind, werden verfälscht. Um solche Prozessorsysteme abzusichern, wird in der Literatur Fehlertoleranz empfohlen, welche die Systemperformanz verringert, einen größeren Flächenverbrauch mit sich bringt und das System dennoch nicht komplett schützen kann. Diese Fehlertoleranz kann sowohl durch software- als auch durch hardwarebasierte Ansätze umgesetzt werden. In diesem Zusammenhang schlagen wir eine Kombination aus Hardware- und Software- Lösung vor, welche die Systemperformanz nur sehr wenig beeinflusst und den zusätzlichen Speicheraufwand minimiert. Diese Hybrid-Technologie zielt darauf ab, alle Fehler in einem System zu finden. Fünf solcher Techniken werden beschrieben und erklärt, zwei der vorgestellten Techniken sind bekannte Software-Lösungen, die anderen drei sind neue Hybrid-Lösungen, um alle transienten Effekte von Strahlung in Prozessoren erkennen zu können. Diese unterschiedlichen Ansätze werden anhand ihrer Ausführungszeit, Programm-, Datenspeicher, Flächenvergrößerung und Taktfrequenz analysiert und ausgewertet. Um die Effizienz und die Machbarkeit des vorgeschlagenen Ansatzes verifizieren zu können, werden Fehlerinjektionstests sowohl durch Simulation als auch durch Bestrahlungsexperimente in unterschiedlichen Positionen mit einer Cobalt-60 Quelle durchgeführt. Die Ergebnisse des vorgeschlagenen Ansatzes verbessern den Stand der Technik durch die Bereitstellung einer höheren Fehlererkennungsrate bei sehr geringer negativer Beeinflussung der Performanz und des Speicherverbrauchs. / Os recentes avanços tecnológicos proporcionaram dispositivos menores e mais rápidos para a fabricação de circuitos que, apesar de mais eficientes, se tornaram mais sensíveis aos efeitos de radiação. Menores dimensões de transistores, mais densidade de integração, tensões de alimentação mais baixas e frequências de operação mais altas são algumas das características que tornaram partículas energizadas um problema, quando lidando com sistemas integrados em ambientes severos. Estes tipos de partículas tem uma grande influencia em processadores funcionando em tais ambientes, afetando tanto o fluxo de execução do programa ao causar desvios incorretos, bem como os dados armazenados em elementos de memória, como memórias de dados e programas e registradores. A fim de proteger sistemas processados, técnicas de tolerância a falhas foram propostas na literatura usando propostas baseadas em hardware, software, que diminuem o desempenho do sistema, aumentam a sua área e não são capazes de proteger totalmente o sistema destes efeitos. Neste contexto, propomos a combinação de técnicas baseadas em hardware e software para criar técnicas híbridas orientadas a detectar todas as falhas que afetam o sistema, com baixa degradação de desempenho e aumento de memória. Cinco técnicas são apresentadas e descritas em detalhes, das quais duas são conhecidas técnicas baseadas puramente em software e três são técnicas híbridas novas, para detectar todos os tipos de efeitos transientes causados pela radiação em processadores. As técnicas são avaliadas de acordo com o aumento no tempo de execução, no uso das memórias de dados e programa e de área, e degradação da frequência de operação. Para verificar a eficiência e aplicabilidade das técnicas propostas, campanhas de injeção de falhas são realizadas ao se simular a injeção de falhas e realizar experimentos de irradiação em diferentes localidades com nêutron e fontes de Cobalto-60. Os resultados mostraram que as técnicas propostas aprimoraram o estado da arte ao fornecer altas taxas de detecção de falhas com baixas penalidades em degradação de desempenho e aumento de memória. / Recent technology advances have provided faster and smaller devices for manufacturing circuits that while more efficient have become more sensitive to the effects of radiation. Smaller transistor dimensions, higher density integration, lower voltage supplies and higher operating frequencies are some of the characteristics that make energized particles an issue when dealing with integrated circuits in harsh environments. These types of particles have a major influence in processors working in such environments, affecting both the program’s execution flow by causing incorrect jumps in the program, and the data stored in memory elements, such as data and program memories, and registers. In order to protect processor systems, fault tolerance techniques have been proposed in literature using hardware-based and software-based approaches, which decrease the system’s performance, increase its area, and are not able to fully protect the system against such effects. In this context, we proposed a combination of hardware- and software-based techniques to create hybrid techniques aimed at detecting all the faults affecting the system, at low performance degradation and memory overhead. Five techniques are presented and described in detail, from which two are known software-based only techniques and three are new hybrid techniques, to detect all kinds of transient effects caused by radiation in processors. The techniques are evaluated according to execution time, program and data memories, and area overhead and operating frequency degradation. To verify the effectiveness and the feasibility of the proposed techniques, fault injection campaigns are performed by injecting faults by simulation and performing irradiation experiments in different locations with neutrons and a Cobalt-60 sources. Results have shown that the proposed techniques improve the state-of-the-art by providing high fault detection rates at low penalties on performance degradation and memory overhead. Fehlertoleranz Strahlungseffekte Prozessoren Microeletrônica Tolerancia : Falhas Fpga Fault tolerance Radiation effects Processors
215	Extensão do suporte para simulação de defeitos em algoritmos distribuídos utilizando o Neko / Extension to support failures in distributed algorithm simulation using Neko Rodrigues, Luiz Antonio January 2006 (has links) O estudo e desenvolvimento de sistemas distribuídos é uma tarefa que demanda grande esforço e recursos. Por este motivo, a pesquisa em sistemas deste tipo pode ser auxiliada com o uso de simuladores, bem como por meio da emulação. A vantagem de se usar simuladores é que eles permitem obter resultados bastante satisfatórios sem causar impactos indesejados no mundo real e, conseqüentemente, evitando desperdícios de recursos. Além disto, testes em larga escala podem ser controlados e reproduzidos. Neste sentido, vem sendo desenvolvido desde 2000 um framework para simulação de algoritmos distribuídos denominado Neko. Por meio deste framework, algoritmos podem ser simulados em uma única máquina ou executados em uma rede real utilizando-se o mesmo código nos dois casos. Entretanto, através de um estudo realizado sobre os modelos de defeitos mais utilizados na literatura, verificou-se que o Neko é ainda bastante restrito nesta área. A única classe de defeito abordada, lá referida como colapso, permite apenas o bloqueio temporário de mensagens do processo. Assim, foram definidos mecanismos para a simulação das seguintes classes de defeitos: omissão de mensagens, colapso de processo, e alguns defeitos de rede tais como quebra de enlace, perda de mensagens e particionamento. A implementação foi feita em Java e as alterações necessárias no Neko estão documentadas no texto. Para dar suporte aos mecanismos de simulação de defeitos, foram feitas alterações no código fonte de algumas classes do framework, o que exige que a versão original seja alterada para utilizar as soluções. No entanto, qualquer aplicação desenvolvida anteriormente para a versão original poderá ser executada normalmente independente das modificações efetuadas. Para testar e validar as propostas e soluções desenvolvidas foram utilizados estudos de caso. Por fim, para facilitar o uso do Neko foi gerado um documento contendo informações sobre instalação, configuração e principais mecanismos disponíveis no simulador, incluindo o suporte a simulação de defeitos desenvolvido neste trabalho. / The study and development of distributed systems is a task that demands great effort and resources. For this reason, the research in systems of this type can be assisted by the use of simulators, as well as by means of the emulation. The advantage of using simulators is that, in general, they allow to get acceptable results without causing harming impacts in the real world and, consequently, preventing wastefulness of resources. Moreover, tests on a large scale can be controlled and reproduced. In this way, since 2000, a framework for the simulation of distributed algorithms called Neko has been developed. By means of this framework, algorithms can be simulated in a single machine or executed in a real network, using the same code in both cases. However, studying the most known and used failure models developed having in mind distributed systems, we realized that the support offered by Neko for failure simulation was too restrictive. The only developed failure class, originally named crash, allowed only a temporary blocking of process’ messages. Thus, mechanisms for the simulation of the following failure classes were defined in the present work: omission of messages, crash of processes, and some network failures such as link crash, message drop and partitioning. The implementation was developed in Java and the necessary modifications in Neko are registered in this text. To give support to the mechanisms for failure simulation, some changes were carried out in the source code of some classes of the framework, what means that the original version should be modified to use the proposed solutions. However, all legacy applications, developed for the original Neko version, keep whole compatibility and can be executed without being affected by the new changes. In this research, some case studies were used to test and validate the new failure classes. Finally, with the aim to facilitate the use of Neko, a document about the simulator, with information on how to install, to configure, the main available mechanisms and also on the developed support for failure simulation, was produced. Tolerancia : Falhas Sistemas distribuidos Simulação computacional Fault tolerance Neko Distributed systems Simulation
216	Designing fault tolerant NoCs to improve reliability on SoCs / Projeto de NoCs tolerantes a falhas para o aumento da confiabilidade em SoCs Frantz, Arthur Pereira January 2007 (has links) Com a redução das dimensões dos dispositivos nas tecnologias sub-micrônicas foi possível um grande aumento no número de IP cores integrados em um mesmo chip e consequentemente novas arquiteturas de comunicação são usadas bucando atingir os requisitos de desempenho e potência. As redes intra-chip (Networks-on-Chip) foram propostas como uma plataforma alternativa de comunicação capaz de prover interconexões e comunicação entre os cores de um mesmo chip, tratando questões como desempenho, consumo de energia e reusabilidade para grandes sistemas integrados. Por outro lado, a mesma evolução tecnológica dos processos nanométricos reduziu drasticamente a confiabilidade de circuitos integrados, tornando dispositivos e interconexões mais sensíveis a novos tipos de falhas. Erros podem ser gerados por variações no processo de fabricação ou mesmo pela susceptibilidade do projeto, quando este opera em um ambiente hostil. Na comunicação de NoCs as duas principais fontes de erros são falhas de crosstalk e soft errors. No passado, se assumia que interconexões não poderiam ser afetadas por soft errors, por não possuirem circuitos seqüenciais. Porém, quando NoCs são usadas, buffers e circuitos seqüenciais estão presentes nos roteadores e, consequentemente, podem ocorrer soft errors entre a fonte e o destino da comunicação, provocando erros. Técnicas de tolerância a falhas, que tem sido aplicadas em circuitos em geral, podem ser usadas para proteger roteadores contra bit-flips. Neste cenário, este trabalho inicia com a avaliação dos efeitos de soft errors e falhas de crosstalk em uma arquitetura de NoC, através de simulação de injeção de falhas, analisando detalhadamente o impacto de tais falhas no roteador. Os resultados mostram que os efeitos dessas falhas na comunicação do SoC podem ser desastrosos, levando a perda de pacotes e travamento ou indisponibilidade do sistema. Então é proposta e avaliada a aplicação de um conjunto de técnicas de tolerância a falhas em roteadores, possibilitando diminuir os soft errors e falhas de crosstalk no nível de hardware. Estas técnicas propostas foram baseadas em códigos de correção de erros e redundância de hardware. Resultados experimentais mostram que estas técnicas podem obter zero erros com 50% a menos de overhead de área, quando comparadas com a duplicação simples. Entretanto, algumas dessas técnicas têm um grande consumo de potência, pois toda essas técnicas são baseadas na adição de hardware redundante. Considerando que as técnicas de proteção baseadas em software também impõe um considerável overhead na comunicação devido à retransmissão, é proposto o uso de técnicas mistas de hardware e software, que podem oferecer um nível de proteção satisfatório, baseado na análise do ambiente onde o sistema irá operar (soft error rate), fatores relativos ao projeto e fabricação (variações de atraso em interconexões, pontos susceptíveis a crosstalk), a probabilidade de uma falha gerar um erro em um roteador, a carga de comunicação e os limites de potência e energia suportados. / As the technology scales down into deep sub-micron domain, more IP cores are integrated in the same die and new communication architectures are used to meet performance and power constraints. Networks-on-Chip have been proposed as an alternative communication platform capable of providing interconnections and communication among onchip cores, handling performance, energy consumption and reusability issues for large integrated systems. However, the same advances to nanometric technologies have significantly reduced reliability in mass-produced integrated circuits, increasing the sensitivity of devices and interconnects to new types of failures. Variations at the fabrication process or even the susceptibility of a design under a hostile environment might generate errors. In NoC communications the two major sources of errors are crosstalk faults and soft errors. In the past, it was assumed that connections cannot be affected by soft errors because there was no sequential circuit involved. However, when NoCs are used, buffers and sequential circuits are present in the routers, consequently, soft errors can occur between the communication source and destination provoking errors. Fault tolerant techniques that once have been applied in integrated circuits in general can be used to protect routers against bit-flips. In this scenario, this work starts evaluating the effects of soft errors and crosstalk faults in a NoC architecture by performing fault injection simulations, where it has been accurate analyzed the impact of such faults over the switch service. The results show that the effect of those faults in the SoC communication can be disastrous, leading to loss of packets and system crash or unavailability. Then it proposes and evaluates a set of fault tolerant techniques applied at routers able to mitigate soft errors and crosstalk faults at the hardware level. Such proposed techniques were based on error correcting codes and hardware redundancy. Experimental results show that using the proposed techniques one can obtain zero errors with up to 50% of savings in the area overhead when compared to simple duplication. However some of these techniques are very power consuming because all the tolerance is based on adding redundant hardware. Considering that softwarebased mitigation techniques also impose a considerable communication overhead due to retransmission, we then propose the use of mixed hardware-software techniques, that can develop a suitable protection scheme driven by the analysis of the environment that the system will operate in (soft error rate), the design and fabrication factors (delay variations in interconnects, crosstalk enabling points), the probability of a fault generating an error in the router, the communication load and the allowed power or energy budget. Microeletrônica SoC Deteccao : Erros Networks-on-chip Fault tolerance Soft errors Crosstalk
217	Implementação de um mecanismo de recuperação por retorno para a ferramenta ourgrid / Implementation of a rollback recovery mechanism for ourGrid toolkit Silva, Hélio Antônio Miranda da January 2007 (has links) A computação em grid (ou computação em grade) emergiu como uma área de pesquisa importante por permitir o compartilhamento de recursos computacionais geograficamente distribuídos entre vários usuários. Contudo, a heterogeneidade e a dinâmica do comportamento dos recursos em ambientes de grid tornam complexos o desenvolvimento e a execução de aplicações. OurGrid é uma plataforma de software que procura contornar estas dificuldades: além de permitir a execução de aplicações distribuídas em ambientes de computação em grid, oferece e gerencia um esquema de troca de favores entre usuários. Neste esquema, instituições (ou usuários) que possuam recursos ociosos podem oferecê-los a outros que deles necessitem. Quanto mais um domínio oferecer recursos ao grid, mais será favorecido quando precisar, ou seja, terá prioridade mais alta quando requisitar máquinas ao grid. O software MyGrid é o principal componente do OurGrid. É através dele que o usuário interage com o grid, submetendo e gerenciando suas aplicações. No modelo de execução do MyGrid, as tarefas são lançadas por um nó central que coordena todo o escalonamento de tarefas que serão executadas no grid. Este nó apresenta uma fragilidade caracterizada na literatura como "ponto único de falhas", pois seu colapso faz com que os resultados do processamento corrente sejam perdidos. Isto pode significar horas ou, até mesmo, dias de processamento perdido, dependendo das aplicações. Visando suprir esta deficiência, este trabalho descreve o funcionamento e a implementação de um mecanismo de checkpointing (ou salvamento de estado), usado como base para a recuperação por retorno, que permite ao sistema voltar a um estado consistente, minimizando a perda de dados, após uma falha no nó central do MyGrid. Assim, ele salva, de forma estável, o estado da aplicação (estruturas de dados e informações de controle imprescindíveis) capaz de restaurar o sistema após o colapso, oferecendo uma alternativa à sua característica de ponto único de falhas. Os checkpoints são obtidos e salvos a cada mudança de estado do escalonador de tarefas do nó central. A eficiência do mecanismo de recuperação é comprovada através de experimentos que exercitam este mecanismo em cenários com diferentes características, visando validar e avaliar o impacto real no desempenho do MyGrid. / The grid computing has emerged as an important research area because it allows sharing geographically distributed computing resources among several users. However, resources in a grid are highly heterogeneous and dynamic, turning complex the development and the execution of applications. OurGrid is a software platform that intends to reduce these difficulties. Besides allowing the execution of distributed applications in grid environments, it offers and gives support to an exchange of favors between users. In this way, institutions (or users) that have idle resources can offer them to other users. The more resources a domain offers to the grid, the more it will be favored when in need. It will have higher priority when requesting machines to grid. MyGrid software is the main component of OurGrid: it constitutes the interface for user interaction as well as application submission and management. In the execution model of MyGrid, tasks are launched by a central node (home-machine), which manages the scheduling of tasks to be executed in the grid. This node constitutes a "single point of failure", because its crash causes the loss of results of the previous processing. Depending on the particular applications, this loss can be the result of hours or days of processing time. This dissertation aims to reduce the consequences of this problem offering an alternative to the single point of failure: here is proposed and implemented a checkpointing mechanism, used as basis for the rollback recovery. Checkpoints are taken synchronously with the state changes of the scheduler on the central node. After a failure affecting the home-machine of MyGrid, the system recovers information on the state of the application (data structures and essential control information) and results of previous computation, saved in stable storage, minimizing the loss of data. The efficiency of the recovery mechanism and its impact over MyGrid are evaluated through experiments that exercise this mechanism in scenarios with different characteristics. Computação móvel Tolerancia : Falhas Processamento distribuido Grid computation Fault tolerance Rollback-recovery Checkpointing OurGrid
218	Three different techniques to cope with radiation effects and component variability in future technologies Schüler, Erik January 2007 (has links) Existe um consenso de que os transistores CMOS irão em breve ultrapassar a barreira nanométrica, permitindo a inclusão de um enorme número desses componentes em uma simples pastilha de silício, mais ainda do que a grande densidade de integração vista atualmente. Entretanto, também tem sido afirmado que este desenvolvimento da tecnologia trará juntamente conseqüências indesejáveis em termos de confiabilidade. Neste trabalho, três aspectos da evolução tecnológica serão enfatizados: redução do tamanho dos transistores, aumento da freqüência de relógio e variabilidade de componentes analógicos. O primeiro aspecto diz respeito à ocorrência de Single Event Upsets (SEU), uma vez que a carga armazenada nos nós dos circuitos é cada vez menor, tornando o circuito mais suscetível a esses tipos de eventos, principalmente devido à incidência de radiação. O segundo aspecto é também relacionado ao choque de partículas radioativas no circuito. Neste caso, dado que o período de relógio tem se tornado menor, os Single Event Transients (SET) podem ser capturados por um latch, e interpretado como uma inversão de estado em um determinado bit. Finalmente, o terceiro aspecto lida com a variabilidade de componentes analógicos, a qual tende a aumentar a distância entre o projeto e o teste analógico e o digital. Pensando nesses três problemas, foram propostas três diferentes soluções para lidar com eles. Para o problema do SEU, um novo paradigma foi proposto: ao invés do uso de redundância de hardware ou software, um esquema de redundância de sinal foi proposto através de uso de sinais modulados em sigma-delta. No caso do SET, foi proposta uma solução para o esquema de Triple Modular Redundancy (TMR), onde o votador digital é substituído por um analógico, reduzindo assim as chances de ocorrência de SET. Para concluir, para a variabilidade de componentes analógicos, foi proposto um filtro de sinal misto no qual os componentes analógicos críticos são substituídos por partes digitais, permitindo um esquema de teste completamente digital, uma fácil substituição de partes defeituosas e um aumento de produtividade. / It has been a consensus that CMOS transistor gate length will soon overcome the nanometric barrier, allowing the inclusion of a huge number of these devices on a single die, even more than the enormous integration density shown these days. Nevertheless, it has also been claimed that this technology development will bring undesirable consequences as well, for what regards reliability. In this work, three aspects of technology evolution will be emphasized: transistor size shrinking, clock frequency increase and analog components variability. The first aspect concerns the occurrence of Single Event Upsets (SEU), since the charge stored in the circuit nodes becomes ever smaller, making the circuit more susceptible to this kind of events, mainly due to radiation incidence. The second aspect is also related to the hit of radiation particles in the circuit. In this case, since clock period becomes smaller, Single Event Transients (SET) may cross the entire circuit and can possibly be latched and interpreted as a state inversion of a certain bit. Finally, the third aspect deals with the analog components variability, which tends to increase the gap between the analog and digital design and test. Thinking about these three problems, we have proposed three different solutions to deal with them. To the SEU problem, a new paradigm has been proposed: instead of hardware or software redundancy, a signal redundancy approach has been proposed through the use of sigma-delta modulated signals. In the SET case, we have proposed a solution for the Triple Modular Redundancy (TMR) approach, where the digital voter is substituted by an analog one, thus reducing the chances of SET occurrence. To conclude, for the analog components variability, we have proposed a mixed-signal filter solution where critical analog components are substituted by digital parts, allowing a complete digital test approach, an easy faulty parts replacement and yield increase. Transistores Tolerância a falhas Confiabilidade SEU SET Components variability Reliability Fault tolerance Yield
219	Integrando injeção de falhas ao perfil UML 2.0 de testes / Integrating fault injection to the UML 2.0 testing profile Gerchman, Júlio January 2008 (has links) Mecanismos de tolerância a falhas são implementados em sistemas computacionais para atingir níveis de dependabilidade mais elevados. O teste desses mecanismos é essencial para validar seu funcionamento e demonstrar sua eficácia. Uma técnica de teste usada nesse caso é a injeção de falhas: uma simulação ou protótipo funcional é executado em um ambiente onde falhas são artificialmente emuladas e o sistema monitorado de forma a entender seu comportamento, bem como avaliar a eficiência da implementação dos mecanismos de tolerância. Descrever as atividades de teste usando modelos é útil para a documentação do sistema. O Perfil UML 2.0 de Testes (U2TP) é uma linguagem padronizada para a descrição de modelos de testes, possibilitando a representação de ambientes e atividades de verificação e validação. No entanto, U2TP não oferece elementos para suportar técnicas de injeção de falhas. Este trabalho apresenta U2TP-FI, uma extensão do Perfil UML 2.0 de Testes para a descrição de atividades de teste que usem técnicas de injeção de falhas. U2TP-FI é uma linguagem de modelagem que oferece elementos para representar as falhas a serem emuladas em um ambiente de teste, descrevendo os parâmetros que regem seu comportamento, suas condições de ativação e suas relações com os componentes do sistema. O estabelecimento dessa linguagem permite uma melhor visualização da atividade, um melhor projeto do teste e uma fácil documentação do projeto. Além disso, possibilita a criação de ferramentas para automação do processo de injeção de falhas. Como prova de conceito para demonstrar a viabilidade da proposta, foram desenvolvidos usando U2TP-FI modelos de teste para a injeção de falhas em aplicações usando injetores existentes. Ferramentas de transformação de modelos foram aplicadas para gerar de forma automatizada artefatos a serem usados na atividade, como cargas de falhas e relatórios. / Computer systems use fault tolerance mechanisms to reach higher dependability levels. Testing those mechanisms is essential for the validation of their proper operation and for the verification of their effectiveness. Fault injection is a technique for testing fault tolerance mechanisms: a simulation or a functional prototype of the system is executed in a testbed environment where faults are artificially emulated. Monitoring its behavior enables the validation of the implementation and the evaluation of the efficiency of the fault tolerance mechanisms. It is useful for documenting the system to describe the test activities using models. The UML 2.0 Testing Profile is a standard language to create test models, enabling the test engineer to describe the environment, data, components, validations and other elements of the activity. However, U2TP does not offer elements that support fault injection techniques. This work presents U2TP-FI, an extension of the UML 2.0 Testing Profile to model test activities that use fault injection techniques. U2TP-FI is a modeling language offering elements to represent the faults to be emulated on a test environment, describing the parameters which govern their behavior, the activation conditions and the relations between them and the system components. Using this language allows a better visualization of the test activity, a better test project and an easier project documentation. Besides, it enables the development of automation tools for the fault injection process. As a proof of concept to demonstrate the viability of the proposal, U2TP-FI was used to create test models for applications using existing fault injectors. Model transformation tools were applied to automatically generate test artifacts such as faultloads and reports. Processamento distribuido Injecao : Falhas Fault tolerance Fault injection Software testing Software engineering
220	Dealing with radiation induced long duration transient faults in future technologies / Lidando com falhas transitórias de longa duração provocadas por radiação em tecnologias futuras Lisboa, Carlos Arthur Lang January 2009 (has links) Com a evolução da tecnologia, dispositivos menores e mais rápidos ficam disponíveis para a fabricação de circuitos que, embora sejam mais eficientes, são mais sensíveis aos efeitos da radiação. A alta densidade, ao reduzir a distância entre dispositivos vizinhos, torna possível a ocorrência de múltiplas perturbações como resultado da colisão de uma única partícula. A alta velocidade, ao reduzir os ciclos de relógio dos circuitos, faz com que os pulsos transientes durem mais do que um ciclo. Todos estes fatos impedem o uso de diversas técnicas de mitigação existentes, baseadas em redundância temporal, e tornam necessário o desenvolvimento de técnicas inovadoras para fazer frente a este novo e desafiador cenário. Esta tese inicia com a análise da evolução da duração de pulsos transitórios nas diferentes tecnologias que dá suporte à previsão de que transitórios de longa duração (TLDs) irão afetar sistemas fabricados usando tecnologias futuras e mostra que diversas técnicas de mitigação baseadas em redundância temporal existentes não serão capazes de lidar com os TLDs devido à enorme sobrecarga que elas imporiam ao desempenho. Ao mesmo tempo, as técnicas baseadas em redundância temporal, embora sejam capazes de lidar com TLDs, ainda impõem penalidades muito elevadas em termos de área e energia, o que as torna inadequadas para uso em algumas áreas de aplicação, como as de sistemas portáteis e embarcados. Como uma alternativa para enfrentar estes desafios impostos aos projetistas pelas tecnologias futuras, é proposto o desenvolvimento de técnicas de mitigação com baixa sobrecarga, atuando em níveis de abstração distintos. Exemplos de novas técnicas de baixo custo atuando nos níveis de circuito, algoritmo e arquitetura são apresentados e avaliados. Atuando em nível de algoritmo, uma alternativa de baixo custo para verificação de multiplicação de matrizes é proposta e avaliada, mostrando-se que ela oferece uma boa solução para este problema específico, com uma enorme redução no custo de recomputação quando um erro em um elemento da matriz produto é detectado. Para generalizar esta idéia, o uso de invariantes de software na detecção de erros transitórios durante a execução é sugerido como outra técnica de baixo custo, e é mostrado que esta oferece alta capacidade de detecção de falhas, sendo, portanto, uma boa candidata para uso de maneira complementar com outras técnicas no desenvolvimento de software tolerante a falhas transitórias. Como exemplo de uma técnica em nível de arquitetura, é proposta e avaliada uma melhoria da clássica técnica de lockstep com checkpoint e rollback, mostrando uma redução significativa no número de operações de escrita necessárias para um checkpoint. Finalmente, como um exemplo de técnica de baixo custo baseada em redundância espacial, é proposto e avaliado o uso de código de Hamming na proteção de lógica combinacional, um problema ainda em aberto no projeto de sistemas usando tecnologias futuras. / As the technology evolves, faster and smaller devices are available for manufacturing circuits that, while more efficient, are more sensitive to the effects of radiation. The high transistor density, reducing the distance between neighbor devices, makes possible the occurrence of multiple upsets caused by a single particle hit. The achievable high speed, reducing the clock cycles of circuits, leads to transient pulses lasting longer than one cycle. All those facts preclude the use of several existing soft error mitigation techniques based on temporal redundancy, and require the development of innovative fault tolerant techniques to cope with this challenging new scenario. This thesis starts with the analysis of the transient width scaling across technologies, a fact that supports the prediction that long duration transients (LDTs) will affect systems manufactured using future technologies, and shows that several existing mitigation techniques based on temporal redundancy will not be able to cope with LDTs, due to the huge performance overhead that they would impose. At the same time, space redundancy based techniques, despite being able to deal with LDTs, still impose very high area and power penalties, making them inadequate for use in some application areas, such as portable and embedded systems. As an alternative to face those challenges imposed to designers by future technologies, the development of low overhead mitigation techniques, working at different abstraction levels, is proposed. Examples of new low cost techniques working at the circuit, algorithm, and architecture levels are presented and evaluated. Working at the algorithm level, a low cost verification algorithm for matrix multiplication is proposed and evaluated, showing that it provides a good solution for this specific problem, with dramatic reduction in the cost of recomputation when an error in one of the product matrix elements is detected. In order to generalize this idea, the use of software invariants to detect soft errors at runtime is suggested as a low cost technique, and shown to provide high fault detection capability, being a good candidate for use in a complementary fashion in the development of software tolerant to transient faults. As an example of architecture level technique, the improvement of the classic lockstep with checkpoint and rollback technique is proposed and evaluated, showing significant reduction in the number of write operations required for checkpoints. Finally, as an example of low cost space redundancy technique at circuit level, the use of Hamming coding to protect combinational logic, an open issue in the design of systems using future technologies, is proposed and evaluated through its application to a set of arithmetic and benchmark circuits. Microeletrônica Deteccao : Erros Tolerancia : Falhas Fault tolerance Radiation effects Low cost techniques

Search results