Spelling suggestions: "subject:"multiinjection"" "subject:"earlyinjection""
61 |
High Speed Clock GlitchingDesiraju, Santosh 18 February 2015 (has links)
No description available.
|
62 |
Methods for Securing the Integrity of FPGA ConfigurationsWebb, James Braxton 18 October 2006 (has links)
As Field Programmable Gate Arrays (FPGAs) continue to become integral parts of embedded systems, it is imperative to consider their security. While much of the research in this field is oriented toward the protection of the intellectual property contained in the FPGA's configuration, the protection of the design's integrity from malicious attack against the configuration is critical to the operation of the system. Methods for attacking the configuration are semi-invasive attacks, such as fault injection, and data tampering of incoming partial bitstreams.
This thesis introduces methods for securing the integrity of an FPGA's configuration. The design and implementation is discussed for a system that consists of three parts. The first subsystem monitors the running configuration. The second subsystem authenticates partial bistreams that may be used for repairing the configuration from malicious alterations during run-time. The third subsystem indicates if the system itself succumbs to a malicious attack. The system is implemented on-chip, allowing the FPGA to effectively secure itself from attack. / Master of Science
|
63 |
Test et évaluation de la robustesse de la couche fonctionnelle d'un robot autonome / Test and Evaluation of the Robustness of the Functional Layer of an Autonomous RobotChu, Hoang-Nam 01 September 2011 (has links)
La mise en oeuvre de systèmes autonomes nécessite le développement et l'utilisation d'architectures logicielles multi-couches qui soient adaptées. Typiquement, une couche fonctionnelle renferme des modules en charge de commander les éléments matériels du système et de fournir des services élémentaires. Pour être robuste, la couche fonctionnelle doit être dotée de mécanismes de protection vis-à-vis de requêtes erronées ou inopportunes issues de la couche supérieure. Nous présentons une méthodologie pour tester la robustesse de ces mécanismes. Nous définissons un cadre général pour évaluer la robustesse d'une couche fonctionnelle par la caractérisation de son comportement vis-à-vis de requêtes inopportunes. Nous proposons également un environnement de validation basé sur l'injection de fautes dans le logiciel de commande d'un robot simulé. Un grand nombre de cas de tests est généré automatiquement par la mutation d'une séquence de requêtes valides. Les statistiques descriptives des comportements en présence de requêtes inopportunes sont analysées afin d'évaluer la robustesse du système sous test. / The implementation of autonomous systems requires the development and the using of multi-layer software architecture. Typically, a functional layer contains several modules that control the material of the system and provide elementary services. To be robust, the functional layer must be implemented with protection mechanisms with respect to erroneous or inopportune requests sent from the superior layer. We present a methodology for robustness testing these mechanisms. We define a general framework to evaluate the robustness of a functional layer by characterizing its behavior with respect to inappropriate requests. We also propose an validation environment based on fault injection in the control software of a simulated robot. A great number of test cases is generated automatically by the mutation of a sequence of valid requests. The descriptive statistics of the behaviors in the presence of inappropriate requests are analyzed in order to evaluate the robustness of the system under test.
|
64 |
Hardening strategies for HPC applications / Estratégias de enrobustecimento para aplicações PADOliveira, Daniel Alfonso Gonçalves de January 2017 (has links)
A confiabilidade de dispositivos de Processamentos de Alto Desempenho (PAD) é uma das principais preocupações dos supercomputadores hoje e para a próxima geração. De fato, o alto número de dispositivos em grandes centros de dados faz com que a probabilidade de ter pelo menos um dispositivo corrompido seja muito alta. Neste trabalho, primeiro avaliamos o problema realizando experimentos de radiação. Os dados dos experimentos nos dão uma taxa de erro realista de dispositivos PAD. Além disso, avaliamos um conjunto representativo de algoritmos que derivam entendimentos gerais de algoritmos paralelos e a confiabilidade de abordagens de programação. Para entender melhor o problema, propomos uma nova metodologia para ir além da quantificação do problema. Qualificamos o erro avaliando a importância de cada execução corrompida por meio de um conjunto dedicado de métricas. Mostramos que em relação a computação imprecisa, a simples detecção de incompatibilidade não é suficiente para avaliar e comparar a sensibilidade à radiação de dispositivos e algoritmos PAD. Nossa análise quantifica e qualifica os efeitos da radiação na saída das aplicações, correlacionando o número de elementos corrompidos com sua localidade espacial. Também fornecemos o erro relativo médio (em nível do conjunto de dados) para avaliar a magnitude do erro induzido pela radiação. Além disso, desenvolvemos um injetor de falhas, CAROL-FI, para entender melhor o problema coletando informações usando campanhas de injeção de falhas, o que não é possível através de experimentos de radiação. Injetamos diferentes modelos de falha para analisar a sensitividade de determinadas aplicações. Mostramos que partes de aplicações podem ser classificadas com diferentes criticalidades. As técnicas de mitigação podem então ser relaxadas ou enrobustecidas com base na criticalidade de partes específicas da aplicação. Este trabalho também avalia a confiabilidade de seis arquiteturas diferentes, variando de dispositivos PAD a embarcados, com o objetivo de isolar comportamentos dependentes de código e arquitetura. Para esta avaliação, apresentamos e discutimos experimentos de radiação que abrangem um total de mais de 352.000 anos de exposição natural e análise de injeção de falhas com base em um total de mais de 120.000 injeções. Por fim, as estratégias de ECC, ABFT e de duplicação com comparação são apresentadas e avaliadas em dispositivos PAD por meio de experimentos de radiação. Apresentamos e comparamos a melhoria da confiabilidade e a sobrecarga imposta das soluções de enrobustecimento selecionadas. Em seguida, propomos e analisamos o impacto do enrobustecimento seletivo para algoritmos de PAD. Realizamos campanhas de injeção de falhas para identificar as variáveis de código-fonte mais críticas e apresentamos como selecionar os melhores candidatos para maximizar a relação confiabilidade/sobrecarga. / HPC device’s reliability is one of the major concerns for supercomputers today and for the next generation. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this work, we first evaluate the problem by performing radiation experiments. The data from the experiments give us realistic error rate of HPC devices. Moreover, we evaluate a representative set of algorithms deriving general insights of parallel algorithms and programming approaches reliability. To understand better the problem, we propose a novel methodology to go beyond the quantification of the problem. We qualify the error by evaluating the criticality of each corrupted execution through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. We also provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. Furthermore, we designed a homemade fault-injector, CAROL-FI, to understand further the problem by collecting information using fault injection campaigns that is not possible through radiation experiments. We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions. This work also evaluates the reliability behaviors of six different architectures, ranging from HPC devices to embedded ones, with the aim to isolate code- and architecturedependent behaviors. For this evaluation, we present and discuss radiation experiments that cover a total of more than 352,000 years of natural exposure and fault-injection analysis based on a total of more than 120,000 injections. Finally, Error-Correcting Code, Algorithm-Based Fault Tolerance, and Duplication With Comparison hardening strategies are presented and evaluated on HPC devices through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions. Then, we propose and analyze the impact of selective hardening for HPC algorithms. We perform fault-injection campaigns to identify the most critical source code variables and present how to select the best candidates to maximize the reliability/overhead ratio.
|
65 |
FITT : fault injection test tool to validate safety communication protocols / FITT : a fault injection tool to validate safety communication protocols / Uma ferramenta de injeção de falhas para validar protocolos de comunicação segurosDobler, Rodrigo Jaureguy January 2016 (has links)
Protocolos de comunicação seguros são essenciais em ambientes de automação industrial, onde falhas não detectadas na comunicação de dispositivos podem provocar danos irreparáveis à vida ou ao meio-ambiente. Esses protocolos seguros devem ser desenvolvidos de acordo com alguma norma de segurança, como a IEC 61508. Segundo ela, faz parte do processo de implementação destes protocolos, a escolha de técnicas adequadas de validação, entre elas a injeção de falhas, a qual deve considerar um modelo de falhas apropriado ao ambiente de operação do protocolo. Geralmente, esses ambientes são caracterizados pela existência de diversas formas de interferência elétrica e eletromagnética, as quais podem causar falhas nos sistemas eletrônicos existentes. Nos sistemas de comunicação de dados, isto pode levar a destruição do sinal de dados e causar estados de operação equivocados nos dispositivos. Assim, é preciso utilizar uma técnica de injeção de falhas que permita simular os tipos de erros de comunicação que podem ocorrer nos ambientes industriais. Dessa forma, será possível verificar o comportamento dos mecanismos de tolerância falhas na presença de falhas e assegurar o seu correto funcionamento. Para esta finalidade, este trabalho apresenta o desenvolvimento do injetor de falhas FITT para validação de protocolos de comunicação seguros. Esta ferramenta foi desenvolvida para ser utilizada com o sistema operacional Linux. O injetor faz uso do PF_RING, um módulo para o Kernel do Linux, que é responsável por realizar a comunicação direta entre as interfaces de rede e o injetor de falhas. Assim os pacotes não precisam passar pelas estruturas do Kernel do Linux, evitando que atrasos adicionais sejam inseridos no processo de recebimento e envio de mensagens. As funções de falhas desenvolvidas seguem o modelo de falhas de comunicação descrito na norma IEC 61508. Esse modelo é composto pelos erros de repetição, perda, inserção, sequência incorreta, endereçamento, corrupção de dados, atraso, mascaramento e falhas de memória em switches. / Safe communication protocols are essential in industrial automation environments, where undetected failures in the communication of devices can cause irreparable damage to life or to the environment. These safe protocols must be developed according to some safety standard, like IEC 61508. According to it, part of the process of implementing these protocols is to select appropriate techniques for validation, including the fault injection, which should consider an appropriate fault model for the operating environment of the protocol. Generally, these environments are characterized by the existence of various forms of electric and electromagnetic interference, which can cause failures in existing electronic systems. In data communication systems, this can lead to the destruction of the data signal and cause erroneous operation states in the devices. Thus, it is necessary to use a fault injection technique that allows simulating the types of communication errors that may occur in industrial environments. So, it will be possible to verify the behavior of the fault tolerance mechanisms in the presence of failures and ensure its correct functioning. For this purpose, this work presents the development of FITT fault injector for validation of safety communication protocols. This tool was developed to be used with Linux operating system. The fault injector makes use of PF_RING, a module for the Linux Kernel and that is responsible to perform the direct communication between the network interfaces and the fault injector. Thus the packages do not need to go through the Linux Kernel structures, avoiding additional delays to be inserted into the process of receiving and sending messages. The developed fault injection functions follow the communication fault model described in the IEC61508 standard, composed by the errors of repetition, loss, insertion, incorrect sequence, addressing, data corruption, delay, masking and memory failures within switches. The fault injection tests applied with this model allow to properly validate the fault tolerance mechanisms of safety protocols.
|
66 |
Teste integrado de software e hardware : reusando casos de teste de software em teste de microprocessadores / Integrated test of software and hardware: reusing software test cases to test of microprocessorMeirelles, Paulo Roberto Miranda January 2008 (has links)
Sistemas embarcados estão mais complexos e são cada vez mais utilizados em contextos que exigem muitos recursos computacionais. Isso significa que o hardware embarcado pode ser composto por vários processadores, memórias, partes reconfiguráveis e ASIPs integrados em um único silício. Adicionalmente, o software embarcados pode conter muitas rotinas de programação executadas sob restrição de processamento e memória. Esse cenário estabelece uma forte dependência entre o hardware e o software embarcado. Portanto, o teste de um sistema embarcado compreende o teste do hardware e do software. Neste contexto, a reutilização de procedimentos e estruturas de teste é um caminho para se reduzir o tempo de desenvolvimento e execução dos testes. Neste trabalho é apresentado um método de teste integrado de hardware e software. Nesse método, casos de teste desenvolvidos para testar o software embarcado também são usados para testar o seu processador. Comparou-se os custos e cobertura de falhas do método proposto com técnicas de auto-teste funcional. Os resultados experimentais demonstraram que foi possível reduzir os custos de aplicação e geração do teste do sistema usando um método de teste integrado de software e hardware. / Embedded Systems are more complexity. Nowadays, they are used in context that requires computational resources. This means an embedded hardware may be compound of several processors, memories, reconfigurable parts, and ASICs integrated in a single die. Additionally, an embedded software has a lot of programming procedures, which is under processing and memory constraints. This scenario provides a stronger connection between hardware and software. Therefore, the test of an embedded system is the test of both, hardware and software. In this context, reuse of testing structures and procedures is one way to reduce the test development time and execution. This work presents an integrated test of software and software method. In this method, test cases developed to test the embedded software are also used to test its processor. We compared the costs and fault coverage of our proposed method with techniques of functional self-test. The experimental results show that it is possible to reduce the implementation and test generation costs using an integrated test of software and hardware.
|
67 |
Compiler-Assisted Software Fault Tolerance for MicrocontrollersBohman, Matthew Kendall 01 March 2018 (has links)
Commercial off-the-shelf (COTS) microcontrollers can be useful for non-critical processing on spaceborne platforms. Many of these microprocessors are inexpensive and consume little power. However, the software running on these processors is vulnerable to radiation upsets, which can cause unpredictable program execution or corrupt data. Space missions cannot allow these errors to interrupt functionality or destroy gathered data. As a result, several techniques have been developed to reduce the effect of these upsets. Some proposed techniques involve altering the processor hardware, which is impossible for a COTS device. Alternately, the software running on the microcontroller can be modified to detect or correct data corruption. There have been several proposed approaches for software mitigation. Some take advantage of advanced architectural features, others modify software by hand, and still others focus their techniques on specific microarchitectures. However, these approaches do not consider the limited resources of microcontrollers and are difficult to use across multiple platforms. This thesis explores fully automated software-based mitigation to improve the reliability of microcontrollers and microcontroller software in a high radiation environment. Several difficulties associated with automating software protection in the compilation step are also discussed. Previous mitigation techniques are examined, resulting in the creation of COAST (COmpiler-Assisted Software fault Tolerance), a tool that automatically applies software protection techniques to user code. Hardened code has been verified by a fault injection campaign; the mean work to failure increased, on average, by 21.6x. When tested in a neutron beam, the neutron cross sections of programs decreased by an average of 23x, and the average mean work to failure increased by 5.7x.
|
68 |
Conception et prototypage d'architectures robustes de tags RFID UHF / Design and prototyping of robust architectures for UHF RFID TagsAbdelmalek, Omar 20 October 2016 (has links)
Les systèmes RFID sont de plus en plus utilisés dans des applications critiques fonctionnant dans des environnements perturbés (ferroviaire, aéronautique, chaînes de production ou agroalimentaire) ou dans des applications où la sécurité est essentielle (identification, lutte contre la contrefaçon). Pourtant, ces systèmes faibles coûts, initialement conçus pour des applications de masse non critiques, sont peu robustes par nature. Pour les applications critiques, les défaillances des puces RFID peuvent avoir des conséquences catastrophiques ou créer des failles de sécurité importantes. Ces défaillances peuvent avoir des origines nombreuses : par exemple, des origines matérielles dues au vieillissement naturel des circuits intégrés ou à des attaques (optiques, électromagnétiques, en tension). Il est donc d'usage dans les applications critiques d'accroître la robustesse des systèmes RFID par la mise en œuvre de redondance matérielle. Cependant cette redondance accroît le coût du déploiement des systèmes RFID ainsi que la complexité des protocoles et middleware associés. L'amélioration de la robustesse des tags permet de grandement limiter cette redondance. L'objectif de la thèse est d'accroitre la robustesse des tags UHF passifs en proposant et validant de nouvelles architectures numériques de puces RFID robustes à la fois aux défaillances et aux attaques matérielles. Les approches de durcissement des circuits intégrés étudient généralement leur robustesse par simulation et ce de manière indépendante à la validation de leur conception. La méthode la plus courante afin de valider la robustesse d'un circuit repose sur l'injection de fautes par simulation. Pour les puces RFID, ce type d'approche par simulation est problématique car les performances des puces dépendent de nombreux paramètres difficilement modélisables globalement. En effet, le fonctionnement d'un tag dépend de son environnement électromagnétique, du nombre de tags présents dans le système, des protocoles mis en œuvre. Aussi, nous avons développé une méthodologie basée sur le prototypage permettant d'éviter des simulations complexes et chronophages. La puce RFID prototype est alors implantée dans un FPGA. Ainsi, dès la phase de conception, cette puce peut être validée fonctionnellement dans un environnement réel. De plus, en utilisant différentes techniques d'instrumentation permettant l'injection de fautes dans les circuits numériques sur FPGA, il est alors possible d'analyser l'effet sur l'ensemble du système des fautes injectées dans le tag. Dans cette thèse, dans un premier temps, le prototype fonctionnel d'un tag RFID a été développé. Dans un second temps, ce prototype a été instrumenté pour pouvoir réaliser des injections de fautes en ligne ou hors ligne. Ensuite, le comportement du système RFID en présence de fautes dans ce tag RFID a été évalué. L'analyse des effets de ces fautes sur le système a permis de proposer, de mettre en œuvre et de valider de nouvelles architectures numériques de tags RFID robustes. Ce nouvel environnement de prototypage et d'injection de fautes a également permis de démontrer les effets de nouvelles attaques contre les systèmes RFID reposant sur l'insertion de tags fautifs ou malveillants dans les systèmes. Enfin, cette approche a permis d'évaluer les méthodes de détection des tags fautifs. / RFID tags are more and more used for critical applications within harsh environments (aeronautics, railways) or for secure applications such as identification, countermeasure against counterfeiting. However, such low cost systems, initially designed for non-critical applications with a high volume, are not robust by themselves. For critical applications, a malfunction of RFID chip may have serious consequences or induce a severe security breach for hackers. Dysfunctions can have many origins: for instance, hardware issues can be due to aging effects or can also be due to hackers attack such as optical or electromagnetic fault injection. It is thus a common practice for critical applications to increase the robustness of RFID system. The main purpose of this PhD Thesis is to increase UHF tags robustness by proposing new digital architectures of RFID chips which would be resilient against both hardware attacks and natural defects.Usual design techniques for robustness IC improvement consist in evaluating the design robustness by simulation and to do this independently of the design validation. The main technique for robustness evaluation is the simulation based faults injection. Within the RFID context such an approach only based on simulation has several drawbacks. In fact, simulations often are inaccurate because the system behavior relies on several parameters such as the global electromagnetic environment, the number of tags present in the reader field, the RFID protocol parameters.The purposes of this PhD are to develop a design method dedicated to RFID system based on hardware prototyping in order to avoid time consuming simulations and then to evaluate the design within a real environment.The hardware prototyping based on FPGA allows the design to be validated in a real environment. Moreover, using instrumentation techniques for fault injection within FPGA , it will be then possible to analyze the effects of faulty tags on the global system in terms of safety and security and then to propose countermeasures.In this thesis an FPGA based emulation platform called RFIM has been developed. This platform is compliant to EPC C1 Gen2 RFID standard. The RFID tag emulator has been validated functionally in a real environment. The RFIM platform uses the instrumentation technique for injecting faults in the digital tag circuit. Through fault injection campaigns RFIM platform can analyze the effect on the entire system of the faults injected into the tag, and ten validate new robust digital architectures.The RFIM platform has been used to demonstrate the effects of further attacks against RFID systems based on the insertion of faulty or malicious tag that contains a hardware Trojan. Finally, RFIM platform helps to develop countermeasures against the fault effects. These countermeasures have been implemented and tested in a real RFID environment with several tags and reader.
|
69 |
Analyse de code et processus d'évaluation des composants sécurisés contre l'injection de faute / Code analysis and evaluation process for vulnerability detection against fault injection on secure hardwareDureuil, Louis 12 October 2016 (has links)
Dans le domaine des cartes à puce, les analyses de vulnérabilité demandent d’être à la pointe de l’art en termes d’attaques et de techniques de protection. Une attaque classique est l’injection de fautes, réalisée au niveau matériel notamment par des techniques laser. Pour anticiper les impacts possibles de ce type d'attaque, certaines analyses sont menées au niveau logiciel. Il est donc fortement d’actualité de pouvoir définir des critères et proposer des outils automatiques permettant d’évaluer la robustesse d’une application à ce type d’attaque, d’autant plus que les techniques d’attaques matérielles permettent maintenant d’enchaîner plusieurs attaques (spatiales ou temporelles) au cours d’une exécution. En effet, des travaux de recherche récents évaluent l'impact des contre-mesures face à ce type d'attaque[1], ou tentent de modéliser les injections de faute au niveau C[2]. Le sujet de thèse proposé s'inscrit dans cette problématique, avec néanmoins la particularité novatrice de s'intéresser au couplage des analyses statique et dynamique dans le cas des injections de fautes effectuées au niveau binaire. Un des objectifs de la thèse est d'offrir un cadre paramétrable permettant de simuler des attaques par faute telles qu'elles peuvent être réalisées par le laboratoire CESTI-LETI au niveau matériel. Il faudra donc proposer un modèle intermédiaire générique permettant de spécifier des contraintes réelles comme par exemple les différents types de mémoires (RAM, EEPROM, ROM), qui peuvent induire des fautes permanentes ou volatiles. Concilier les analyses statiques du code et l'injection de fautes dynamiques devra permettre de maîtriser la combinatoire des exécutions et de guider l'analyse à l'aide de patterns d'attaques. À ce titre, on sera amené à proposer une taxonomie des attaques et de nouvelles modélisations d'attaques. Il faudra également adapter les outils d'analyse statique aux conséquences de l'injection dynamique de fautes, qui peut modifier profondément le code en changeant l'interprétation des instructions, ce qui a un effet similaire à la génération de code à l'exécution. Ce sujet de thèse s'inscrit dans la stratégie d'innovation du CESTI-LETI et pourra aboutir à un vérificateur automatique de code utilisable par les évaluateurs du CESTI-LETI. [1] A. Séré, J-L. Lanet et J. Iguchi-Cartigny. « Evaluation of Countermeasures Against Fault Attacks on Smart Cards ». en. In : International Journal of Security and Its Applications 5.2 (2011). [2] Xavier Kauffmann-Tourkestansky. « Analyses sécuritaires de code de carte à puce sous attaques physiques simulées ». Français. THESE. Université d’Orléans, nov. 2012. url : http://tel.archives-ouvertes.fr/tel-00771273. / Vulnerability detections for smart cards require state of the art methods both to attack and to protect the secure device. A typical type of attack is fault injection, most notably performed by means of laser techniques. To prevent some of the consequences of this kind of attacks, several analyses are conducted at the software level. Being able to define criteria and to propose automated tools that can survey the robustness of an application to fault injection is thus nowadays a hot topic, even more so since the hardware attack techniques allow today an attacker to perform several attacks in a single software execution. Indeed, recent research works evaluate the effectiveness of counter-measures against fault injection[1], or attempt to develop models of fault injection at the C level[2]. This thesis project addresses the issue of multiple faults injection, albeit by adding the distinctive aspect of static and dynamic analysis interaction in a context of binary-level fault injection. An objective of the thesis is to achieve a configurable framework to simulate fault injections in the way they are currently performed by the CESTI-LETI laboratory on the actual hardware. To do so we will develop a generic intermediate model that will allow us to specify hardware constraints, such as the various kinds of memories (RAM, EEPROM, ROM), whose different properties can induce either permanent or volatile faults. Combining the static code analysis with dynamic fault injections should prevent the combinatory explosion of the executiions while attack patterns will guide the analysis. A taxonomy of attacks and new attack modelisations could emerge from this work. An adaption of the tools for static analysis is also required, because dynamic fault injection can deeply change the code by modifying the interpretation of the instructions, in a similar manner to dynamic compilation. This thesis project falls within the CESTI-LETI's innovation strategy, et could lead to an automated code verifier that could be used by the CESTI-LETI evaluation specialists. [1] A. Séré, J-L. Lanet et J. Iguchi-Cartigny. « Evaluation of Countermeasures Against Fault Attacks on Smart Cards ». en. In : International Journal of Security and Its Applications 5.2 (2011). [2] Xavier Kauffmann-Tourkestansky. « Analyses sécuritaires de code de carte à puce sous attaques physiques simulées ». Français. THESE. Université d’Orléans, nov. 2012. url : http://tel.archives-ouvertes.fr/tel-00771273.
|
70 |
Hardening strategies for HPC applications / Estratégias de enrobustecimento para aplicações PADOliveira, Daniel Alfonso Gonçalves de January 2017 (has links)
A confiabilidade de dispositivos de Processamentos de Alto Desempenho (PAD) é uma das principais preocupações dos supercomputadores hoje e para a próxima geração. De fato, o alto número de dispositivos em grandes centros de dados faz com que a probabilidade de ter pelo menos um dispositivo corrompido seja muito alta. Neste trabalho, primeiro avaliamos o problema realizando experimentos de radiação. Os dados dos experimentos nos dão uma taxa de erro realista de dispositivos PAD. Além disso, avaliamos um conjunto representativo de algoritmos que derivam entendimentos gerais de algoritmos paralelos e a confiabilidade de abordagens de programação. Para entender melhor o problema, propomos uma nova metodologia para ir além da quantificação do problema. Qualificamos o erro avaliando a importância de cada execução corrompida por meio de um conjunto dedicado de métricas. Mostramos que em relação a computação imprecisa, a simples detecção de incompatibilidade não é suficiente para avaliar e comparar a sensibilidade à radiação de dispositivos e algoritmos PAD. Nossa análise quantifica e qualifica os efeitos da radiação na saída das aplicações, correlacionando o número de elementos corrompidos com sua localidade espacial. Também fornecemos o erro relativo médio (em nível do conjunto de dados) para avaliar a magnitude do erro induzido pela radiação. Além disso, desenvolvemos um injetor de falhas, CAROL-FI, para entender melhor o problema coletando informações usando campanhas de injeção de falhas, o que não é possível através de experimentos de radiação. Injetamos diferentes modelos de falha para analisar a sensitividade de determinadas aplicações. Mostramos que partes de aplicações podem ser classificadas com diferentes criticalidades. As técnicas de mitigação podem então ser relaxadas ou enrobustecidas com base na criticalidade de partes específicas da aplicação. Este trabalho também avalia a confiabilidade de seis arquiteturas diferentes, variando de dispositivos PAD a embarcados, com o objetivo de isolar comportamentos dependentes de código e arquitetura. Para esta avaliação, apresentamos e discutimos experimentos de radiação que abrangem um total de mais de 352.000 anos de exposição natural e análise de injeção de falhas com base em um total de mais de 120.000 injeções. Por fim, as estratégias de ECC, ABFT e de duplicação com comparação são apresentadas e avaliadas em dispositivos PAD por meio de experimentos de radiação. Apresentamos e comparamos a melhoria da confiabilidade e a sobrecarga imposta das soluções de enrobustecimento selecionadas. Em seguida, propomos e analisamos o impacto do enrobustecimento seletivo para algoritmos de PAD. Realizamos campanhas de injeção de falhas para identificar as variáveis de código-fonte mais críticas e apresentamos como selecionar os melhores candidatos para maximizar a relação confiabilidade/sobrecarga. / HPC device’s reliability is one of the major concerns for supercomputers today and for the next generation. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this work, we first evaluate the problem by performing radiation experiments. The data from the experiments give us realistic error rate of HPC devices. Moreover, we evaluate a representative set of algorithms deriving general insights of parallel algorithms and programming approaches reliability. To understand better the problem, we propose a novel methodology to go beyond the quantification of the problem. We qualify the error by evaluating the criticality of each corrupted execution through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications’ output correlating the number of corrupted elements with their spatial locality. We also provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. Furthermore, we designed a homemade fault-injector, CAROL-FI, to understand further the problem by collecting information using fault injection campaigns that is not possible through radiation experiments. We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions. This work also evaluates the reliability behaviors of six different architectures, ranging from HPC devices to embedded ones, with the aim to isolate code- and architecturedependent behaviors. For this evaluation, we present and discuss radiation experiments that cover a total of more than 352,000 years of natural exposure and fault-injection analysis based on a total of more than 120,000 injections. Finally, Error-Correcting Code, Algorithm-Based Fault Tolerance, and Duplication With Comparison hardening strategies are presented and evaluated on HPC devices through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions. Then, we propose and analyze the impact of selective hardening for HPC algorithms. We perform fault-injection campaigns to identify the most critical source code variables and present how to select the best candidates to maximize the reliability/overhead ratio.
|
Page generated in 0.091 seconds