Spelling suggestions: "subject:"transient faults"" "subject:"ransient faults""
11 |
Surveillance comportementale de systèmes et logiciels embarqués par signature disjointe / Behavioral monitoring for embedded systems and software by disjoint signature analysisBergaoui, Selma 06 June 2013 (has links)
Les systèmes critiques, parmi lesquels les systèmes embarqués construits autour d'un microprocesseur mono-cœur exécutant un logiciel d'application, ne sont pas à l'abri d'interférences naturelles ou malveillantes qui peuvent provoquer des fautes transitoires. Cette thèse porte sur des protections qui peuvent être implantées pour détecter les effets de telles fautes transitoires sans faire d'hypothèses sur la multiplicité des erreurs générées. De plus, ces erreurs peuvent être soit des erreurs de flot de contrôle soit des erreurs sur les données. Une nouvelle méthode de vérification de flot de contrôle est tout d'abord proposée. Elle permet de vérifier, sans modifier le système initial, que les instructions du programme d'application sont lues sans erreur et dans le bon ordre. Les erreurs sur les données sont également prises en compte par une extension de la vérification de flot de contrôle. La méthode proposée offre un bon compromis entre les différents surcoûts, le temps de latence de détection et la couverture des erreurs. Les surcoûts peuvent aussi être ajustés aux besoins de l'application. La méthode est mise en œuvre sur un prototype, construit autour d'un microprocesseur Sparc v8. Les fonctions d'analyse de criticité développées dans le cadre de la méthodologie proposée sont également utilisées pour évaluer l'impact des options de compilation sur la robustesse intrinsèque du logiciel d'application. / Critical systems, including embedded systems built around a single core microprocessor running a software application, can be the target of natural or malicious interferences that may cause transient faults. This work focuses on protections that can be implemented to detect the effects of such transient faults without any assumption about the multiplicity of generated errors. In addition, those errors can be either control flow errors or data errors. A new control flow checking method is first proposed. It monitors, without modifying the original system, that the instructions of the microprocessor application program are read without error and in the proper order. Data errors are also taken into account by an extension of the control flow checking. The proposed method offers a good compromise between overheads, latency detection and errors coverage. Trade-offs can also be tuned according to the application constraints. The methodology is demonstrated on a prototype built around a Sparc v8 microprocessor. Criticality evaluation functions developed in the frame of the proposed methodology are also used to evaluate the impact of compilation options on the intrinsic robustness of the application software.
|
12 |
Determining One-Shot Control Criteria in Western North American Power Grid with Swarm OptimizationGregory Vaughan (6615489) 10 June 2019 (has links)
The power transmission network is stretched thin in Western North America. When generators or substations fault, the resultant cascading failures can diminish transmission capabilities across wide regions of the continent. This thesis examined several methods of<br><div>determining one-shot controls based on frequency decline in electrical generators to reduce the effect of one or more phase faults and tripped generators. These methods included criteria based on indices calculated from frequency measured at the controller location. These indices included criteria based on local modes and the rate of change of frequency.</div><br>This thesis primarily used particle swarm optimization (PSO) with inertia to determine a well-adapted set of parameters. The parameters included up to three thresholds for indices calculated from frequency. The researchers found that the best method for distinguishing between one or more phase faults used thresholds on two Fourier indices. Future lines of research regarding one-shot controls were considered.<br><div><br></div><div>A method that distinguished nearby tripped generators from one or more phase faults and load change events was proposed. This method used a moving average, a negative<br></div>threshold for control, and a positive threshold to reject control. The negative threshold for the moving average is met frequently during any large transient event. An additional index must be used to distinguish loss of generation events. This index is the maximum value of the moving average up to the present time and it is good for distinguishing loss of<br>generation events from transient swings caused by other events.<br><br><div>This thesis further demonstrated how well a combination of controls based on both rate of change of frequency and local modes reduces instability of the network as determined by both a reduction in RMSGA and control efficiency at any time after the events.</div><br>This thesis found that using local modes is generally useful to diagnose and apply one-shot controls when instability is caused by one or more phase faults, while when disconnected generators or reduced loads cause instability in the system, the local modes did not distinguish between loss of generation capacity events and reduced load events. Instead, differentiating based on the rate of change of frequency and an initial upward deflection of frequency or an initial downward deflection of frequency did distinguish between these types of events.
|
13 |
Selective software-implemented hardware fault tolerance tecnhiques to detect soft errors in processors with reduced overhead / Técnicas seletivas de tolerência a falhas em software com custo reduzido para detectar erros causados por falhas transientes em processadoresChielle, Eduardo January 2016 (has links)
A utilização de técnicas de tolerância a falhas em software é uma forma de baixo custo para proteger processadores contra soft errors. Contudo, elas causam aumento no tempo de execução e utilização de memória. Em consequência disso, o consumo de energia também aumenta. Sistemas que operam com restrição de tempo ou energia podem ficar impossibilitados de utilizar tais técnicas. Por esse motivo, este trabalho propoe técnicas de tolerância a falhas em software com custos no desempenho e memória reduzidos e cobertura de falhas similar a técnicas presentes na literatura. Como detecção é menos custoso que correção, este trabalho foca em técnicas de detecção. Primeiramente, um conjunto de técnicas de dados baseadas em regras de generalização, chamada VAR, é apresentada. As técnicas são baseadas nesse conjunto generalizado de regras para permitir uma investigação exaustiva, em termos de confiabilidade e custos, de diferentes variações de técnicas. As regras definem como a técnica duplica o código e insere verificadores. Cada técnica usa um diferente conjunto de regras. Então, uma técnica de controle, chamada SETA, é introduzida. Comparando SETA com uma técnica estado-da-arte, SETA é 11.0% mais rápida e ocupa 10.3% menos posições de memória. As técnicas de dados mais promissoras são combinadas com a técnica de controle com o objetivo de proteger tanto os dados quanto o fluxo de controle da aplicação alvo. Para reduzir ainda mais os custos, métodos para aplicar seletivamente as técnicas propostas foram desenvolvidos. Para técnica de dados, em vez de proteger todos os registradores, somente um conjunto de registradores selecionados é protegido. O conjunto é selecionado com base em uma métrica que analisa o código e classifica os registradores por sua criticalidade. Para técnicas de controle, há duas abordagens: (1) remover verificadores de blocos básicos, e (2) seletivamente proteger blocos básicos. As técnicas e suas versões seletivas são avaliadas em termos de tempo de execução, tamanho do código, cobertura de falhas, e o Mean Work to Failure (MWTF), o qual é uma métrica que mede o compromisso entre cobertura de falhas e tempo de execução. Resultados mostram redução dos custos sem diminuição da cobertura de falhas, e para uma pequena redução na cobertura de falhas foi possível significativamente reduzir os custos. Por fim, uma vez que a avaliação de todas as possíveis combinações utilizando métodos seletivos toma muito tempo, este trabalho utiliza um método para extrapolar os resultados obtidos por simulação com o objetivo de encontrar os melhores parâmetros para a proteção seletiva e combinada de técnicas de dados e de controle que melhorem o compromisso entre confiabilidade e custos. / Software-based fault tolerance techniques are a low-cost way to protect processors against soft errors. However, they introduce significant overheads to the execution time and code size, which consequently increases the energy consumption. System operation with time or energy restrictions may not be able to make use of these techniques. For this reason, this work proposes software-based fault tolerance techniques with lower overheads and similar fault coverage to state-of-the-art software techniques. Once detection is less costly than correction, the work focuses on software-based detection techniques. Firstly, a set of data-flow techniques called VAR is proposed. The techniques are based on general building rules to allow an exhaustive assessment, in terms of reliability and overheads, of different technique variations. The rules define how the technique duplicates the code and insert checkers. Each technique uses a different set of rules. Then, a control-flow technique called SETA (Software-only Error-detection Technique using Assertions) is introduced. Comparing SETA with a state-of-the-art technique, SETA is 11.0% faster and occupies 10.3% fewer memory positions. The most promising data-flow techniques are combined with the control-flow technique in order to protect both dataflow and control-flow of the target application. To go even further with the reduction of the overheads, methods to selective apply the proposed software techniques have been developed. For the data-flow techniques, instead of protecting all registers, only a set of selected registers is protected. The set is selected based on a metric that analyzes the code and rank the registers by their criticality. For the control-flow technique, two approaches are taken: (1) removing checkers from basic blocks: all the basic blocks are protected by SETA, but only selected basic blocks have checkers inserted, and (2) selectively protecting basic blocks: only a set of basic blocks is protected. The techniques and their selective versions are evaluated in terms of execution time, code size, fault coverage, and Mean Work To Failure (MWTF), which is a metric to measure the trade-off between fault coverage and execution time. Results show that was possible to reduce the overheads without affecting the fault coverage, and for a small reduction in the fault coverage it was possible to significantly reduce the overheads. Lastly, since the evaluation of all the possible combinations for selective hardening of every application takes too much time, this work uses a method to extrapolate the results obtained by simulation in order to find the parameters for the selective combination of data and control-flow techniques that are probably the best candidates to improve the trade-off between reliability and overheads.
|
14 |
Transient-fault robust systems exploiting quasi-delay insensitive asynchronous circuits / Sistemas robustos a falhas transientes explorando circuitos assíncronos quase-insensíveis aos atrasosBastos, Rodrigo Possamai January 2010 (has links)
Os circuitos integrados recentes baseados em tecnologias nanoeletrônicas estão significativamente mais vulneráveis a falhas transientes. Os erros gerados são assim também mais críticos do que eram antes. Esta tese apresenta uma nova virtude em termos de confiabilidade dos circuitos assíncronos quase-insensíveis aos atrasos (QDI): a sua grande habilidade natural para mitigar falhas transientes de longa duração, que são severas em circuitos síncronos modernos. Uma metodologia para avaliar comparativamente os efeitos de falhas transientes tanto em circuitos síncronos como em circuitos assíncronos QDI é apresentada. Além disso, um método para obter a habilidade de mitigação de falhas transientes dos elementos de memória de circuitos QDI (ou seja, os C-elements) é também proposto. Por fim, técnicas de mitigação são sugeridas para aumentar ainda mais a atenuação de falhas transientes por parte dos Celements e, por consequência, também a robustez dos sistemas assíncronos QDI. / Recent deep-submicron technology-based ICs are significantly more vulnerable to transient faults. The arisen errors are thus also more critical than they have ever been before. This thesis presents a further novel benefit of the Quasi-Delay Insensitive (QDI) asynchronous circuits in terms of reliability: their strong natural ability to mitigate longduration transient faults that are severe in modern synchronous circuits. A methodology to evaluate comparatively the transient-fault effects on synchronous and QDI asynchronous circuits is presented. Furthermore, a method to obtain the transient-fault mitigation ability of the QDI circuits’ memory elements (i.e., the C-elements) is also proposed. Finally, mitigation techniques are suggested to increase even more the Celements’ transient-fault attenuation, and thus also the QDI asynchronous systems’ robustness.
|
15 |
Selective software-implemented hardware fault tolerance tecnhiques to detect soft errors in processors with reduced overhead / Técnicas seletivas de tolerência a falhas em software com custo reduzido para detectar erros causados por falhas transientes em processadoresChielle, Eduardo January 2016 (has links)
A utilização de técnicas de tolerância a falhas em software é uma forma de baixo custo para proteger processadores contra soft errors. Contudo, elas causam aumento no tempo de execução e utilização de memória. Em consequência disso, o consumo de energia também aumenta. Sistemas que operam com restrição de tempo ou energia podem ficar impossibilitados de utilizar tais técnicas. Por esse motivo, este trabalho propoe técnicas de tolerância a falhas em software com custos no desempenho e memória reduzidos e cobertura de falhas similar a técnicas presentes na literatura. Como detecção é menos custoso que correção, este trabalho foca em técnicas de detecção. Primeiramente, um conjunto de técnicas de dados baseadas em regras de generalização, chamada VAR, é apresentada. As técnicas são baseadas nesse conjunto generalizado de regras para permitir uma investigação exaustiva, em termos de confiabilidade e custos, de diferentes variações de técnicas. As regras definem como a técnica duplica o código e insere verificadores. Cada técnica usa um diferente conjunto de regras. Então, uma técnica de controle, chamada SETA, é introduzida. Comparando SETA com uma técnica estado-da-arte, SETA é 11.0% mais rápida e ocupa 10.3% menos posições de memória. As técnicas de dados mais promissoras são combinadas com a técnica de controle com o objetivo de proteger tanto os dados quanto o fluxo de controle da aplicação alvo. Para reduzir ainda mais os custos, métodos para aplicar seletivamente as técnicas propostas foram desenvolvidos. Para técnica de dados, em vez de proteger todos os registradores, somente um conjunto de registradores selecionados é protegido. O conjunto é selecionado com base em uma métrica que analisa o código e classifica os registradores por sua criticalidade. Para técnicas de controle, há duas abordagens: (1) remover verificadores de blocos básicos, e (2) seletivamente proteger blocos básicos. As técnicas e suas versões seletivas são avaliadas em termos de tempo de execução, tamanho do código, cobertura de falhas, e o Mean Work to Failure (MWTF), o qual é uma métrica que mede o compromisso entre cobertura de falhas e tempo de execução. Resultados mostram redução dos custos sem diminuição da cobertura de falhas, e para uma pequena redução na cobertura de falhas foi possível significativamente reduzir os custos. Por fim, uma vez que a avaliação de todas as possíveis combinações utilizando métodos seletivos toma muito tempo, este trabalho utiliza um método para extrapolar os resultados obtidos por simulação com o objetivo de encontrar os melhores parâmetros para a proteção seletiva e combinada de técnicas de dados e de controle que melhorem o compromisso entre confiabilidade e custos. / Software-based fault tolerance techniques are a low-cost way to protect processors against soft errors. However, they introduce significant overheads to the execution time and code size, which consequently increases the energy consumption. System operation with time or energy restrictions may not be able to make use of these techniques. For this reason, this work proposes software-based fault tolerance techniques with lower overheads and similar fault coverage to state-of-the-art software techniques. Once detection is less costly than correction, the work focuses on software-based detection techniques. Firstly, a set of data-flow techniques called VAR is proposed. The techniques are based on general building rules to allow an exhaustive assessment, in terms of reliability and overheads, of different technique variations. The rules define how the technique duplicates the code and insert checkers. Each technique uses a different set of rules. Then, a control-flow technique called SETA (Software-only Error-detection Technique using Assertions) is introduced. Comparing SETA with a state-of-the-art technique, SETA is 11.0% faster and occupies 10.3% fewer memory positions. The most promising data-flow techniques are combined with the control-flow technique in order to protect both dataflow and control-flow of the target application. To go even further with the reduction of the overheads, methods to selective apply the proposed software techniques have been developed. For the data-flow techniques, instead of protecting all registers, only a set of selected registers is protected. The set is selected based on a metric that analyzes the code and rank the registers by their criticality. For the control-flow technique, two approaches are taken: (1) removing checkers from basic blocks: all the basic blocks are protected by SETA, but only selected basic blocks have checkers inserted, and (2) selectively protecting basic blocks: only a set of basic blocks is protected. The techniques and their selective versions are evaluated in terms of execution time, code size, fault coverage, and Mean Work To Failure (MWTF), which is a metric to measure the trade-off between fault coverage and execution time. Results show that was possible to reduce the overheads without affecting the fault coverage, and for a small reduction in the fault coverage it was possible to significantly reduce the overheads. Lastly, since the evaluation of all the possible combinations for selective hardening of every application takes too much time, this work uses a method to extrapolate the results obtained by simulation in order to find the parameters for the selective combination of data and control-flow techniques that are probably the best candidates to improve the trade-off between reliability and overheads.
|
16 |
Transient-fault robust systems exploiting quasi-delay insensitive asynchronous circuits / Sistemas robustos a falhas transientes explorando circuitos assíncronos quase-insensíveis aos atrasosBastos, Rodrigo Possamai January 2010 (has links)
Os circuitos integrados recentes baseados em tecnologias nanoeletrônicas estão significativamente mais vulneráveis a falhas transientes. Os erros gerados são assim também mais críticos do que eram antes. Esta tese apresenta uma nova virtude em termos de confiabilidade dos circuitos assíncronos quase-insensíveis aos atrasos (QDI): a sua grande habilidade natural para mitigar falhas transientes de longa duração, que são severas em circuitos síncronos modernos. Uma metodologia para avaliar comparativamente os efeitos de falhas transientes tanto em circuitos síncronos como em circuitos assíncronos QDI é apresentada. Além disso, um método para obter a habilidade de mitigação de falhas transientes dos elementos de memória de circuitos QDI (ou seja, os C-elements) é também proposto. Por fim, técnicas de mitigação são sugeridas para aumentar ainda mais a atenuação de falhas transientes por parte dos Celements e, por consequência, também a robustez dos sistemas assíncronos QDI. / Recent deep-submicron technology-based ICs are significantly more vulnerable to transient faults. The arisen errors are thus also more critical than they have ever been before. This thesis presents a further novel benefit of the Quasi-Delay Insensitive (QDI) asynchronous circuits in terms of reliability: their strong natural ability to mitigate longduration transient faults that are severe in modern synchronous circuits. A methodology to evaluate comparatively the transient-fault effects on synchronous and QDI asynchronous circuits is presented. Furthermore, a method to obtain the transient-fault mitigation ability of the QDI circuits’ memory elements (i.e., the C-elements) is also proposed. Finally, mitigation techniques are suggested to increase even more the Celements’ transient-fault attenuation, and thus also the QDI asynchronous systems’ robustness.
|
17 |
Selective software-implemented hardware fault tolerance tecnhiques to detect soft errors in processors with reduced overhead / Técnicas seletivas de tolerência a falhas em software com custo reduzido para detectar erros causados por falhas transientes em processadoresChielle, Eduardo January 2016 (has links)
A utilização de técnicas de tolerância a falhas em software é uma forma de baixo custo para proteger processadores contra soft errors. Contudo, elas causam aumento no tempo de execução e utilização de memória. Em consequência disso, o consumo de energia também aumenta. Sistemas que operam com restrição de tempo ou energia podem ficar impossibilitados de utilizar tais técnicas. Por esse motivo, este trabalho propoe técnicas de tolerância a falhas em software com custos no desempenho e memória reduzidos e cobertura de falhas similar a técnicas presentes na literatura. Como detecção é menos custoso que correção, este trabalho foca em técnicas de detecção. Primeiramente, um conjunto de técnicas de dados baseadas em regras de generalização, chamada VAR, é apresentada. As técnicas são baseadas nesse conjunto generalizado de regras para permitir uma investigação exaustiva, em termos de confiabilidade e custos, de diferentes variações de técnicas. As regras definem como a técnica duplica o código e insere verificadores. Cada técnica usa um diferente conjunto de regras. Então, uma técnica de controle, chamada SETA, é introduzida. Comparando SETA com uma técnica estado-da-arte, SETA é 11.0% mais rápida e ocupa 10.3% menos posições de memória. As técnicas de dados mais promissoras são combinadas com a técnica de controle com o objetivo de proteger tanto os dados quanto o fluxo de controle da aplicação alvo. Para reduzir ainda mais os custos, métodos para aplicar seletivamente as técnicas propostas foram desenvolvidos. Para técnica de dados, em vez de proteger todos os registradores, somente um conjunto de registradores selecionados é protegido. O conjunto é selecionado com base em uma métrica que analisa o código e classifica os registradores por sua criticalidade. Para técnicas de controle, há duas abordagens: (1) remover verificadores de blocos básicos, e (2) seletivamente proteger blocos básicos. As técnicas e suas versões seletivas são avaliadas em termos de tempo de execução, tamanho do código, cobertura de falhas, e o Mean Work to Failure (MWTF), o qual é uma métrica que mede o compromisso entre cobertura de falhas e tempo de execução. Resultados mostram redução dos custos sem diminuição da cobertura de falhas, e para uma pequena redução na cobertura de falhas foi possível significativamente reduzir os custos. Por fim, uma vez que a avaliação de todas as possíveis combinações utilizando métodos seletivos toma muito tempo, este trabalho utiliza um método para extrapolar os resultados obtidos por simulação com o objetivo de encontrar os melhores parâmetros para a proteção seletiva e combinada de técnicas de dados e de controle que melhorem o compromisso entre confiabilidade e custos. / Software-based fault tolerance techniques are a low-cost way to protect processors against soft errors. However, they introduce significant overheads to the execution time and code size, which consequently increases the energy consumption. System operation with time or energy restrictions may not be able to make use of these techniques. For this reason, this work proposes software-based fault tolerance techniques with lower overheads and similar fault coverage to state-of-the-art software techniques. Once detection is less costly than correction, the work focuses on software-based detection techniques. Firstly, a set of data-flow techniques called VAR is proposed. The techniques are based on general building rules to allow an exhaustive assessment, in terms of reliability and overheads, of different technique variations. The rules define how the technique duplicates the code and insert checkers. Each technique uses a different set of rules. Then, a control-flow technique called SETA (Software-only Error-detection Technique using Assertions) is introduced. Comparing SETA with a state-of-the-art technique, SETA is 11.0% faster and occupies 10.3% fewer memory positions. The most promising data-flow techniques are combined with the control-flow technique in order to protect both dataflow and control-flow of the target application. To go even further with the reduction of the overheads, methods to selective apply the proposed software techniques have been developed. For the data-flow techniques, instead of protecting all registers, only a set of selected registers is protected. The set is selected based on a metric that analyzes the code and rank the registers by their criticality. For the control-flow technique, two approaches are taken: (1) removing checkers from basic blocks: all the basic blocks are protected by SETA, but only selected basic blocks have checkers inserted, and (2) selectively protecting basic blocks: only a set of basic blocks is protected. The techniques and their selective versions are evaluated in terms of execution time, code size, fault coverage, and Mean Work To Failure (MWTF), which is a metric to measure the trade-off between fault coverage and execution time. Results show that was possible to reduce the overheads without affecting the fault coverage, and for a small reduction in the fault coverage it was possible to significantly reduce the overheads. Lastly, since the evaluation of all the possible combinations for selective hardening of every application takes too much time, this work uses a method to extrapolate the results obtained by simulation in order to find the parameters for the selective combination of data and control-flow techniques that are probably the best candidates to improve the trade-off between reliability and overheads.
|
18 |
Transient-fault robust systems exploiting quasi-delay insensitive asynchronous circuits / Sistemas robustos a falhas transientes explorando circuitos assíncronos quase-insensíveis aos atrasosBastos, Rodrigo Possamai January 2010 (has links)
Os circuitos integrados recentes baseados em tecnologias nanoeletrônicas estão significativamente mais vulneráveis a falhas transientes. Os erros gerados são assim também mais críticos do que eram antes. Esta tese apresenta uma nova virtude em termos de confiabilidade dos circuitos assíncronos quase-insensíveis aos atrasos (QDI): a sua grande habilidade natural para mitigar falhas transientes de longa duração, que são severas em circuitos síncronos modernos. Uma metodologia para avaliar comparativamente os efeitos de falhas transientes tanto em circuitos síncronos como em circuitos assíncronos QDI é apresentada. Além disso, um método para obter a habilidade de mitigação de falhas transientes dos elementos de memória de circuitos QDI (ou seja, os C-elements) é também proposto. Por fim, técnicas de mitigação são sugeridas para aumentar ainda mais a atenuação de falhas transientes por parte dos Celements e, por consequência, também a robustez dos sistemas assíncronos QDI. / Recent deep-submicron technology-based ICs are significantly more vulnerable to transient faults. The arisen errors are thus also more critical than they have ever been before. This thesis presents a further novel benefit of the Quasi-Delay Insensitive (QDI) asynchronous circuits in terms of reliability: their strong natural ability to mitigate longduration transient faults that are severe in modern synchronous circuits. A methodology to evaluate comparatively the transient-fault effects on synchronous and QDI asynchronous circuits is presented. Furthermore, a method to obtain the transient-fault mitigation ability of the QDI circuits’ memory elements (i.e., the C-elements) is also proposed. Finally, mitigation techniques are suggested to increase even more the Celements’ transient-fault attenuation, and thus also the QDI asynchronous systems’ robustness.
|
19 |
Efficient Fault Tolerance In Chip Multiprocessors Using Critical Value ForwardingSubramanyan, Pramod 06 1900 (has links) (PDF)
Relentless CMOS scaling coupled with lower design tolerances is making ICs increasingly susceptible to transient faults, wear-out related permanent faults and process variations. Decreasing CMOS reliability implies that high-availability systems which were previously restricted to the domain of mainframe computers or specially designed fault-tolerant systems may be come important for the commodity market as well. In this thesis we tackle the problem of enabling efficient, low cost and configurable fault-tolerance using Chip Multiprocessors (CMPs).
Our work studies architectural fault detection methods based on redundant execution, specifically focusing on “leader-follower” architectures. In such architectures redundant execution is performed on two cores/threads of a CMP. One thread acts as the leading thread while the other acts as the trailing thread. The leading thread assists the execution of the trailing thread by forwarding the results of its execution. These forwarded results are used as predictions in the trailing thread and help improve its performance. In this thesis, we introduce a new form of execution assistance called critical value forwarding. Critical value forwarding uses heuristics to identify instructions on the critical path of execution and forwards the results of these instructions to the trailing core. The advantage of critical value forwarding is that it provides much of the speed up obtained by forwarding all values at a fraction of the bandwidth cost.
We propose two architectures to exploit the idea of critical value forwarding. The first of these operates the trailing core at lower voltage/frequency levels in order to provide energy-efficient redundant execution. In this context, we also introduce algorithms to dynamically adapt the voltage/frequency level of the trailing core based on program behavior. Our experimental evaluation shows that this proposal consumes only 1.26 times the energy of a non-fault-tolerant baseline and has a mean performance overhead of about 1%. We compare our proposal to two previous energy-efficient fault-tolerant CMP proposals and find that our proposal delivers higher energy-efficiency and lower performance degradation than both while providing a similar level of fault coverage.
Our second proposal uses the idea of critical value forwarding to improve fault-tolerant CMP throughput. This is done by using coarse-grained multithreading to mul-tiplex trailing threads on a single core. Our evaluation shows that this architecture delivers 9–13% higher throughput than previous proposals, including one configuration that uses simultaneous multithreading(SMT) to multiplex trailing threads. Since this proposal increases fault-tolerant CMP throughput by executing multiple threads on a single core, it comes at a modest cost in single-threaded performance, a mean slowdown between11–14%.
|
20 |
Low Overhead Soft Error Mitigation MethodologiesPrasanth, V January 2012 (has links) (PDF)
CMOS technology scaling is bringing new challenges to the designers in the form of new failure modes. The challenges include long term reliability failures and particle strike induced random failures. Studies have shown that increasingly, the largest contributor to the device reliability failures will be soft errors. Due to reliability concerns, the adoption of soft error mitigation techniques is on the increase. As the soft error mitigation techniques are increasingly adopted, the area and performance overhead incurred in their implementation also becomes pertinent. This thesis addresses the problem of providing low cost soft error mitigation.
The main contributions of this thesis include, (i) proposal of a new delayed capture methodology for low overhead soft error detection, (ii) adopting Error Control Coding (ECC) for delayed capture methodology for correction of single event upsets, (iii) analyzing the impact of different derating factors to reduce the hardware overhead incurred by the above implementations, and (iv) proposal for hardware software co-design for reliability based upon critical component identification determined by the application executing on the hardware (as against standalone hardware analysis).
This thesis first surveys existing soft error mitigation techniques and their associated limitations. It proposes a new delayed capture methodology as a low overhead soft error detection technique. Delayed capture methodology is an enhancement of the Razor flip-flop methodology. In the delayed capture methodology, the parity for a set of flip-flops is calculated at their inputs and outputs. The input parity is latched on a second clock, which is delayed with respect to the functional clock by more than the soft error pulse width. It requires an extra flip-flop for each set of flip-flops. On the other hand, in the Razor flip-flop methodology an additional flip-flop is required for every functional flip-flop. Due to the skew in the clocks, either the parity flip-flop or the functional flip-flop will capture the effect of transient, and hence by comparing the output parity and latched input parity an error can be detected. Fault injection experiments are performed to evaluate the bneefits and limitations of the proposed approach.
The limitations include soft error detection escapes and lack of error correction capability. Different cases of soft error detection escapes are analyzed. They are attributed mainly to a Single Event Upset (SEU) causing multiple flip-flops within a group to be in error. The error space due to SEUs is analyzed and an intelligent flip-flop grouping method using graph theoretic formulations is proposed such that no SEU can cause multiple flip-flops within a group to be in error. Once the error occurs, leaving the correction aspects to the application may not be desirable. The proposed delayed capture methodology is extended to replace parity codes with codes having higher redundancy to enable correction. The hardware overhead due to the proposed methodology is analyzed and an area savings of about 15% is obtained when compared to an existing soft error mitigation methodology with equivalent coverage.
The impact of different derating factors in determining the hardware overhead due to the soft error mitigation methodology is then analyzed. We have considered electrical derating and timing derating information for the evaluation purpose. The area overhead of the circuit with implementation of delayed capture methodology, considering different derating factors standalone and in combination is then analyzed. Results indicate that in different circuits, either a combination of these derating factors yield optimal results, or each of them considered standalone. This is due to the dependency of the solution on the heuristic nature of the algorithms used. About 23% area savings are obtained by employing these derating factors for a more optimal grouping of flip-flops.
A new paradigm of hardware software co-design for reliability is finally proposed. This is based on application derating in which the application / firmware code is profiled to identify the critical components which must be guarded from soft errors. This identification is based on the ability of the application software to tolerate certain errors in hardware. An algorithm to identify critical components in the control logic based on fault injection is developed. Experimental results indicated that for a safety critical automotive application, only 12% of the sequential logic elements were found to be critical. This approach provides a framework for investigating how software methods can complement hardware methods, to provide a reduced hardware solution for soft error mitigation.
|
Page generated in 0.0927 seconds