Return to search

Low Overhead Soft Error Mitigation Methodologies

CMOS technology scaling is bringing new challenges to the designers in the form of new failure modes. The challenges include long term reliability failures and particle strike induced random failures. Studies have shown that increasingly, the largest contributor to the device reliability failures will be soft errors. Due to reliability concerns, the adoption of soft error mitigation techniques is on the increase. As the soft error mitigation techniques are increasingly adopted, the area and performance overhead incurred in their implementation also becomes pertinent. This thesis addresses the problem of providing low cost soft error mitigation.
The main contributions of this thesis include, (i) proposal of a new delayed capture methodology for low overhead soft error detection, (ii) adopting Error Control Coding (ECC) for delayed capture methodology for correction of single event upsets, (iii) analyzing the impact of different derating factors to reduce the hardware overhead incurred by the above implementations, and (iv) proposal for hardware software co-design for reliability based upon critical component identification determined by the application executing on the hardware (as against standalone hardware analysis).
This thesis first surveys existing soft error mitigation techniques and their associated limitations. It proposes a new delayed capture methodology as a low overhead soft error detection technique. Delayed capture methodology is an enhancement of the Razor flip-flop methodology. In the delayed capture methodology, the parity for a set of flip-flops is calculated at their inputs and outputs. The input parity is latched on a second clock, which is delayed with respect to the functional clock by more than the soft error pulse width. It requires an extra flip-flop for each set of flip-flops. On the other hand, in the Razor flip-flop methodology an additional flip-flop is required for every functional flip-flop. Due to the skew in the clocks, either the parity flip-flop or the functional flip-flop will capture the effect of transient, and hence by comparing the output parity and latched input parity an error can be detected. Fault injection experiments are performed to evaluate the bneefits and limitations of the proposed approach.
The limitations include soft error detection escapes and lack of error correction capability. Different cases of soft error detection escapes are analyzed. They are attributed mainly to a Single Event Upset (SEU) causing multiple flip-flops within a group to be in error. The error space due to SEUs is analyzed and an intelligent flip-flop grouping method using graph theoretic formulations is proposed such that no SEU can cause multiple flip-flops within a group to be in error. Once the error occurs, leaving the correction aspects to the application may not be desirable. The proposed delayed capture methodology is extended to replace parity codes with codes having higher redundancy to enable correction. The hardware overhead due to the proposed methodology is analyzed and an area savings of about 15% is obtained when compared to an existing soft error mitigation methodology with equivalent coverage.
The impact of different derating factors in determining the hardware overhead due to the soft error mitigation methodology is then analyzed. We have considered electrical derating and timing derating information for the evaluation purpose. The area overhead of the circuit with implementation of delayed capture methodology, considering different derating factors standalone and in combination is then analyzed. Results indicate that in different circuits, either a combination of these derating factors yield optimal results, or each of them considered standalone. This is due to the dependency of the solution on the heuristic nature of the algorithms used. About 23% area savings are obtained by employing these derating factors for a more optimal grouping of flip-flops.
A new paradigm of hardware software co-design for reliability is finally proposed. This is based on application derating in which the application / firmware code is profiled to identify the critical components which must be guarded from soft errors. This identification is based on the ability of the application software to tolerate certain errors in hardware. An algorithm to identify critical components in the control logic based on fault injection is developed. Experimental results indicated that for a safety critical automotive application, only 12% of the sequential logic elements were found to be critical. This approach provides a framework for investigating how software methods can complement hardware methods, to provide a reduced hardware solution for soft error mitigation.

Identiferoai:union.ndltd.org:IISc/oai:etd.ncsi.iisc.ernet.in:2005/3232
Date January 2012
CreatorsPrasanth, V
ContributorsSingh, Virendra, Parekhji, Rubin
Source SetsIndia Institute of Science
Languageen_US
Detected LanguageEnglish
TypeThesis
RelationG25445

Page generated in 0.0147 seconds