Error resilience has become as important as power and performance in modern computing architecture. There are various sources of errors that can paralyze real-world computing systems. Of particular interest to this dissertation are single-event errors. They can be the results of energetic particle strike or abrupt power outage that corrupts the program states leading to system failures. Specifically, energetic particle strike is the major cause of soft error while abrupt power outage can result in memory inconsistency in the nonvolatile memory systems.
Unfortunately, existing techniques to handle those single-event errors are either resource consuming (e.g., hardware approaches) or heavy-weight (e.g., software approaches). To address this problem, this dissertation identifies idempotent processing as an alternative recovery technique to handle the system failures in an efficient and low-cost manner. Then, this dissertation first proposes to design and develop a compiler-directed lightweight methodology which leverages idempotent processing and the state-of-the-art sensor-based detection to achieve soft error resilience at low-cost. This dissertation also introduces a lightweight soft error tolerant hardware design that redefines idempotent processing where the idempotent regions can be created, verified and recovered from the processor's point of view. Furthermore, this dissertation proposes a series of compiler optimizations that significantly reduce the hardware and runtime overhead of the idempotent processing. Lastly, this dissertation proposes a failure-atomic system integrated with idempotent processing to resolve another type of single-event error, i.e., failure-induced memory inconsistency in the nonvolatile memory systems. / Ph. D. / Our computing systems are vulnerable to different kinds of errors. All these errors can potentially crash real-world computing systems. This dissertation specifically addresses the challenges of single-event errors. Single-event errors can be caused by energetic particle strikes or abrupt power outage that can corrupt the program states leading to system failures. Unfortunately, existing techniques to handle those single-event errors are expensive in terms of hardware/software. To address this problem, this dissertation leverages an interesting property called idempotence in the program. A region of code is idempotent if and only if it always generates the same output whenever the program jumps back to the region entry from any execution point within the region. Thus, we can leverage the idempotent property as a low-cost recovery technique to recover the system failures by jumping back to the beginning of the region where the errors occur. This dissertation proposes solutions to incorporate the idempotent property for resilience against those single-event errors. Furthermore, this dissertation introduces a series of optimization techniques with compiler and hardware support to improve the efficiency and overheads for error resilience. We believe that our proposed techniques in this dissertation can inspire researchers for future error resilience research.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/84526 |
Date | 08 August 2018 |
Creators | Liu, Qingrui |
Contributors | Electrical and Computer Engineering, Jung, Changhee, Zeng, Haibo, Ravindran, Binoy, Schaumont, Patrick R., Min, Chang Woo |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Detected Language | English |
Type | Dissertation |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0023 seconds