Return to search

Some effective approaches on designing fault-tolerant digital circuits and systems

Continued technology scaling on integrated circuits (IC) resulted in various benefits towards modern lifestyle. Smaller ICs made it possible to have daily-use devices at a small size and lower price. However, higher wear-out, stress effects and varied operating environment contributed towards shorter and severely limited lifetime. A possible solution to alleviate this problem is to introduce fault tolerance in the system that provides resilient towards the faults normally occur due to these effects. The main challenge here is to provide adequate increment towards reliability without imposing much overhead. To this end, this thesis presents several hardware and software approaches that improve the reliability of a system and also provide resilience towards transient and permanent faults. We observed that the multiple-faults aware placement strategy improves the lifetime reliability of digital circuits by lowering the error rate. We proposed several improvements in the multiple-faults aware placement strategy to achieve faster processing and higher reliability. These improvements are classified as hardware level approaches to achieve fault tolerance towards multiple faults in digital circuits. An analytical method is proposed using the Signal Probability Reliability Analysis (SPRA) that overcomes the issue of long simulation time for profiling pairs of cells/ gates. This method runs with one order magnitude faster than the original simulation approach. We also proposed the use of Hill Climbing strategy after Simulated Annealing to reduce the observed wire length in the original design. Experimental results show that this method can reduce the wire length up to 61%. We also proposed a novel optimisation algorithm to reduce the error rate by smartly manipulating the available spaces to separate the 'bad pairs' in the circuit. We investigated on the level of 'bad pair' considered in the optimisation algorithm. We found that with two categories of 'bad pairs', the error rate reduces up to 23% with little simulation time overhead. Checkpointing has been used over decades as one of primary software level approach for mitigating the effect of transient faults in a system. We studied the effectiveness of checkpointing in the view of lifetime reliability of a system than merely providing fault tolerance. Here, we proposed a novel checkpointing mechanism, namely, Lifetime Reliability-Aware Checkpointing Mechanism (LRAC), that is capable of not only tolerating transient fault but also migrating the task to a spare host whenever a permanent fault occurs or is expected to occur. We observed that this incurs approximately 12% time overhead, only during the occurrences of faults, even when the fault rate is as high as 10-3. However, this approach does not fail to meet the hard deadline of the tasks being executed.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:702217
Date January 2016
CreatorsBandan, Mohamad Imran bin
PublisherUniversity of Bristol
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation

Page generated in 0.0023 seconds