Global ETD Search

1	Efficient Fault Tolerance In Chip Multiprocessors Using Critical Value Forwarding Subramanyan, Pramod 06 1900 (has links) (PDF) Relentless CMOS scaling coupled with lower design tolerances is making ICs increasingly susceptible to transient faults, wear-out related permanent faults and process variations. Decreasing CMOS reliability implies that high-availability systems which were previously restricted to the domain of mainframe computers or specially designed fault-tolerant systems may be come important for the commodity market as well. In this thesis we tackle the problem of enabling efficient, low cost and configurable fault-tolerance using Chip Multiprocessors (CMPs). Our work studies architectural fault detection methods based on redundant execution, specifically focusing on “leader-follower” architectures. In such architectures redundant execution is performed on two cores/threads of a CMP. One thread acts as the leading thread while the other acts as the trailing thread. The leading thread assists the execution of the trailing thread by forwarding the results of its execution. These forwarded results are used as predictions in the trailing thread and help improve its performance. In this thesis, we introduce a new form of execution assistance called critical value forwarding. Critical value forwarding uses heuristics to identify instructions on the critical path of execution and forwards the results of these instructions to the trailing core. The advantage of critical value forwarding is that it provides much of the speed up obtained by forwarding all values at a fraction of the bandwidth cost. We propose two architectures to exploit the idea of critical value forwarding. The first of these operates the trailing core at lower voltage/frequency levels in order to provide energy-efficient redundant execution. In this context, we also introduce algorithms to dynamically adapt the voltage/frequency level of the trailing core based on program behavior. Our experimental evaluation shows that this proposal consumes only 1.26 times the energy of a non-fault-tolerant baseline and has a mean performance overhead of about 1%. We compare our proposal to two previous energy-efficient fault-tolerant CMP proposals and find that our proposal delivers higher energy-efficiency and lower performance degradation than both while providing a similar level of fault coverage. Our second proposal uses the idea of critical value forwarding to improve fault-tolerant CMP throughput. This is done by using coarse-grained multithreading to mul-tiplex trailing threads on a single core. Our evaluation shows that this architecture delivers 9–13% higher throughput than previous proposals, including one configuration that uses simultaneous multithreading(SMT) to multiplex trailing threads. Since this proposal increases fault-tolerant CMP throughput by executing multiple threads on a single core, it comes at a modest cost in single-threaded performance, a mean slowdown between11–14%. Fault Tolerant Computing Microprocessors Chip Multiprocessors (CMPs) Microarchitecture Energy-efficient Architecture Transient Faults Permanent Faults Redundant Execution Fault Tolerance CMOS Computer Science
2	Power Efficient Last Level Cache For Chip Multiprocessors Mandke, Aparna 01 1900 (has links) (PDF) The number of processor cores and on-chip cache size has been increasing on chip multiprocessors (CMPs). As a result, leakage power dissipated in the on-chip cache has become very significant. We explore various techniques to switch-off the over-allocated cache so as to reduce leakage power consumed by it. A large cache offers non-uniform access latency to different cores present on a CMP and such a cache is called “Non-Uniform Cache Architecture (NUCA)”. Past studies have explored techniques to reduce leakage power for uniform access latency caches and with a single application executing on a uniprocessor. Our ideas of power optimized caches are applicable to any memory technology and architecture for which the difference of leakage power in the on-state and off-state of on-chip cache bank is significant. Switching off the last level shared cache on a CMP is a challenging problem due to concurrently executing threads/processes and large dispersed NUCA cache. Hence, to determine cache requirement on a CMP, first we propose a new highly accurate method to estimate working set size of an application, which we call “tagged working set size estimation (TWSS)” method. This method has a negligible hardware storage overhead of 0.1% of the cache size. The use of TWSS is demonstrated by adaptively adjusting cache associativity. Our ideas of adaptable associative cache is scalable with respect to the number of cores present on a CMP. It uses information available locally in a tile on a tiled CMP and thus avoids network access unlike other commonly used heuristics such as average memory access latency and cache miss ratio. Our implementation gives 25% and 19% higher EDP savings than that obtained with average memory access latency and cache miss ratio heuristics on a static NUCA platform (SNUCA), respectively. Cache misses increase with reduced cache associativity. Hence, we also propose to map some of the L2 slices onto the rest L2 slices and switch-off mapped L2 slices. The L2 slice includes all L2 banks in a tile. We call this technique the “remap policy”. Some applications execute with lesser number of threads than available cores during their execution. In such applications L2 slices which are farther to those threads are switched-off and mapped on-to L2 slices which are located nearer to those threads. By using nearer L2 slices with the help of remapped technology, some applications show improved execution time apart from reduction in leakage power consumption in NUCA caches. To estimate the maximum possible gains that can be obtained using the remap policy, we statically determine the near-optimal remap configuration using the genetic algorithms. We formulate this problem as a energy-delay product minimization problem. Our dynamic remap policy implementation gives energy-delay savings within an average of 5% than that obtained with the near-optimal remap configuration. Energy-delay product can also be minimized by improving execution time, which depends mainly on the static and dynamic NUCA access policies (DNUCA). The suitability of cache access policy depends on data sharing properties of a multi-threaded application. Hence, we propose three indices to quantify data sharing properties of an application and use them to predict a more suitable cache access policy among SNUCA and DNUCA for an application. Processor Architecture Chip Multiprocessors (CMPs) Cache Memory Cache (Computers) Genetic Algorithms Leakage Power Optimization Working Set Size Optimization Near Optimal Remap Configuration Thread Contention Predictors On-Chip Cache Cache (Computers) Architecture Non-Uniform Cache Architecture (NUCA) Computer Science

Search results

Efficient Fault Tolerance In Chip Multiprocessors Using Critical Value Forwarding

Power Efficient Last Level Cache For Chip Multiprocessors