Global ETD Search

21	Design of Low-Power Reduction-Trees in Parallel Multipliers Oskuii, Saeeid Tahmasbi January 2008 (has links) <p>Multiplications occur frequently in digital signal processing systems, communication systems, and other application specific integrated circuits. Multipliers, being relatively complex units, are deciding factors to the overall speed, area, and power consumption of digital computers. The diversity of application areas for multipliers and the ubiquity of multiplication in digital systems exhibit a variety of requirements for speed, area, power consumption, and other specifications. Traditionally, speed, area, and hardware resources have been the major design factors and concerns in digital design. However, the design paradigm shift over the past decade has entered dynamic power and static power into play as well.</p><p>In many situations, the overall performance of a system is decided by the speed of its multiplier. In this thesis, parallel multipliers are addressed because of their speed superiority. Parallel multipliers are combinational circuits and can be subject to any standard combinational logic optimization. However, the complex structure of the multipliers imposes a number of difficulties for the electronic design automation (EDA) tools, as they simply cannot consider the multipliers as a whole; i.e., EDA tools have to limit the optimizations to a small portion of the circuit and perform logic optimizations. On the other hand, multipliers are arithmetic circuits and considering arithmetic relations in the structure of multipliers can be extremely useful and can result in better optimization results. The different structures obtained using the different arithmetically equivalent solutions, have the same functionality but exhibit different temporal and physical behavior. The arithmetic equivalencies are used earlier mainly to optimize for area, speed and hardware resources.</p><p>In this thesis a design methodology is proposed for reducing dynamic and static power dissipation in parallel multiplier partial product reduction tree. Basically, using the information about the input pattern that is going to be applied to the multiplier (such as static probabilities and spatiotemporal correlations), the reduction tree is optimized. The optimization is obtained by selecting the power efficient configurations by searching among the permutations of partial products for each reduction stage. Probabilistic power estimation methods are introduced for leakage and dynamic power estimations. These estimations are used to lead the optimizers to minimum power consumption. Optimization methods, utilizing the arithmetic equivalencies in the partial product reduction trees, are proposed in order to reduce the dynamic power, static power, or total power which is a combination of dynamic and static power. The energy saving is achieved without any noticeable area or speed overhead compared to random reduction trees. The optimization algorithms are extended to include spatiotemporal correlations between primary inputs. As another extension to the optimization algorithms, the cost function is considered as a weighted sum of dynamic power and static power. This can be extended further to contain speed merits and interconnection power. Through a number of experiments the effectiveness of the optimization methods are shown. The average number of transitions obtained from simulation is reduced significantly (up to 35% in some cases) using the proposed optimizations.</p><p>The proposed methods are in general applicable on arbitrary multi-operand adder trees. As an example, the optimization is applied to the summation tree of a class of elementary function generators which is implemented using summation of weighted bit-products. Accurate transistor-level power estimations show up to 25% reduction in dynamic power compared to the original designs.</p><p>Power estimation is an important step of the optimization algorithm. A probabilistic gate-level power estimator is developed which uses a novel set of simple waveforms as its kernel. The transition density of each circuit node is estimated. This power estimator allows to utilize a global glitch filtering technique that can model the removal of glitches in more detail. It produces error free estimates for tree structured circuits. For circuits with reconvergent fanout, experimental results using the ISCAS85 benchmarks show that this method generally provides significantly better estimates of the transition density compared to previous techniques.</p> parallel multiplier multi-operand adder partial product reduction tree arithmetic equivalency progressive design low power transition activity dynamic power static or leakage power total power wallace dadda full-adder half-adder combinational logic gate-level Electrical engineering Elektroteknik
22	Design of Low-Power Reduction-Trees in Parallel Multipliers Oskuii, Saeeid Tahmasbi January 2008 (has links) Multiplications occur frequently in digital signal processing systems, communication systems, and other application specific integrated circuits. Multipliers, being relatively complex units, are deciding factors to the overall speed, area, and power consumption of digital computers. The diversity of application areas for multipliers and the ubiquity of multiplication in digital systems exhibit a variety of requirements for speed, area, power consumption, and other specifications. Traditionally, speed, area, and hardware resources have been the major design factors and concerns in digital design. However, the design paradigm shift over the past decade has entered dynamic power and static power into play as well. In many situations, the overall performance of a system is decided by the speed of its multiplier. In this thesis, parallel multipliers are addressed because of their speed superiority. Parallel multipliers are combinational circuits and can be subject to any standard combinational logic optimization. However, the complex structure of the multipliers imposes a number of difficulties for the electronic design automation (EDA) tools, as they simply cannot consider the multipliers as a whole; i.e., EDA tools have to limit the optimizations to a small portion of the circuit and perform logic optimizations. On the other hand, multipliers are arithmetic circuits and considering arithmetic relations in the structure of multipliers can be extremely useful and can result in better optimization results. The different structures obtained using the different arithmetically equivalent solutions, have the same functionality but exhibit different temporal and physical behavior. The arithmetic equivalencies are used earlier mainly to optimize for area, speed and hardware resources. In this thesis a design methodology is proposed for reducing dynamic and static power dissipation in parallel multiplier partial product reduction tree. Basically, using the information about the input pattern that is going to be applied to the multiplier (such as static probabilities and spatiotemporal correlations), the reduction tree is optimized. The optimization is obtained by selecting the power efficient configurations by searching among the permutations of partial products for each reduction stage. Probabilistic power estimation methods are introduced for leakage and dynamic power estimations. These estimations are used to lead the optimizers to minimum power consumption. Optimization methods, utilizing the arithmetic equivalencies in the partial product reduction trees, are proposed in order to reduce the dynamic power, static power, or total power which is a combination of dynamic and static power. The energy saving is achieved without any noticeable area or speed overhead compared to random reduction trees. The optimization algorithms are extended to include spatiotemporal correlations between primary inputs. As another extension to the optimization algorithms, the cost function is considered as a weighted sum of dynamic power and static power. This can be extended further to contain speed merits and interconnection power. Through a number of experiments the effectiveness of the optimization methods are shown. The average number of transitions obtained from simulation is reduced significantly (up to 35% in some cases) using the proposed optimizations. The proposed methods are in general applicable on arbitrary multi-operand adder trees. As an example, the optimization is applied to the summation tree of a class of elementary function generators which is implemented using summation of weighted bit-products. Accurate transistor-level power estimations show up to 25% reduction in dynamic power compared to the original designs. Power estimation is an important step of the optimization algorithm. A probabilistic gate-level power estimator is developed which uses a novel set of simple waveforms as its kernel. The transition density of each circuit node is estimated. This power estimator allows to utilize a global glitch filtering technique that can model the removal of glitches in more detail. It produces error free estimates for tree structured circuits. For circuits with reconvergent fanout, experimental results using the ISCAS85 benchmarks show that this method generally provides significantly better estimates of the transition density compared to previous techniques. parallel multiplier multi-operand adder partial product reduction tree arithmetic equivalency progressive design low power transition activity dynamic power static or leakage power total power wallace dadda full-adder half-adder combinational logic gate-level Electrical engineering Elektroteknik
23	Power Efficient Last Level Cache For Chip Multiprocessors Mandke, Aparna 01 1900 (has links) (PDF) The number of processor cores and on-chip cache size has been increasing on chip multiprocessors (CMPs). As a result, leakage power dissipated in the on-chip cache has become very significant. We explore various techniques to switch-off the over-allocated cache so as to reduce leakage power consumed by it. A large cache offers non-uniform access latency to different cores present on a CMP and such a cache is called “Non-Uniform Cache Architecture (NUCA)”. Past studies have explored techniques to reduce leakage power for uniform access latency caches and with a single application executing on a uniprocessor. Our ideas of power optimized caches are applicable to any memory technology and architecture for which the difference of leakage power in the on-state and off-state of on-chip cache bank is significant. Switching off the last level shared cache on a CMP is a challenging problem due to concurrently executing threads/processes and large dispersed NUCA cache. Hence, to determine cache requirement on a CMP, first we propose a new highly accurate method to estimate working set size of an application, which we call “tagged working set size estimation (TWSS)” method. This method has a negligible hardware storage overhead of 0.1% of the cache size. The use of TWSS is demonstrated by adaptively adjusting cache associativity. Our ideas of adaptable associative cache is scalable with respect to the number of cores present on a CMP. It uses information available locally in a tile on a tiled CMP and thus avoids network access unlike other commonly used heuristics such as average memory access latency and cache miss ratio. Our implementation gives 25% and 19% higher EDP savings than that obtained with average memory access latency and cache miss ratio heuristics on a static NUCA platform (SNUCA), respectively. Cache misses increase with reduced cache associativity. Hence, we also propose to map some of the L2 slices onto the rest L2 slices and switch-off mapped L2 slices. The L2 slice includes all L2 banks in a tile. We call this technique the “remap policy”. Some applications execute with lesser number of threads than available cores during their execution. In such applications L2 slices which are farther to those threads are switched-off and mapped on-to L2 slices which are located nearer to those threads. By using nearer L2 slices with the help of remapped technology, some applications show improved execution time apart from reduction in leakage power consumption in NUCA caches. To estimate the maximum possible gains that can be obtained using the remap policy, we statically determine the near-optimal remap configuration using the genetic algorithms. We formulate this problem as a energy-delay product minimization problem. Our dynamic remap policy implementation gives energy-delay savings within an average of 5% than that obtained with the near-optimal remap configuration. Energy-delay product can also be minimized by improving execution time, which depends mainly on the static and dynamic NUCA access policies (DNUCA). The suitability of cache access policy depends on data sharing properties of a multi-threaded application. Hence, we propose three indices to quantify data sharing properties of an application and use them to predict a more suitable cache access policy among SNUCA and DNUCA for an application. Processor Architecture Chip Multiprocessors (CMPs) Cache Memory Cache (Computers) Genetic Algorithms Leakage Power Optimization Working Set Size Optimization Near Optimal Remap Configuration Thread Contention Predictors On-Chip Cache Cache (Computers) Architecture Non-Uniform Cache Architecture (NUCA) Computer Science
24	Power Efficient Last Level Cache for Chip Multiprocessors Mandke, Aparna January 2013 (has links) (PDF) The number of processor cores and on-chip cache size has been increasing on chip multiprocessors (CMPs). As a result, leakage power dissipated in the on-chip cache has become very significant. We explore various techniques to switch-off the over-allocated cache so as to reduce leakage power consumed by it. A large cache offers non-uniform access latency to different cores present on a CMP and such a cache is called “Non-Uniform Cache Architecture (NUCA)”. Past studies have explored techniques to reduce leakage power for uniform access latency caches and with a single application executing on a uniprocessor. Our ideas of power optimized caches are applicable to any memory technology and architecture for which the difference of leakage power in the on-state and off-state of on-chip cache bank is significant. Switching off the last level shared cache on a CMP is a challenging problem due to concurrently executing threads/processes and large dispersed NUCA cache. Hence, to determine cache requirement on a CMP, first we propose a new highly accurate method to estimate working set size of an application, which we call “tagged working set size estimation (TWSS)” method. This method has a negligible hardware storage overhead of 0.1% of the cache size. The use of TWSS is demonstrated by adaptively adjusting cache associativity. Our ideas of adaptable associative cache is scalable with respect to the number of cores present on a CMP. It uses information available locally in a tile on a tiled CMP and thus avoids network access unlike other commonly used heuristics such as average memory access latency and cache miss ratio. Our implementation gives 25% and 19% higher EDP savings than that obtained with average memory access latency and cache miss ratio heuristics on a static NUCA platform (SNUCA), respectively. Cache misses increase with reduced cache associativity. Hence, we also propose to map some of the L2 slices onto the rest L2 slices and switch-off mapped L2 slices. The L2 slice includes all L2 banks in a tile. We call this technique the “remap policy”. Some applications execute with lesser number of threads than available cores during their execution. In such applications L2 slices which are farther to those threads are switched-off and mapped on-to L2 slices which are located nearer to those threads. By using nearer L2 slices with the help of remapped technology, some applications show improved execution time apart from reduction in leakage power consumption in NUCA caches. To estimate the maximum possible gains that can be obtained using the remap policy, we statically determine the near-optimal remap configuration using the genetic algorithms. We formulate this problem as a energy-delay product minimization problem. Our dynamic remap policy implementation gives energy-delay savings within an average of 5% than that obtained with the near-optimal remap configuration. Energy-delay product can also be minimized by improving execution time, which depends mainly on the static and dynamic NUCA access policies (DNUCA). The suitability of cache access policy depends on data sharing properties of a multi-threaded application. Hence, we propose three indices to quantify data sharing properties of an application and use them to predict a more suitable cache access policy among SNUCA and DNUCA for an application. Processor Architecture Chip Multiprocessor Cache Memory Cache Genetic Algorithms Leakage Power Optimization Working Set Size Optimization Near Optimal Remap Configuration Thread Contention Predictors On-Chip Cache Cache Architecture NUCA SNUCA Non-Uniform Cache Architecture Computer Science

Page generated in 0.0494 seconds