Spelling suggestions: "subject:"size aptimization"" "subject:"size anoptimization""
11 |
Correlation-based Cross-layer Communication in Wireless Sensor NetworksVuran, Mehmet Can 09 July 2007 (has links)
Wireless sensor networks (WSN) are event based systems that rely on the collective effort of densely deployed sensor nodes continuously observing a physical phenomenon. The spatio-temporal correlation between the sensor observations and the cross-layer design advantages are significant and unique to the design of WSN. Due to the high density in the network topology, sensor observations are highly correlated in the space domain. Furthermore, the nature of the energy-radiating physical phenomenon constitutes the temporal correlation between each consecutive observation of a sensor node. This unique characteristic of WSN can be exploited through a cross-layer design of communication functionalities to improve energy efficiency of the network.
In this thesis, several key elements are investigated to capture and exploit the correlation in the WSN for the realization of advanced efficient communication protocols. A theoretical framework is developed to capture the spatial and temporal correlations in WSN and to enable the development of efficient communication protocols. Based on this framework, spatial Correlation-based Collaborative Medium Access Control (CC-MAC) protocol is described, which exploits the spatial correlation in the WSN in order to achieve efficient medium access. Furthermore, the cross-layer module (XLM), which melts common protocol layer functionalities into a cross-layer module for resource-constrained sensor nodes, is developed. The cross-layer analysis of error control in WSN is then presented to enable a comprehensive comparison of error control schemes for WSN. Finally, the cross-layer packet size optimization framework is described.
|
12 |
Power Efficient Last Level Cache For Chip MultiprocessorsMandke, Aparna 01 1900 (has links) (PDF)
The number of processor cores and on-chip cache size has been increasing on chip multiprocessors (CMPs). As a result, leakage power dissipated in the on-chip cache has become very significant. We explore various techniques to switch-off the over-allocated cache so as to reduce leakage power consumed by it. A large cache offers non-uniform access latency to different cores present on a CMP and such a cache is called “Non-Uniform Cache Architecture (NUCA)”. Past studies have explored techniques to reduce leakage power for uniform access latency caches and with a single application executing on a uniprocessor. Our ideas of power optimized caches are applicable to any memory technology and architecture for which the difference of leakage power in the on-state and off-state of on-chip cache bank is significant.
Switching off the last level shared cache on a CMP is a challenging problem due to concurrently executing threads/processes and large dispersed NUCA cache. Hence, to determine cache requirement on a CMP, first we propose a new highly accurate method to estimate working set size of an application, which we call “tagged working set size estimation (TWSS)” method. This method has a negligible hardware storage overhead of 0.1% of the cache size. The use of TWSS is demonstrated by adaptively adjusting cache associativity. Our ideas of adaptable associative cache is scalable with respect to the number of cores present on a CMP. It uses information available locally in a tile on a tiled CMP and thus avoids network access unlike other commonly used heuristics such as average memory access latency and cache miss ratio. Our implementation gives 25% and 19% higher EDP savings than that obtained with average memory access latency and cache miss ratio heuristics on a static NUCA platform (SNUCA), respectively.
Cache misses increase with reduced cache associativity. Hence, we also propose to map some of the L2 slices onto the rest L2 slices and switch-off mapped L2 slices. The L2 slice includes all L2 banks in a tile. We call this technique the “remap policy”. Some applications execute with lesser number of threads than available cores during their execution. In such applications L2 slices which are farther to those threads are switched-off and mapped on-to L2 slices which are located nearer to those threads. By using nearer L2 slices with the help of remapped technology, some applications show improved execution time apart from reduction in leakage power consumption in NUCA caches.
To estimate the maximum possible gains that can be obtained using the remap policy, we statically determine the near-optimal remap configuration using the genetic algorithms. We formulate this problem as a energy-delay product minimization problem. Our dynamic remap policy implementation gives energy-delay savings within an average of 5% than that obtained with the near-optimal remap configuration.
Energy-delay product can also be minimized by improving execution time, which depends mainly on the static and dynamic NUCA access policies (DNUCA). The suitability of cache access policy depends on data sharing properties of a multi-threaded application. Hence, we propose three indices to quantify data sharing properties of an application and use them to predict a more suitable cache access policy among SNUCA and DNUCA for an application.
|
13 |
Power Efficient Last Level Cache for Chip MultiprocessorsMandke, Aparna January 2013 (has links) (PDF)
The number of processor cores and on-chip cache size has been increasing on chip multiprocessors (CMPs). As a result, leakage power dissipated in the on-chip cache has become very significant. We explore various techniques to switch-off the over-allocated cache so as to reduce leakage power consumed by it. A large cache offers non-uniform access latency to different cores present on a CMP and such a cache is called “Non-Uniform Cache Architecture (NUCA)”. Past studies have explored techniques to reduce leakage power for uniform access latency caches and with a single application executing on a uniprocessor. Our ideas of power optimized caches are applicable to any memory technology and architecture for which the difference of leakage power in the on-state and off-state of on-chip cache bank is significant.
Switching off the last level shared cache on a CMP is a challenging problem due to concurrently executing threads/processes and large dispersed NUCA cache. Hence, to determine cache requirement on a CMP, first we propose a new highly accurate method to estimate working set size of an application, which we call “tagged working set size estimation (TWSS)” method. This method has a negligible hardware storage overhead of 0.1% of the cache size. The use of TWSS is demonstrated by adaptively adjusting cache associativity. Our ideas of adaptable associative cache is scalable with respect to the number of cores present on a CMP. It uses information available locally in a tile on a tiled CMP and thus avoids network access unlike other commonly used heuristics such as average memory access latency and cache miss ratio. Our implementation gives 25% and 19% higher EDP savings than that obtained with average memory access latency and cache miss ratio heuristics on a static NUCA platform (SNUCA), respectively.
Cache misses increase with reduced cache associativity. Hence, we also propose to map some of the L2 slices onto the rest L2 slices and switch-off mapped L2 slices. The L2 slice includes all L2 banks in a tile. We call this technique the “remap policy”. Some applications execute with lesser number of threads than available cores during their execution. In such applications L2 slices which are farther to those threads are switched-off and mapped on-to L2 slices which are located nearer to those threads. By using nearer L2 slices with the help of remapped technology, some applications show improved execution time apart from reduction in leakage power consumption in NUCA caches.
To estimate the maximum possible gains that can be obtained using the remap policy, we statically determine the near-optimal remap configuration using the genetic algorithms. We formulate this problem as a energy-delay product minimization problem. Our dynamic remap policy implementation gives energy-delay savings within an average of 5% than that obtained with the near-optimal remap configuration.
Energy-delay product can also be minimized by improving execution time, which depends mainly on the static and dynamic NUCA access policies (DNUCA). The suitability of cache access policy depends on data sharing properties of a multi-threaded application. Hence, we propose three indices to quantify data sharing properties of an application and use them to predict a more suitable cache access policy among SNUCA and DNUCA for an application.
|
14 |
Performance-Aware Code Size Optimization of Generic Functions through Automatic Implementation of Dynamic Dispatch / Prestandamedveten kodstorleksoptimering av generiska funktioner genom automatisk tillämpning av dynamic dispatchHärnqvist, Ivar January 2022 (has links)
Monomorphization and dynamic dispatch are two common techniques for implementing polymorphism in statically typed programming languages. Function templates in C++ use the former technique to enable algorithms written as generic functions to be efficiently reused with multiple different data types by producing a separate function instantiation for each invocation that uses a unique permutation of argument types. This avoids the overhead of indirection associated with dynamic dispatch and allows the generated code of each instantiation to be optimized by the compiler for its specific concrete types, which typically yields great improvements in runtime performance over any dynamic approach. The disadvantage of this implementation, compared to the type-erased generics found in many other programming languages, is that careless over-use of templates with many different argument types can lead to an excessive amount of redundant code being generated for the same function. This increase in code size may increase the binary size of the final program and reduce the amount of useful code that can fit into the processor's instruction cache during execution, reducing code locality and thereby potentially reducing performance. Monomorphization can also increase compilation time due to the increase in generated code that needs to be compiled and optimized. This thesis presents a heuristic-based approach to generic programming that allows function templates to be automatically converted to use dynamic dispatch in scenarios where the resulting negative impact on runtime performance is predicted to be low. The thesis project includes the development of a proof of concept plugin for the Clang compiler frontend that can be used to compile existing C++ projects with the conversions applied. The design of a heuristic function for determining whether a given function template should use monomorphization or dynamic dispatch based on statically known metrics is proposed based on the results of an experiment. This heuristic is shown to achieve a small general improvement in program size across a set of open-source C++ projects when they are compiled using the plugin. The key findings from the experiment and from the development of the plugin are summarized with a general strategy for how the approach can be integrated into the design of future programming languages to promote more extensive use of generic programming in performance-sensitive code while avoiding regressions in program size and compilation time.
|
Page generated in 0.135 seconds