• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 6
  • 5
  • 2
  • 2
  • Tagged with
  • 23
  • 23
  • 9
  • 5
  • 5
  • 5
  • 4
  • 4
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • 3
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
21

Power Efficient Last Level Cache For Chip Multiprocessors

Mandke, Aparna 01 1900 (has links) (PDF)
The number of processor cores and on-chip cache size has been increasing on chip multiprocessors (CMPs). As a result, leakage power dissipated in the on-chip cache has become very significant. We explore various techniques to switch-off the over-allocated cache so as to reduce leakage power consumed by it. A large cache offers non-uniform access latency to different cores present on a CMP and such a cache is called “Non-Uniform Cache Architecture (NUCA)”. Past studies have explored techniques to reduce leakage power for uniform access latency caches and with a single application executing on a uniprocessor. Our ideas of power optimized caches are applicable to any memory technology and architecture for which the difference of leakage power in the on-state and off-state of on-chip cache bank is significant. Switching off the last level shared cache on a CMP is a challenging problem due to concurrently executing threads/processes and large dispersed NUCA cache. Hence, to determine cache requirement on a CMP, first we propose a new highly accurate method to estimate working set size of an application, which we call “tagged working set size estimation (TWSS)” method. This method has a negligible hardware storage overhead of 0.1% of the cache size. The use of TWSS is demonstrated by adaptively adjusting cache associativity. Our ideas of adaptable associative cache is scalable with respect to the number of cores present on a CMP. It uses information available locally in a tile on a tiled CMP and thus avoids network access unlike other commonly used heuristics such as average memory access latency and cache miss ratio. Our implementation gives 25% and 19% higher EDP savings than that obtained with average memory access latency and cache miss ratio heuristics on a static NUCA platform (SNUCA), respectively. Cache misses increase with reduced cache associativity. Hence, we also propose to map some of the L2 slices onto the rest L2 slices and switch-off mapped L2 slices. The L2 slice includes all L2 banks in a tile. We call this technique the “remap policy”. Some applications execute with lesser number of threads than available cores during their execution. In such applications L2 slices which are farther to those threads are switched-off and mapped on-to L2 slices which are located nearer to those threads. By using nearer L2 slices with the help of remapped technology, some applications show improved execution time apart from reduction in leakage power consumption in NUCA caches. To estimate the maximum possible gains that can be obtained using the remap policy, we statically determine the near-optimal remap configuration using the genetic algorithms. We formulate this problem as a energy-delay product minimization problem. Our dynamic remap policy implementation gives energy-delay savings within an average of 5% than that obtained with the near-optimal remap configuration. Energy-delay product can also be minimized by improving execution time, which depends mainly on the static and dynamic NUCA access policies (DNUCA). The suitability of cache access policy depends on data sharing properties of a multi-threaded application. Hence, we propose three indices to quantify data sharing properties of an application and use them to predict a more suitable cache access policy among SNUCA and DNUCA for an application.
22

Power Efficient Last Level Cache for Chip Multiprocessors

Mandke, Aparna January 2013 (has links) (PDF)
The number of processor cores and on-chip cache size has been increasing on chip multiprocessors (CMPs). As a result, leakage power dissipated in the on-chip cache has become very significant. We explore various techniques to switch-off the over-allocated cache so as to reduce leakage power consumed by it. A large cache offers non-uniform access latency to different cores present on a CMP and such a cache is called “Non-Uniform Cache Architecture (NUCA)”. Past studies have explored techniques to reduce leakage power for uniform access latency caches and with a single application executing on a uniprocessor. Our ideas of power optimized caches are applicable to any memory technology and architecture for which the difference of leakage power in the on-state and off-state of on-chip cache bank is significant. Switching off the last level shared cache on a CMP is a challenging problem due to concurrently executing threads/processes and large dispersed NUCA cache. Hence, to determine cache requirement on a CMP, first we propose a new highly accurate method to estimate working set size of an application, which we call “tagged working set size estimation (TWSS)” method. This method has a negligible hardware storage overhead of 0.1% of the cache size. The use of TWSS is demonstrated by adaptively adjusting cache associativity. Our ideas of adaptable associative cache is scalable with respect to the number of cores present on a CMP. It uses information available locally in a tile on a tiled CMP and thus avoids network access unlike other commonly used heuristics such as average memory access latency and cache miss ratio. Our implementation gives 25% and 19% higher EDP savings than that obtained with average memory access latency and cache miss ratio heuristics on a static NUCA platform (SNUCA), respectively. Cache misses increase with reduced cache associativity. Hence, we also propose to map some of the L2 slices onto the rest L2 slices and switch-off mapped L2 slices. The L2 slice includes all L2 banks in a tile. We call this technique the “remap policy”. Some applications execute with lesser number of threads than available cores during their execution. In such applications L2 slices which are farther to those threads are switched-off and mapped on-to L2 slices which are located nearer to those threads. By using nearer L2 slices with the help of remapped technology, some applications show improved execution time apart from reduction in leakage power consumption in NUCA caches. To estimate the maximum possible gains that can be obtained using the remap policy, we statically determine the near-optimal remap configuration using the genetic algorithms. We formulate this problem as a energy-delay product minimization problem. Our dynamic remap policy implementation gives energy-delay savings within an average of 5% than that obtained with the near-optimal remap configuration. Energy-delay product can also be minimized by improving execution time, which depends mainly on the static and dynamic NUCA access policies (DNUCA). The suitability of cache access policy depends on data sharing properties of a multi-threaded application. Hence, we propose three indices to quantify data sharing properties of an application and use them to predict a more suitable cache access policy among SNUCA and DNUCA for an application.
23

Algorithm And Architecture Design for Real-time Face Recognition

Mahale, Gopinath Vasanth January 2016 (has links) (PDF)
Face recognition is a field of biometrics that deals with identification of subjects based on features present in the images of their faces. The factors that make face recognition popular and favorite as compared to other biometric methods are easier operation and ability to identify subjects without their knowledge. With these features, face recognition has become an integral part of the present day security systems, targeting a smart and secure world. There are various factors that de ne the performance of a face recognition system. The most important among them are recognition accuracy of algorithm used and time taken for recognition. Recognition accuracy of the face recognition algorithm gets affected by changes in pose, facial expression and illumination along with occlusions in the images. There have been a number of algorithms proposed to enable recognition under these ambient changes. However, it has been hard to and a single algorithm that can efficiently recognize faces in all the above mentioned conditions. Moreover, achieving real time performance for most of the complex face recognition algorithms on embedded platforms has been a challenge. Real-time performance is highly preferred in critical applications such as identification of crime suspects in public. As available software solutions for FR have significantly large latency in recognizing individuals, they are not suitable for such critical real-time applications. This thesis focuses on real-time aspect of FR, where acceleration of the algorithms is achieved by means of parallel hardware architectures. The major contributions of this work are as follows. We target to design a face recognition system that can identify at most 30 faces in each frame of video at 15 frames per second, which amounts to 450 recognitions per second. In addition, we target to achieve good recognition accuracy along with scalability in terms of database size and input image resolutions. To design a system with these specifications, as a first step, we explore algorithms in literature and come up with a hybrid face recognition algorithm. This hybrid algorithm shows good recognition accuracy on face images with changes in illumination, pose and expressions, and also with occlusions. In addition the computations in the algorithm are modular in nature which are suitable for real-time realizations through parallel processing. The face recognition system consists of a face detection module to detect faces in the input image, which is followed by a face recognition module to identify the detected faces. There are well established algorithms and architectures for face detection in literature which can perform detection at 15 frames per second on video frames. Detected faces of different sizes need to be scaled to the size specified by the face recognition module. To meet the real-time constraints, we propose a hardware architecture for real-time bi-cubic convolution interpolation with dynamic scaling factors. To recognize the resized faces in real-time, a scalable parallel pipelined architecture is designed for the hybrid algorithm which can perform 450 recognitions per second on a database containing grayscale images of at most 450 classes on Virtex 6 FPGA. To provide flexibility and programmability, we extend this design to REDEFINE, a multi-core massively parallel reconfigurable architecture. In this design, we come up with FR specific programmable cores termed Scalable Unit for Region Evaluation (SURE) capable of performing modular computations in the hybrid face recognition algorithm. We replicate SUREs in each tile of REDEFINE to construct a face recognition module termed REDEFINE for Face Recognition using SURE Homogeneous Cores (REFRESH). There is a need to learn new unseen faces on-line in practical face recognition systems. Considering this, for real-time on-line learning of unseen face images, we design tiny processors termed VOP, Processor for Vector Operations. VOPs function as coprocessors to process elements under each tile of REDEFINE to accelerate micro vector operations appearing in the synaptic weight computations. We also explore deep neural networks which operate similar to the processing in human brain and capable of working on very large face databases. We explore the field of Random matrix theory to come up with a solution for synaptic weight initialization in deep neural networks for better classification . In addition, we perform design space exploration of hardware architecture for deep convolution networks and conclude with directions for future work.

Page generated in 0.0681 seconds