Spelling suggestions: "subject:"multicore 1rchitecture"" "subject:"multicore 1architecture""
1 |
Design of the Execution-driven Simulation Environment for Hyper-scalar ArchitectureSu, Ding-Siang 21 August 2008 (has links)
As a result of the microprocessor system research and the development of VLSI manufacturing process technology, the recent trend and development in high performance computer have toward to multi-core architecture. However current multi-core architectures are designed by the symmetric multi processors (SMP) concept. In traditional SMP mechanism, there are only data link between processor cores. So a single thread only can be handled by a single core, it limits the usage rate in multi-core and performance can not increase.
This paper proposed a scalable chip multiprocessor architecture, which is called Hyper-scalar. The Principal characteristic of the architecture is ¡§design the interconnect control mechanisms for instructions in the multi-core¡¨. Some single scalar processor cores in Hyper-scalar architecture can be dynamically grouped as an n-way superscalar accelerator to improve the instruction-level parallelism, which is called accelerator group. Hyper-scalar combines the advantages of superscalar and multithreaded architecture; Hence, this architecture can not only enhance single-threaded performance by using accelerate group but also supports multithreaded applications.
The paper based on ARM instruction set, to analyze how to create the interactive control mechanisms for instruction in the multi-core, and how to enhance the performance of a single thread in the Hyper-scalar architecture. It can be divided into four parts: register flow, memory flow, instruction control flow, chop of multi cycle instruction. When instructions are issued into the processor, they must be attached dependence tags that can solve the dependence between all issued instructions. All instructions can exchange the data through the virtual shared register file (VSRF) mechanism, and all instructions are executed only when the operands are available. In the memory flow part: we solve the dependence problem with a simple technique¡Xto execute instruction in instruction order. In instruction control flow part: in order to improve performance, we perform speculation execution mechanism, so the instructions can out of order execution beyond the basic block. Finally because there are some multi cycle instructions in the ARM instruction set, in hyper-scalar framework can chop into many one cycle instructions to further enhance performance.
The simulation Model is written by SystemC, a modeling language based on C++ is to provide hardware-oriented simulation platform and the MediaBench suite is selected for the experiments. On average, the Hyper-scalar architecture can accelerate single-threaded performance by 50% to 300% using 2 ~ 8 cores.
|
2 |
Performance evaluation of Cache-Based Multi-Core Architectures with Networks-on-ChipRajkumar, Robin Kingsley 01 December 2012 (has links)
Multi-core architectures are the future for high-performance computing and are omnipresent these days; what was a vision some twenty years back is now a reality with most personal computers/laptops now running on multi-cores making them ubiquitous in today's world. However, as the number of cores continue scaling with time, there will be serious throughput and performance issues with relation to the network topologies used in connecting the cores. Among possible network topologies under consideration in modern multi-core systems, the `Mesh' topology is widely used. In terms of performance, the `Point to Point topology' would outperform all other topologies such as Crossbar, Mesh and Torus. The `Point to Point' topology does include additional expenses with respect to more links needed to connect each core to every other core in the network. Its expensive implementation cost is the reason it is not preferred in the industry for general use systems. But, for research purposes it serves as the best network topology alternative to the `Mesh' for higher speed in computer systems. However, the characteristics of the tasks executing on the cores will also have a significant impact on topology performance. So, with the scaling of multi-cores from 10 to 1000 cores per chip and more, selection of the right network topology is of importance. Another interesting factor to consider is the effect of the cache on these multi-core systems with respect to each of these topologies. Cache coherency is and will be a major cause for throughput decrease as cores scale. In our work, we are using the Modified-Exclusive-Shared-Invalid (MESI) Cache Coherency protocol for all the above mentioned network topologies considered. In this thesis, we investigate the effect of varying cache parameters such as the sizes of L1 Instruction cache, L1 Data cache and L2 cache and their respective associativities on each network topology. Various combinations of all these four parameters were considered as we ran experiments. We use the gem5 Computer Architecture Simulator for running our experiments with 4 core models. For benchmark purposes, we use the SPLASH-2 set of {\it `High Performance Computing'} benchmarks. A benchmark is assigned to each core. We also observe the effects of running benchmarks with similar characteristics on all cores versus comparing them with a set of different benchmarks while keeping all other parameters constant. Through our results, we attempt to give researchers and the industry at large a better view of the advantages and disadvantages along with the relationship between multi-cores, the cache and network topologies for multi-core systems.
|
3 |
Design and Implementation of Multi-core Support for an Embedded Real-time Operating System for Space ApplicationsZhang, Wei January 2015 (has links)
Nowadays, multi-core processors are widely used in embedded applications due to the advantages of higher performance and lower power consumption. However, the complexity of multi-core architectures makes it a considerably challenging task to extend a single-core version of a real-time operating system to support multi-core platform. This thesis documents the process of design and implementation of a multi-core version of RODOS - an embedded real-time operating system developed by German Aerospace Center and the University of Würzburg - on a dual-core platform. Two possible models are proposed: Symmetric Multiprocessing and Asymmetric Multiprocessing. In order to prevent the collision of the global components initialization, a new multi-core boot loader is created to allow that each core boots up in a proper manner. A working version of multi-core RODOS is implemented that has an ability to run tasks on a multi-core platform. Several test cases are applied and verified that the performance on the multi-core version of RODOS achieves around 180% improved than the same tasks running on the original RODOS. Deadlock free communication and synchronization APIs are provided to let parallel applications share data and messages in a safe manner.
|
4 |
Multicore Optimized Real-Time Protocol for Power Control NetworksNaveed, Muhammad January 2012 (has links)
The Technology today is changing at a fast pace. The growth of computers and telecommunications over the past three decades has been extraordinary. We today are at the point where all technologies related to communication and data transfer are submerging to a common platform. A number of different methods are available for data communication or data transfer. The important factor in all communication setups is to satisfy user demands with low cost and reliability. The area of interest for this thesis is future energy substations and wind mills. In order to make things more straight forward and see its different options and capabilities the focus is on designing and implementing a new energy protocol called Energy Real Time Protocol (eRTP) based on Iyad Real Time Protocol (iRTP) [2]. The protocol is designed to meet the requirements of power and energy networks in terms of sending the energy parameters with VoIP data (optional) among power stations at different locations. Keeping in mind the importance transferring energy parameters in real-time, the presented protocol has built upon small individual algorithms/modules designed for multi-core architecture. Each module is supposed to be processed by an individual core/processor in parallel.
|
5 |
Stratégie de placement et d'ordonnancement de taches logicielles pour architectures reconfigurables sous contrainte énergétique / Mapping and scheduling strategy of OS tasks into reconfigurable architectures under energy constraintGammoudi, Aymen 26 June 2018 (has links)
La conception de systèmes temps-réel embarqués se développe de plus en plus avec l’intégration croissante de fonctionnalités critiques pour les applications de surveillance, notamment dans le domaine biomédical, environnemental, domotique, etc. Le développement de ces systèmes doit relever divers défis en termes de minimisation de la consommation énergétique. Gérer de tels dispositifs embarqués, entièrement autonomes, nécessite cependant de résoudre différents problèmes liés à la quantité d’énergie disponible dans la batterie, à l’ordonnancement temps-réel des tâches qui doivent être exécutées avant leurs échéances, aux scénarios de reconfiguration, particulièrement dans le cas d’ajout de tâches, et à la contrainte de communication pour pouvoir assurer l’échange des messages entre les processeurs, de façon à assurer une autonomie durable jusqu’à la prochaine recharge et ce, tout en maintenant un niveau de qualité de service acceptable du système de traitement. Pour traiter cette problématique, nous proposons dans ces travaux une stratégie de placement et d’ordonnancement de tâches permettant d’exécuter des applications temps-réel sur une architecture contenant des cœurs hétérogènes. Dans cette thèse, nous avons choisi d’aborder cette problématique de façon incrémentale pour traiter progressivement les problèmes liés aux contraintes temps-réel, énergétique et de communications. Tout d’abord, nous nous intéressons particulièrement à l’ordonnancement des tâches sur une architecture mono-cœur. Nous proposons une stratégie d’ordonnancement basée sur le regroupement des tâches dans des packs pour pouvoir calculer facilement les nouveaux paramètres des tâches afin de réobtenir la faisabilité du système. Puis, nous l’avons étendu pour traiter le cas de l’ordonnancement sur une architecture multi-cœurs homogènes. Finalement, une extension de ce dernier sera réalisée afin d’arriver à l’objectif principal qui est l’ordonnancement des tâches pour les architectures hétérogènes. L’idée est de prendre progressivement en compte des contraintes d’exécution de plus en plus complexes. Nous formalisons tous les problèmes en utilisant la formulation ILP afin de pouvoir produire des résultats optimaux. L’idée est de pouvoir situer nos solutions proposées par rapport aux solutions optimales produites par un solveur et par rapport aux autres algorithmes de l’état de l’art. Par ailleurs, la validation par simulation des stratégies proposées montre qu’elles engendrent un gain appréciable vis-à-vis des critères considérés importants dans les systèmes embarqués, notamment le coût de la communication entre cœurs et le taux de rejet des tâches. / The design of embedded real-time systems is developing more and more with the increasing integration of critical functionalities for monitoring applications, particularly in the biomedical, environmental, home automation, etc. The developement of these systems faces various challenges particularly in terms of minimizing energy consumption. Managing such autonomous embedded devices, requires solving various problems related to the amount of energy available in the battery and the real-time scheduling of tasks that must be executed before their deadlines, to the reconfiguration scenarios, especially in the case of adding tasks, and to the communication constraint to be able to ensure messages exchange between cores, so as to ensure a lasting autonomy until the next recharge, while maintaining an acceptable level of quality of services for the processing system. To address this problem, we propose in this work a new strategy of placement and scheduling of tasks to execute real-time applications on an architecture containing heterogeneous cores. In this thesis, we have chosen to tackle this problem in an incremental manner in order to deal progressively with problems related to real-time, energy and communication constraints. First of all, we are particularly interested in the scheduling of tasks for single-core architecture. We propose a new scheduling strategy based on grouping tasks in packs to calculate the new task parameters in order to re-obtain the system feasibility. Then we have extended it to address the scheduling tasks on an homogeneous multi-core architecture. Finally, an extension of the latter will be achieved in order to realize the main objective, which is the scheduling of tasks for the heterogeneous architectures. The idea is to gradually take into account the constraints that are more and more complex. We formalize the proposed strategy as an optimization problem by using integer linear programming (ILP) and we compare the proposed solutions with the optimal results provided by the CPLEX solver. Inaddition, the validation by simulation of the proposed strategies shows that they generate a respectable gain compared with the criteria considered important in embedded systems, in particular the cost of communication between cores and the rate of new tasks rejection.
|
6 |
Contrôle distribué pour les systèmes multi-cœurs auto-adaptatifs / Distributed Control for Self-adaptatif Multi-Core ArchitecturesMansouri, Imen 30 November 2011 (has links)
Les architectures régulières intégrant plusieurs cœurs de traitement sont davantage utilisées dans les systèmes embarqués. Dans cette thèse, on s'intéresse aux mécanismes d'optimisation d'énergie dans des architectures avec une dimension étendue; pour faire face aux problèmes de variabilité technologique et aux changements du contexte applicatif, le processus d'optimisation se déroule en temps réel. Des capteurs in-situ détectent le degré de dégradation du circuit. Quant a la variabilité applicative, des moniteurs d'activité sont insérés sur un niveau architectural pour estimer la charge de travail engendrée par l'application en cours et la consommation qui en découle. Nous avons développé une méthode systématique pour l'intégration de ces capteurs avec un moindre coût en surface. Leurs sorties alimentent un processus d'optimisation basé sur la théorie de consensus et dupliqué dans chaque cœur. Ce contrôle vise à fixer la meilleure configuration locale à chaque cœur permettant d'optimiser la consommation globale du système tout en respectant les contraintes temps réel de l'application en cours. Ce schéma opère d'une manière complètement distribuée afin de garantir la scalabilité de notre solution, et donc sa faisabilité, compte tenu de la complexité des circuits actuels et futurs. / Regular architectures embedding several processing elements are increasingly used in embedded systems. They require careful design to avoid high power consumption and to improve their flexibility. This thesis work deals with optimization mechanisms of large scale architectures; to meet variability issues, optimization is processed at run-time. The target design implements in-situ features to collect physical information about its yield and to monitor application workload and generated consumption. As for workload monitoring, we use activity counters connected at architecture level to a set of critical signals. We developed an automated method to optimally place these features with a minimal area overhead. The collected information are used further jointly with a power model to estimate the dissipated power and then driven appropriate optimization process. Optimal frequency for each core is set by means of a distributed controller based on consensus theory. The resulting settings aim to reduce the whole system power while fulfilling application constraints. The scheme needs to be fully distributed to garantee the control scalability, and so feasibility, as the number of cores scales.
|
7 |
Deployment of mixed criticality and data driven systems on multi-cores architectures / Déploiement de systèmes à flots de données en criticité mixte pour architectures multi-coeursMedina, Roberto 30 January 2019 (has links)
De nos jours, la conception de systèmes critiques va de plus en plus vers l’intégration de différents composants système sur une unique plate-forme de calcul. Les systèmes à criticité mixte permettent aux composants critiques ayant un degré élevé de confiance (c.-à-d. une faible probabilité de défaillance) de partager des ressources de calcul avec des composants moins critiques sans nécessiter des mécanismes d’isolation logicielle.Traditionnellement, les systèmes critiques sont conçus à l’aide de modèles de calcul comme les graphes data-flow et l’ordonnancement temps-réel pour fournir un comportement logique et temporel correct. Néanmoins, les ressources allouées aux data-flows et aux ordonnanceurs temps-réel sont fondées sur l’analyse du pire cas, ce qui conduit souvent à une sous-utilisation des processeurs. Les ressources allouées ne sont ainsi pas toujours entièrement utilisées. Cette sous-utilisation devient plus remarquable sur les architectures multi-cœurs où la différence entre le meilleur et le pire cas est encore plus significative.Le modèle d’exécution à criticité mixte propose une solution au problème susmentionné. Afin d’allouer efficacement les ressources tout en assurant une exécution correcte des composants critiques, les ressources sont allouées en fonction du mode opérationnel du système. Tant que des capacités de calcul suffisantes sont disponibles pour respecter toutes les échéances, le système est dans un mode opérationnel de « basse criticité ». Cependant, si la charge du système augmente, les composants critiques sont priorisés pour respecter leurs échéances, leurs ressources de calcul augmentent et les composants moins/non critiques sont pénalisés. Le système passe alors à un mode opérationnel de « haute criticité ».L’ intégration des aspects de criticité mixte dans le modèle data-flow est néanmoins un problème difficile à résoudre. Des nouvelles méthodes d’ordonnancement capables de gérer des contraintes de précédences et des variations sur les budgets de temps doivent être définies.Bien que plusieurs contributions sur l’ordonnancement à criticité mixte aient été proposées, l’ordonnancement avec contraintes de précédences sur multi-processeurs a rarement été étudié. Les méthodes existantes conduisent à une sous-utilisation des ressources, ce qui contredit l’objectif principal de la criticité mixte. Pour cette raison, nous définissons des nouvelles méthodes d’ordonnancement efficaces basées sur une méta-heuristique produisant des tables d’ordonnancement pour chaque mode opérationnel du système. Ces tables sont correctes : lorsque la charge du système augmente, les composants critiques ne manqueront jamais leurs échéances. Deux implémentations basées sur des algorithmes globaux préemptifs démontrent un gain significatif en ordonnançabilité et en utilisation des ressources : plus de 60 % de systèmes ordonnançables sur une architecture donnée par rapport aux méthodes existantes.Alors que le modèle de criticité mixte prétend que les composants critiques et non critiques peuvent partager la même plate-forme de calcul, l'interruption des composants non critiques réduit considérablement leur disponibilité. Ceci est un problème car les composants non critiques doivent offrir une degré minimum de service. C’est pourquoi nous définissons des méthodes pour évaluer la disponibilité de ces composants. A notre connaissance, nos évaluations sont les premières capables de quantifier la disponibilité. Nous proposons également des améliorations qui limitent l’impact des composants critiques sur les composants non critiques. Ces améliorations sont évaluées grâce à des automates probabilistes et démontrent une amélioration considérable de la disponibilité : plus de 2 % dans un contexte où des augmentations de l’ordre de 10-9 sont significatives.Nos contributions ont été intégrées dans un framework open-source. Cet outil fournit également un générateur utilisé pour l’évaluation de nos méthodes d’ordonnancement. / Nowadays, the design of modern Safety-critical systems is pushing towards the integration of multiple system components onto a single shared computation platform. Mixed-Criticality Systems in particular allow critical components with a high degree of confidence (i.e. low probability of failure) to share computation resources with less/non-critical components without requiring software isolation mechanisms (as opposed to partitioned systems).Traditionally, safety-critical systems have been conceived using models of computations like data-flow graphs and real-time scheduling to obtain logical and temporal correctness. Nonetheless, resources given to data-flow representations and real-time scheduling techniques are based on worst-case analysis which often leads to an under-utilization of the computation capacity. The allocated resources are not always completely used. This under-utilization becomes more notorious for multi-core architectures where the difference between best and worst-case performance is more significant.The mixed-criticality execution model proposes a solution to the abovementioned problem. To efficiently allocate resources while ensuring safe execution of the most critical components, resources are allocated in function of the operational mode the system is in. As long as sufficient processing capabilities are available to respect deadlines, the system remains in a ‘low-criticality’ operational mode. Nonetheless, if the system demand increases, critical components are prioritized to meet their deadlines, their computation resources are increased and less/non-critical components are potentially penalized. The system is said to transition to a ‘high-criticality’ operational mode.Yet, the incorporation of mixed-criticality aspects into the data-flow model of computation is a very difficult problem as it requires to define new scheduling methods capable of handling precedence constraints and variations in timing budgets.Although mixed-criticality scheduling has been well studied for single and multi-core platforms, the problem of data-dependencies in multi-core platforms has been rarely considered. Existing methods lead to poor resource usage which contradicts the main purpose of mixed-criticality. For this reason, our first objective focuses on designing new efficient scheduling methods for data-driven mixed-criticality systems. We define a meta-heuristic producing scheduling tables for all operational modes of the system. These tables are proven to be correct, i.e. when the system demand increases, critical components will never miss a deadline. Two implementations based on existing preemptive global algorithms were developed to gain in schedulability and resource usage. In some cases these implementations schedule more than 60% of systems compared to existing approaches.While the mixed-criticality model claims that critical and non-critical components can share the same computation platform, the interruption of non-critical components degrades their availability significantly. This is a problem since non-critical components need to deliver a minimum service guarantee. In fact, recent works in mixed-criticality have recognized this limitation. For this reason, we define methods to evaluate the availability of non-critical components. To our knowledge, our evaluations are the first ones capable of quantifying availability. We also propose enhancements compatible with our scheduling methods, limiting the impact that critical components have on non-critical ones. These enhancements are evaluated thanks to probabilistic automata and have shown a considerable improvement in availability, e.g. improvements of over 2% in a context where 10-9 increases are significant.Our contributions have been integrated into an open-source framework. This tool also provides an unbiased generator used to perform evaluations of scheduling methods for data-driven mixed-criticality systems.
|
8 |
Parallel Heart Analysis Algorithms Utilizing Multi-core for Optimized Medical Data Exchange over Voice and Data NetworksKarim, Fazal January 2011 (has links)
In today’s research and market, IT applications for health-care are gaining huge interest of both IT and medical researchers. Cardiovascular diseases (CVDs) are considered the largest cause of death for both men and women regardless of ethnic backgrounds. More efficient treatments and most importantly efficient methods of cardiac diagnosis that examine heart diseases are desired. Electrocardiography (ECG) is an essential method used to diagnose heart diseases. However, diagnosing any cardiovascular disease based on the 12-lead ECG printout from an ECG machine using human eye might seriously impair analysis accuracy. To meet this challenge of today’s ECG analysis methodology, a more reliable solution that can analyze huge amount of patient’s data in real-time is desired. The software solution presented in this article is aimed to reduce the risk while diagnosing cardiovascular diseases (CVDs) by human eye, computation of large-scale patient’s data in real-time at the patient’s location and sending the required results or summary to the doctor/nurse. Keeping in mind the importance of real-time analysis of patient’s data, the software system has built upon small individual algorithms/modules designed for multi-core architecture, where each module is supposed to be processed by an individual core/processor in parallel. All the input and output processes to the analysis system are made automated, which reduces operator’s interaction to the system and thus reducing the cost. The outputs/results of the processing are summarized to smaller files in both ASCII and binary formats to meet the requirement of exchanging the data over Voice and Data Networks.
|
9 |
Multi-Core Memory System Design : Developing and using Analytical Models for Performance Evaluation and EnhancementsDwarakanath, Nagendra Gulur January 2015 (has links) (PDF)
Memory system design is increasingly influencing modern multi-core architectures from both performance and power perspectives. Both main memory latency and bandwidth have im-proved at a rate that is slower than the increase in processor core count and speed. Off-chip memory, primarily built from DRAM, has received significant attention in terms of architecture and design for higher performance. These performance improvement techniques include sophisticated memory access scheduling, use of multiple memory controllers, mitigating the impact of DRAM refresh cycles, and so on. At the same time, new non-volatile memory technologies have become increasingly viable in terms of performance and energy. These alternative technologies offer different performance characteristics as compared to traditional DRAM.
With the advent of 3D stacking, on-chip memory in the form of 3D stacked DRAM has opened up avenues for addressing the bandwidth and latency limitations of off-chip memory. Stacked DRAM is expected to offer abundant capacity — 100s of MBs to a few GBs — at higher bandwidth and lower latency. Researchers have proposed to use this capacity as an extension to main memory, or as a large last-level DRAM cache. When leveraged as a cache, stacked DRAM provides opportunities and challenges for improving cache hit rate, access latency, and off-chip bandwidth.
Thus, designing off-chip and on-chip memory systems for multi-core architectures is complex, compounded by the myriad architectural, design and technological choices, combined with the characteristics of application workloads. Applications have inherent spatial local-ity and access parallelism that influence the memory system response in terms of latency and bandwidth.
In this thesis, we construct an analytical model of the off-chip main memory system to comprehend this diverse space and to study the impact of memory system parameters and work-load characteristics from latency and bandwidth perspectives. Our model, called ANATOMY, uses a queuing network formulation of the memory system parameterized with workload characteristics to obtain a closed form solution for the average miss penalty experienced by the last-level cache. We validate the model across a wide variety of memory configurations on four-core, eight-core and sixteen-core architectures. ANATOMY is able to predict memory latency with average errors of 8.1%, 4.1%and 9.7%over quad-core, eight-core and sixteen-core configurations respectively. Further, ANATOMY identifie better performing design points accurately thereby allowing architects and designers to explore the more promising design points in greater detail. We demonstrate the extensibility and applicability of our model by exploring a variety of memory design choices such as the impact of clock speed, benefit of multiple memory controllers, the role of banks and channel width, and so on. We also demonstrate ANATOMY’s ability to capture architectural elements such as memory scheduling mechanisms and impact of DRAM refresh cycles. In all of these studies, ANATOMY provides insight into sources of memory performance bottlenecks and is able to quantitatively predict the benefit of redressing them.
An insight from the model suggests that the provisioning of multiple small row-buffers in each DRAM bank achieves better performance than the traditional one (large) row-buffer per bank design. Multiple row-buffers also enable newer performance improvement opportunities such as intra-bank parallelism between data transfers and row activations, and smart row-buffer allocation schemes based on workload demand. Our evaluation (both using the analytical model and detailed cycle-accurate simulation) shows that the proposed DRAM re-organization achieves significant speed-up as well as energy reduction.
Next we examine the role of on-chip stacked DRAM caches at improving performance by reducing the load on off-chip main memory. We extend ANATOMY to cover DRAM caches. ANATOMY-Cache takes into account all the key parameters/design issues governing DRAM cache organization namely, where the cache metadata is stored and accessed, the role of cache block size and set associativity and the impact of block size on row-buffer hit rate and off-chip bandwidth. Yet the model is kept simple and provides a closed form solution for the aver-age miss penalty experienced by the last-level SRAM cache. ANATOMY-Cache is validated against detailed architecture simulations and shown to have latency estimation errors of 10.7% and 8.8%on average in quad-core and eight-core configurations respectively. An interesting in-sight from the model suggests that under high load, it is better to bypass the congested DRAM cache and leverage the available idle main memory bandwidth. We use this insight to propose a refresh reduction mechanism that virtually eliminates refresh overhead in DRAM caches. We implement a low-overhead hardware mechanism to record accesses to recent DRAM cache pages and refresh only these pages. Older cache pages are considered invalid and serviced from the (idle) main memory. This technique achieves average refresh reduction of 90% with resulting memory energy savings of 9%and overall performance improvement of 3.7%.
Finally, we propose a new DRAM cache organization that achieves higher cache hit rate, lower latency and lower off-chip bandwidth demand. Called the Bi-Modal Cache, our cache organization brings three independent improvements together: (i) it enables parallel tag and data accesses, (ii) it eliminates a large fraction of tag accesses entirely by use of a novel way locator and (iii) it improves cache space utilization by organizing the cache sets as a combination of some big blocks (512B) and some small blocks (64B). The Bi-Modal Cache reduces hit latency by use of the way locator and parallel tag and data accesses. It improves hit rate by leveraging the cache capacity efficiently – blocks with low spatial reuse are allocated in the cache at 64B granularity thereby reducing both wasted off-chip bandwidth as well as cache internal fragmentation. Increased cache hit rate leads to reduction in off-chip bandwidth demand. Through detailed simulations, we demonstrate that the Bi-Modal Cache achieves overall performance improvement of 10.8%, 13.8% and 14.0% in quad-core, eight-core and sixteen-core workloads respectively over an aggressive baseline.
|
10 |
Algorithm And Architecture Design for Real-time Face RecognitionMahale, Gopinath Vasanth January 2016 (has links) (PDF)
Face recognition is a field of biometrics that deals with identification of subjects based on features present in the images of their faces. The factors that make face recognition popular and favorite as compared to other biometric methods are easier operation and ability to identify subjects without their knowledge. With these features, face recognition has become an integral part of the present day security systems, targeting a smart and secure world.
There are various factors that de ne the performance of a face recognition system. The most important among them are recognition accuracy of algorithm used and time taken for recognition. Recognition accuracy of the face recognition algorithm gets affected by changes in pose, facial expression and illumination along with occlusions in the images. There have been a number of algorithms proposed to enable recognition under these ambient changes. However, it has been hard to and a single algorithm that can efficiently recognize faces in all the above mentioned conditions. Moreover, achieving real time performance for most of the complex face recognition algorithms on embedded platforms has been a challenge. Real-time performance is highly preferred in critical applications such as identification of crime suspects in public. As available software solutions for FR have significantly large latency in recognizing individuals, they are not suitable for such critical real-time applications. This thesis focuses on real-time aspect of FR, where acceleration of the algorithms is achieved by means of parallel hardware architectures.
The major contributions of this work are as follows. We target to design a face recognition system that can identify at most 30 faces in each frame of video at 15 frames per second, which amounts to 450 recognitions per second. In addition, we target to achieve good recognition accuracy along with scalability in terms of database size and input image resolutions. To design a system with these specifications, as a first step, we explore algorithms in literature and come up with a hybrid face recognition algorithm. This hybrid algorithm shows good recognition accuracy on face images with changes in illumination, pose and expressions, and also with occlusions. In addition the computations in the algorithm are modular in nature which are suitable for real-time realizations through parallel processing.
The face recognition system consists of a face detection module to detect faces in the input image, which is followed by a face recognition module to identify the detected faces. There are well established algorithms and architectures for face detection in literature which can perform detection at 15 frames per second on video frames. Detected faces of different sizes need to be scaled to the size specified by the face recognition module. To meet the real-time constraints, we propose a hardware architecture for real-time bi-cubic convolution interpolation with dynamic scaling factors. To recognize the resized faces in real-time, a scalable parallel pipelined architecture is designed for the hybrid algorithm which can perform 450 recognitions per second on a database containing grayscale images of at most 450 classes on Virtex 6 FPGA. To provide flexibility and programmability, we extend this design to REDEFINE, a multi-core massively parallel reconfigurable architecture. In this design, we come up with FR specific programmable cores termed Scalable Unit for Region Evaluation (SURE) capable of performing modular computations in the hybrid face recognition algorithm. We replicate SUREs in each tile of REDEFINE to construct a face recognition module termed REDEFINE for Face Recognition using SURE Homogeneous Cores (REFRESH).
There is a need to learn new unseen faces on-line in practical face recognition systems. Considering this, for real-time on-line learning of unseen face images, we design tiny processors termed VOP, Processor for Vector Operations. VOPs function as coprocessors to process elements under each tile of REDEFINE to accelerate micro vector operations appearing in the synaptic weight computations. We also explore deep neural networks which operate similar to the processing in human brain and capable of working on very large face databases. We explore the field of Random matrix theory to come up with a solution for synaptic weight initialization in deep neural networks for better classification . In addition, we perform design space exploration of hardware architecture for deep convolution networks and conclude with directions for future work.
|
Page generated in 0.3674 seconds