Global ETD Search

31	Consolidating Automotive Real-Time Applications on Many-Core Platforms Becker, Matthias January 2017 (has links) Automotive systems have transitioned from basic transportation utilities to sophisticated systems. The rapid increase in functionality comes along with a steep increase in software complexity. This manifests itself in a surge of the number of functionalities as well as the complexity of existing functions. To cope with this transition, current trends shift away from today’s distributed architectures towards integrated architectures, where previously distributed functionality is consolidated on fewer, more powerful, computers. This can ease the integration process, reduce the hardware complexity, and ultimately save costs. One promising hardware platform for these powerful embedded computers is the many-core processor. A many-core processor hosts a vast number of compute cores, that are partitioned on tiles which are connected by a Network-on-Chip. These natural partitions can provide exclusive execution spaces for different applications, since most resources are not shared among them. Hence, natural building blocks towards temporally and spatially separated execution spaces exist as a result of the hardware architecture. Additionally to the traditional task local deadlines, automotive applications are often subject to timing constraints on the data propagation through a chain of semantically related tasks. Such requirements pose challenges to the system designer as they are only able to verify them after the system synthesis (i.e. very late in the design process). In this thesis, we present methods that transform complex timing constraints on the data propagation delay to precedence constraints between individual jobs. An execution framework for the cluster of the many-core is proposed that allows access to cluster external memory while it avoids contention on shared resources by design. A partitioning and configuration of the Network-on-Chip provides isolation between the different applications and reduces the access time from the clusters to external memory. Moreover, methods that facilitate the verification of data propagation delays in each development step are provided. Many-Core Automotive Network-on-Chip Real-Time Timing analysis Embedded Systems Inbäddad systemteknik
32	Design of a Distributed Transactional Memory for Many-core systems Trigonakis, Vasileios January 2011 (has links) The emergence of Multi/Many-core systems signified an increasing need for parallel programming. Transactional Memory (TM) is a promising programming paradigm for creating concurrent applications. At current date, the design of Distributed TM (DTM) tailored for non coherent Manycore architectures is largely unexplored. This thesis addresses this topic by analysing, designing, and implementing a DTM system suitable for low latency message passing platforms. The resulting system, named SC-TM, the Single-Chip Cloud TM, is a fully decentralized and scalable DTM, implemented on Intel’s SCC processor; a 48-core ’concept vehicle’ created by Intel Labs as a platform for Many-core software research. SC-TM is one of the first fully decentralized DTMs that guarantees starvation-freedom and the first to use an actual pluggable Contention Manager (CM) to ensure liveness. Finally, this thesis introduces three completely decentralized CMs; Offset-Greedy, a decentralized version of Greedy, Wholly, which relies on the number of completed transactions, and FairCM, that makes use off the effective transactional time. The evaluation showed the latter outperformed the three. Engineering and Technology Teknik och teknologier
33	Modélisation système d'une architecture d'interconnexion RF reconfigurable pour les many-cœurs / System modeling of a reconfigurable RF interconnect architecture for manycore Brière, Alexandre 08 December 2017 (has links) La multiplication du nombre de cœurs de calcul présents sur une même puce va depair avec une augmentation des besoins en communication. De plus, la variété des applications s’exécutant sur la puce provoque une hétérogénéité spatiale et temporelle des communications. C’est pour répondre à ces problématiques que nous pré-sentons dans ce manuscrit un réseau d’interconnexion sur puce dynamiquement reconfigurable utilisant la Radio Fréquence (RF). L’utilisation de la RF permet de disposer d’une bande passante plus importante tout en minimisant la latence. La possibilité de reconfigurer dynamiquement le réseau permet d’adapter cette puce many-cœur à la variabilité des applications et des communications. Nous présentons les raisons du choix de la RF par rapport aux autres nouvelles technologies du domaine que sont l’optique et la 3D, l’architecture détaillée de ce réseau et d’une puce le mettant en œuvre ainsi que l’évaluation de sa faisabilité et de ses performances. Durant la phase d’évaluation nous avons pu montrer que pour un Chip Multiprocessor (CMP) de 1 024 tuiles, notre solution permettait un gain en performance de 13 %. Un des avantages de ce réseau d’interconnexion RF est la possibilité de faire du broadcast sans surcoût par rapport aux communications point-à-point,ouvrant ainsi de nouvelles perspectives en termes de gestion de la cohérence mémoire notamment. / The growing number of cores in a single chip goes along with an increase in com-munications. The variety of applications running on the chip causes spatial andtemporal heterogeneity of communications. To address these issues, we presentin this thesis a dynamically reconfigurable interconnect based on Radio Frequency(RF) for intra chip communications. The use of RF allows to increase the bandwidthwhile minimizing the latency. Dynamic reconfiguration of the interconnect allowsto handle the heterogeneity of communications. We present the rationale for choos-ing RF over optics and 3D, the detailed architecture of the network and the chipimplementing it, the evaluation of its feasibility and its performances. During theevaluation phase we were able to show that for a CMP of 1 024 tiles, our solutionallowed a performance gain of 13 %. One advantage of this RF interconnect is theability to broadcast without additional cost compared to point-to-point communi-cations, opening new perspectives in terms of cache coherence. NoC RF Dynamique Reconfigurable Hiérarchique Many-Cœur NoC Many-core Radio frequency 004.5
34	A Dynamically Configurable Discrete Event Simulation Framework for Many-Core System-on-Chips Barnes, Christopher J. January 2010 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Industry trends indicate that many-core heterogeneous processors will be the next-generation answer to Moore's law and reduced power consumption. Thus, both academia and industry are focused on the challenges presented by many-core heterogeneous processor designs. In many cases, researchers use discrete event simulators to research and validate new computer architecture innovations. However, there is a lack of dynamically configurable discrete event simulation environments for the testing and development of many-core heterogeneous processors. To fulfill this need we present Mhetero, a retargetable framework for cycle-accurate simulation of heterogeneous many-core processors along with the cycle-accurate simulation of their associated network-on-chip communication infrastructure. Mhetero is the result of research into dynamically configurable and highly flexible simulation tools with which users are free to produce custom instruction sets and communication methods in a highly modular design environment. In this thesis we will discuss our approach to dynamically configurable discrete event simulation and present several experiments performed using the framework to exemplify how Mhetero, and similarly constructed simulators, may be used for future innovations. Many-core heterogeneous processors Processor Simulator Mhetero Computer architecture Computer simulation
35	TOWARDS MANY-CORE PROCESSOR SIMULATION ON CLOUD COMPUTING PLATFORMS Schmidt, James Michael 23 August 2011 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Growth of interest and need for many-core systems have steadily increased over the recent years. Industry trends lead many-core systems to become increasingly larger and more complex. Because of these realities it is important to researchers, academia, and industry that the design of these many-core systems be straightforward and comprehensive. There is a need for a many-core simulator that can be simple to use and learn from for students, dynamic and capable of emulating large systems for researchers, and flexible with fast turnover for industry designers. At the same time, as many-core systems have been becoming popular and complex, and hence their design, the long standing field of Cloud Computing has become more prevalent and feasible to use. Such cloud computing platforms as Windows Azure allow for the easy access and use of resources that in the past were simply not available to ordinary users. Large tasks can be performed in SaaS Cloud Computing models and be accessible from a small, lightweight device using nothing more than a web browser. As a solution to the needs for designing future many-core systems, we present a Many-Core Simulator on Azure Cloud Computing Platform called M3C Simulator. This is targeted at teaching, research, and industry and as such needs to be easy to use, flexible, and powerful. The Could Computing service model meets all these needs. This thesis discusses overall design of the M3C Simulator and how it leverages Cloud Computing resources, the simple-to-use and understand Interface layout, and the software design including program flow and dynamic compilation. Computer Engineering Cloud Computing Simulator Many Core Cloud computing Computer simulation
36	High-Performance Sparse Matrix-Multi Vector Multiplication on Multi-Core Architecture Singh, Kunal 15 August 2018 (has links) No description available. Computer Science SpMM SpMDM Sparse Dense Matrix Multiplication Multi-core Many-Core
37	Variation Aware Energy-Efficient Methodologies for Homogeneous Many-core Designs Srivastav, Meeta S. 30 January 2015 (has links) Earlier designs were driven by the goal of achieving higher performance, but lately, energy efficiency has emerged as an even more important design principle. Strong demand from the consumer electronics drives research in the low power and energy-efficient methodologies. Moreover, with exponential increase in the number of transistors on a chip and with further technology scaling, variability in the design is now of greater concern. Variations can make the design unreliable or the design may suffer from sub-optimal performance. Through the work in this thesis, we present a multi-dimensional investigation into the design of variation aware energy-efficient systems. Our overarching methodology is to use system-level decisions to mitigate undesired effects originating from device-level and circuit-level issues. We first look into the impact of process variation (PV) on energy efficient, scalable throughput many-core DSP systems. In our proposed methodology, we leverage the benefits of aggressive voltage scaling (VS) for obtaining energy efficiency while compensating for the loss in performance by exploiting parallelism present in various DSP designs. We demonstrate this proposed methodology consumes 8% - 77% less power as compared to simple dynamic VS over different workload environments. Later, we show judicious system-level decisions, namely, number of cores, and their operating voltage can greatly mitigate the effects of PV and consequently, improve the energy efficiency of the design. We also present our analysis discussing the impact of aging on the proposed methodology. To validate our proposed system-level approach, design details of a prototype chip fabricated in the 90nm technology node and its findings are also presented. The chip consists of 8 homogeneous FIR cores, which are capable of running from near-threshold to nominal voltages. In the 20-chip population, we observe 7% variation in the speed at nominal voltage (0.9V) and 26% at near threshold voltage (0.55V) among all the cores. We also observe 54% variation in power consumption characteristics of the cores. The chip measurement results show that our proposed methodology of judiciously selecting the cores and their operating voltage can result in 6.27% - 28.15% more energy savings for various workload environments, as compared to globally voltage scaled systems. Furthermore, we present the impact of temperature variations on the energy-efficiency of the above systems. We also study the problem of voltage variations in the integrated circuits. We first present the characteristics of a dynamic voltage noise as measured on a 28nm FPGA. We propose a fully digital on-chip sensor that can detect the fast voltage transients and alert the system of voltage emergency. A traditional approach to mitigate this problem is to use safety guardbands. We demonstrate that our proposed sensor system will be 6% - 27.5% more power efficient than the traditional approach. / Ph. D. Low power Near-threshold Energy efficient di/dt process variation PVTA Aging homogeneous many-core
38	Transforming and Optimizing Irregular Applications for Parallel Architectures Zhang, Jing 12 February 2018 (has links) Parallel architectures, including multi-core processors, many-core processors, and multi-node systems, have become commonplace, as it is no longer feasible to improve single-core performance through increasing its operating clock frequency. Furthermore, to keep up with the exponentially growing desire for more and more computational power, the number of cores/nodes in parallel architectures has continued to dramatically increase. On the other hand, many applications in well-established and emerging fields, such as bioinformatics, social network analysis, and graph processing, exhibit increasing irregularities in memory access, control flow, and communication patterns. While multiple techniques have been introduced into modern parallel architectures to tolerate these irregularities, many irregular applications still execute poorly on current parallel architectures, as their irregularities exceed the capabilities of these techniques. Therefore, it is critical to resolve irregularities in applications for parallel architectures. However, this is a very challenging task, as the irregularities are dynamic, and hence, unknown until runtime. To optimize irregular applications, many approaches have been proposed to improve data locality and reduce irregularities through computational and data transformations. However, there are two major drawbacks in these existing approaches that prevent them from achieving optimal performance. First, these approaches use local optimizations that exploit data locality and regularity locally within a loop or kernel. However, in many applications, there is hidden locality across loops or kernels. Second, these approaches use "one-size-fits-all'' methods that treat all irregular patterns equally and resolve them with a single method. However, many irregular applications have complex irregularities, which are mixtures of different types of irregularities and need differentiated optimizations. To overcome these two drawbacks, we propose a general methodology that includes a taxonomy of irregularities to help us analyze the irregular patterns in an application, and a set of adaptive transformations to reorder data and computation based on the characteristics of the application and architecture. By extending our adaptive data-reordering transformation on a single node, we propose a data-partitioning framework to resolve the load imbalance problem of irregular applications on multi-node systems. Unlike existing frameworks, which use "one-size-fits-all" methods to partition the input data by a single property, our framework provides a set of operations to transform the input data by multiple properties and generates the desired data-partitioning codes by composing these operations into a workflow. / Ph. D. / Irregular applications, which present unpredictable and irregular patterns of data accesses and computation, are increasingly important in well-established and emerging fields, such as biological data analysis, social network analysis, and machine learning, to deal with large datasets. On the other hand, current parallel processors, such as multi-core CPUs (central processing units), GPUs (graphics processing units), and computer clusters (i.e., groups of connected computers), are designed for regular applications and execute irregular applications poorly. Therefore, it is critical to optimize irregular applications for parallel processors. However, it is a very challenging task, as the irregular patterns are dynamic, and hence, unknown until application execution. To overcome this challenge, we propose a general methodology that includes a taxonomy of irregularities to help us analyze the irregular patterns in an application, and a set of adaptive transformations to reorder data and computation for exploring hidden regularities based on the characteristics of the application and processor. We apply our methodology on couples of important and complex irregular applications as case studies to demonstrate that it is effective and efficient. Irregular Applications Parallel Architectures Multi-core Many-core Multi-node Bioinformatics
39	Contribution à la parallélisation automatique : un modèle de processeur à beaucoup de coeurs parallélisant. / Contribution to the automatic parallelization : the model of the manycore parallelizing processor Porada, Katarzyna 14 November 2017 (has links) Depuis les premiers ordinateurs on est en quête de machines plus rapides, plus puissantes, plus performantes. Après avoir épuisé le filon de l’augmentation de la fréquence, les constructeurs se sont tournés vers les multi-cœurs. Le modèle de calcul actuel repose sur les threads de l'OS qu’on exploite à travers différents langages à constructions parallèles. Cependant, la programmation multithread reste un art délicat car le calcul parallèle découpé en threads souffre d’un grand défaut : il est non déterministe.Pourtant, on peut faire du calcul parallèle déterministe, à condition de remplacer le modèle des threads par un modèle s’appuyant sur l’ordre partiel des dépendances. Dans cette thèse, nous proposons un modèle alternatif d’architecture qui exploite le parallélisme d’instructions (ILP) présent dans les programmes. Nous proposons de nombreuses techniques pour s’affranchir de la plupart des dépendances architecturales et obtenir ainsi un ILP qui croît avec la taille de l’exécution. L’ILP qu’on atteint de cette façon est suffisant pour permettre d’alimenter plusieurs milliers de cœurs. Les dépendances architecturales sérialisantes ayant été supprimées, l’ILP peut être bien mieux exploité que dans les architectures actuelles. Un code VHDL au niveau RTL de l’architecture a été développé pour en mesurer les avantages. Les résultats de synthèse d’un processeur allant de 2 à 64 cœurs montrent que la vitesse du matériel que nous proposons reste constante et que sa surface varie linéairement avec le nombre de cœurs. Cela prouve que le modèle d’interconnexion proposé est extensible. / The pursuit for faster and more powerful machines started from the first computers. After exhausting the increase of the frequency, the manufacturers have turned to another solution and started to introduce multiples cores on a chip. The computational model is today based on the OS threads exploited through different languages offering parallel constructions. However, parallel programming remains an art because the thread management by the operating system is not deterministic.Nonetheless, it is possible to compute in a parallel deterministic way if we replace the thread model by a model built on the partial order of dependencies. In this thesis, we present an alternative architectural model exploiting the Instruction Level Parallelism (ILP) naturally present in applications. We propose many techniques to remove most of the architectural dependencies which leads to an ILP increasing with the execution length. The ILP which is reached this way is enough to allow feeding thousands of cores. Eliminating the architecutral dependencies serializing the run allows to exploit the ILP better than in actual microarchitectures. A VHDL code at the RTL level has been implemented to mesure the benefits of our design. The results of the synthesis of a processeur ranging from 2 to 64 cores are reported. They show that the speed of the proposed material keeps constant and the surface grows linearly with the number of cores : our interconnect solution is scalable. Processeur à beaucoup de cœurs Déterminisme Parallélisation automatique Description VHDL RTL FPGA Many-core processor Determinism Automatic parallelisation VHDL RTL FPGA 004
40	Proceedings of the 4th Many-core Applications Research Community (MARC) Symposium January 2012 (has links) In continuation of a successful series of events, the 4th Many-core Applications Research Community (MARC) symposium took place at the HPI in Potsdam on December 8th and 9th 2011. Over 60 researchers from different fields presented their work on many-core hardware architectures, their programming models, and the resulting research questions for the upcoming generation of heterogeneous parallel systems. Mehrkernsysteme Verbindungsnetzwerke Prozessoren paralleles Rechnen Virtualisierung many-core multi-core interconnect processor hardware parallel computing virtualization Data processing Computer science

Search results