Global ETD Search

31	Methods for Creating and Exploiting Data Locality Wallin, Dan January 2006 (has links) The gap between processor speed and memory latency has led to the use of caches in the memory systems of modern computers. Programs must use the caches efficiently and exploit data locality for maximum performance. Multiprocessors, built from many processing units, are becoming commonplace not only in large servers but also in smaller systems such as personal computers. Multiprocessors require careful data locality optimizations since accesses from other processors can lead to invalidations and false sharing cache misses. This thesis explores hardware and software approaches for creating and exploiting temporal and spatial locality in multiprocessors. We propose the capacity prefetching technique, which efficiently reduces the number of cache misses but avoids false sharing by distinguishing between cache lines involved in communication from non-communicating cache lines at run-time. Prefetching techniques often lead to increased coherence and data traffic. The new bundling technique avoids one of these drawbacks and reduces the coherence traffic in multiprocessor prefetchers. This is especially important in snoop-based systems where the coherence bandwidth is a scarce resource. Most of the studies have been performed on advanced scientific algorithms. This thesis demonstrates that a cc-NUMA multiprocessor, with hardware data migration and replication optimizations, efficiently exploits the temporal locality in such codes. We further present a method of parallelizing a multigrid Gauss-Seidel partial differential equation solver, which creates temporal locality at the expense of increased communication. Our conclusion is that on modern chip multiprocessors, it is more important to optimize algorithms for data locality than to avoid communication, since communication can take place using a shared cache. data locality temporal locality spatial locality prefetching cache cache behavior cache coherence snooping protocols partial differential equation shared-memory multiprocessor chip multiprocessor simulation Computer engineering Datorteknik
32	Système distribué à adressage global et cohérence logicielle pourl’exécution d’un modèle de tâche à flot de données / Distributed runtime system with global address space and software cache coherence for a data-flow task model Gindraud, François 11 January 2018 (has links) Les architectures distribuées sont fréquemment utilisées pour le calcul haute performance (HPC). Afin de réduire la consommation énergétique, certains fabricants de processeurs sont passés d’architectures multi-cœurs en mémoire partagée aux MPSoC. Les MPSoC (Multi-Processor System On Chip) sont des architectures incluant un système distribué dans une puce.La programmation des architectures distribuées est plus difficile que pour les systèmes à mémoire partagée, principalement à cause de la nature distribuée de la mémoire. Une famille d’outils nommée DSM (Distributed Shared Memory) a été développée pour simplifier la programmation des architectures distribuées. Cette famille inclut les architectures NUMA, les langages PGAS, et les supports d’exécution distribués pour graphes de tâches. La stratégie utilisée par les DSM est de créer un espace d’adressage global pour les objets du programme, et de faire automatiquement les transferts réseaux nécessaires lorsque ces objets sont utilisés. Les systèmes DSM sont très variés, que ce soit par l’interface fournie, les fonctionnalités, la sémantique autour des objets globalement adressables, le type de support (matériel ou logiciel), ...Cette thèse présente un nouveau système DSM à support logiciel appelé Givy. Le but de Givy est d’exécuter sur des MPSoC (MPPA) des programmes sous la forme de graphes de tâches dynamiques, avec des dépendances de flot de données (data-flow ). L’espace d’adressage global (GAS) de Givy est indexé par des vrais pointeurs, contrairement à de nombreux autres systèmes DSM à support logiciel : les pointeurs bruts du langage C sont valides sur tout le système distribué. Dans Givy, les objets globaux sont les blocs de mémoire fournis par malloc(). Ces blocs sont répliqués entre les nœuds du système distribué, et sont gérés par un protocole de cohérence de cache logiciel nommé Owner Writable Memory. Le protocole est capable de déplacer ses propres métadonnées, ce qui devrait permettre l’exécution efficace de programmes irréguliers. Le modèle de programmation impose de découper le programme en tâches créées dynamiquement et annotées par leurs accès mémoire. Ces annotations sont utilisées pour générer les requêtes au protocole de cohérence, ainsi que pour fournir des informations à l’ordonnanceur de tâche (spatial et temporel).Le premier résultat de cette thèse est l’organisation globale de Givy. Une deuxième contribution est la formalisation du protocole Owner Writable Memory. Le troisième résultat est la traduction de cette formalisation dans le langage d’un model checker (Cubicle), et les essais de validation du protocole. Le dernier résultat est la réalisation et explication détaillée du sous-système d’allocation mémoire : le choix de pointeurs bruts en tant qu’index globaux nécessite une intégration forte entre l’allocateur mémoire et le protocole de cohérence de cache. / Distributed systems are widely used in HPC (High Performance Computing). Owing to rising energy concerns, some chip manufacturers moved from multi-core CPUs to MPSoC (Multi-Processor System on Chip), which includes a distributed system on one chip.However distributed systems – with distributed memories – are hard to program compared to more friendly shared memory systems. A family of solutions called DSM (Distributed Shared Memory) systems has been developed to simplify the programming of distributed systems. DSM systems include NUMA architectures, PGAS languages, and distributed task runtimes. The common strategy of these systems is to create a global address space of some kind, and automate network transfers on accesses to global objects. DSM systems usually differ in their interfaces, capabilities, semantics on global objects, implementation levels (hardware / software), ...This thesis presents a new software DSM system called Givy. The motivation of Givy is to execute programs modeled as dynamic task graphs with data-flow dependencies on MPSoC architectures (MPPA). Contrary to many software DSM, the global address space of Givy is indexed by real pointers: raw C pointers are made global to the distributed system. Givy global objects are memory blocks returned by malloc(). Data is replicated across nodes, and all these copies are managed by a software cache coherence protocol called Owner Writable Memory. This protocol can relocate coherence metadata, and thus should help execute irregular applications efficiently. The programming model cuts the program into tasks which are annotated with memory accesses, and created dynamically. Memory annotations are used to drive coherence requests, and provide useful information for scheduling and load-balancing.The first contribution of this thesis is the overall design of the Givy runtime. A second contribution is the formalization of the Owner Writable Memory coherence protocol. A third contribution is its translation in a model checker language (Cubicle), and correctness validation attempts. The last contribution is the detailed allocator subsystem implementation: the choice of real pointers for global references requires a tight integration between memory allocator and coherence protocol. Système distribué Protocole de cohérence de cache Support d'exécution Multi-Coeurs Modèle mémoire Distributed systems Cache coherence protocol Runtime Manycore Memory model 004
33	Performance Analysis of Complex Shared Memory Systems Molka, Daniel 10 March 2017 (has links) Systems for high performance computing are getting increasingly complex. On the one hand, the number of processors is increasing. On the other hand, the individual processors are getting more and more powerful. In recent years, the latter is to a large extent achieved by increasing the number of cores per processor. Unfortunately, scientific applications often fail to fully utilize the available computational performance. Therefore, performance analysis tools that help to localize and fix performance problems are indispensable. Large scale systems for high performance computing typically consist of multiple compute nodes that are connected via network. Performance analysis tools that analyze performance problems that arise from using multiple nodes are readily available. However, the increasing number of cores per processor that can be observed within the last decade represents a major change in the node architecture. Therefore, this work concentrates on the analysis of the node performance. The goal of this thesis is to improve the understanding of the achieved application performance on existing hardware. It can be observed that the scaling of parallel applications on multi-core processors differs significantly from the scaling on multiple processors. Therefore, the properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed. As a first step, a comprehensive suite of highly optimized micro-benchmarks is developed. These benchmarks are able to determine the performance of memory accesses depending on the location and coherence state of the data. They are used to perform an in-depth analysis of the characteristics of memory accesses in contemporary multi-processor systems, which identifies potential bottlenecks. However, in order to localize performance problems, it also has to be determined to which extend the application performance is limited by certain resources. Therefore, a methodology to derive metrics for the utilization of individual components in the memory hierarchy as well as waiting times caused by memory accesses is developed in the second step. The approach is based on hardware performance counters, which record the number of certain hardware events. The developed micro-benchmarks are used to selectively stress individual components, which can be used to identify the events that provide a reasonable assessment for the utilization of the respective component and the amount of time that is spent waiting for memory accesses to complete. Finally, the knowledge gained from this process is used to implement a visualization of memory related performance issues in existing performance analysis tools. The results of the micro-benchmarks reveal that the increasing number of cores per processor and the usage of multiple processors per node leads to complex systems with vastly different performance characteristics of memory accesses depending on the location of the accessed data. Furthermore, it can be observed that the aggregated throughput of shared resources in multi-core processors does not necessarily scale linearly with the number of cores that access them concurrently, which limits the scalability of parallel applications. It is shown that the proposed methodology for the identification of meaningful hardware performance counters yields useful metrics for the localization of memory related performance limitations. info:eu-repo/classification/ddc/004 ddc:004
34	High-Performance Network-on-Chip Design for Many-Core Processors Wang, Boqian January 2020 (has links) With the development of on-chip manufacturing technologies and the requirements of high-performance computing, the core count is growing quickly in Chip Multi/Many-core Processors (CMPs) and Multiprocessor System-on-Chip (MPSoC) to support larger scale parallel execution. Network-on-Chip (NoC) has become the de facto solution for CMPs and MPSoCs in addressing the communication challenge. In the thesis, we tackle a few key problems facing high-performance NoC designs. For general-purpose CMPs, we encompass a full system perspective to design high-performance NoC for multi-threaded programs. By exploring the cache coherence under the whole system scenario, we present a smart communication service called Advance Virtual Channel Reservation (AVCR) to provide a highway to target packets, which can greatly reduce their contention delay in NoC. AVCR takes advantage of the fact that we can know or predict the destination of some packets ahead of their arrival at the Network Interface (NI). Exploiting the time interval before a packet is ready, AVCR establishes an end-to-end highway from the source NI to the destination NI. This highway is built up by reserving the Virtual Channel (VC) resources ahead of the target packet transmission and offering priority service to flits in the reserved VC in the wormhole router, which can avoid the target packets’ VC allocation and switch arbitration delay. Besides, we also propose an admission control method in NoC with a centralized Artificial Neural Network (ANN) admission controller, which can improve system performance by predicting the most appropriate injection rate of each node using the network performance information. In the online control process, a data preprocessing unit is applied to simplify the ANN architecture and make the prediction results more accurate. Based on the preprocessed information, the ANN predictor determines the control strategy and broadcasts it to each node where the admission control will be applied. For application-specific MPSoCs, we focus on developing high-performance NoC and NI compatible with the common AMBA AXI4 interconnect protocol. To offer the possibility of utilizing the AXI4 based processors and peripherals in the on-chip network based system, we propose a whole system architecture solution to make the AXI4 protocol compatible with the NoC based communication interconnect in the many-core system. Due to possible out-of-order transmission in the NoC interconnect, which conflicts with the ordering requirements specified by the AXI4 protocol, in the first place, we especially focus on the design of the transaction ordering units, realizing a high-performance and low cost solution to the ordering requirements. The microarchitectures and the functionalities of the transaction ordering units are also described and explained in detail for ease of implementation. Then, we focus on the NI and the Quality of Service (QoS) support in NoC. In our design, the NI is proposed to make the NoC architecture independent from the AXI4 protocol via message format conversion between the AXI4 signal format and the packet format, offering high flexibility to the NoC design. The NoC based communication architecture is designed to support high-performance multiple QoS schemes. The NoC system contains Time Division Multiplexing (TDM) and VC subnetworks to apply multiple QoS schemes to AXI4 signals with different QoS tags and the NI is responsible for traffic distribution between two subnetworks. Besides, a QoS inheritance mechanism is applied in the slave-side NI to support QoS during packets’ round-trip transfer in NoC. / Med utvecklingen av tillverkningsteknologi av on-chip och kraven på högpresterande da-toranläggning växer kärnantalet snabbt i Chip Multi/Many-core Processors (CMPs) ochMultiprocessor Systems-on-Chip (MPSoCs) för att stödja större parallellkörning. Network-on-Chip (NoC) har blivit den de facto lösningen för CMP:er och MPSoC:er för att mötakommunikationsutmaningen. I uppsatsen tar vi upp några viktiga problem med hög-presterande NoC-konstruktioner.Allmänna CMP:er omfattas ett fullständigt systemperspektiv för att design högprester-ande NoC för flertrådad program. Genom att utforska cachekoherensen under hela system-scenariot presenterar vi en smart kommunikationstjänst, AVCR (Advance Virtual ChannelReservation) för att tillhandahålla en motorväg till målpaket, vilket i hög grad kan min-ska deras förseningar i NoC. AVCR utnyttjar det faktum att vi kan veta eller förutsägadestinationen för vissa paket före deras ankomst till nätverksgränssnittet (Network inter-face, NI). Genom att utnyttja tidsintervallet innan ett paket är klart, etablerar AVCRen ände till ände motorväg från källan NI till destinationen NI. Denna motorväg byggsupp genom att reservera virtuell kanal (Virtual Channel, VC) resurser före målpaket-söverföringen och erbjuda prioriterade tjänster till flisar i den reserverade VC i wormholerouter. Dessutom föreslår vi också en tillträdeskontrollmetod i NoC med en centraliseradartificiellt neuronät (Artificial Neural Network, ANN) tillträdeskontroll, som kan förbättrasystemets prestanda genom att förutsäga den mest lämpliga injektionshastigheten för varjenod via nätverksprestationsinformationen. I onlinekontrollprocessen används en förbehan-dlingsenhet på data för att förenkla ANN-arkitekturen och göra förutsägningsresultatenmer korrekta. Baserat på den förbehandlade informationen bestämmer ANN-prediktornkontrollstrategin och sänder den till varje nod där tillträdeskontrollen kommer att tilläm-pas.För applikationsspecifika MPSoC:er fokuserar vi på att utveckla högpresterande NoCoch NI kompatibla med det gemensamma AMBA AXI4 protokoll. För att erbjuda möj-ligheten att använda AXI4-baserade processorer och kringutrustning i det on-chip baseradenätverkssystemet föreslår vi en hel systemarkitekturlösning för att göra AXI4 protokolletkompatibelt med den NoC-baserade kommunikation i det multikärnsystemet. På grundav den out-of-order överföring i NoC, som strider mot ordningskraven som anges i AXI4-protokollet, fokuserar vi i första hand på utformningen av transaktionsordningsenheterna,för att förverkliga en hög prestanda och låg kostnad-lösning på ordningskraven. Sedanfokuserar vi på NI och Quality of Service (QoS)-stödet i NoC. I vår design föreslås NI attgöra NoC-arkitekturen oberoende av AXI4-protokollet via meddelandeformatkonverteringmellan AXI4 signalformatet och paketformatet, vilket erbjuder NoC-designen hög flexi-bilitet. Den NoC-baserade kommunikationsarkitekturen är utformad för att stödja fleraQoS-schema med hög prestanda. NoC-systemet innehåller Time-Division Multiplexing(TDM) och VC-subnät för att tillämpa flera QoS-scheman på AXI4-signaler med olikaQoS-taggar och NI ansvarar för trafikdistribution mellan två subnät. Dessutom tillämpasen QoS-arvsmekanism i slav-sidan NI för att stödja QoS under paketets tur-returöverföringiNoC / <p>QC 20201008</p> Network-on-Chip Chip Multi/Many-core Processors Multiprocessor System-on-Chip High-Performance Computing Cache Coherence Virtual Channel Reservation Admission Control Artificial Neural Network AXI4 Quality of Service Network-on-Chip Chip Multi/Many-core Processors Multiprocessor Sys-tem on a Chip High-Performance Computing Cache Coherence Virtual Channel Reser-vation Admission Control Artificial Neural Network AXI4 Quality of Servic Computer Engineering Datorteknik Computer Systems Datorsystem
35	Efficient and Scalable Cache Coherence for Many-Core Chip Multiprocessors Ros Bardisa, Alberto 24 September 2009 (has links) La nueva tendencia para aumentar el rendimiento de los futuroscomputadores son los multiprocesadores en un solo chip (CMPs). Seespera que en un futuro cercano salgan al mercado CMPs con decenas deprocesadores. Hoy en dï¿½a, la mejor manera de mantener la coherencia decache en estos sistemas es mediante los protocolos basados endirectorio. Sin embargo, estos protocolos tienen dos grandesproblemas: una gran sobrecarga de memoria y una alta latencia de losfallos de cache.Esta tesis se ha centrado en estos problemas claves para la eficienciay escalabilidad del CMP. En primer lugar, se ha presentado unaorganizaciï¿½n de directorios escalable. En segundo lugar, se hanpropuesto los protocolos de coherencia directa, que evitan laindirecciï¿½n al nodo home y, por tanto, reducen el tiempo de ejecuciï¿½nde las aplicaciones. Por ï¿½ltimo, se ha desarrollado una polï¿½tica demapeo para caches compartidas pero fï¿½sicamente distribuidas, quereduce la latencia de acceso y garantiza una distribuciï¿½n uniforme delos datos con el fin de reducir su tasa de fallos. Esto se traducefinalmente en un menor tiempo de ejecuciï¿½n para las aplicaciones. / Chip multiprocessors (CMPs) constitute the new trend for increasingthe performance of future computers. In the near future, chips withtens of cores will become more popular. Nowadays, directory-basedprotocols constitute the best alternative to keep cache coherence inlarge-scale systems. Nevertheless, directory-based protocols have twoimportant issues that prevent them from achieving better scalability:the directory memory overhead and the long cache miss latencies.This thesis focuses on these key issues. The first proposal is ascalable distributed directory organization that copes with the memoryoverhead of directory-based protocols. The second proposal presentsthe direct coherence protocols, which are aimed at avoiding theindirection problem of traditional directory-based protocols and,therefore, they improve applications' performance. Finally, a novelmapping policy for distributed caches is presented. This policyreduces the long access latency while lessening the number of off-chipaccesses, leading to improvements in applications' execution time. directory protocols scalability cache coherence Chip multiprocessors NUCA caches latencia de acceso coherencia directa indirecciï¿½n protocolos de directorio escalabilidad coherencia de cache Multiprocesadores en un solo chip indirection direct coherence access latency NUCA caches Arquitectura de computadores 004
36	Projeto e implementa??o de uma plataforma MP-SoC usando SystemC Rego, Rodrigo Soares de Lima S? 19 May 2006 (has links) Made available in DSpace on 2014-12-17T15:47:57Z (GMT). No. of bitstreams: 1 RodrigoSLSR.pdf: 1278461 bytes, checksum: ac21fe12bc1ce120cf688ba59e4bf754 (MD5) Previous issue date: 2006-05-19 / This work presents the concept, design and implementation of a MP-SoC platform, named STORM (MP-SoC DirecTory-Based PlatfORM). Currently the platform is composed of the following modules: SPARC V8 processor, GPOP processor, Cache module, Memory module, Directory module and two different modles of Network-on-Chip, NoCX4 and Obese Tree. All modules were implemented using SystemC, simulated and validated, individually or in group. The modules description is presented in details. For programming the platform in C it was implemented a SPARC assembler, fully compatible with gcc s generated assembly code. For the parallel programming it was implemented a library for mutex managing, using the due assembler s support. A total of 10 simulations of increasing complexity are presented for the validation of the presented concepts. The simulations include real parallel applications, such as matrix multiplication, Mergesort, KMP, Motion Estimation and DCT 2D / Este trabalho apresenta o conceito, desenvolvimento e implementa??o de uma plataforma MP-SoC, batizada STORM (MP-SoC DirecTory-Based PlatfORM). A plataforma atualmente ? composta pelos seguintes m?dulos: processador SPARC V8, processador GPOP, m?dulo de Cache, m?dulo de Mem?ria, m?dulo de Diret?rio e dois diferentes modelos de Network-on-Chip, a NoCX4 e a ?rvore Obesa. Todos os m?dulos foram implementados usando a linguagem SystemC, simulados e validados, tanto separadamente quanto em conjunto. A descri??o dos m?dulos ? apresentada em detalhes. Para a programa??o da plataforma usando C foi implementado um montador SPARC, totalmente compat?vel com o c?digo assembly gerado pelo compilador gcc. Para a programa??o concorrente foi implementada uma biblioteca de fun??es para gerenciamento de mutexes, com o devido suporte por parte do montador. S?o apresentadas 10 simula??es do sistema, de complexidade crescente, para valida??o de todos os conceitos apresentados. As simula??es incluem aplica??es paralelas reais, como a multiplica??o de matrizes, Mergesort, KMP, Estima??o de Movimento e DCT 2D System-on-Chip Network-on-Chip Projeto baseado em plataforma Coer?ncia de cache Diret?rio SPARC ?rvore obesa Processamento paralelo System-on-Chip Network-on-Chip Platform-based design Cache coherence Directory SPARC Obese tree Parallel processing

Page generated in 0.0552 seconds