Global ETD Search

1	On the simulation and design of manycore CMPs Thompson, Christopher Callum January 2015 (has links) The progression of Moore’s Law has resulted in both embedded and performance computing systems which use an ever increasing number of processing cores integrated in a single chip. Commercial systems are now available which provide hundreds of cores, and academics have proposed architectures for up to 1024 cores. Embedded multicores are increasingly popular as it is easier to guarantee hard-realtime constraints using individual cores dedicated for tasks, than to use traditional time-multiplexed processing. However, finding the optimal hardware configuration to meet these requirements at minimum cost requires extensive trial and error approaches to investigate the design space. This thesis tackles the problems encountered in the design of these large scale multicore systems by first addressing the problem of fast, detailed micro-architectural simulation. Initially addressing embedded systems, this work exploits the lack of hardware cache-coherence support in many deeply embedded systems to increase the available parallelism in the simulation. Then, through partitioning the NoC and using packet counting and cycle skipping reduces the amount of computation required to accurately model the NoC interconnect. In combination, this enables simulation speeds significantly higher than the state of the art, while maintaining less error, when compared to real hardware, than any similar simulator. Simulation speeds reach up to 370MIPS (Million (target) Instructions Per Second), or 110MHz, which is better than typical FPGA prototypes, and approaching final ASIC production speeds. This is achieved while maintaining an error of only 2.1%, significantly lower than other similar simulators. The thesis continues by scaling the simulator past large embedded systems up to 64-1024 core processors, adding support for coherent architectures using the same packet counting techniques along with low overhead context switching to enable the simulation of such large systems with stricter synchronisation requirements. The new interconnect model was partitioned to enable parallel simulation to further improve simulation speeds in a manner which did not sacrifice any accuracy. These innovations were leveraged to investigate significant novel energy saving optimisations to the coherency protocol, processor ISA, and processor micro-architecture. By introducing a new instruction, with the name wait-on-address, the energy spent during spin-wait style synchronisation events can be significantly reduced. This functions by putting the core into a low-power idle state while the cache line of the indicated address is monitored for coherency action. Upon an update or invalidation (or traditional timer or external interrupts) the core will resume execution, but the active energy of running the core pipeline and repeatedly accessing the data and instruction caches is effectively reduced to static idle power. The thesis also shows that existing combined software-hardware schemes to track data regions which do not require coherency can adequately address the directory-associativity problem, and introduces a new coherency sharer encoding which reduces the energy consumed by sharer invalidations when sharers are grouped closely together, such as would be the case with a system running many tasks with a small degree of parallelism in each. The research concludes by using the extremely fast simulation speeds developed to produce a large set of training data, collecting various runtime and energy statistics for a wide range of embedded applications on a huge diverse range of potential MPSoC designs. This data was used to train a series of machine learning based models which were then evaluated on their capacity to predict performance characteristics of unseen workload combinations across the explored MPSoC design space, using only two sample simulations, with promising results from some of the machine learning techniques. The models were then used to produce a ranking of predicted performance across the design space, and on average Random Forest was able to predict the best design within 89% of the runtime performance of the actual best tested design, and better than 93% of the alternative design space. When predicting for a weighted metric of energy, delay and area, Random Forest on average produced results within 93% of the optimum result. In summary this thesis improves upon the state of the art for cycle accurate multicore simulation, introduces novel energy saving changes the the ISA and microarchitecture of future multicore processors, and demonstrates the viability of machine learning techniques to significantly accelerate the design space exploration required to bring a new manycore design to market. 004
2	Exploration d'architecture d'accélérateurs à mémoire distribuée / Design space exploration of distributed-memory accelerators Busseuil, Rémi 04 December 2012 (has links) Bien que le développement actuel d'accélérateurs se concentre principalement sur la création de puces Multiprocesseurs (MPSoC) hétérogènes, c'est-à-dire composés de processeurs spécialisées, de nombreux acteurs de la microélectronique s'intéressent au développement d'un autre type de MPSoC, constitué d'une grille de processeurs identiques. Ces MPSoC homogènes, bien que composés de processeurs énergétiquement moins efficaces, possèdent une programmabilité et une flexibilité plus importante que les MPSoC hétérogènes, ce qui favorise notamment l'adaptation du système à la charge demandée, et offre un espace de solutions de configuration potentiellement plus vaste et plus simple à contrôler. C'est dans ce contexte que s'inscrit cette thèse, en exposant la création d'une architecture MPSoC homogène scalable (c'est-à-dire dont la mise à l'échelle des performances est linéaire), ainsi que le développement de différents systèmes d'adaptation et de programmation sur celle-ci.Cette architecture, constituée d'une grille de processeurs de type MicroBlaze, possédant chacun sa propre mémoire, au sein d'un Réseau sur Puce 2D, a été développée conjointement avec un système d'exploitation temps réel (RTOS) spécialisé et modulaire. Grâce à la création d'une pile de communication complexe, plusieurs mécanismes d'adaptation ont été mis en œuvre : une migration de tâche « avec redirection de données », permettant de diminuer l'impact de cette migration avec des applications de type flux de données, ainsi qu'un mécanisme dit « d'exécution distante ». Ce dernier consiste non plus à migrer le code instruction d'une mémoire à une autre, mais de conserver le code dans sa mémoire initiale et de le faire exécuter par un processeur distinct. Les différentes expériences réalisées avec ce mécanisme ont permis de souligner la meilleure réactivité de celui-ci face à la migration de tâche, tout en possédant des performances d'adaptation plus faible.Ce dernier mécanisme a conduit naturellement à la création d'un modèle de programmation de type « mémoire partagée » au sein de l'architecture. La mise en place de ce dernier nécessitait la création d'un mécanisme de cohérence mémoire, qui a été réalisé de façon matérielle/logicielle et scalable par l'intermédiaire du développement de la librairie PThread. Les performances ainsi obtenues mettent en évidence les avantages d'un MPSoC homogène tout en utilisant une programmation « classique » de type multiprocesseur. / Although the accelerators market is dominated by heterogeneous MultiProcessor Systems-on-Chip (MPSoC), i.e. with different specialized processors, a growing interest is put on another type of MPSoC, composed by an array of identical processors. Even if these processors achieved lower performance to power ratio, the better flexibility and programmability of these homogeneous MPSoC allow an easier adaptation to the load, and offer a wider space of configurations. In this context, this thesis exposes the development of a scalable homogeneous MPSoC – i.e. with linear performance scaling – and different kind of adaptive mechanisms and programming model on it.This architecture is based on an array of MicroBlaze-like processors, each having its own memory, and connected through a 2D NoC. A modular RTOS was build on top of it. Thanks to a complex communication stack, different adaptive mechanisms were made: a “redirected data” task migration mechanism, reducing the impact of the migration mechanism for data-flow applications, and a “remote execution” mechanism. Instead of migrate the instruction code from a memory to another, this last consists in only migrate the execution, keeping the code in its initial memory. The different experiments shows faster reactivity but lower performance of this mechanism compared to migration.This development naturally led to the creation of a shared memory programming model. To achieve this, a scalable hardware/software memory consistency and cache coherency mechanism has been made, through the PThread library development. Experiments show the advantage of using NoC based homogeneous MPSoC with a brand programming model. Cohérence de cache Mémoire Distribuée MPSoC Cache coherency Distributed memory MPSoC
3	Cache Coherency for Symmetric Multiprocessor Systems on Programmable Chips Hung, Austin January 2004 (has links) Rapid progress in the area of Field-Programmable Gate Arrays (FPGAs) has led to the availability of softcore processors that are simple to use, and can enable the development of a fully working system in minutes. This has lead to the enormous popularity of System-On-Programmable-Chip (SOPC) computing platforms. These softcore processors, while relatively simple compared to their leading-edge hardcore counterparts, are often designed with a number of advanced performance-enhancing features, such as instruction and data caches. Moreover, they are designed to be used in a uniprocessor or uncoupled multiprocessor architecture, and not in a tightly-coupled multiprocessing architecture. As a result, traditional cache-coherency protocols are not suitable for use with such systems. This thesis describes a system for enforcing cache coherency on symmetric multiprocessing (SMP) systems using softcore processors. A hybrid protocol that incorporates hardware and software to enforce cache coherency is presented. Electrical & Computer Engineering FPGA SOPC SMP cache coherency
4	Cache Coherency for Symmetric Multiprocessor Systems on Programmable Chips Hung, Austin January 2004 (has links) Rapid progress in the area of Field-Programmable Gate Arrays (FPGAs) has led to the availability of softcore processors that are simple to use, and can enable the development of a fully working system in minutes. This has lead to the enormous popularity of System-On-Programmable-Chip (SOPC) computing platforms. These softcore processors, while relatively simple compared to their leading-edge hardcore counterparts, are often designed with a number of advanced performance-enhancing features, such as instruction and data caches. Moreover, they are designed to be used in a uniprocessor or uncoupled multiprocessor architecture, and not in a tightly-coupled multiprocessing architecture. As a result, traditional cache-coherency protocols are not suitable for use with such systems. This thesis describes a system for enforcing cache coherency on symmetric multiprocessing (SMP) systems using softcore processors. A hybrid protocol that incorporates hardware and software to enforce cache coherency is presented. Electrical & Computer Engineering FPGA SOPC SMP cache coherency
5	Αρχιτεκτονική προσομοίωση σε επεξεργαστικές μονάδες υψηλού βαθμού παραλληλίας Στρίκος, Νικόλαος 11 January 2011 (has links) Η πρόσφατη εξάπλωση που είδε το μοντέλο της παράλληλης επεξεργασίας στους μικροεπεξεργαστές γενικής χρήσης με την εισαγωγή περισσότερων από έναν πυρήνες εντός του ολοκληρωμένου κυκλώματος έφερε νέες απαιτήσεις στις μεθόδους προσομοίωσης που παραδοσιακά χρησιμοποιήθηκαν για την εξερεύνηση νέων αρχιτεκτονικών. Στην εργασία αυτή προτείνεται ένα πλαίσιο και ένα προγραμματιστικό μοντέλο που κάνει χρήση της αρχιτεκτονικής υψηλού βαθμού παραλληλίας CUDA για να επιτύχει επιτάχυνση στην αρχιτεκτονική προσομοίωση πρωτοκόλλων συνοχής κρυφής μνήμης. / The recent adoption of the parallel computing model in general-use microprocessors with the inclusion of more than one cores in the IC has raised new demands for the simulation methodologies that have been traditionally used. In this work, a framework and a programming model are proposed that make use of the highly parallel CUDA platform to accelerate architectural simulation of cache coherency protocols. Μικροεπεξεργαστές Κρυφή μνήμη 005.275 GPU CUDA Cache coherency protocols Parallel simulation
6	Efficient synchronization and communication in many-core chip multiprocessors Abellán Miguel, José Luis 21 December 2012 (has links) En esta tesis hemos identificado tres de los mayores cuellos de botella para el rendimiento y escalabilidad de las arquitecturas many-core CMP de memoria compartida. En particular, los mecanismos de sincronización de barrera y cerrojo cuando presentan alta contención, así como los protocolos hardware de coherencia de caché en el mantenimiento de la coherencia del uso de bloques memoria compartidos en una jerarquía de memoria. Para paliar estas deficiencias y aprovechar más el rendimiento de estas arquitecturas, hemos propuesto tres mecanismos hardware: GBarrier, para un mecanismo de barreras eficiente; GLock, para un manejo justo y eficiente de la contención en el acceso a las secciones críticas protegidas por cerrojos; y ECONO, un protocolo de coherencia muy simple que aporta gran eficiencia a bajo costo. La tesis concluye que nuestras propuestas resuelven de manera eficiente los problemas de rendimiento derivados de implementaciones ineficientes para sincronización y coherencia en arquitecturas many-core CMP. / In this thesis we have identified three of the major problems that restrict efficiency and scalability in future shared-memory tiled many-core CMPs. In particular, the synchronization operations of barriers and locks under highly-contended scenarios, and the hardware-based cache coherence protocols when dealing with the maintenance of coherence of all memory blocks across all levels of a memory hierarchy. To alleviate such performance bottlenecks in order to harness the computational power of such systems, we have proposed three hardware-based mechanisms: GBarrier, a very efficient barrier mechanism; GLock, an efficient and fair mechanism to implement highly-contended locks; and ECONO, a simple and efficient hardware coherence protocol. In light of our performance results obtained in this thesis, we can affirm that our proposals represent a step forward towards the resolution of the challenges that many-core CMP architectures will pose to computer architects. sincronización hardware hardware synchronization coherencia de cache cache coherency multiprocesadores chip multiprocessors programación paralela parallel programming Arquitectura de computadores 62
7	Conception d'une architecture extensible pour le calcul massivement parallèle / Designing a scalable architecture for massively parallel computing Kaci, Ania 14 December 2016 (has links) En réponse à la demande croissante de performance par une grande variété d’applications (exemples : modélisation financière, simulation sub-atomique, bio-informatique, etc.), les systèmes informatiques se complexifient et augmentent en taille (nombre de composants de calcul, mémoire et capacité de stockage). L’accroissement de la complexité de ces systèmes se traduit par une évolution de leur architecture vers une hétérogénéité des technologies de calcul et des modèles de programmation. La gestion harmonieuse de cette hétérogénéité, l’optimisation des ressources et la minimisation de la consommation constituent des défis techniques majeurs dans la conception des futurs systèmes informatiques.Cette thèse s’adresse à un domaine de cette complexité en se focalisant sur les sous-systèmes à mémoire partagée où l’ensemble des processeurs partagent un espace d’adressage commun. Les travaux porteront essentiellement sur l’implémentation d’un protocole de cohérence de cache et de consistance mémoire, sur une architecture extensible et sur la méthodologie de validation de cette implémentation.Dans notre approche, nous avons retenu les processeurs 64-bits d’ARM et des co-processeurs génériques (GPU, DSP, etc.) comme composants de calcul, les protocoles de mémoire partagée AMBA/ACE et AMBA/ACE-Lite ainsi que l’architecture associée « CoreLink CCN » comme solution de départ. La généralisation et la paramètrisation de cette architecture ainsi que sa validation dans l’environnement de simulation Gem5 constituent l’épine dorsale de cette thèse.Les résultats obtenus à la fin de la thèse, tendent à démontrer l’atteinte des objectifs fixés / In response to the growing demand for performance by a wide variety of applications (eg, financial modeling, sub-atomic simulation, bioinformatics, etc.), computer systems become more complex and increase in size (number of computing components, memory and storage capacity). The increased complexity of these systems results in a change in their architecture towards a heterogeneous computing technologies and programming models. The harmonious management of this heterogeneity, resource optimization and minimization of consumption are major technical challenges in the design of future computer systems.This thesis addresses a field of this complexity by focusing on shared memory subsystems where all processors share a common address space. Work will focus on the implementation of a cache coherence and memory consistency on an extensible architecture and methodology for validation of this implementation.In our approach, we selected processors 64-bit ARM and generic co-processor (GPU, DSP, etc.) as components of computing, shared memory protocols AMBA / ACE and AMBA / ACE-Lite and associated architecture "CoreLink CCN" as a starting solution. Generalization and parameterization of this architecture and its validation in the simulation environment GEM5 are the backbone of this thesis.The results at the end of the thesis, tend to demonstrate the achievement of objectives Systèmes informatiques Mémoire partagée Cohérence de cache Consistance mémoire Modélisation niveau transactionnel Réseaux d’interconnexion Computing systems Shared memory Cache coherency Memory consistancy Transactional level modeling Interconnection networks

1

Page generated in 0.0664 seconds