Spelling suggestions: "subject:"chip multiprocessor""
11 |
UnSync: A Soft Error Resilient Redundant CMP ArchitectureJanuary 2011 (has links)
abstract: Reducing device dimensions, increasing transistor densities, and smaller timing windows, expose the vulnerability of processors to soft errors induced by charge carrying particles. Since these factors are inevitable in the advancement of processor technology, the industry has been forced to improve reliability on general purpose Chip Multiprocessors (CMPs). With the availability of increased hardware resources, redundancy based techniques are the most promising methods to eradicate soft error failures in CMP systems. This work proposes a novel customizable and redundant CMP architecture (UnSync) that utilizes hardware based detection mechanisms (most of which are readily available in the processor), to reduce overheads during error free executions. In the presence of errors (which are infrequent), the always forward execution enabled recovery mechanism provides for resilience in the system. The inherent nature of UnSync architecture framework supports customization of the redundancy, and thereby provides means to achieve possible performance-reliability trade-offs in many-core systems. This work designs a detailed RTL model of UnSync architecture and performs hardware synthesis to compare the hardware (power/area) overheads incurred. It then compares the same with those of the Reunion technique, a state-of-the-art redundant multi-core architecture. This work also performs cycle-accurate simulations over a wide range of SPEC2000, and MiBench benchmarks to evaluate the performance efficiency achieved over that of the Reunion architecture. Experimental results show that, UnSync architecture reduces power consumption by 34.5% and improves performance by up to 20% with 13.3% less area overhead, when compared to Reunion architecture for the same level of reliability achieved. / Dissertation/Thesis / M.S. Computer Science 2011
|
12 |
Implementation of a Hardware-Optimized MPI Library for the SCMP MultiprocessorPoole, Jeffrey Hyatt 16 August 2004 (has links)
As time progresses, computer architects continue to create faster and more complex microprocessors using techniques such as out-of-order execution, branch prediction, dynamic scheduling, and predication. While these techniques enable greater performance, they also increase the complexity and silicon area of the design. This creates larger development and testing times. The shrinking feature sizes associated with newer technology increase wire resistance and signal propagation delays, further complicating large designs. One potential solution is the Single-Chip Message-Passing (SCMP) Parallel Computer, developed at Virginia Tech. SCMP makes use of an architecture where a number of simple processors are tiled across a single chip and connected by a fast interconnection network. The system is designed to take advantage of thread-level parallelism and to keep wire traces short in preparation for even smaller integrated circuit feature sizes.
This thesis presents the implementation of the MPI (Message-Passing Interface) communications library on top of SCMP's hardware communication support. Emphasis is placed on the specific needs of this system with regards to MPI. For example, MPI is designed to operate between heterogeneous systems; however, in the SCMP environment such support is unnecessary and wastes resources. The SCMP network is also designed such that messages can be sent with very low latency, but with cooperative multitasking it is difficult to assure a timely response to messages. Finally, the low-level network primitives have no support for send operations that occur before the receiver is prepared and that functionality is necessary for MPI support. / Master of Science
|
13 |
High Performance Applications for the Single-Chip Message-Passing Parallel ComputerDickenson, William Wesley 05 May 2004 (has links)
Computer architects continue to push the limits of modern microprocessors. By using techniques such as out-of-order execution, branch prediction, and dynamic scheduling, designers have found ways to speed execution. However, growing architectural complexity has led to unsustained development and testing times. Shrinking feature sizes are causing increased wire resistances and signal propagation, thereby limiting a design's scalability. Indeed, the method of exploiting instruction-level parallelism (ILP) within applications is reaching a point of diminishing returns.
One approach to the aforementioned challenges is the Single-Chip Message-Passing (SCMP) Parallel Computer, developed at Virginia Tech. SCMP is a unique, tiled architecture aimed at thread-level parallelism (TLP). Identical cores are replicated across the chip, and global wire traces have been eliminated. The nodes are connected via a 2-D grid network and each contains a local memory bank.
This thesis presents the design and analysis of three high-performance applications for SCMP. The results show that the architecture proves itself as a formidable opponent to several current systems. / Master of Science
|
14 |
Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip MultiprocessorsLodde, Mario 04 February 2014 (has links)
La jerarquía de caches y la red en el chip (NoC) son dos componentes clave de los chip multiprocesadores (CMPs). La mayoría del trafico en la NoC se debe a mensajes que las caches envían según lo que establece el protocolo de coherencia. La cantidad de trafico, el porcentaje de mensajes cortos y largos y el patrón de trafico en general varían dependiendo de la geometría de las caches y del protocolo de coherencia. La arquitectura de la NoC y la jerarquía de caches están de hecho firmemente acopladas, y estos dos componentes deben ser diseñados y evaluados conjuntamente para estudiar como el variar uno afecta a las prestaciones del otro. Además, cada componente debe ajustarse a los requisitos y a las oportunidades del otro, y al revés. Normalmente diferentes clases de mensajes se envían por diferentes redes virtuales o por NoCs con diferente ancho de banda, separando mensajes largos y cortos. Sin embargo, otra clasificación de los mensajes se puede hacer dependiendo del tipo de información que proveen: algunos mensajes, como las peticiones de datos, necesitan campos para almacenar información (dirección del bloque, tipo de petición, etc.); otros, como los mensajes de reconocimiento (ACK), no proporcionan ninguna información excepto por el ID del nodo destino: solo proveen una información de tipo temporal, en el sentido que la recepción de un ACK indica que el nodo fuente ha recibido el mensaje al que está contestando con el ACK y completado todas las operaciones determinadas por el protocolo de coherencia. Esta segunda clase de mensaje no necesita de mucho ancho de banda: la latencia es mucho mas importante, dado que el nodo destino esta típicamente bloqueado esperando la recepción de ellos.
En este trabajo de tesis se desarrolla una red dedicada para trasmitir la segunda clase de mensajes; la red es muy sencilla y rápida, y permite la entrega de los ACKs con una latencia de pocos ciclos de reloj. Reduciendo la latencia y el trafico en la NoC debido a los ACKs, es posible:
-acelerar la fase de invalidación en fase de escritura en un sistema que usa un protocolo de coherencia basado en directorios
-mejorar las prestaciones de un protocolo de coerencia basado en broadcast, hasta llegar a prestaciones comparables con las de un protocolo de directorios pero sin el coste de área debido a la necesidad de almacenar el directorio
-implementar un mapeado dinámico de bloques a las caches de ultimo nivel de forma eficiente, con el objetivo de acercar cuanto al máximo los bloques a los cores que los utilizan
El objetivo final es obtener un co-diseño de NoC y jerarquía de caches que minimice los problemas de escalabilidad de los protocolos de coherencia. Como gran objetivo final, se pretende la implementación de un CMP con ubicación dinámica de los recursos de cache y red, tal que estos recursos se puedan particionar de forma eficiente e independiente para asignar diferentes particiones a diferentes aplicaciones en un entorno virtualizado. / Lodde, M. (2014). Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35325
|
15 |
Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose ProcessorsJanuary 2018 (has links)
abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions.
Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%.
Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications.
Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future.
In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2018
|
16 |
Hardware Techniques for High-Performance Transactional Memory in Many-Core Chip Multiprocessors / Técnicas Hardware para Sistemas de Memoria Transaccional de Alto Rendimiento en Procesadores MultinúcleoTitos Gil, José Rubén 08 November 2011 (has links)
Esta tesis investiga la implementación hardware eficiente de los sistemas de memoria transaccional (HTM) en un chip multiprocesador escalable (CMP), identificando aspectos que limitan el rendimiento y proponiendo técnicas que solventan dichas patologías. Las contribuciones de la tesis son varios diseños HTM complementarios que alcanzan un rendimiento robusto y evitan comportamientos patológicos, mediante la introducción de flexibilidad y adaptabilidad, sin que dichas técnicas apenas supongan un incremento en la complejidad del sistema global. Esta disertación considera tanto sistemas HTM de política ansiosa como aquellos diseñados bajo el enfoque perezoso, y afrontamos las sobrecargas en el rendimiento que son inherentes a cada política.
Quizá la contribución más relevante de esta tesis es ZEBRA, un sistema HTM de política híbrida que adapta su comportamiento en función de las características dinámicas de la carga de trabajo. / This thesis focuses on the hardware mechanisms that provide optimistic concurrency control with guarantees of atomicity and isolation, with the intent of achieving high-performance across a variety of workloads, at a reasonable cost in terms of design complexity.
This thesis identifies key inefficiencies that impact the performance of several hardware implementations of TM, and proposes mechanisms to overcome such limitations. In this dissertation we consider both eager and lazy approaches to HTM system design, and address important sources of overhead that are inherent to each policy. This thesis presents a hybrid-policy, adaptable HTM system that combines the advantages of both eager and lazy approaches in a low complexity design.
Furthermore, this thesis investigates the overheads of the simpler, fixed-policy HTM designs that leverage a distributed directory-based coherence protocol to detect data races over a scalable interconnect, and develops solutions that address some performance degrading factors.
|
17 |
Efficient synchronization and communication in many-core chip multiprocessorsAbellán Miguel, José Luis 21 December 2012 (has links)
En esta tesis hemos identificado tres de los mayores cuellos de botella para el rendimiento y escalabilidad de las arquitecturas many-core CMP de memoria compartida. En particular, los mecanismos de sincronización de barrera y cerrojo cuando presentan alta contención, así como los protocolos hardware de coherencia de caché en el mantenimiento de la coherencia del uso de bloques memoria compartidos en una jerarquía de memoria. Para paliar estas deficiencias y aprovechar más el rendimiento de estas arquitecturas, hemos propuesto tres mecanismos hardware: GBarrier, para un mecanismo de barreras eficiente; GLock, para un manejo justo y eficiente de la contención en el acceso a las secciones críticas protegidas por cerrojos; y ECONO, un protocolo de coherencia muy simple que aporta gran eficiencia a bajo costo. La tesis concluye que nuestras propuestas resuelven de manera eficiente los problemas de rendimiento derivados de implementaciones ineficientes para sincronización
y coherencia en arquitecturas many-core CMP. / In this thesis we have identified three of the major problems that restrict efficiency and scalability in future shared-memory tiled many-core CMPs. In particular, the synchronization operations of barriers and locks under highly-contended scenarios, and the hardware-based cache coherence protocols when dealing with the maintenance of coherence of all memory blocks across all levels of a memory hierarchy. To alleviate such performance bottlenecks in order to harness the computational power of such systems, we have proposed three hardware-based mechanisms: GBarrier, a very efficient barrier mechanism; GLock, an efficient and fair mechanism to implement highly-contended locks; and ECONO, a simple and efficient hardware coherence protocol. In light of our performance results obtained in this thesis, we can affirm that our proposals represent a step forward towards the resolution of the challenges that many-core CMP architectures will pose to computer architects.
|
18 |
A Hardware Framework for Yield and Reliability Enhancement in Chip MultiprocessorsPan, Abhisek 01 January 2009 (has links) (PDF)
Device reliability and manufacturability have emerged as dominant concerns in end-of-road CMOS devices. Today an increasing number of hardware failures are attributed to device reliability problems that cause partial system failure or shutdown. Also maintaining an acceptable manufacturing yield is seen as challenge because of smaller feature sizes, process variation, and reduced headroom for burn-in tests. In this project we investigate a hardware-based scheme for improving yield and reliability of a homogeneous chip multiprocessor (CMP). The proposed solution involves a hardware framework that enables us to utilize the redundancies inherent in a multi-core system to keep the system operational in face of partial failures due to hard faults (faults due to manufacturing defects or permanent faults developed during system lifetime). A micro-architectural modification allows a faulty core in a multiprocessor system to use another core as a coprocessor to service any instruction that the former cannot execute correctly by itself. This service is accessed to improve yield and reliability, but at the cost of some loss of performance. In order to quantify this loss we have used a cycle-accurate architectural simulator to simulate the performance of dual-core and quad-core systems with one or more cores sustaining partial failure. Simulation studies indicate that when a large and sparingly-used unit such as a floating point unit fails in a core, even for a floating point intensive benchmark, we can continue to run the faulty core with as little as 10% performance impact and minimal area overhead. Incorporating this recovery mechanism entails some modifications in the microprocessor micro-architecture. The modifications are also described here through a simplified model of a superscalar processor.
|
19 |
Διαχείριση κοινών πόρων σε πολυπύρηνους επεξεργαστέςΑλεξανδρής, Φωκίων 27 June 2012 (has links)
Οι σύγχρονες τάσεις της Επιστήμης Σχεδιασμού των Υπολογιστικών Συστημάτων έχουν υιοθετήσει την χρήση των Κρυφών Μνημών ή Μνημών Cache, αποβλέποντας στην απόκρυψη της Καθυστέρησης της Κύριας Μνήμης των Συστημάτων (Memory Latency) και την γεφύρωση του χάσματος της απόδοσης του Επεξεργαστή και της Κύριας Μνήμης (Processor – Memory Performance Gap). Οι Μνήμες Cache έτσι έχουν αποκτήσει αδιαμφισβήτητα πρωτεύοντα ρόλο στην Ιεραρχία Μνήμης των Ηλεκτρονικών Υπολογιστών.
Οι νέες τάσεις Σχεδιασμού ανέδειξαν την Έννοια του Παραλληλισμού σε πρωτεύοντα ρόλο. Αρχικά διερευνήθηκε ο Παραλληλισμός Επιπέδου Εντολών, ωστόσο η αύξηση της Απόδοσης των Υπολογιστών σύντομα έφτασε ένα μέγιστο. Την τελευταία δεκαετία το κέντρο του ενδιαφέροντος των σχεδιαστών έχει και πάλι μετατοπιστεί, καθώς ένας νέος τύπος Επεξεργαστών έχει εισέλθει στο προσκήνιο, οι Πολυπύρηνοι Επεξεργαστές, ή όπως είναι αλλιώς γνωστοί on-chip Multiprocessors (CMP). Αυτές οι εξελίξεις, σε συνδυασμό με την ολοένα αυξανόμενη πολυπλοκότητα της “συμπεριφοράς” των εκτελούμενων Εφαρμογών, ώθησαν το σχεδιαστικό ενδιαφέρον προς την εκμετάλλευση ενός νεοσύστατου τύπου Παραλληλισμού. Ο Παραλληλισμός Επιπέδου Μνήμης ή Memory Level Parallelism (MLP) αποτελεί τα τελευταία χρόνια, το πλέον ισχυρό μέσο αύξησης της απόδοσης των Υπολογιστικών Συστημάτων και μαζί με τους Πολυπύρηνους Επεξεργαστές θα κυριαρχήσει στο προσκήνιο των εξελίξεων τα επόμενα χρόνια.
Σκοπός της παρούσας Διπλωματικής Εργασίας είναι η ανάπτυξη ενός Στατιστικού – Πιθανοτικού Μοντέλου για μελέτη και πρόβλεψη των φαινομένων που αναπτύσσονται σε Μνήμες Cache, στις οποίες αποθηκεύονται δεδομένα από εκτελούμενες Εφαρμογές, με έντονο Παραλληλισμό Επιπέδου Μνήμης. Θα οριστεί ένας Εκτιμητής του Φόρτου που επιβάλλεται στο Σύστημα, από φαινόμενα Παραλληλισμού Επιπέδου Μνήμης (MLP). Στην συνέχεια, με βάση το Μοντέλο που αναπτύσσουμε, θα διερευνηθεί ένα ικανοποιητικό σύνολο Εφαρμογών, και θα εξαχθεί μια Εκτίμηση – Πρόβλεψη για τον Φόρτο (MLP) του Συστήματος. Εφόσον οι Προβλέψεις μας κριθούν επιτυχής, το Μοντέλο Πρόβλεψης Φόρτου MLP που αναπτύξαμε, μπορεί να αποτελέσει χρήσιμο Εργαλείο στα χέρια των Σχεδιαστών που ασχολούνται με την αύξηση της Απόδοσης των Σύγχρονων Υπολογιστικών Συστημάτων. / -
|
20 |
Διαχείριση κοινόχρηστων πόρων σε πολυεπεξεργαστικά συστήματα ενός ολοκληρωμένουΠετούμενος, Παύλος 06 October 2011 (has links)
Στην παρούσα διατριβή προτείνονται μέθοδοι διαχείρισης των κοινόχρηστων πόρων σε υπολογιστικά συστήματα όπου πολλαπλοί επεξεργαστές μοιράζονται το ίδιο ολοκληρωμένο (Chip Multiprocessors – CMPs). Ενώ μέχρι πρόσφατα ο σχεδιασμός ενός υπολογιστικού συστήματος στόχευε στην ικανοποίηση των απαιτήσεων μόνο μίας εφαρμογής ανά χρονική περίοδο, τώρα πια απαιτείται και η εξισορρόπηση των απαιτήσεων διαφορετικών εφαρμογών που ανταγωνίζονται για την κατοχή των ίδιων πόρων. Σε πολλές περιπτώσεις, όμως, αυτό δεν αρκεί από μόνο του. Ακόμη και αν επιτευχθεί κάποιος ιδανικός διαμοιρασμός του πόρου, αν δεν βελτιστοποιηθεί ο τρόπος με τον οποίο χρησιμοποιούν οι επεξεργαστές τον κοινόχρηστο πόρο, δεν θα καταφέρει να εξυπηρετήσει ικανοποιητικά το αυξημένο φορτίο. Για να αντιμετωπιστούν τα προβλήματα που πηγάζουν από τον διαμοιρασμό των κοινόχρηστων πόρων, στην παρούσα εργασία προτείνονται τρεις εναλλακτικοί μηχανισμοί διαχείρισης.
Η πρώτη μεθοδολογία εισάγει μία νέα θεωρητική μοντελοποίηση του διαμοιρασμού της κρυφής μνήμης, η οποία μπορεί να χρησιμοποιηθεί παράλληλα με την εκτέλεση των προγραμμάτων που διαμοιράζονται την κρυφή μνήμη. Η μεθοδολογία αξιοποιεί στην συνέχεια αυτήν την μοντελοποίηση, για να ελέγξει τον διαμοιρασμό της κρυφής μνήμης και να επιτύχει δικαιοσύνη στο πως κατανέμεται ο χώρος της κρυφής μνήμης μεταξύ
των επεξεργαστών.
Η δεύτερη μεθοδολογία παρουσιάζει μία νέα τεχνική για την πρόβλεψη της τοπικότητας των προσπελάσεων της κρυφής μνήμης. Καθώς η τοπικότητα είναι η βασική παράμετρος που καθορίζει την χρησιμότητα των δεδομένων της κρυφής μνήμης, χρησιμοποιώντας αυτήν την τεχνική πρόβλεψης μπορούν να οδηγηθούν μηχανισμοί διαχείρισης που βελτιώνουν την αξιοποίηση του χώρου της κρυφής μνήμης. Στα πλαίσια της μεθοδολογίας παρουσιάζουμε έναν τέτοιο μηχανισμό, ο οποίος στοχεύει στην ελαχιστοποίηση των αστοχιών της κρυφής μνήμης μέσω μίας νέας πολιτικής αντικατάστασης.
Η τελευταία μεθοδολογία που παρουσιάζεται είναι μία μεθοδολογία για την μείωση της κατανάλωσης ενέργειας της ουράς εντολών, που είναι μία από τις πιο ενεργειακά απαιτητικές δομές του επεξεργαστή. Στα πλαίσια της μεθοδολογίας, δείχνεται ότι το κλειδί για την αποδοτική μείωση της κατανάλωσης ενέργειας της ουράς εντολών βρίσκεται στην αλληλεπίδραση της με το υποσύστημα μνήμης. Με βάση αυτό το συμπέρασμα, παρουσιάζουμε έναν νέο μηχανισμό δυναμικής διαχείρισης του μεγέθους της ουράς εντολών, ο οποίος συνδυάζει επιθετική μείωση της κατανάλωσης ενέργειας του επεξεργαστή με διατήρηση της υψηλής απόδοσής του. / This dissertation proposes methodologies for the management of shared resources in chip multi-processors (CMP). Until recently, the design of a computing system had to satisfy the computational and storage needs of a single program during each time period. Now instead, the designer has to balance the, perhaps conflicting, needs of multiple programs competing for the same resources. But, in many cases, even this is not enough. Even if we could invent a perfect way to manage sharing, without optimizing the way that each processor uses the shared resource, the resource could not deal efficiently with the increased load. In order to handle the negative effects of resource sharing, this dissertation proposes three management mechanisms.
The first one introduces a novel theoretical model of the sharing of the shared cache, which can be used at run-time. Furthermore, out methodology uses the model to control sharing and to achieve a sense of justice in the way the cache is shared among the processors.
Our second methodology presents a new technique for predicting the locality of cache accesses. Since locality determines, almost entirely, the usefulness of cache data, our technique can be used to drive any management mechanism which strives to improve the efficiency of the cache. As part of our methodology, we present such a mechanism, a new cache replacement policy which tries to minimize cache misses by near-optimal replacement decisions.
The last methodology presented in this dissertation, targets the energy consumption of the processor. To that end, our methodology shows that the key to reducing the power consumption of the Issue Queue, without disproportional performance degradation, lies at the interaction of the Issue Queue with the memory subsystem: as long as the management of the Issue Queue doesn’t reduce the utilization of the memory subsystem, the effects of the management on the processor’s performance will be minimal. Based on this conclusion, we introduce a new mechanism for dynamically resizing the Issue Queue, which achieves aggressive downsizing and energy savings with almost no performance degradation.
|
Page generated in 0.0682 seconds