1 |
A Hybrid Network-on-Chip and Segmented Bus Architecture for Large CachesVelayutham, Chandru 20 April 2009 (has links)
No description available.
|
2 |
Τρόποι διαχείρισης κρυφών μνημών με ανομοιογενείς χρόνους πρόσβασηςΑβραμόπουλος, Γεώργιος 20 September 2010 (has links)
Η εργασία αποτελεί μελέτη της λειτουργίας των caches, χρησιμοποιώντας μια συγκεκριμένη cache δομή.
Η εργασία αυτή έχει σα σκοπό τη μελέτη των κρυφών μνημών με μη ομοιογενή χρόνο προσπέλασης στα διάφορα «φυσικά» σημεία της επιφάνειάς της. Αντικειμενικός σκοπός των κρυφών αυτών μνημών, είναι να τοποθετούνται τα δεδομένα που χρησιμοποιούνται συχνότερα, σε θέσεις που βρίσκονται κοντύτερα στον επεξεργαστή και έχουν λιγότερες διασυνδέσεις καλωδίων, άρα έχουν και το μικρότερο χρόνο προσπέλασης. Όταν αυτό είναι επιτεύξιμο, τα δεδομένα που χρησιμοποιούνται περισσότερες φορές, χρειάζονται τον ελάχιστο χρόνο για την προσπέλασή τους.
Για το σκοπό αυτό επιλέξαμε έναν ήδη προτεινόμενο μηχανισμό, τον οποίο αναλύσαμε εκτενώς. Η επιλογή αυτή δεν έγινε τυχαία, αλλά επιλέξαμε έναν μηχανισμό που διαφέρει στη λογική από τη γενική έννοια των εν λόγω κρυφών μνημών (NUCA), έχοντας σαν κύρια διαφορά ότι διαφοροποιεί εντελώς τη διαχείριση του tag από εκείνη του data array,
αντίθετα με τις γενικότερης έννοιας NUCA μνήμες.
Εκτός από τη λειτουργία της δομής αυτής όπως είχε προταθεί, εισάγουμε στη διαχείριση των δεδομένων και την πληροφορία της πρόβλεψης για να δούμε πως μπορεί να επιδράσει στην απόδοση και αν μπορούμε να καταφέρουμε κάποια βελτίωση. / This work is a study of cache memories, using a specific cache structure.
Its goal is to study cache memories with non-uniform access time for all blocks
throughout the cache surface (NUCA). The objective of these "hidden" memories is to put the most often used data at the closest to processor positions (blocks), which have fewer wire connections and therefore smaller access time. Whenever this is feasible, the data used most often need are accessed in the least possible amount of time.
For this purpose we chose an already proposed mechanism, which was analyzed extensively. The selection was not random, but chose a structure that differs from the usual NUCA structure, having as main diferrence that it completely decouples the tag array management from the data array, contrary to the general concept of NUCA memories.
Apart from this strucure's function as originally proposed, we introduced prediction in both tag and data arrays management to see how it can affect performance and whether
we can achieve some performance improvement.
|
3 |
Adaptive Shared Cache Migration PolicyBien-aise, Hemsley 20 July 2010 (has links)
No description available.
|
4 |
ACTION : Adaptive Cache Block Migration in Distributed Cache ArchitecturesMummidi, Chandra Sekhar 20 October 2021 (has links)
Increasing number of cores in chip multiprocessors (CMP) result in increasing traffic to last-level cache (LLC). Without commensurate increase in LLC bandwidth, such traffic cannot be sustained resulting in loss of performance. Further, as the number of cores increases, it is necessary to scale up the LLC size; otherwise, the LLC miss rate will rise, resulting in a loss of performance. Unfortunately, for a unified LLC with uniform cache access time, access latency increases with cache size, resulting in performance loss. Previously, researchers have proposed partitioning the cache into multiple smaller caches interconnected by a communication network which increases aggregate cache bandwidth but causes non-uniform access latency. Such a cache architecture is called non-uniform cache architecture (NUCA). While NUCA addresses the LLC bandwidth issue, partitioning by itself does not address the access latency problem. Consequently, researchers have previously considered data placement techniques to improve access latency. However, earlier data placement
work did not account for the frequency with which specific memory references are accessed. A major reason for that is access frequency for all memory references is difficult to track. In this research, we present a hardware-assisted solution called ACTION (Adaptive Cache Block Migration) to track the access frequency of individual memory references and prioritize their placement closer to the affine core. ACTION mechanism implements cache block migration when there is a detectable change in access frequencies due to a change in the program phase. To keep the hardware overhead low, ACTION counts access references in the LLC stream using a simple and approximate method, and uses simple algorithms for placement and migration. We tested ACTION on a 4-core CMP with a 5x5 mesh LLC network implementing a partitioned D-NUCA against workloads exhibiting distinct asymmetry in cache block access frequency. Our simulation results indicate that ACTION can improve CMP performance by as much as 8% over the state-of-the-art (SOTA) solutions.
|
5 |
Efficient and Scalable Cache Coherence for Many-Core Chip MultiprocessorsRos Bardisa, Alberto 24 September 2009 (has links)
La nueva tendencia para aumentar el rendimiento de los futuroscomputadores son los multiprocesadores en un solo chip (CMPs). Seespera que en un futuro cercano salgan al mercado CMPs con decenas deprocesadores. Hoy en d�a, la mejor manera de mantener la coherencia decache en estos sistemas es mediante los protocolos basados endirectorio. Sin embargo, estos protocolos tienen dos grandesproblemas: una gran sobrecarga de memoria y una alta latencia de losfallos de cache.Esta tesis se ha centrado en estos problemas claves para la eficienciay escalabilidad del CMP. En primer lugar, se ha presentado unaorganizaci�n de directorios escalable. En segundo lugar, se hanpropuesto los protocolos de coherencia directa, que evitan laindirecci�n al nodo home y, por tanto, reducen el tiempo de ejecuci�nde las aplicaciones. Por �ltimo, se ha desarrollado una pol�tica demapeo para caches compartidas pero f�sicamente distribuidas, quereduce la latencia de acceso y garantiza una distribuci�n uniforme delos datos con el fin de reducir su tasa de fallos. Esto se traducefinalmente en un menor tiempo de ejecuci�n para las aplicaciones. / Chip multiprocessors (CMPs) constitute the new trend for increasingthe performance of future computers. In the near future, chips withtens of cores will become more popular. Nowadays, directory-basedprotocols constitute the best alternative to keep cache coherence inlarge-scale systems. Nevertheless, directory-based protocols have twoimportant issues that prevent them from achieving better scalability:the directory memory overhead and the long cache miss latencies.This thesis focuses on these key issues. The first proposal is ascalable distributed directory organization that copes with the memoryoverhead of directory-based protocols. The second proposal presentsthe direct coherence protocols, which are aimed at avoiding theindirection problem of traditional directory-based protocols and,therefore, they improve applications' performance. Finally, a novelmapping policy for distributed caches is presented. This policyreduces the long access latency while lessening the number of off-chipaccesses, leading to improvements in applications' execution time.
|
6 |
Hiérarchie mémoire dans les systèmes intégrés multiprocesseurs construits autour de réseaux sur puce / Memory hierarchy in embedded multiprocessor system built around networks on chipBelhadj Amor, Hela 05 October 2017 (has links)
Les systèmes parallèles de type multi/pluri-cœurs permettant d'obtenir une grande puissance de calcul à bas coût énergétique sont de nos jours une réalité. Néanmoins, l'exploitation des performances de ces architectures dépend de l'efficacité du système à gérer les accès aux données. Le but de nos travaux est d'améliorer l'efficacité de ces accès en exploitant les caractéristiques de l'architecture matérielle.Dans une première partie, nous proposons une nouvelle organisation de la hiérarchie des mémoires caches qui maximise l'utilisation de l'espace de stockage disponible à chaque niveau. Cette solution, basée sur les architectures à accès non uniforme au cache (NUCA), supporte les transferts inter et intra-niveau de la hiérarchie. Elle requiert un protocole de cohérence de cache qui s'adapte à ses spécifications.Certes, le transfert des données au niveau de la hiérarchie est aussi un déterminant de la performance du système. Dans une seconde partie, nous prenons en compte les besoins de communication spécifiques du protocole. Nous proposons un réseau virtualisé comme support de communication ad-hoc afin de gérer le trafic de cohérence à moindre coût. Ce dernier relie les caches d'un même niveau pour supporter les transferts intra-niveaux, qui sont une spécificité de notre protocole, en vue de réduire la latence moyenne d'accès. / Multi/many-cores parallel systems for high-power computing at low energy costs are nowadays a reality. However, exploiting the performance of these architectures depends on the efficiency of the system in managing data accesses. The aim of our work is to improve the efficiency of these accesses by exploiting the hardware architecture characteristics.In a first part, we propose a new cache hierarchy organization that aims at maximizing the use of the available storage space at each level. This solution, based on non-uniform cache access architectures (NUCA), supports inter and intra-level transfers of the hierarchy. It requires a cache coherency protocol that suits its specifications.Obviously, the transfer of data in the hierarchy is also a determinant of the system performance. In a second part, we consider the specific communication needs of the protocol. We suggest the use of a virtualized network as an ad-hoc communication medium to manage consistency traffic at a lower cost. It links the caches of the same level to support intra-level transfers, which are a specificity of our protocol, in order to reduce the average access latency.
|
7 |
Managing dynamic non-uiform cache architecturesLira Rueda, Javier 25 November 2011 (has links)
Researchers from both academia and industry agree that future CMPs will accommodate large shared on-chip last-level caches.
However, the exponential increase in multicore processor cache sizes accompanied by growing on-chip wire delays make it difficult
to implement traditional caches with a single, uniform access latency. Non-Uniform Cache Access (NUCA) designs have been
proposed to address this situation. A NUCA cache divides the whole cache memory into smaller banks that are distributed along the
chip and can be accessed independently. Response time in NUCA caches does not only depend on the latency of the actual bank,
but also on the time required to reach the bank that has the requested data and to send it to the core. So, the NUCA cache allows
those banks that are located next to the cores to have lower access latencies than the banks that are further away, thus mitigating
the effects of the cache’s internal wires.
These cache architectures have been traditionally classified based on their placement decisions as static (S-NUCA) or dynamic (DNUCA).
In this thesis, we have focused on D-NUCA as it exploits the dynamic features that NUCA caches offer, like data migration.
The flexibility that D-NUCA provides, however, raises new challenges that hardens the management of this kind of cache
architectures in CMP systems. We have identified these new challenges and tackled them from the point of view of the four NUCA
policies: replacement, access, placement and migration.
First, we focus on the challenges introduced by the replacement policy in D-NUCA. Data migration makes most frequently accessed
data blocks to be concentrated on the banks that are closer to the processors. This creates big differences in the average usage
rate of the NUCA banks, being the banks that are close to the processors the most accessed banks, while the banks that are further
away are not accessed so often. Upon a replacement in a particular bank of the NUCA cache, the probabilities of the evicted data
block to be reused by the program will differ if its last location in the NUCA cache was a bank that are close to the processors, or
not. The decentralized nature of NUCA, however, prevents a NUCA bank from knowing that other bank is constantly evicting data
blocks that are later being reused. We propose three different techniques to dealwith the replacement policy, being The Auction the
most successful one.
Then, we deal with the challenges in the access policy. As data blocks can be mapped in multiple banks within the NUCA cache.
Finding the requesting data in a D-NUCA cache is a difficult task. In addition, data can freely move between these banks, thus the
search scheme must look up all banks where the requesting data block can be mapped to ascertain if it is in the NUCA cache, or
not. We have proposed HK-NUCA. This is a search scheme that uses home knowledge to effectively reduce the average number of
messages introduced to the on-chip network to satisfy a memory request.
With regard to the placement policy, this thesis shows the implementation of a hybrid NUCA cache. We have proposed a novel
placement policy that accomodates both memory technologies, SRAM and eDRAM, in a single NUCA cache.
Finally, in order to deal with the migration policy in D-NUCA caches, we propose The Migration Prefetcher. This is a technique that
anticipates data migrations.
Summarizing, in this thesis we propose different techniques to efficiently manage future D-NUCA cache architectures on CMPs. We
demonstrate the effectivity of our techniques to deal with the challenges introduced by D-NUCA caches. Our techniques outperform
existing solutions in the literature, and are in most cases more energy efficient. / CMPs actuales integran memorias cache de último nivel cada vez más grandes dentro del chip. Roadmaps en la industria y
trabajos en ámbito académico muestran que esta tendencia seguirá en los próximos años. Sin embargo, los altos retrasos en la
red de interconexión y el cableado hace que sea cada vez más difícil de implementar memorias cachés tradicionales con una única
y uniforme latencia de acceso. Para solventar esta situación aparecieron los diseños NUCA (Non-Uniform Cache Access). Una
caché de tipo NUCA divide una memoria grande en bloques más pequeños que se distribuyen a lo largo del chip y pueden ser
accedidos de manera independiente. De esta manera el tiempo de respuesta en una caché NUCA no depende sólo de la latencia
de un banco, sino que también se tiene en cuenta el tiempo de enrutamiento de la petición hasta y desde el banco de la NUCA que
responde. La posición física de un banco en el chip es clave para determinar la latencia de acceso a NUCA, entonces bancos que
se encuentren más cerca de los cores tendrán menores latencias de acceso que otros que estén más alejados.
Las cachés NUCA se pueden clasificar como estáticas (S-NUCA) o dinámicas (D-NUCA), basándonos en sus decisiones de
emplazamiento. Esta tesis se centra en D-NUCA. Este diseño permite a un dato migrar de banco en banco a fín de reducir la
latencia de futuros accesos a ese dato, pero también ofrece otros retos que deben ser investigados para gestionar estas cachés de
manera eficiente. Hemos identificado y explorado estos retos desde el punto de vista de las cuatro políticas NUCA: reemplazo,
acceso, emplazamiento y migración.
En primer lugar nos hemos centrado en la política de reemplazo. La migración de datos permite que los datos que se utilizan más
frequentemente se concentren en aquellos bancos que estan más cerca de los cores. Ésto crea grandes diferencias en el uso
medio de los bancos en NUCA, siendo los bancos cercanos a los cores los más accedidos, mientras que los bancos lejanos no se
acceden tan a menudo. Debido a las diferencias en la frequencia de reemplazos entre bancos, las probabilidades de que el dato
expulsado sea reusado en un futuro crecerán o disminuirán dependiendo del banco donde se efectuó el reemplazo. Por otro lado,
los trabajos previos en la política de reemplazo no son efectivos en este tipo de cachés ya que los bancos trabajan de manera
independiente. Nosotros proponemos tres técnicas de reemplazo para NUCA, siendo The Auction la técnica con mayor beneficio.
En cuanto a los retos con la política de acceso, como los datos se pueden mapear en diversos bancos dentro de la caché NUCA,
encontrarlos se convierte en una tarea complicada y costosa. Aquí, nosotros proponemos HK-NUCA. Es un algoritmo de acceso
que usa el conocimiento integrado en los bancos "home" para reducir de manera eficiente el número medio de accesos necesarios
para resolver una petición de memoria.
Para analizar la política de emplazamiento, esta tesis muestra la implementación de una caché NUCA híbrida. Nuestra política de
emplazamiento permite integrar ambas tecnologías, SRAM y eDRAM, en un único nivel de cache NUCA.
Finalmente, en cuanto a la migración en D-NUCA, hemos propuesto The Migration Prefetcher. Es una técnica que permite anticipar
migraciones de datos usando el conocimiento adquirido por el historial de accesos.
En resumen, esta tesis propone diferentes técnicas para gestionar de manera eficiente las futuras arquitecturas de memoria caché
D-NUCA en un entorno CMP. A lo largo de la tesis, demostramos la efectividad de las técnicas propuestas para paliar los efectos
inducidos por el hecho de utilizar cachés D-NUCA. Estas técnicas, además de obtener mayor rendimiento que otros mecanismos
existentes en la literatura, son en muchos casos más eficientes en términos de energía.
|
8 |
Power Efficient Last Level Cache For Chip MultiprocessorsMandke, Aparna 01 1900 (has links) (PDF)
The number of processor cores and on-chip cache size has been increasing on chip multiprocessors (CMPs). As a result, leakage power dissipated in the on-chip cache has become very significant. We explore various techniques to switch-off the over-allocated cache so as to reduce leakage power consumed by it. A large cache offers non-uniform access latency to different cores present on a CMP and such a cache is called “Non-Uniform Cache Architecture (NUCA)”. Past studies have explored techniques to reduce leakage power for uniform access latency caches and with a single application executing on a uniprocessor. Our ideas of power optimized caches are applicable to any memory technology and architecture for which the difference of leakage power in the on-state and off-state of on-chip cache bank is significant.
Switching off the last level shared cache on a CMP is a challenging problem due to concurrently executing threads/processes and large dispersed NUCA cache. Hence, to determine cache requirement on a CMP, first we propose a new highly accurate method to estimate working set size of an application, which we call “tagged working set size estimation (TWSS)” method. This method has a negligible hardware storage overhead of 0.1% of the cache size. The use of TWSS is demonstrated by adaptively adjusting cache associativity. Our ideas of adaptable associative cache is scalable with respect to the number of cores present on a CMP. It uses information available locally in a tile on a tiled CMP and thus avoids network access unlike other commonly used heuristics such as average memory access latency and cache miss ratio. Our implementation gives 25% and 19% higher EDP savings than that obtained with average memory access latency and cache miss ratio heuristics on a static NUCA platform (SNUCA), respectively.
Cache misses increase with reduced cache associativity. Hence, we also propose to map some of the L2 slices onto the rest L2 slices and switch-off mapped L2 slices. The L2 slice includes all L2 banks in a tile. We call this technique the “remap policy”. Some applications execute with lesser number of threads than available cores during their execution. In such applications L2 slices which are farther to those threads are switched-off and mapped on-to L2 slices which are located nearer to those threads. By using nearer L2 slices with the help of remapped technology, some applications show improved execution time apart from reduction in leakage power consumption in NUCA caches.
To estimate the maximum possible gains that can be obtained using the remap policy, we statically determine the near-optimal remap configuration using the genetic algorithms. We formulate this problem as a energy-delay product minimization problem. Our dynamic remap policy implementation gives energy-delay savings within an average of 5% than that obtained with the near-optimal remap configuration.
Energy-delay product can also be minimized by improving execution time, which depends mainly on the static and dynamic NUCA access policies (DNUCA). The suitability of cache access policy depends on data sharing properties of a multi-threaded application. Hence, we propose three indices to quantify data sharing properties of an application and use them to predict a more suitable cache access policy among SNUCA and DNUCA for an application.
|
9 |
Power Efficient Last Level Cache for Chip MultiprocessorsMandke, Aparna January 2013 (has links) (PDF)
The number of processor cores and on-chip cache size has been increasing on chip multiprocessors (CMPs). As a result, leakage power dissipated in the on-chip cache has become very significant. We explore various techniques to switch-off the over-allocated cache so as to reduce leakage power consumed by it. A large cache offers non-uniform access latency to different cores present on a CMP and such a cache is called “Non-Uniform Cache Architecture (NUCA)”. Past studies have explored techniques to reduce leakage power for uniform access latency caches and with a single application executing on a uniprocessor. Our ideas of power optimized caches are applicable to any memory technology and architecture for which the difference of leakage power in the on-state and off-state of on-chip cache bank is significant.
Switching off the last level shared cache on a CMP is a challenging problem due to concurrently executing threads/processes and large dispersed NUCA cache. Hence, to determine cache requirement on a CMP, first we propose a new highly accurate method to estimate working set size of an application, which we call “tagged working set size estimation (TWSS)” method. This method has a negligible hardware storage overhead of 0.1% of the cache size. The use of TWSS is demonstrated by adaptively adjusting cache associativity. Our ideas of adaptable associative cache is scalable with respect to the number of cores present on a CMP. It uses information available locally in a tile on a tiled CMP and thus avoids network access unlike other commonly used heuristics such as average memory access latency and cache miss ratio. Our implementation gives 25% and 19% higher EDP savings than that obtained with average memory access latency and cache miss ratio heuristics on a static NUCA platform (SNUCA), respectively.
Cache misses increase with reduced cache associativity. Hence, we also propose to map some of the L2 slices onto the rest L2 slices and switch-off mapped L2 slices. The L2 slice includes all L2 banks in a tile. We call this technique the “remap policy”. Some applications execute with lesser number of threads than available cores during their execution. In such applications L2 slices which are farther to those threads are switched-off and mapped on-to L2 slices which are located nearer to those threads. By using nearer L2 slices with the help of remapped technology, some applications show improved execution time apart from reduction in leakage power consumption in NUCA caches.
To estimate the maximum possible gains that can be obtained using the remap policy, we statically determine the near-optimal remap configuration using the genetic algorithms. We formulate this problem as a energy-delay product minimization problem. Our dynamic remap policy implementation gives energy-delay savings within an average of 5% than that obtained with the near-optimal remap configuration.
Energy-delay product can also be minimized by improving execution time, which depends mainly on the static and dynamic NUCA access policies (DNUCA). The suitability of cache access policy depends on data sharing properties of a multi-threaded application. Hence, we propose three indices to quantify data sharing properties of an application and use them to predict a more suitable cache access policy among SNUCA and DNUCA for an application.
|
Page generated in 0.0311 seconds