Spelling suggestions: "subject:"cache memories"" "subject:"vache memories""
1 |
Αποτίμηση των τεχνικών μείωσης της δυναμικής ισχύος σε κρυφές μνήμες στο περιβάλλον UnisimΓάκη, Μαρία 09 February 2009 (has links)
Οι μνήμες αποτελούν την κύρια ανησυχία σε αρχιτεκτονικές χαμηλής κατανάλωσης και υψηλών ταχυτήτων. Σε έναν επεξεργαστή SOC (System on chip)περιορίζουν τις περισσότερες φορές την ταχύτητα και αποτελούν το κύριο μέρος της κατανάλωσης ενέργειας.Διάφορες τεχνικές έχουν προταθεί για τη μείωση της ισχύος σε κρυφές μνήμες. Στην παρούσα εργασία παρουσιάζονται τεχνικές μείωσης της δυναμικής ισχύος με μείωση της παράλληλης διακοπτικής δραστηριότητας. Οι τεχνικές αυτές αναπτύχθηκαν στη κρυφή μνήμη του Cellsim εξομοιωτή, βασισμένο στη λογική του Unisim εξομοιωτή, ενώ όλα τα ενεργειακά αποτελέσματα εξήχθησαν με το εργαλείο Cacti.
Η εφαρμογή των συγκεκριμένων τεχνικών επέφερε σημαντικές μειώσεις στην κατανάλωση της δυναμικής ισχύος της κρυφής μνήμης για όλα τα μετροπρογράμματα που χρησιμοποιήθηκαν. / Memories are the main concern in low power and high speed architecture. In a Soc (System on chip) processor memories most of the time limit the speed and become the main part of power consumption. Various techniques have been proposed for power reduction in cache memories. In this thesis are presented different power reduction techniques by reducing parallel switching activity. These techniques were developped in the cache memory of Cellsim emulator, based on the logic of Unisim emulator, while all power results were extracted with the Cacti tool.
The application of the specified techniques brought serious reductions in power consumption of cache memory for all the benchmarks that were used.
|
2 |
Μείωση της κατανάλωσης ισχύος σε διασυνδετικά μέσα εντός ολοκληρωμένου χρησιμοποιώντας τεχνικές φιλτραρίσματος / Reduction of power consumption in on-chip interconnection networks with filtering techniquesΟικονόμου, Ιωάννης 23 January 2012 (has links)
Η πρόοδος της τεχνολογίας CMOS δίνει τη δυνατότητα σχεδιασμού φθηνών, πολυπύρηνων, κοινής μνήμης, ενσωματωμένων επεξεργαστών. Ωστόσο, η υποστήριξη της συνάφειας της κρυφής μνήμης με κάποια μέθοδο που παρουσιάζει καλή κλιμάκωση απαιτεί σημαντική προσπάθεια. Τα πρωτόκολλα υποκλοπής παρέχουν μία λύση εύκολη στο σχεδιασμό, όμως είναι απαιτητικά σε εύρος ζώνης και κατανάλωση. Επιπλέον, η κλιμάκωσή τους είναι περιορισμένη όταν χρησιμοποιούνται σε αρτηρίες. Τα πρωτόκολλα που κάνουν χρήση ευρετηρίου, ειδικά τα κατανεμημένα, επιφέρουν μικρότερη επιβάρυνση στο δίκτυο. Απαιτούν όμως ελεγκτές ευρετηρίων οι οποίοι είναι δύσκολοι στο σχεδιασμό και καταναλώνουν πολύτιμη μνήμη, επιφάνεια και κατανάλωση εντός του ολοκληρωμένου, κάνοντάς τη λύση αυτή ακατάλληλη για ενσωματωμένα πολυπύρηνα συστήματα.
Στην εργασία αυτή, παρουσιάζουμε ένα μηχανισμό διατήρησης της συνάφειας ο οποίος παρουσιάζει καλή κλιμάκωση, και βασίζεται σε απλά πρωτόκολλα υποκλοπής, πάνω όμως σε ένα ιεραρχικό δίκτυο σημείο προς σημείο. Για να μειωθούν δραματικά τα μηνύματα που στέλνονται με ευρεία εκπομπή, προτείνουμε τα Χρονολογικά Φίλτρα, μια λύση βασισμένη στα φίλτρα Bloom. Σε αντίθεση με προηγούμενες προσεγγίσεις, τα Χρονολογικά Φίλτρα (Temporal Filters - TF) είναι εφοδιασμένα με ένα μοναδικό χαρακτηριστικό: την ικανότητα να σβήνουν τα περιεχόμενά τους σε συγχρονισμό - αλλά χωρίς να επικοινωνούν - με τις κρυφές μνήμες. Τα Χρονολογικά Φίλτρα και οι κρυφές μνήμες σβήνουν τα περιεχόμενά τους βασισμένα στις ενέργειες που γίνονται για τη διατήρηση της συνάφειας, παρέχοντας ασφαλές φιλτράρισμα ορισμένων μηνυμάτων του πρωτοκόλλου συνάφειας. Με τον τρόπο αυτό, ξεπερνάμε το πρόβλημα της αφαίρεσης στοιχείων των φίλτρων Bloom, χωρίς τη χρήση επιπλέον μετρητών, μηνυμάτων ή σημάτων, όπως έχουν προταθεί σε προηγούμενες εργασίες. Όλα τα παραπάνω γίνονται χωρίς καμία τροποποίηση των πρωτοκόλλων συνάφειας της κρυφής μνήμης. Ως αποτέλεσμα, η λύση που προτείνεται στην εργασία αυτή, χρησιμοποιεί μικρές δομές που μπορούν να ενσωματωθούν εύκολα στους μεταγωγείς του μέσου διασύνδεσης.
Για την αποτίμηση των μηχανισμών που προτείνουμε, χρησιμοποιήθηκε το περιβάλλον προσομοίωσης GEMS - για να μοντελοποιηθούν πολυπύρηνοι επεξεργαστές εντός ολοκληρωμένου με 8 και 16 πυρήνες, με ιδιωτικές κρυφές μνήμες πρώτου και δευτέρου επιπέδου - και η σουίτα μετροπρογραμμάτων SPLASH-2. Τα Χρονολογικά Φίλτρα αποδείχτηκαν ικανά να μειώσουν έως και κατά 74.7\% (κατά μέσο όρο) τα μηνύματα στο μέσο διασύνδεσης. Επιπλέον, τα Χρονολογικά Φίλτρα προσφέρουν τη δυνατότητα μείωσης της στατικής κατανάλωσης, καθώς χρησιμοποιείται η τεχνική Decay στις κρυφές μνήμες. / Advances in CMOS technology are enabling the design of inexpensive, multicore, shared-memory, embedded processors. However, supporting cache coherence in a scalable fashion in these architectures requires considerable effort. Snoop protocols provide an easy-to-design solution but they are greedy bandwidth and power consumers. In addition, their scalability is limited over a broadcast bus. Scalable directory protocols, especially distributed ones, remedy the bandwidth overhead but require hard-to-design directory controllers that consume precious on-chip storage, area, and power, rendering the solution unattractive for embedded multicores.
In this work we advocate a scalable coherence solution based on simple broadcast snooping protocols but over a scalable hierarchical point-to-point network. To dramatically cut down on broadcasts we propose Temporal Filtering, a solution based on Bloom filters - a storage-efficient memory structure. In contrast to previous approaches, Temporal Filters (TFs) are equipped with a unique characteristic: the ability to self-clean their contents in concert - but without communicating - with caches. Both TFs and caches decay their contents based on coherence activity, guaranteeing the correctness of coherence filtering. In this way, we overcome the problem of entry removal in the Bloom filters without the need of extra counters, messages, or even extra signals as in previous work and, more importantly, without requiring changes in the underlying cache snoop protocols. As a result, our solution utilizes frugal single-bit structures that can be easily integrated into network switches.
For our evaluation we use GEMS to model a 8- and 16-core CMP with private L1/L2 caches of various sizes, and the SPLASH-2 suite. TFs are proven able to reduce the 74.7\% (arithmetic average) of the network messages. In addition, TFs offer also leakage saving opportunities since cache decay is also applied in private caches.
|
3 |
Ανάπτυξη τεχνικής αύξησης της αξιοπιστίας των κρυφών μνημών πρώτου επιπέδου βασισμένη στη χωρική τοπικότητα των μπλοκ μνήμηςΜαυρόπουλος, Μιχαήλ 16 May 2014 (has links)
Στην παρούσα διπλωματική εργασία θα ασχοληθούμε με το πρόβλημα της αξιοπιστίας των κρυφών μνημών δεδομένων και εντολών πρώτου επιπέδου. Η υψηλή πυκνότητα ολοκλήρωσης και η υψηλή συχνότητα λειτουργίας των σύγχρονων ολοκληρωμένων κυκλωμάτων έχει οδηγήσει σε σημαντικά προβλήματα αξιοπιστίας, που οφείλονται είτε στην κατασκευή, είτε στη γήρανση των ολοκληρωμένων κυκλωμάτων. Στην παρούσα εργασία γίνεται αρχικά μια αποτίμηση της μείωσης της απόδοσης των κρυφών μνημών πρώτου επιπέδου όταν εμφανίζονται μόνιμα σφάλματα για διαφορετικές τεχνολογίες ολοκλήρωσης. Στη συνέχεια παρουσιάζεται μια νέα τεχνική αντιμετώπισης της επίδρασης των σφαλμάτων, η οποία βασίζεται στη πρόβλεψη της χωρικής τοπικότητας των μπλοκ μνήμης που εισάγονται στις κρυφές μνήμες πρώτου επιπέδου. Η αξιολόγηση της εν λόγω τεχνικής γίνεται με τη χρήση ενός εξομοιωτή σε επίπεδο αρχιτεκτονικής. / In this thesis we will work on the problem of reliability of first-level data and instruction cache memories. Technology scaling improvement is affecting the reliability of ICs due to increases in static and dynamic variations as well as wear out failures. First of all, in this work we try to estimate the impact of permanent faults in first level faulty caches. Then we propose a methodology to mitigate this negative impact of defective bits. Out methodology based on prediction of spatial locality of the incoming blocks to cache memory. Finally using cycle accurate simulation we showcase that our approach is able to offer significant benefits in cache performance.
|
4 |
Διαχείριση κρυφής μνήμης επεξεργαστών με πρόβλεψηΣπηλιωτακάρας, Αθανάσιος 11 May 2010 (has links)
Στον διαρκώς μεταβαλλόμενο τομέα της αρχιτεκτονικής των υπολογιστών, τα τελευταία 30 τουλάχιστον χρόνια οι αλλαγές έρχονται με εκθετικό ρυθμό. Οι κρυφές μνήμες αποτελούν πλέον το κέντρο του ενδιαφέροντος, αφού οι επεξεργαστές γίνονται ολοένα και ταχύτεροι, ολοένα και αποδοτικότεροι, αλλά τα κυκλώματα μνήμης αδυνατούν να τους ακολουθήσουν. Το επιστημονικό αυτό πεδίο στρέφεται πλέον σε έξυπνες λύσεις που έχουν ως στόχο την μείωση του κόστους επικοινωνίας μεταξύ των δύο υποσυστημάτων. Οι τρόποι διαχείρισης της κρυφής μνήμης αποτελούν έκφανση της πραγματικότητας αυτής και ένα από τα βασικότερα μέρη της είναι οι αλγόριθμοι αντικατάστασης.
Η μελέτη εστιάζει στη σχέση ανάμεσα σε δύο, ήδη εφαρμοσμένων, νέων πολιτικών αντικατάστασης, καθώς και το βαθμό στον οποίο μπορεί να υπάρξει συγχώνευση τους σε μία καινούργια. Οι νέοι αλγόριθμοι που μελετάμε είναι ο αλγόριθμος αντικατάστασης IbRdPrediction (Instruction-based Reuse-Distance Prediction – Πρόβλεψης απόστασης επαναχρησιμοποίησης βασισμένης σε εντολή) και ο αλγόριθμος MLP-Aware (Memory level parallelism aware – επίγνωσης επιπέδου παραλληλισμού μνήμης). Εξετάζουμε κατά πόσο είναι δυνατόν να δημιουργηθεί ένας νέος μηχανισμός πρόβλεψης βασισμένος σε εντολη (instruction-based) που να λαμβάνει υπόψιν του τα χαρακτηριστικά του παραλληλισμού επιπέδου μνήμης (MLP) και κατα πόσο βελτιώνει τις ήδη υπάρχουσες τεχνικές ως προς την απόδοση του συστήματος. / In the continiously altering field of computer architecture, changes occur with exponential rate the last 30 years. Cache memories have become the pole of interest, as processors are growing all faster, all efficient, but memory circuits fail to follow them. The scientific community is now turning to clever solutions which aim to limit the two subsytem communication cost. Cache management consists the expression of this reality, and one of its most basic parts is cache replacement algorithms.
The thesis focuses on the relation between two, already applied, recent replacement policies, and the degree in which their coalescence in a new policy can exist. We study the IbRdPrediction (Instruction-based Reuse-Distance Prediction) replacement algorithm and the MLP-Aware (Memory level parallelism aware) replacement algorithm. We thoroughly examine if it is possible to create a novel prediction mecahnism, based on instruction, that takes into account the MLP ((Memory level parallelism) characteristics, and how much it improves the existing techniques concerning system performance.
|
5 |
Συμπίεση με πρόγνωση αποστάσεων επαναχρησιμοποίησης σε κρυφές μνήμες δευτέρου επιπέδουΣταυρόπουλος, Νικόλαος 03 October 2011 (has links)
Η αλματώδης αύξηση της ταχύτητας του επεξεργαστή δημιούργησε ένα χάσμα μεταξύ αυτού και της κύριας μνήμης. Η αρχιτεκτονική υπολογιστών καλείται να δώσει λύση στο πρόβλημα αυτό εφαρμόζοντας νέες τεχνικές στην ιεραρχία μνημών. Να αποκρύψει δηλαδή αυτή την καθυστέρηση έχοντας όμως περιορισμούς στην σχεδίαση ως προς τον χώρο και την κατανάλωση. Για τον λόγο αυτό προτείνουμε μια νέα τεχνική που συνδυάζει συμπίεση και πρόγνωση αποστάσεων επαναχρησιμοποίησης. Η συμπίεση αυξάνει την αποθηκευτική δυνατότητα της μνήμης και η πρόγνωση αποστάσεων επαναχρησιμοποίησης βοηθά στην σωστή επιλογή του μπλοκ προς συμπίεση.
Η διπλωματική εργασία έχει ως στόχο την διερεύνηση του μοντέλου συμπίεσης με αλγόριθμο (FPC) και πρόγνωσης βάση εντολής αποστάσεων επαναχρησιμοποίησης (IbRDP) σε κρυφές μνήμες δευτέρου επιπέδου, ως προς την βελτιστοποίηση που μπορεί να επιφέρει στην ταχύτητα εκτέλεσης των προγραμμάτων καθώς και σε άλλες παραμέτρους.
Διερευνήθηκαν διάφορα μοντέλα και στο βέλτιστο μοντέλο επετεύχθησαν σημαντικές αυξήσεις στην ταχύτητα εκτέλεσης των μετροπρογραμμάτων (16% αύξηση γεωμετρικού μέσου IPC στο 1ΜΒ) ενώ μόνο ένα μετροπρόγραμμα παρουσίασε έντονη μείωση της τάξης του 17 %. / the gap of speed between CPU and main memory is a problem than need to be solved by proposing new techniques on cache hierarchies, so the delay of fetching data from the main memory will be eliminated.
We propose a new techinque of compression and reuse distance prediction. This compression will increase the capacity of L2 cache memory and the reuse distance prediction will find the most appropriate block to compress
The thesis aims to search the combinational model of compression (FPC) and Reuse distance Predictor (IbRDP)on L2 cache memories.
Several models have been simulated and the optimal model had increased execution speed of benchmarks (16% improvement in geometric mean IPC 1MB) while only one bencmark reduced its execution speed by 17%.
|
6 |
Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose ProcessorsJanuary 2018 (has links)
abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions.
Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%.
Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications.
Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future.
In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2018
|
7 |
Increasing energy efficiency of processor caches via line usage predictors / Aumentando a eficiência energética da memória cache de processadores através de preditores de uso de linhas da cacheAlves, Marco Antonio Zanata January 2014 (has links)
O consumo de energia se torna cada vez mais importante para a arquitetura de processadores, onde o número de cores dentro de um mesmo chip está aumentando mas o total de energia disponível se mantém no mesmo nível ou até mesmo se reduz. Assim, técnicas para economizar energia, tais como opções de escala de frequência e desligamento automático de subsistemas, estão sendo usadas para manter a troca entre energia e desempenho. Para se obter alto desempenho, os atuais Chip Multiprocessors (CMPs) integram grandes memórias cache a fim de reduzir a latência média para acesso a memória principal, através da alocação do conjunto de dados da aplicação dentro do chip. Essas memórias cache tem sido projetadas tradicionalmente para explorar a localidade temporal usando políticas de substituição inteligentes e localidade espacial buscando todos os dados da linha da cache após uma falta de dados. Entretanto, estudos recentes mostraram que o número de sub-blocos dentro da linha da memória cache, que são realmente usados, costuma ser baixo, sendo que, os sub-blocos que são usados recebem poucos acessos antes de se tornarem mortos (isto é, nunca mais são acessados). Além disso, muitas da linhas da memória cache permanecem ligadas por longos períodos de tempo, mesmo que os dados não sejam usados novamente ou são inválidos. Para linhas de cache modificadas, a memória cache aguarda até que a linha seja expulsa para que esta seja gravada (write-back) de volta no próximo nível de memória. Essas escritas competem com as requisições de leitura (demanda do processador e prébusca da cache), aumentando a pressão no controlador de memória. Por essas razões, a eficiência energética e o desempenho das memórias cache não são ideais. Essa tese propõe a aplicação de preditores de uso de linhas da cache para aumentar a eficiência energética das memórias cache. São propostos os mecanismos Dead Sub-Block Predictor (DSBP) e Dead Line and Early Write-Back Predictor (DEWP) para permitir economia de energia sem que haja degradação do desempenho. DSBP é usado para prever quais sub-blocos da linha da cache serão usados e quantas vezes eles serão acessados de forma a trazer para a cache apenas os sub-blocos úteis e desliga-los após eles serem acessados pelo número de vezes previsto. DEWP prevê linhas de cache mortas assim que elas recebem o último acesso, desligando essas linhas. As linhas sujas são escalonadas para sofrerem write-back após a última operação de escrita, aumentando o potencial de salvar energia, reduzindo também a pressão no controlador de memória. Ambos os mecanismos propostos também reduzem a poluição nas memórias cache, dando prioridade para a expulsão de linhas mortas, melhorando as atuais políticas de substituição. Embora cada mecanismo apresentado seja capaz de funcionar separadamente dentro do sistema, ambos os mecanismos podem também ser misturados em uma mesma hierarquia de cache. Essa implementação mista é interessante pois a granularidade de sub-bloco é preferível para níveis de cache próximos do processador, onde as linhas de memória cache são expulsas rapidamente, enquanto o último nível de cache tende a usar toda a linha antes da sua expulsão. Com o intuito de avaliar os mecanismos propostos, é apresentado o Simulator of Non- Uniform Cache Architectures (SiNUCA). Esse simulador de microarquitetura com precisão de ciclos é validado em termos de desempenho e consumo de energia através da comparação com um processador real. Os resultados de desempenho foram obtidos executando aplicações das cargas de trabalho single-threaded do conjunto SPEC-CPU2006 e aplicações multi-threaded dos conjuntos SPEC-OMP2001 e NAS-NPB. Os resultados relativos a energia foram obtidos integrando o SiNUCA com as ferramentas de modelagem Multi-core Power, Area, and Timing (McPAT) e CACTI. Quando aplicados os mecanismos em todos os níveis de memória cache, observou-se em média uma redução de 36% no consumo de energia usando o DSBP, 25% usando o DEWP e 37% quando usou-se o DSBP nos níveis L1 e L2 e o DEWP no último nível. Todas essas reduções causaram uma perda desprezível de desempenho de menos de 4% em média. / Energy consumption is becoming more important for processor architectures, where the number of cores inside the chip is increasing and the total power budget is kept at the same level or even reduced. Thus, energy saving techniques such as frequency scaling options and automatic shutdown of sub-systems are being used to maintain the trade-off between power and performance. To deliver high performance, current Chip Multiprocessors (CMPs) integrate large caches in order to reduce the average memory access latency by allocating the applications’ working set on-chip. These cache memories have traditionally been designed to exploit temporal locality by using smart replacement policies, and spatial locality by fetching entire cache lines from memory on a cache miss. However, recent studies have shown that the number of sub-blocks within a line that are actually used is often low, and those sub-blocks that are used are accessed only a few times before becoming dead (that is, never accessed again). Additionally, many of the cache lines remain powered for a long period of time even if the data is not used again, or is invalid. For modified cache lines, the cache memory waits until the line is evicted to perform the write-back to next memory level. These write-backs compete with read requests (processor demand and cache prefetch), increasing the pressure on the memory controller. For these reasons, the energy efficiency and performance of cache memories are not ideal. This thesis introduces cache line usage predictors to increase the energy efficiency of cache memories. We propose the Dead Sub-Block Predictor (DSBP) and Dead Line and Early Write-Back Predictor (DEWP) mechanisms to enable energy savings without performance degradation. DSBP is used to predict which sub-blocks of a cache line will be actually accessed and how many times they will be used in order to bring into the cache only those sub-blocks that are necessary, and power them off after they are accessed the predicted number of times. DEWP predicts dead lines as soon as they receive the last access, and turns off these lines. Dirty lines are scheduled for write-back after the last write operation occurs, increasing the energy savings potential and also reducing the pressure on the memory controller. Both proposed mechanisms also reduce pollution in cache memories by prioritizing dead lines for eviction in the existing replacement policy. Although each introduced mechanism is capable of performing separately inside a system, both mechanisms can also be mixed in the same cache hierarchy. This mixed implementation is interesting because the sub-block granularity is more suitable for cache levels closer to the processor, where the cache lines are quickly evicted, while the Last- Level Cache (LLC) tends to use the whole cache line before its eviction. In order to evaluate our proposed mechanisms, we introduce the Simulator of Non- Uniform Cache Architectures (SiNUCA). This cycle-accurate microarchitecture simulator is validated in terms of performance and energy consumption by comparing it to a real processor. Our performance results were obtained executing single-threaded applications from SPEC-CPU2006 and multi-threaded applications from SPEC-OMP2001 and NASNPB benchmark suites. The energy related results were obtained by integrating SiNUCA with the Multi-core Power, Area, and Timing (McPAT) framework and the CACTI power modeling tool. When applying our mechanisms on all the cache levels, we observe on average a 36% energy reduction for DSBP, 25% energy reduction using DEWP and an average reduction of 37% in the energy consumption applying DSBP on L1 and L2 and DEWP on the LLC. All these reductions caused a negligible performance loss of less than 4% on average.
|
8 |
Increasing energy efficiency of processor caches via line usage predictors / Aumentando a eficiência energética da memória cache de processadores através de preditores de uso de linhas da cacheAlves, Marco Antonio Zanata January 2014 (has links)
O consumo de energia se torna cada vez mais importante para a arquitetura de processadores, onde o número de cores dentro de um mesmo chip está aumentando mas o total de energia disponível se mantém no mesmo nível ou até mesmo se reduz. Assim, técnicas para economizar energia, tais como opções de escala de frequência e desligamento automático de subsistemas, estão sendo usadas para manter a troca entre energia e desempenho. Para se obter alto desempenho, os atuais Chip Multiprocessors (CMPs) integram grandes memórias cache a fim de reduzir a latência média para acesso a memória principal, através da alocação do conjunto de dados da aplicação dentro do chip. Essas memórias cache tem sido projetadas tradicionalmente para explorar a localidade temporal usando políticas de substituição inteligentes e localidade espacial buscando todos os dados da linha da cache após uma falta de dados. Entretanto, estudos recentes mostraram que o número de sub-blocos dentro da linha da memória cache, que são realmente usados, costuma ser baixo, sendo que, os sub-blocos que são usados recebem poucos acessos antes de se tornarem mortos (isto é, nunca mais são acessados). Além disso, muitas da linhas da memória cache permanecem ligadas por longos períodos de tempo, mesmo que os dados não sejam usados novamente ou são inválidos. Para linhas de cache modificadas, a memória cache aguarda até que a linha seja expulsa para que esta seja gravada (write-back) de volta no próximo nível de memória. Essas escritas competem com as requisições de leitura (demanda do processador e prébusca da cache), aumentando a pressão no controlador de memória. Por essas razões, a eficiência energética e o desempenho das memórias cache não são ideais. Essa tese propõe a aplicação de preditores de uso de linhas da cache para aumentar a eficiência energética das memórias cache. São propostos os mecanismos Dead Sub-Block Predictor (DSBP) e Dead Line and Early Write-Back Predictor (DEWP) para permitir economia de energia sem que haja degradação do desempenho. DSBP é usado para prever quais sub-blocos da linha da cache serão usados e quantas vezes eles serão acessados de forma a trazer para a cache apenas os sub-blocos úteis e desliga-los após eles serem acessados pelo número de vezes previsto. DEWP prevê linhas de cache mortas assim que elas recebem o último acesso, desligando essas linhas. As linhas sujas são escalonadas para sofrerem write-back após a última operação de escrita, aumentando o potencial de salvar energia, reduzindo também a pressão no controlador de memória. Ambos os mecanismos propostos também reduzem a poluição nas memórias cache, dando prioridade para a expulsão de linhas mortas, melhorando as atuais políticas de substituição. Embora cada mecanismo apresentado seja capaz de funcionar separadamente dentro do sistema, ambos os mecanismos podem também ser misturados em uma mesma hierarquia de cache. Essa implementação mista é interessante pois a granularidade de sub-bloco é preferível para níveis de cache próximos do processador, onde as linhas de memória cache são expulsas rapidamente, enquanto o último nível de cache tende a usar toda a linha antes da sua expulsão. Com o intuito de avaliar os mecanismos propostos, é apresentado o Simulator of Non- Uniform Cache Architectures (SiNUCA). Esse simulador de microarquitetura com precisão de ciclos é validado em termos de desempenho e consumo de energia através da comparação com um processador real. Os resultados de desempenho foram obtidos executando aplicações das cargas de trabalho single-threaded do conjunto SPEC-CPU2006 e aplicações multi-threaded dos conjuntos SPEC-OMP2001 e NAS-NPB. Os resultados relativos a energia foram obtidos integrando o SiNUCA com as ferramentas de modelagem Multi-core Power, Area, and Timing (McPAT) e CACTI. Quando aplicados os mecanismos em todos os níveis de memória cache, observou-se em média uma redução de 36% no consumo de energia usando o DSBP, 25% usando o DEWP e 37% quando usou-se o DSBP nos níveis L1 e L2 e o DEWP no último nível. Todas essas reduções causaram uma perda desprezível de desempenho de menos de 4% em média. / Energy consumption is becoming more important for processor architectures, where the number of cores inside the chip is increasing and the total power budget is kept at the same level or even reduced. Thus, energy saving techniques such as frequency scaling options and automatic shutdown of sub-systems are being used to maintain the trade-off between power and performance. To deliver high performance, current Chip Multiprocessors (CMPs) integrate large caches in order to reduce the average memory access latency by allocating the applications’ working set on-chip. These cache memories have traditionally been designed to exploit temporal locality by using smart replacement policies, and spatial locality by fetching entire cache lines from memory on a cache miss. However, recent studies have shown that the number of sub-blocks within a line that are actually used is often low, and those sub-blocks that are used are accessed only a few times before becoming dead (that is, never accessed again). Additionally, many of the cache lines remain powered for a long period of time even if the data is not used again, or is invalid. For modified cache lines, the cache memory waits until the line is evicted to perform the write-back to next memory level. These write-backs compete with read requests (processor demand and cache prefetch), increasing the pressure on the memory controller. For these reasons, the energy efficiency and performance of cache memories are not ideal. This thesis introduces cache line usage predictors to increase the energy efficiency of cache memories. We propose the Dead Sub-Block Predictor (DSBP) and Dead Line and Early Write-Back Predictor (DEWP) mechanisms to enable energy savings without performance degradation. DSBP is used to predict which sub-blocks of a cache line will be actually accessed and how many times they will be used in order to bring into the cache only those sub-blocks that are necessary, and power them off after they are accessed the predicted number of times. DEWP predicts dead lines as soon as they receive the last access, and turns off these lines. Dirty lines are scheduled for write-back after the last write operation occurs, increasing the energy savings potential and also reducing the pressure on the memory controller. Both proposed mechanisms also reduce pollution in cache memories by prioritizing dead lines for eviction in the existing replacement policy. Although each introduced mechanism is capable of performing separately inside a system, both mechanisms can also be mixed in the same cache hierarchy. This mixed implementation is interesting because the sub-block granularity is more suitable for cache levels closer to the processor, where the cache lines are quickly evicted, while the Last- Level Cache (LLC) tends to use the whole cache line before its eviction. In order to evaluate our proposed mechanisms, we introduce the Simulator of Non- Uniform Cache Architectures (SiNUCA). This cycle-accurate microarchitecture simulator is validated in terms of performance and energy consumption by comparing it to a real processor. Our performance results were obtained executing single-threaded applications from SPEC-CPU2006 and multi-threaded applications from SPEC-OMP2001 and NASNPB benchmark suites. The energy related results were obtained by integrating SiNUCA with the Multi-core Power, Area, and Timing (McPAT) framework and the CACTI power modeling tool. When applying our mechanisms on all the cache levels, we observe on average a 36% energy reduction for DSBP, 25% energy reduction using DEWP and an average reduction of 37% in the energy consumption applying DSBP on L1 and L2 and DEWP on the LLC. All these reductions caused a negligible performance loss of less than 4% on average.
|
9 |
Increasing energy efficiency of processor caches via line usage predictors / Aumentando a eficiência energética da memória cache de processadores através de preditores de uso de linhas da cacheAlves, Marco Antonio Zanata January 2014 (has links)
O consumo de energia se torna cada vez mais importante para a arquitetura de processadores, onde o número de cores dentro de um mesmo chip está aumentando mas o total de energia disponível se mantém no mesmo nível ou até mesmo se reduz. Assim, técnicas para economizar energia, tais como opções de escala de frequência e desligamento automático de subsistemas, estão sendo usadas para manter a troca entre energia e desempenho. Para se obter alto desempenho, os atuais Chip Multiprocessors (CMPs) integram grandes memórias cache a fim de reduzir a latência média para acesso a memória principal, através da alocação do conjunto de dados da aplicação dentro do chip. Essas memórias cache tem sido projetadas tradicionalmente para explorar a localidade temporal usando políticas de substituição inteligentes e localidade espacial buscando todos os dados da linha da cache após uma falta de dados. Entretanto, estudos recentes mostraram que o número de sub-blocos dentro da linha da memória cache, que são realmente usados, costuma ser baixo, sendo que, os sub-blocos que são usados recebem poucos acessos antes de se tornarem mortos (isto é, nunca mais são acessados). Além disso, muitas da linhas da memória cache permanecem ligadas por longos períodos de tempo, mesmo que os dados não sejam usados novamente ou são inválidos. Para linhas de cache modificadas, a memória cache aguarda até que a linha seja expulsa para que esta seja gravada (write-back) de volta no próximo nível de memória. Essas escritas competem com as requisições de leitura (demanda do processador e prébusca da cache), aumentando a pressão no controlador de memória. Por essas razões, a eficiência energética e o desempenho das memórias cache não são ideais. Essa tese propõe a aplicação de preditores de uso de linhas da cache para aumentar a eficiência energética das memórias cache. São propostos os mecanismos Dead Sub-Block Predictor (DSBP) e Dead Line and Early Write-Back Predictor (DEWP) para permitir economia de energia sem que haja degradação do desempenho. DSBP é usado para prever quais sub-blocos da linha da cache serão usados e quantas vezes eles serão acessados de forma a trazer para a cache apenas os sub-blocos úteis e desliga-los após eles serem acessados pelo número de vezes previsto. DEWP prevê linhas de cache mortas assim que elas recebem o último acesso, desligando essas linhas. As linhas sujas são escalonadas para sofrerem write-back após a última operação de escrita, aumentando o potencial de salvar energia, reduzindo também a pressão no controlador de memória. Ambos os mecanismos propostos também reduzem a poluição nas memórias cache, dando prioridade para a expulsão de linhas mortas, melhorando as atuais políticas de substituição. Embora cada mecanismo apresentado seja capaz de funcionar separadamente dentro do sistema, ambos os mecanismos podem também ser misturados em uma mesma hierarquia de cache. Essa implementação mista é interessante pois a granularidade de sub-bloco é preferível para níveis de cache próximos do processador, onde as linhas de memória cache são expulsas rapidamente, enquanto o último nível de cache tende a usar toda a linha antes da sua expulsão. Com o intuito de avaliar os mecanismos propostos, é apresentado o Simulator of Non- Uniform Cache Architectures (SiNUCA). Esse simulador de microarquitetura com precisão de ciclos é validado em termos de desempenho e consumo de energia através da comparação com um processador real. Os resultados de desempenho foram obtidos executando aplicações das cargas de trabalho single-threaded do conjunto SPEC-CPU2006 e aplicações multi-threaded dos conjuntos SPEC-OMP2001 e NAS-NPB. Os resultados relativos a energia foram obtidos integrando o SiNUCA com as ferramentas de modelagem Multi-core Power, Area, and Timing (McPAT) e CACTI. Quando aplicados os mecanismos em todos os níveis de memória cache, observou-se em média uma redução de 36% no consumo de energia usando o DSBP, 25% usando o DEWP e 37% quando usou-se o DSBP nos níveis L1 e L2 e o DEWP no último nível. Todas essas reduções causaram uma perda desprezível de desempenho de menos de 4% em média. / Energy consumption is becoming more important for processor architectures, where the number of cores inside the chip is increasing and the total power budget is kept at the same level or even reduced. Thus, energy saving techniques such as frequency scaling options and automatic shutdown of sub-systems are being used to maintain the trade-off between power and performance. To deliver high performance, current Chip Multiprocessors (CMPs) integrate large caches in order to reduce the average memory access latency by allocating the applications’ working set on-chip. These cache memories have traditionally been designed to exploit temporal locality by using smart replacement policies, and spatial locality by fetching entire cache lines from memory on a cache miss. However, recent studies have shown that the number of sub-blocks within a line that are actually used is often low, and those sub-blocks that are used are accessed only a few times before becoming dead (that is, never accessed again). Additionally, many of the cache lines remain powered for a long period of time even if the data is not used again, or is invalid. For modified cache lines, the cache memory waits until the line is evicted to perform the write-back to next memory level. These write-backs compete with read requests (processor demand and cache prefetch), increasing the pressure on the memory controller. For these reasons, the energy efficiency and performance of cache memories are not ideal. This thesis introduces cache line usage predictors to increase the energy efficiency of cache memories. We propose the Dead Sub-Block Predictor (DSBP) and Dead Line and Early Write-Back Predictor (DEWP) mechanisms to enable energy savings without performance degradation. DSBP is used to predict which sub-blocks of a cache line will be actually accessed and how many times they will be used in order to bring into the cache only those sub-blocks that are necessary, and power them off after they are accessed the predicted number of times. DEWP predicts dead lines as soon as they receive the last access, and turns off these lines. Dirty lines are scheduled for write-back after the last write operation occurs, increasing the energy savings potential and also reducing the pressure on the memory controller. Both proposed mechanisms also reduce pollution in cache memories by prioritizing dead lines for eviction in the existing replacement policy. Although each introduced mechanism is capable of performing separately inside a system, both mechanisms can also be mixed in the same cache hierarchy. This mixed implementation is interesting because the sub-block granularity is more suitable for cache levels closer to the processor, where the cache lines are quickly evicted, while the Last- Level Cache (LLC) tends to use the whole cache line before its eviction. In order to evaluate our proposed mechanisms, we introduce the Simulator of Non- Uniform Cache Architectures (SiNUCA). This cycle-accurate microarchitecture simulator is validated in terms of performance and energy consumption by comparing it to a real processor. Our performance results were obtained executing single-threaded applications from SPEC-CPU2006 and multi-threaded applications from SPEC-OMP2001 and NASNPB benchmark suites. The energy related results were obtained by integrating SiNUCA with the Multi-core Power, Area, and Timing (McPAT) framework and the CACTI power modeling tool. When applying our mechanisms on all the cache levels, we observe on average a 36% energy reduction for DSBP, 25% energy reduction using DEWP and an average reduction of 37% in the energy consumption applying DSBP on L1 and L2 and DEWP on the LLC. All these reductions caused a negligible performance loss of less than 4% on average.
|
10 |
Διαχείριση κοινών πόρων σε πολυπύρηνους επεξεργαστέςΑλεξανδρής, Φωκίων 27 June 2012 (has links)
Οι σύγχρονες τάσεις της Επιστήμης Σχεδιασμού των Υπολογιστικών Συστημάτων έχουν υιοθετήσει την χρήση των Κρυφών Μνημών ή Μνημών Cache, αποβλέποντας στην απόκρυψη της Καθυστέρησης της Κύριας Μνήμης των Συστημάτων (Memory Latency) και την γεφύρωση του χάσματος της απόδοσης του Επεξεργαστή και της Κύριας Μνήμης (Processor – Memory Performance Gap). Οι Μνήμες Cache έτσι έχουν αποκτήσει αδιαμφισβήτητα πρωτεύοντα ρόλο στην Ιεραρχία Μνήμης των Ηλεκτρονικών Υπολογιστών.
Οι νέες τάσεις Σχεδιασμού ανέδειξαν την Έννοια του Παραλληλισμού σε πρωτεύοντα ρόλο. Αρχικά διερευνήθηκε ο Παραλληλισμός Επιπέδου Εντολών, ωστόσο η αύξηση της Απόδοσης των Υπολογιστών σύντομα έφτασε ένα μέγιστο. Την τελευταία δεκαετία το κέντρο του ενδιαφέροντος των σχεδιαστών έχει και πάλι μετατοπιστεί, καθώς ένας νέος τύπος Επεξεργαστών έχει εισέλθει στο προσκήνιο, οι Πολυπύρηνοι Επεξεργαστές, ή όπως είναι αλλιώς γνωστοί on-chip Multiprocessors (CMP). Αυτές οι εξελίξεις, σε συνδυασμό με την ολοένα αυξανόμενη πολυπλοκότητα της “συμπεριφοράς” των εκτελούμενων Εφαρμογών, ώθησαν το σχεδιαστικό ενδιαφέρον προς την εκμετάλλευση ενός νεοσύστατου τύπου Παραλληλισμού. Ο Παραλληλισμός Επιπέδου Μνήμης ή Memory Level Parallelism (MLP) αποτελεί τα τελευταία χρόνια, το πλέον ισχυρό μέσο αύξησης της απόδοσης των Υπολογιστικών Συστημάτων και μαζί με τους Πολυπύρηνους Επεξεργαστές θα κυριαρχήσει στο προσκήνιο των εξελίξεων τα επόμενα χρόνια.
Σκοπός της παρούσας Διπλωματικής Εργασίας είναι η ανάπτυξη ενός Στατιστικού – Πιθανοτικού Μοντέλου για μελέτη και πρόβλεψη των φαινομένων που αναπτύσσονται σε Μνήμες Cache, στις οποίες αποθηκεύονται δεδομένα από εκτελούμενες Εφαρμογές, με έντονο Παραλληλισμό Επιπέδου Μνήμης. Θα οριστεί ένας Εκτιμητής του Φόρτου που επιβάλλεται στο Σύστημα, από φαινόμενα Παραλληλισμού Επιπέδου Μνήμης (MLP). Στην συνέχεια, με βάση το Μοντέλο που αναπτύσσουμε, θα διερευνηθεί ένα ικανοποιητικό σύνολο Εφαρμογών, και θα εξαχθεί μια Εκτίμηση – Πρόβλεψη για τον Φόρτο (MLP) του Συστήματος. Εφόσον οι Προβλέψεις μας κριθούν επιτυχής, το Μοντέλο Πρόβλεψης Φόρτου MLP που αναπτύξαμε, μπορεί να αποτελέσει χρήσιμο Εργαλείο στα χέρια των Σχεδιαστών που ασχολούνται με την αύξηση της Απόδοσης των Σύγχρονων Υπολογιστικών Συστημάτων. / -
|
Page generated in 0.0514 seconds