Global ETD Search

1	Mechanisms to improve the efficiency of hardware data prefetchers Díaz, Pedro January 2011 (has links) A well known performance bottleneck in computer architecture is the so-called memory wall. This term refers to the huge disparity between on-chip and off-chip access latencies. Historically speaking, the operating frequency of processors has increased at a steady pace, while most past advances in memory technology have been in density, not speed. Nowadays, the trend for ever increasing processor operating frequencies has been replaced by an increasing number of CPU cores per chip. This will continue to exacerbate the memory wall problem, as several cores now have to compete for off-chip data access. As multi-core systems pack more and more cores, it is expected that the access latency as observed by each core will continue to increase. Although the causes of the memory wall have changed, it is, and will continue to be in the near future, a very significant challenge in terms of computer architecture design. Prefetching has been an important technique to amortize the effect of the memory wall. With prefetching, data or instructions that are expected to be used in the near future are speculatively moved up in the memory hierarchy, were the access latency is smaller. This dissertation focuses on hardware data prefetching at the last cache level before memory (last level cache, LLC). Prefetching at the LLC usually offers the best performance increase, as this is where the disparity between hit and miss latencies is the largest. Hardware prefetchers operate by examining the miss address stream generated by the cache and identifying patterns and correlations between the misses. Most prefetchers divide the global miss stream in several sub-streams, according to some pre-specified criteria. This process is known as localization. The benefits of localization are well established: it increases the accuracy of the predictions and helps filtering out spurious, non-predictable misses. However localization has one important drawback: since the misses are classified into different sub-streams, important chronological information is lost. A consequence of this is that most localizing prefetchers issue prefetches in an untimely manner, fetching data too far in advance. This behavior promotes data pollution in the cache. The first part of this thesis proposes a new class of prefetchers based on the novel concept of Stream Chaining. With Stream Chaining, the prefetcher tries to reconstruct the chronological information lost in the process of localization, while at the same time keeping its benefits. We describe two novel Stream Chaining prefetching algorithms based on two state of the art localizing prefetchers: PC/DC and C/DC. We show how both prefetchers issue prefetches in a more timely manner than their nonchaining counterparts, increasing performance by as much as 55% (10% on average) on a suite of sequential benchmarks, while consuming roughly the same amount of memory bandwidth. In order to hide the effects of the memory wall, hardware prefetchers are usually configured to aggressively prefetch as much data as possible. However, a highly aggressive prefetcher can have negative effects on performance. Factors such as prefetching accuracy, cache pollution and memory bandwidth consumption have to be taken into account. This is specially important in the context of multi-core systems, where typically each core has its own prefetching engine and there is high competition for accessing memory. Several prefetch throttling and filtering mechanisms have been proposed to maximize the effect of prefetching in multi-core systems. The general strategy behind these heuristics is to promote prefetches that are more likely to be used and cause less interference. Traditionally these methods operate at the source level, i.e., directly into the prefetch engine they are assigned to control. In multi-core systems all prefetches are aggregated in a FIFO-like data structure called the Prefetch Request Queue (PRQ), where they wait to be dispatched to memory. The second part of this thesis shows that a traditional FIFO PRQ does not promote a timely prefetching behavior and usually hinders part of the performance benefits achieved by throttling heuristics. We propose a novel approach to prefetch aggressiveness control in multi-cores that performs throttling at the PRQ (i.e., global) level, using global knowledge of the metrics of all prefetchers and information about the global state of the PRQ. To do this, we introduce the Resizable Prefetching Heap (RPH), a data structure modeled after a binary heap that promotes timely dispatch of prefetches as well as fairness in the distribution of prefetching bandwidth. The RPH is designed as a drop-in replacement of traditional FIFO PRQs. We compare our proposal against a state-of-the-art source-level throttling algorithm (HPAC) in a 8-core system. Unlike previous research, we evaluate both multiprogrammed and multithreaded (parallel) workloads, using a modern prefetching algorithm (C/DC). Our experimental results show that RPH-based throttling increases the throttling performance benefits obtained by HPAC by as much as 148% (53.8% average) in multiprogrammed workloads and as much as 237% (22.5% average) in parallel benchmarks, while consuming roughly the same amount of memory bandwidth. When comparing the speedup over fixed degree prefetching, RPH increased the average speedup of HPAC from 7.1% to 10.9% in multiprogrammed workloads, and from 5.1% to 7.9% in parallel benchmarks. 004 prefetching ; caches ; hardware
2	A Branch-Directed Data Cache Prefetching Technique for Inorder Processors Panda, Reena 2011 December 1900 (has links) The increasing gap between processor and main memory speeds has become a serious bottleneck towards further improvement in system performance. Data prefetching techniques have been proposed to hide the performance impact of such long memory latencies. But most of the currently proposed data prefetchers predict future memory accesses based on current memory misses. This limits the opportunity that can be exploited to guide prefetching. In this thesis, we propose a branch-directed data prefetcher that uses the high prediction accuracies of current-generation branch predictors to predict a future basic block trace that the program will execute and issues prefetches for all the identified memory instructions contained therein. We also propose a novel technique to generate prefetch addresses by exploiting the correlation between the addresses generated by memory instructions and the values of the corresponding source registers at prior branch instances. We evaluate the impact of our prefetcher by using a cycle-accurate simulation of an inorder processor on the M5 simulator. The results of the evaluation show that the branch-directed prefetcher improves the performance on a set of 18 SPEC CPU2006 benchmarks by an average of 38.789% over a no-prefetching implementation and 2.148% over a system that employs a Spatial Memory Streaming prefetcher. Caches Prefetching Inorder Processors
3	"Uso de caches na Web - Influência das políticas de substituição de objetos" / The influence of objects replacement policies in web caches Oliveira, Jacqueline Augusto de 17 May 2004 (has links) Este trabalho tem como objetivo analisar a influência provocada pelas políticas de substituição de objetos em caches na Web. Isso é feito por meio da investigação das políticas existentes na literatura, considerando um estudo de caracterização de carga, de avaliação de desempenho e de comparação do uso dessas políticas. Para realizar a avaliação das políticas é utilizado um simulador de caches para a Web. Durante a pesquisa, foi desenvolvida uma nova política, denominada MeMoExP. Ela utiliza os conceitos de Média Móvel para otimizar HR e BHR. As simulações realizadas mostraram que a MeMoExP segue a mesma tendência da política FBR, tida como eficiente na literatura. Finalmente, são expostas algumas ponderações sobre as idéias apresentadas nos capítulos componentes desta dissertação, além de serem apresentadas as contribuições provenientes desta pesquisa e alguns trabalhos futuros propostos a partir desta dissertação. / This work aims to analise the influence of the replacement policies on web caches. This is carried out by investigating the policies found at the literature, considering an study of load characterization and performance assessment as well as a comparison between the policies' usage. All the experiments are done using a web cache simulator. During the research, it was developed a new policie, called MeMoExp. It utilizes the concept of Moving Exponencial Average to optimize the HR and BHR metrics. The simulation studies showed that the MeMoExP policie follows the same tendency than the FBR metric, which is considered efficient in the literature. Finally, some considerations about the ideas presented in the dissertation are exposed. The contributions of this research work are presented and some future works are proposed. caches web políticas de substituição replacement policies web caches
4	"Uso de caches na Web - Influência das políticas de substituição de objetos" / The influence of objects replacement policies in web caches Jacqueline Augusto de Oliveira 17 May 2004 (has links) Este trabalho tem como objetivo analisar a influência provocada pelas políticas de substituição de objetos em caches na Web. Isso é feito por meio da investigação das políticas existentes na literatura, considerando um estudo de caracterização de carga, de avaliação de desempenho e de comparação do uso dessas políticas. Para realizar a avaliação das políticas é utilizado um simulador de caches para a Web. Durante a pesquisa, foi desenvolvida uma nova política, denominada MeMoExP. Ela utiliza os conceitos de Média Móvel para otimizar HR e BHR. As simulações realizadas mostraram que a MeMoExP segue a mesma tendência da política FBR, tida como eficiente na literatura. Finalmente, são expostas algumas ponderações sobre as idéias apresentadas nos capítulos componentes desta dissertação, além de serem apresentadas as contribuições provenientes desta pesquisa e alguns trabalhos futuros propostos a partir desta dissertação. / This work aims to analise the influence of the replacement policies on web caches. This is carried out by investigating the policies found at the literature, considering an study of load characterization and performance assessment as well as a comparison between the policies' usage. All the experiments are done using a web cache simulator. During the research, it was developed a new policie, called MeMoExp. It utilizes the concept of Moving Exponencial Average to optimize the HR and BHR metrics. The simulation studies showed that the MeMoExP policie follows the same tendency than the FBR metric, which is considered efficient in the literature. Finally, some considerations about the ideas presented in the dissertation are exposed. The contributions of this research work are presented and some future works are proposed. caches web políticas de substituição replacement policies web caches
5	Auto-Determination of Cache/TLB parameters Kommanaboina, Kishor Yadav 23 August 2013 (has links) No description available. Computer Science Cache TLB Microbenchmarks inclusive caches exclusive caches
6	Active management of Cache resources Ramaswamy, Subramanian 08 July 2008 (has links) This dissertation addresses two sets of challenges facing processor design as the industry enters the deep sub-micron region of semiconductor design. The first set of challenges relates to the memory bottleneck. As the focus shifts from scaling processor frequency to scaling the number of cores, performance growth demands increasing die area. Scaling the number of cores also places a concurrent area demand in the form of larger caches. While on-chip caches occupy 50-60% of area and consume 20-30% of energy expended on-chip, their performance and energy efficiencies are less than 15% and 1% respectively for a range of benchmarks! The second set of challenges is posed by transistor leakage and process variation (inter-die and intra-die) at future technology nodes. Leakage power is anticipated to increase exponentially and sharply lower defect-free yield with successive technology generations. For performance scaling to continue, cache efficiencies have to improve significantly. This thesis proposes and evaluates a broad family of such improvements. This dissertation first contributes a model for cache efficiencies and finds them to be extremely low - performance efficiencies less than 15% and energy efficiencies in the order of 1%. Studying the sources of inefficiency leads to a framework for efficiency improvement based on two interrelated strategies. The approach for improving energy efficiency primarily relies on sizing the cache to match the application memory footprint during a program phase while powering down all remaining cache sets. Importantly, the sized is fully functional with no references to inactive sets. Improving performance efficiency primarily relies on cache shaping, i.e., changing the placement function and thereby the manner in which memory shares the cache. Sizing and shaping are applied at different phase of the design cycle: i) post-manufacturing & offline, ii) at compile-time, and at iii) run-time. This thesis proposes and explores techniques at each phase collectively realizing a repertoire of techniques for future memory system designers. The techniques use a combination of HW-SW techniques and are demonstrated to provide substantive improvements with modest overheads. Efficiency Customized placement Reconfigurable caches Customized caches Efficient caches Cache memory Memory management (Computer science) Energy conservation
7	Analytic cache modelling of numerical programs Harper, John Stuart January 1999 (has links) No description available. 005 Memory caches; PACE; Multi-level
8	Operating System Techniques for Reducing Processor State Pollution Soares, Livio 31 August 2012 (has links) Application performance on modern processors has become increasingly dictated by the use of on-chip structures, such as caches and look-aside buffers. The hierarchical (multi-leveled) design of processor structures, the ubiquity of multicore processor architectures, as well as the increasing relative cost of accessing memory have all contributed to this condition. Our thesis is that operating systems should provide services and mechanisms for applications to more efficiently utilize on-chip processor structures. As such, this dissertation demonstrates how the operating system can improve processor efficiency of applications through specific techniques. Two operating system services are investigated: (1) improving secondary and last-level cache utilization through a run-time cache filtering technique, and (2) improving the processor efficiency of system intensive applications through a new exception-less system call mechanism. With the first mechanism, we introduce the concept of a software pollute buffer and show that it can be used effectively at run-time, with assistance from commodity hardware performance counters, to reduce pollution of secondary on-chip caches. In the second mechanism, we are able to decouple application and operating system execution, showing the benefits of the reduced interference in various processor components such as the first level instruction and data caches, second level caches and branch predictor. We show that exception-less system calls are particularly effective on modern multicore processors. We explore two ways for applications to use exception-less system calls. The first way, which is completely transparent to the application, uses multi-threading to hide asynchronous communication between the operating system kernel and the application. In the second way, we propose that applications can directly use the exception-less system call interface by designing programs that follow an event-driven architecture. Operating Systems Processors Processor Caches Multicore 0984
9	Operating System Techniques for Reducing Processor State Pollution Soares, Livio 31 August 2012 (has links) Application performance on modern processors has become increasingly dictated by the use of on-chip structures, such as caches and look-aside buffers. The hierarchical (multi-leveled) design of processor structures, the ubiquity of multicore processor architectures, as well as the increasing relative cost of accessing memory have all contributed to this condition. Our thesis is that operating systems should provide services and mechanisms for applications to more efficiently utilize on-chip processor structures. As such, this dissertation demonstrates how the operating system can improve processor efficiency of applications through specific techniques. Two operating system services are investigated: (1) improving secondary and last-level cache utilization through a run-time cache filtering technique, and (2) improving the processor efficiency of system intensive applications through a new exception-less system call mechanism. With the first mechanism, we introduce the concept of a software pollute buffer and show that it can be used effectively at run-time, with assistance from commodity hardware performance counters, to reduce pollution of secondary on-chip caches. In the second mechanism, we are able to decouple application and operating system execution, showing the benefits of the reduced interference in various processor components such as the first level instruction and data caches, second level caches and branch predictor. We show that exception-less system calls are particularly effective on modern multicore processors. We explore two ways for applications to use exception-less system calls. The first way, which is completely transparent to the application, uses multi-threading to hide asynchronous communication between the operating system kernel and the application. In the second way, we propose that applications can directly use the exception-less system call interface by designing programs that follow an event-driven architecture. Operating Systems Processors Processor Caches Multicore 0984
10	Improving Caches in Consolidated Environments Koller, Ricardo 24 July 2012 (has links) Memory (cache, DRAM, and disk) is in charge of providing data and instructions to a computer’s processor. In order to maximize performance, the speeds of the memory and the processor should be equal. However, using memory that always match the speed of the processor is prohibitively expensive. Computer hardware designers have managed to drastically lower the cost of the system with the use of memory caches by sacrificing some performance. A cache is a small piece of fast memory that stores popular data so it can be accessed faster. Modern computers have evolved into a hierarchy of caches, where a memory level is the cache for a larger and slower memory level immediately below it. Thus, by using caches, manufacturers are able to store terabytes of data at the cost of cheapest memory while achieving speeds close to the speed of the fastest one. The most important decision about managing a cache is what data to store in it. Failing to make good decisions can lead to performance overheads and over- provisioning. Surprisingly, caches choose data to store based on policies that have not changed in principle for decades. However, computing paradigms have changed radically leading to two noticeably different trends. First, caches are now consol- idated across hundreds to even thousands of processes. And second, caching is being employed at new levels of the storage hierarchy due to the availability of high-performance flash-based persistent media. This brings four problems. First, as the workloads sharing a cache increase, it is more likely that they contain dupli- cated data. Second, consolidation creates contention for caches, and if not managed carefully, it translates to wasted space and sub-optimal performance. Third, as contented caches are shared by more workloads, administrators need to carefully estimate specific per-workload requirements across the entire memory hierarchy in order to meet per-workload performance goals. And finally, current cache write poli- cies are unable to simultaneously provide performance and consistency guarantees for the new levels of the storage hierarchy. We addressed these problems by modeling their impact and by proposing solu- tions for each of them. First, we measured and modeled the amount of duplication at the buffer cache level and contention in real production systems. Second, we created a unified model of workload cache usage under contention to be used by administrators for provisioning, or by process schedulers to decide what processes to run together. Third, we proposed methods for removing cache duplication and to eliminate wasted space because of contention for space. And finally, we pro- posed a technique to improve the consistency guarantees of write-back caches while preserving their performance benefits. Operating systems storage systems memory caches consolidation

Search results