Global ETD Search

1	PERFORMANCE EVALUATION OF AN ENHANCED POPULARITY-BASED WEB PREFETCHING TECHNIQUE Sharma, Mayank January 2006 (has links) No description available. Computer Science web prefetching proxy server prefetching popularity based prefetching
2	Mechanisms to improve the efficiency of hardware data prefetchers Díaz, Pedro January 2011 (has links) A well known performance bottleneck in computer architecture is the so-called memory wall. This term refers to the huge disparity between on-chip and off-chip access latencies. Historically speaking, the operating frequency of processors has increased at a steady pace, while most past advances in memory technology have been in density, not speed. Nowadays, the trend for ever increasing processor operating frequencies has been replaced by an increasing number of CPU cores per chip. This will continue to exacerbate the memory wall problem, as several cores now have to compete for off-chip data access. As multi-core systems pack more and more cores, it is expected that the access latency as observed by each core will continue to increase. Although the causes of the memory wall have changed, it is, and will continue to be in the near future, a very significant challenge in terms of computer architecture design. Prefetching has been an important technique to amortize the effect of the memory wall. With prefetching, data or instructions that are expected to be used in the near future are speculatively moved up in the memory hierarchy, were the access latency is smaller. This dissertation focuses on hardware data prefetching at the last cache level before memory (last level cache, LLC). Prefetching at the LLC usually offers the best performance increase, as this is where the disparity between hit and miss latencies is the largest. Hardware prefetchers operate by examining the miss address stream generated by the cache and identifying patterns and correlations between the misses. Most prefetchers divide the global miss stream in several sub-streams, according to some pre-specified criteria. This process is known as localization. The benefits of localization are well established: it increases the accuracy of the predictions and helps filtering out spurious, non-predictable misses. However localization has one important drawback: since the misses are classified into different sub-streams, important chronological information is lost. A consequence of this is that most localizing prefetchers issue prefetches in an untimely manner, fetching data too far in advance. This behavior promotes data pollution in the cache. The first part of this thesis proposes a new class of prefetchers based on the novel concept of Stream Chaining. With Stream Chaining, the prefetcher tries to reconstruct the chronological information lost in the process of localization, while at the same time keeping its benefits. We describe two novel Stream Chaining prefetching algorithms based on two state of the art localizing prefetchers: PC/DC and C/DC. We show how both prefetchers issue prefetches in a more timely manner than their nonchaining counterparts, increasing performance by as much as 55% (10% on average) on a suite of sequential benchmarks, while consuming roughly the same amount of memory bandwidth. In order to hide the effects of the memory wall, hardware prefetchers are usually configured to aggressively prefetch as much data as possible. However, a highly aggressive prefetcher can have negative effects on performance. Factors such as prefetching accuracy, cache pollution and memory bandwidth consumption have to be taken into account. This is specially important in the context of multi-core systems, where typically each core has its own prefetching engine and there is high competition for accessing memory. Several prefetch throttling and filtering mechanisms have been proposed to maximize the effect of prefetching in multi-core systems. The general strategy behind these heuristics is to promote prefetches that are more likely to be used and cause less interference. Traditionally these methods operate at the source level, i.e., directly into the prefetch engine they are assigned to control. In multi-core systems all prefetches are aggregated in a FIFO-like data structure called the Prefetch Request Queue (PRQ), where they wait to be dispatched to memory. The second part of this thesis shows that a traditional FIFO PRQ does not promote a timely prefetching behavior and usually hinders part of the performance benefits achieved by throttling heuristics. We propose a novel approach to prefetch aggressiveness control in multi-cores that performs throttling at the PRQ (i.e., global) level, using global knowledge of the metrics of all prefetchers and information about the global state of the PRQ. To do this, we introduce the Resizable Prefetching Heap (RPH), a data structure modeled after a binary heap that promotes timely dispatch of prefetches as well as fairness in the distribution of prefetching bandwidth. The RPH is designed as a drop-in replacement of traditional FIFO PRQs. We compare our proposal against a state-of-the-art source-level throttling algorithm (HPAC) in a 8-core system. Unlike previous research, we evaluate both multiprogrammed and multithreaded (parallel) workloads, using a modern prefetching algorithm (C/DC). Our experimental results show that RPH-based throttling increases the throttling performance benefits obtained by HPAC by as much as 148% (53.8% average) in multiprogrammed workloads and as much as 237% (22.5% average) in parallel benchmarks, while consuming roughly the same amount of memory bandwidth. When comparing the speedup over fixed degree prefetching, RPH increased the average speedup of HPAC from 7.1% to 10.9% in multiprogrammed workloads, and from 5.1% to 7.9% in parallel benchmarks. 004 prefetching ; caches ; hardware
3	A Branch-Directed Data Cache Prefetching Technique for Inorder Processors Panda, Reena 2011 December 1900 (has links) The increasing gap between processor and main memory speeds has become a serious bottleneck towards further improvement in system performance. Data prefetching techniques have been proposed to hide the performance impact of such long memory latencies. But most of the currently proposed data prefetchers predict future memory accesses based on current memory misses. This limits the opportunity that can be exploited to guide prefetching. In this thesis, we propose a branch-directed data prefetcher that uses the high prediction accuracies of current-generation branch predictors to predict a future basic block trace that the program will execute and issues prefetches for all the identified memory instructions contained therein. We also propose a novel technique to generate prefetch addresses by exploiting the correlation between the addresses generated by memory instructions and the values of the corresponding source registers at prior branch instances. We evaluate the impact of our prefetcher by using a cycle-accurate simulation of an inorder processor on the M5 simulator. The results of the evaluation show that the branch-directed prefetcher improves the performance on a set of 18 SPEC CPU2006 benchmarks by an average of 38.789% over a no-prefetching implementation and 2.148% over a system that employs a Spatial Memory Streaming prefetcher. Caches Prefetching Inorder Processors
4	Dynamic Memory Optimization using Pool Allocation and Prefetching Zhao, Qin, Rabbah, Rodric, Wong, Weng Fai 01 1900 (has links) Heap memory allocation plays an important role in modern applications. Conventional heap allocators, however, generally ignore the underlying memory hierarchy of the system, favoring instead a low runtime overhead and fast response times. Unfortunately, with little concern for the memory hierarchy, the data layout may exhibit poor spatial locality, and degrade cache performance. In this paper, we describe a dynamic heap allocation scheme called pool allocation. The strategy aims to improve cache performance by inspecting memory allocation requests, and allocating memory from appropriate heap pools as dictated by the requesting context. The advantages are two fold. First, by pooling together data with a common context, we expect to improve spatial locality, as data fetched to the caches will contain fewer items from different contexts. If the allocation patterns are closely matched to the traversal patterns, the end result is faster memory performance. Second, by pooling heap objects, we expect access patterns to exhibit more regularity, thus creating more opportunities for data prefetching. Our dynamic memory optimizer exploits the increased regularity to insert prefetch instructions at runtime. The optimizations are implemented in DynamoRIO, a dynamic optimization framework. We evaluate the work using various benchmarks, and measure a 17% speedup over gcc -O3 on an Athlon MP, and a 13% speedup on a Pentium 4. / Singapore-MIT Alliance (SMA) Dynamic optimization Locality Prefetching
5	Prefetching for complex memory access patterns Ainsworth, Sam January 2018 (has links) Modern-day workloads, particularly those in big data, are heavily memory-latency bound. This is because of both irregular memory accesses, which have no discernable pattern in their memory addresses, and large data sets that cannot fit in any cache. However, this need not be a barrier to high performance. With some data structure knowledge it is typically possible to bring data into the fast on-chip memory caches early, so that it is already available by the time it needs to be accessed. This thesis makes three contributions. I first contribute an automated software prefetching compiler technique to insert high-performance prefetches into program code to bring data into the cache early, achieving 1.3x geometric mean speedup on the most complex processors, and 2.7x on the simplest. I also provide an analysis of when and why this is likely to be successful, which data structures to target, and how to schedule software prefetches well. Then I introduce a hardware solution, the configurable graph prefetcher. This uses the example of breadth-first search on graph workloads to motivate how a hardware prefetcher armed with data-structure knowledge can avoid the instruction overheads, inflexibility and limited latency tolerance of software prefetching. The configurable graph prefetcher sits at the L1 cache and observes memory accesses, which can be configured by a programmer to be aware of a limited number of different data access patterns, achieving 2.3x geometric mean speedup on graph workloads on an out-of-order core. My final contribution extends the hardware used for the configurable graph prefetcher to make an event-triggered programmable prefetcher, using a set of a set of very small micro-controller-sized programmable prefetch units (PPUs) to cover a wide set of workloads. I do this by developing a highly parallel programming model that can be used to issue prefetches, thus allowing high-throughput prefetching with low power and area overheads of only around 3%, and a 3x geometric mean speedup for a variety of memory-bound applications. To facilitate its use, I then develop compiler techniques to help automate the process of targeting the programmable prefetcher. These provide a variety of tradeoffs from easiest to use to best performance.
6	ADAPTIVE PROFILE DRIVEN DATA CACHING AND PREFETCHING IN MOBILE ENVIRONMENT Mahmood, Omer January 2005 (has links) This thesis describes a new method of calculating data priority by using adaptive mobile user and device profiles which change with user location, time of the day, available networks and data access history. The profiles are used for data prefetching, selection of most suitable wireless network and cache management on the mobile device in order to optimally utilize the device�s storage capacity and available bandwidth. Some of the inherent characteristics of mobile devices due to user movements are �non-persistent connection, limited bandwidth and storage capacity, changes in mobile device�s geographical location and connection (eg. connection can be from GPRS to WLAN to Bluetooth). New research is being carried out in making mobile devices work more efficiently by reducing and/or eliminating their limitations. The focus of this research is to propose, evaluate and test a new user profiling technique which specifically caters to the needs of the mobile device users who are required to access large amounts of data, possibly more than the device storage capability during the course of the day or week. This work involves the development of an intelligent user profiling system along with mobile device caching system which will first allocate weight (priority) to the different sets and subsets of the total given data based on user�s location, user�s appointment information, user�s preferences, device capabilities and available networks. Then the profile will automatically change the data weights with user movements, history of cached data access and characteristics of available networks. The Adaptive User and Device Profiles were designed to handle broad range of the issues associated with: �Changing network types and conditions �Limited storage capacity and document type support of mobile devices �Changes in user data needs due to their movements at different times of the day Many research areas have been addressed through this research but the primary focus has remained on the following four core areas. The four core areas are : selecting the most suitable wireless network; allocating weights to different datasets & subsets by integrating user�s movements; previously accessed data; time of the day with user appointment information and device capabilities. Caching;Prefetching;Mobile User profiles
7	ADAPTIVE PROFILE DRIVEN DATA CACHING AND PREFETCHING IN MOBILE ENVIRONMENT Mahmood, Omer January 2005 (has links) This thesis describes a new method of calculating data priority by using adaptive mobile user and device profiles which change with user location, time of the day, available networks and data access history. The profiles are used for data prefetching, selection of most suitable wireless network and cache management on the mobile device in order to optimally utilize the device�s storage capacity and available bandwidth. Some of the inherent characteristics of mobile devices due to user movements are �non-persistent connection, limited bandwidth and storage capacity, changes in mobile device�s geographical location and connection (eg. connection can be from GPRS to WLAN to Bluetooth). New research is being carried out in making mobile devices work more efficiently by reducing and/or eliminating their limitations. The focus of this research is to propose, evaluate and test a new user profiling technique which specifically caters to the needs of the mobile device users who are required to access large amounts of data, possibly more than the device storage capability during the course of the day or week. This work involves the development of an intelligent user profiling system along with mobile device caching system which will first allocate weight (priority) to the different sets and subsets of the total given data based on user�s location, user�s appointment information, user�s preferences, device capabilities and available networks. Then the profile will automatically change the data weights with user movements, history of cached data access and characteristics of available networks. The Adaptive User and Device Profiles were designed to handle broad range of the issues associated with: �Changing network types and conditions �Limited storage capacity and document type support of mobile devices �Changes in user data needs due to their movements at different times of the day Many research areas have been addressed through this research but the primary focus has remained on the following four core areas. The four core areas are : selecting the most suitable wireless network; allocating weights to different datasets & subsets by integrating user�s movements; previously accessed data; time of the day with user appointment information and device capabilities. Caching;Prefetching;Mobile User profiles
8	Evaluation of Memory Prefetching Techniques for Modem Applications Nyholm, Gustav January 2022 (has links) Processor performance has increased far faster than memories have been able to keep up with, forcing processor designers to use caches in order to bridge the speed difference. This can increase performance significantly for programs that utilize the caches efficiently but results in significant performance penalties when data is not in cache. One way to mitigate this problem is to to make sure that data is cached before it is needed using memory prefetching. This thesis focuses on different ways to perform prefetching in systems with strict area and energy requirements by evaluating a number of prefetch techniques based on performance in two programs as well as metrics such as coverage and accuracy. Both data and instruction prefetching are investigated. The studied techniques include a number of versions of next line prefetching, prefetching based on stride identification and history as well as post-increment based prefetching. While the best increase in program performance is achieved using next 2 lines prefetching it comes at a significant energy cost as well as drastically increased memory traffic making it unsuitable for use in energy-constrained applications. RPT-based prefetching on the other hand gives a good balance between performance and cost managing to improve performance by 4% and 7% for two programs while keeping the impact on both area and energy minimal. cache prefetching Computer Engineering Datorteknik
9	Optimizing Performance in Highly Utilized Multicores with Intelligent Prefetching Khan, Muneeb January 2016 (has links) Modern processors apply sophisticated techniques, such as deep cache hierarchies and hardware prefetching, to increase performance. Such complex hardware structures have helped improve performance in general, however, their full potential is not realized as software often utilizes the memory hierarchy inefficiently. Performance can be improved further by ensuring careful interaction between software and hardware. Performance can typically improve by increasing the cache utilization and by conserving the DRAM bandwidth, i.e., retaining more useful data in the caches and lowering data requests to the DRAM. One way to achieve this is to conserve space across the cache hierarchy and increase opportunity for temporal reuse of cached data. Similarly, conserving the DRAM bandwidth is essential for performance in highly utilized multicores, as it can easily become a critical resource. When multiple cores are active and the per-core share of DRAM bandwidth shrinks, its efficient utilization plays an important role in improving the overall performance. Together the cache hierarchy and the DRAM bandwidth play a significant role in defining the overall performance in multicores. Based on deep insight from memory behavior modeling of software, this thesis explores five software-only methods to analyze and increase performance in multicores. The underlying philosophy that drives these techniques is to increase cache utilization and conserve DRAM bandwidth by 1) focusing on making data prefetching more accurate, and 2) lowering the miss rate in the cache hierarchy either by preserving useful data longer by cache-bypassing the less useful data or via code size compaction using compiler options. First, we show how microarchitecture-independent memory access profiles can be used to analyze the Instruction Cache performance of software. We use this information in a compiler pass to recompile application phases (with large Instruction cache miss rate) for smaller code size in an effort to improve the application Instruction Cache behavior. Second, we demonstrate how a resourceefficient software prefetching method can be combined with hardware prefetching to improve performance in multicores when running software that exhibits irregular memory access patterns. Third, we show that hardware prefetching on high performance commodity multicores is sub-optimal and demonstrate how a resource-efficient software-only prefetching method can perform better in fully utilized multicores. Fourth, we present an adaptive prefetching approach that dynamically combines software and hardware prefetching in a runtime system to improve performance in highly utilized multicores. Finally, in the fifth work we develop a method to predict per-core prefetching configurations that deliver near-optimal overall multicore performance. These software techniques enable us to tap greater performance in multicores (up to 50%), without requiring more processing resources. Performance Optimization Prefetching multicore memory hierarchy
10	Proxy Support for HTTP Adaptive Streaming 2013 December 1900 (has links) Not long ago streaming video over the Internet included only short clips of low quality video. Now the possibilities seem endless as professional productions are made available in high definition. This explosion of growth is the result of several factors, such as increasing network performance, advancements in video encoding technology, improvements to video streaming techniques, and a growing number of devices capable of handling video. However, despite the improvements to Internet video streaming this paradigm is still evolving. HTTP adaptive streaming involves encoding a video at multiple quality levels then dividing those quality levels into small chunks. The player can then determine which quality level to retrieve the next chunk from in order to optimize video playback when considering the underlying network conditions. This thesis first presents an experimental framework that allows for adaptive streaming players to be analyzed and evaluated. Evaluation is beneficial because there are several concerns with the adaptive video streaming ecosystem such as achieving a high video playback quality while also ensuring stable playback quality. The primary contribution of this thesis is the evaluation of prefetching by a proxy server as a means to improve streaming performance. This work considers an implementation of a proxy server that is functional with the extremely popular Netflix streaming service, and it is evaluated using two Netflix players. The results show its potential to improve video streaming performance in several scenarios. It effectively increases the buffer capacity of the player as chunks can be prefetched in advance of the player's request then stored on the proxy to be quickly delivered once requested. This allows for degradation in network conditions to be hidden from the player while the proxy serves prefetched data, preventing a reduction to the video quality as a result of an overreaction by the player. Further, the proxy can reduce the impact of the bottleneck in the network, achieving higher throughput by utilizing parallel connections to the server. HTTP Adaptive Streaming video streaming Prefetching

Search results