1 |
DESIGN OF PRIORITIZED LRU CIRCUITS FOR CACHE OF MULTI-CORE REAL-TIME SYSTEMSGopalakrishnan, Lavanya 01 August 2011 (has links)
With the advancement of technology, multi-cores with shared cache have been used in real-time applications. In such systems, some cores run real-time applications and some cores run other non-critical applications that do not have strict deadline. Due to the sharing of cache by multi-core processors, problems predicting the actual execution time and the execution time of real-time applications have emerged. To address these problems, cache memory with prioritized replacement policy is proposed. Most of the work is carried out in high-level hardware designs and software based application level designs. No low-level hardware implementations of cache memory with prioritized replacement circuits have been designed to the best of my knowledge. My thesis focuses on designing a LRU replacement circuit that is prioritized based on the application the processor is running. Real-time applications acquire priority in using the cache memory over other applications which enhance the seamless execution of the real-time application and hence supports execution time predictability which in turn helps improve the potential of multi-core computing of real-time systems. The speed, size and power overhead are analyzed by placing the N-way set associative LRU as a part of cache of size 128KB designed using 65nm CMOS technology.
|
2 |
Improving cache Behavior in CMP architectures throug cache partitioning techniquesMoretó Planas, Miquel 19 March 2010 (has links)
The evolution of microprocessor design in the last few decades has changed significantly, moving from simple inorder single core architectures to superscalar and vector architectures in order to extract the maximum available instruction level parallelism. Executing several instructions from the same thread in parallel allows significantly improving the performance of an application. However, there is only a limited amount of parallelism available in each thread, because of data and control dependences. Furthermore, designing a high performance, single, monolithic processor has become very complex due to power and chip latencies constraints. These limitations have motivated the use of thread level parallelism (TLP) as a common strategy for improving processor performance. Multithreaded processors allow executing different threads at the same time, sharing some hardware resources. There are several flavors of multithreaded processors that exploit the TLP, such as chip multiprocessors (CMP), coarse grain multithreading, fine grain multithreading, simultaneous multithreading (SMT), and combinations of them.To improve cost and power efficiency, the computer industry has adopted multicore chips. In particular, CMP architectures have become the most common design decision (combined sometimes with multithreaded cores). Firstly, CMPs reduce design costs and average power consumption by promoting design re-use and simpler processor cores. For example, it is less complex to design a chip with many small, simple cores than a chip with fewer, larger, monolithic cores.Furthermore, simpler cores have less power hungry centralized hardware structures. Secondly, CMPs reduce costs by improving hardware resource utilization. On a multicore chip, co-scheduled threads can share costly microarchitecture resources that would otherwise be underutilized. Higher resource utilization improves aggregate performance and enables lower cost design alternatives.One of the resources that impacts most on the final performance of an application is the cache hierarchy. Caches store data recently used by the applications in order to take advantage of temporal and spatial locality of applications. Caches provide fast access to data, improving the performance of applications. Caches with low latencies have to be small, which prompts the design of a cache hierarchy organized into several levels of cache.In CMPs, the cache hierarchy is normally organized in a first level (L1) of instruction and data caches private to each core. A last level of cache (LLC) is normally shared among different cores in the processor (L2, L3 or both). Shared caches increase resource utilization and system performance. Large caches improve performance and efficiency by increasing the probability that each application can access data from a closer level of the cache hierarchy. It also allows an application to make use of the entire cache if needed.A second advantage of having a shared cache in a CMP design has to do with the cache coherency. In parallel applications, different threads share the same data and keep a local copy of this data in their cache. With multiple processors, it is possible for one processor to change the data, leaving another processor's cache with outdated data. Cache coherency protocol monitors changes to data and ensures that all processor caches have the most recent data. When the parallel application executes on the same physical chip, the cache coherency circuitry can operate at the speed of on-chip communications, rather than having to use the much slower between-chip communication, as is required with discrete processors on separate chips. These coherence protocols are simpler to design with a unified and shared level of cache onchip.Due to the advantages that multicore architectures offer, chip vendors use CMP architectures in current high performance, network, real-time and embedded systems. Several of these commercial processors have a level of the cache hierarchy shared by different cores. For example, the Sun UltraSPARC T2 has a 16-way 4MB L2 cache shared by 8 cores each one up to 8-way SMT. Other processors like the Intel Core 2 family also share up to a 12MB 24-way L2 cache. In contrast, the AMD K10 family has a private L2 cache per core and a shared L3 cache, with up to a 6MB 64-way L3 cache.As the long-term trend of increasing integration continues, the number of cores per chip is also projected to increase with each successive technology generation. Some significant studies have shown that processors with hundreds of cores per chip will appear in the market in the following years. The manycore era has already begun. Although this era provides many opportunities, it also presents many challenges. In particular, higher hardware resource sharing among concurrently executing threads can cause individual thread's performance to become unpredictable and might lead to violations of the individual applications' performance requirements. Current resource management mechanisms and policies are no longer adequate for future multicore systems.Some applications present low re-use of their data and pollute caches with data streams, such as multimedia, communications or streaming applications, or have many compulsory misses that cannot be solved by assigning more cache space to the application. Traditional eviction policies such as Least Recently Used (LRU), pseudo LRU or random are demand-driven, that is, they tend to give more space to the application that has more accesses to the cache hierarchy.When no direct control over shared resources is exercised (the last level cache in this case), it is possible that a particular thread allocates most of the shared resources, degrading other threads performance. As a consequence, high resource sharing and resource utilization can cause systems to become unstable and violate individual applications' requirements. If we want to provide a Quality of Service (QoS) to applications, we need to enhance the control over shared resources and enrich the collaboration between the OS and the architecture.In this thesis, we propose software and hardware mechanisms to improve cache sharing in CMP architectures. We make use of a holistic approach, coordinating targets of software and hardware to improve system aggregate performance and provide QoS to applications. We make use of explicit resource allocation techniques to control the shared cache in a CMP architecture, with resource allocation targets driven by hardware and software mechanisms.The main contributions of this thesis are the following:- We have characterized different single- and multithreaded applications and classified workloads with a systematic method to better understand and explain the cache sharing effects on a CMP architecture. We have made a special effort in studying previous cache partitioning techniques for CMP architectures, in order to acquire the insight to propose improved mechanisms.- In CMP architectures with out-of-order processors, cache misses can be served in parallel and share the miss penalty to access main memory. We take this fact into account to propose new cache partitioning algorithms guided by the memory-level parallelism (MLP) of each application. With these algorithms, the system performance is improved (in terms of throughput and fairness) without significantly increasing the hardware required by previous proposals.- Driving cache partition decisions with indirect indicators of performance such as misses, MLP or data re-use may lead to suboptimal cache partitions. Ideally, the appropriate metric to drive cache partitions should be the target metric to optimize, which is normally related to IPC. Thus, we have developed a hardware mechanism, OPACU, which is able to obtain at run-time accurate predictions of the performance of an application when running with different cache assignments.- Using performance predictions, we have introduced a new framework to manage shared caches in CMP architectures, FlexDCP, which allows the OS to optimize different IPC-related target metrics like throughput or fairness and provide QoS to applications. FlexDCP allows an enhanced coordination between the hardware and the software layers, which leads to improved system performance and flexibility.- Next, we have made use of performance estimations to reduce the load imbalance problem in parallel applications. We have built a run-time mechanism that detects parallel applications sensitive to cache allocation and, in these situations, the load imbalance is reduced by assigning more cache space to the slowest threads. This mechanism, helps reducing the long optimization time in terms of man-years of effort devoted to large-scale parallel applications.- Finally, we have stated the main characteristics that future multicore processors with thousands of cores should have. An enhanced coordination between the software and hardware layers has been proposed to better manage the shared resources in these architectures.
|
3 |
Efficient shared cache management in multicore processorsXie, Yuejian 20 May 2011 (has links)
In modern multicore processors, various resources (such as memory bandwidth and caches) are designed to be shared by concurrently running threads. Though it is good to be able to run multiple programs on a single chip at the same time, sometimes the contention of these shared resources can create problems for system performance. Naive hard-partitioning between threads can result in low resource utilization. This research shows that simple and effective approaches to dynamically manage the shared cache can be achieved. The contributions of this work are the following: (1) a technique for dynamic on-line classification of application memory access behaviors to predict the usefulness of cache partitioning, and a simple shared-cache management approach based on the classification; (2) a cache pseudo-partitioning technique that manipulates insertion and promotion policies; (3) a scalable algorithm to quickly decide per-core cache allocations; (4) pseudo-LRU cache partition approximation; (5) a dynamic shared cache compression technique that considers different thread behaviors.
|
4 |
Shared resource management for efficient heterogeneous computingLee, Jaekyu 13 January 2014 (has links)
The demand for heterogeneous computing, because of its performance and energy efficiency, has made on-chip heterogeneous chip multi-processors (HCMP) become the mainstream computing platform, as the recent trend shows in a wide spectrum of platforms from smartphone application processors to desktop and low-end server processors. The performance of on-chip GPUs is not yet comparable to that of discrete GPU cards, but vendors have integrated more powerful GPUs and this trend will continue in upcoming processors.
In this architecture, several system resources are shared between CPUs and GPUs. The sharing of system resources enables easier and cheaper data transfer between CPUs and GPUs, but it also causes resource contention problems between cores. The resource sharing problem has existed since the homogeneous (CPU-only) chip-multi processor (CMP) was introduced. However, resource sharing in HCMPs shows different aspects because of the different nature of CPU and GPU cores. In order to solve the resource sharing problem in HCMPs, we consider efficient shared resource management schemes, in particular tackling the problem in shared last-level cache and interconnection network.
In the thesis, we propose four resource sharing mechanisms:
First, we propose an efficient cache sharing mechanism that exploits the different characteristics of CPU and GPU cores to effectively share cache space between them. Second, adaptive virtual channel partitioning for on-chip interconnection network is proposed to isolate inter-application interference. By partitioning virtual channels to CPUs and GPUs, we can prevent the interference problem while guaranteeing quality-of-service (QoS) for both cores. Third, we propose a dynamic frequency controlling mechanism to efficiently share system resources. When both cores are active, the degree of resource contention as well as the system throughput will be affected by the operating frequency of CPUs and GPUs. The proposed mechanism tries to find optimal operating frequencies for both cores, which reduces the resource contention while improving system throughput. Finally, we propose a second cache sharing mechanism that exploits GPU-semantic information. The programming and execution models of GPUs are more strict and easier than those of CPUs. Also, programmers are asked to provide more information to the hardware. By exploiting these characteristics, GPUs can energy-efficiently exercise the cache and simpler, but more efficient cache partitioning can be enabled for HCMPs.
|
5 |
Improving Last-Level Cache Performance in Single and Multi-Core ProcesssorsManikanth, R January 2013 (has links) (PDF)
With off-chip memory access taking 100's of processor cycles, getting data to the processor in a timely fashion remains one of the key performance bottlenecks in current systems. With increasing core counts, this problem aggravates and the memory access latency becomes even more critical in multi-core systems. Thus the Last Level Cache (LLC) is of particular importance as any miss experienced at the LLC translates into a costly off-chip memory access. A combination of on-chip caches and prefacers is used to hide the off-chip memory access latency. While a hierarchy of caches focus on exploiting locality by retaining useful data, prefacers complement them by initating data accesses early for blocks that are likely to be accessed in future. In the first half of this thesis, we focus on improving the performance of LLC in single-core processors by focusing on prefetchers. In the case of multi-cores, the LLC is shared across many cores and therefore by many programs running on them. Thus, in the second half of this thesis, we focus on novel and efficient management mechanisms for shared LLC to improve the performance of programs running on the various cores.
Prefetchers observe a training stream of primary misses in the cache and rely on the regularity present in them to predict and avoid future misses. We quantify the regularity present in the training stream using the information theoretic measure of entropy and study the impact on regularity by extending the training stream to include secondary misses and accesses. We also consider triggering prefetches on secondary misses. We _nd that the extended histories are more regular in general and it is beneficial to trigger prefetches on secondary misses also. However, the best design choice varies on a per-benchmark and prefetcher basis, necessitating a dynamic approach to identify the best prefetcher configuration. We propose an inexpensive bloom filter based dynamic mechanism to identify the best performing prefetch design point at run time. The adaptive scheme improves the performance in terms of Instructions Per Cycle (IPC) by 4.6% on average over a baseline prefetcher. This performance improvement is achieved along with a reduction in memory traffic requirements.
It is well known that aggressive prefetching can harm performance due to increased contention for memory bandwidth and cache pollution. Prefetchers treat all loads as equal and try to eliminate as many misses as possible while certain (static) load instructions are known to be more performance critical. As our second contribution, we propose Focused Prefetching, a generic mechanism to introduce performance awareness in prefetching. We identify that a small number of static loads, referred to as Loads Incurring Majority of Commit Stalls (LIMCOS), account for a majority of the commit stalls in processors. We propose simple history-based classifier to identify LIMCOS with high accuracy. We use the classifier to focus the prefetching efforts on LIMCOS. This is achieved in a generic prefetcher-agnostic fashion by filtering the history used by the prefetchers. Focused Prefetching improves performance in terms of IPC by 9.8% for a set of memory intensive SPEC2000 workloads. This performance gain is achieved along with a reduction in memory traffic and an improvement in prefetch accuracy.
In the second part of the thesis, we focus on improving the performance of shared caches in multi-core systems. Last level caches are affected by a lack of temporal locality in the access stream as the locality gets filtered out by caches above it. In the case of multi-cores, the interleaving of accesses from the various cores further adds to the problem. To overcome this, we propose a PC-Centric Next-Use Aware Cache Organization (NUcache) for shared caches in multi-cores, with an ability to retain a subset of cache blocks longer. This is achieved by a logical partitioning of the associative ways of a cache set into Main Ways and Deli Ways. While all the blocks have access to the Main Ways, blocks that are likely to be accessed in the near future (with shorter Next-Use distance) are candidates to be retained longer in the Deli Ways to eliminate future misses. We make use of the fact that a small number of PCs, referred to as delinquent PCs, bring in a majority of the cache blocks and learn the Next-Use characteristic of blocks brought in by them. We propose an intelligent cost-benefit based PC-selection mechanism to identify the best set of delinquent PCs that should have access to the Deli Ways to maximize the cache hits. Performance evaluation reveals that NUcache improves the performance (in terms of Average Normalized Turnaround Time, ANTT) of multi-programmed workloads by 6.2%, 13.9%, 15.8% and 19.6% in dual, quad, eight and sixteen core machines respectively. NUcache also performs better than some of the state-of-the-art cache partitioning mechanisms.
The last part of the thesis deals with effective shared cache management in multi-core systems to achieve various performance objectives. Explicitly controlling the shared cache occupancy of competing applications is a flexible and practical way to achieve a variety of high level performance goals. Existing solutions control cache occupancy at a coarser granularity, do not scale well to large core counts and, in some cases, lack the flexibility to support a variety of performance goals. To overcome this, we propose Probabilistic Shared Cache Management (PriSM), a framework to manage the cache occupancy of different cores at cache block granularity by controlling their eviction probabilities. The proposed framework requires only simple hardware changes to implement, can scale to larger core count and is flexible enough to support a variety of performance goals like hit-maximization, fairness and QoS. PriSM with Hit-Maximization improves the performance (of multi-programmed workloads) in terms of ANTT by 16.5%, 18.7% and 12.7% over baseline LRU in eight, sixteen and thirty two core machines respectively.
|
6 |
Competitive cache replacement strategies for a shared cacheKatti, Anil Kumar 08 July 2011 (has links)
We consider cache replacement algorithms at a shared cache in a multicore system which receives an arbitrary interleaving of requests from processes that have full knowledge about their individual request sequences. We establish tight bounds on the competitive ratio of deterministic and randomized cache replacement strategies when processes share memory blocks. Our main result for this case is a deterministic algorithm called GLOBAL-MAXIMA which is optimum up to a constant factor when processes share memory blocks. Our framework is a generalization of the application controlled caching framework in which processes access disjoint sets of memory blocks. We also present a deterministic algorithm called RR-PROC-MARK which exactly matches the lower bound on the competitive ratio of deterministic cache replacement algorithms when processes access disjoint sets of memory blocks. We extend our results to multiple levels of caches and prove that an exclusive cache is better than both inclusive and non-inclusive caches; this validates the experimental findings in the literature. Our results could be applied to shared caches in multicore systems in which processes work together on multithreaded computations like Gaussian elimination paradigm, fast Fourier transform, matrix multiplication, etc. In these computations, processes have full knowledge about their individual request sequences and can share memory blocks. / text
|
7 |
Precise Analysis of Private And Shared Caches for Tight WCET EstimatesNagar, Kartik January 2016 (has links) (PDF)
Worst Case Execution Time (WCET) is an important metric for programs running on real-time systems, and finding precise estimates of a program’s WCET is crucial to avoid over-allocation and wastage of hardware resources and to improve the schedulability of task sets. Hardware Caches have a major impact on a program’s execution time, and accurate estimation of a program’s cache behavior generally leads to significant reduction of its estimated WCET. However, the cache behavior of an access cannot be determined in isolation, since it depends on the access history, and in multi-path programs, the sequence of accesses made to the cache is not fixed. Hence, the same access can exhibit different cache behavior in different execution instances. This issue is further exacerbated in shared caches in a multi-core architecture, where interfering accesses from co-running programs on other cores can arrive at any time and modify the cache state. Further, cache analysis aimed towards WCET estimation should be provably safe, in that the estimated WCET should always exceed the actual execution time across all execution instances.
Faced with such contradicting requirements, previous approaches to cache analysis try to find memory accesses in a program which are guaranteed to hit the cache, irrespective of the program input, or the interferences from other co-running programs in case of a shared cache. To do so, they find the worst-case cache behavior for every individual memory access, analyzing the program (and interferences to a shared cache) to find whether there are execution instances where an access can super a cache miss. However, this approach loses out in making more precise predictions of private cache behavior which can be safely used for WCET estimation, and is significantly imprecise for shared cache analysis, where it is often impossible to guarantee that an access always hits the cache. In this work, we take a fundamentally different approach to cache analysis, by (1) trying to find worst-case behavior of groups of cache accesses, and (2) trying to find the exact cache behavior in the worst-case program execution instance, which is the execution instance with the maximum execution time.
For shared caches, we propose the Worst Case Interference Placement (WCIP) technique, which finds the worst-case timing of interfering accesses that would cause the maximum number of cache misses on the worst case execution path of the program. We first use Integer Linear Programming (ILP) to find an exact solution to the WCIP problem. However, this approach does not scale well for large programs, and so we investigate the WCIP problem in detail and prove that it is NP-Hard.
In the process, we discover that the source of hardness of the WCIP problem lies in finding the worst case execution path which would exhibit the maximum execution time in the presence of interferences. We use this observation to propose an approximate algorithm for performing WCIP, which bypasses the hard problem of finding the worst case execution path by simply assuming that all cache accesses made by the program occur on a single path. This allows us to use a simple greedy algorithm to distribute the interfering accesses by choosing those cache accesses which could be most affected by interferences. The greedy algorithm also guarantees that the increase in WCET due to interferences is linear in the number of interferences. Experimentally, we show that WCIP provides substantial precision improvement in the final WCET over previous approaches to shared cache analysis, and the approximate algorithm almost matches the precision of the ILP-based approach, while being considerably faster.
For private caches, we discover multiple scenarios where hit-miss predictions made by traditional Abstract Interpretation-based approaches are not sufficient to fully capture cache behavior for WCET estimation. We introduce the concept of cache miss paths, which are abstractions of program path along which an access can super a cache miss. We propose an ILP-based approach which uses cache miss paths to find the exact cache behavior in the worst-case execution instance of the program. However, the ILP-based approach needs information about the worst-case execution path to predict the cache behavior, and hence it is difficult to integrate it with other micro-architectural analysis. We then show that most of the precision improvement of the ILP-based approach can be recovered without any knowledge of the worst-case execution path, by a careful analysis of the cache miss paths themselves. In particular, we can use cache miss paths to find the worst-case behavior of groups of cache accesses. Further, we can find upper bounds on the maximum number of times that cache accesses inside loops can exhibit worst-case behavior. This results in a scalable, precise method for performing private cache analysis which can be easily integrated with other micro-architectural analysis.
|
Page generated in 0.0673 seconds