Global ETD Search

1	Energy efficient cache architectures for single, multi and many core processors Thucanakkenpalayam Sundararajan, Karthik January 2013 (has links) With each technology generation we get more transistors per chip. Whilst processor frequencies have increased over the past few decades, memory speeds have not kept pace. Therefore, more and more transistors are devoted to on-chip caches to reduce latency to data and help achieve high performance. On-chip caches consume a significant fraction of the processor energy budget but need to deliver high performance. Therefore cache resources should be optimized to meet the requirements of the running applications. Fixed configuration caches are designed to deliver low average memory access times across a wide range of potential applications. However, this can lead to excessive energy consumption for applications that do not require the full capacity or associativity of the cache at all times. Furthermore, in systems where the clock period is constrained by the access times of level-1 caches, the clock frequency for all applications is effectively limited by the cache requirements of the most demanding phase within the most demanding application. This motivates the need for dynamic adaptation of cache configurations in order to optimize performance while minimizing energy consumption, on a per-application basis. First, this thesis proposes an energy-efficient cache architecture for a single core system, along with a run-time support framework for dynamic adaptation of cache size and associativity through the use of machine learning. The machine learning model, which is trained offline, profiles the application’s cache usage and then reconfigures the cache according to the program’s requirement. The proposed cache architecture has, on average, 18% better energy-delay product than the prior state-of-the-art cache architectures proposed in the literature. Next, this thesis proposes cooperative partitioning, an energy-efficient cache partitioning scheme for multi-core systems that share the Last Level Cache (LLC), with a core to LLC cache way ratio of 1:4. The proposed cache partitioning scheme uses small auxiliary tags to capture each core’s cache requirements, and partitions the LLC according to the individual cores cache requirement. The proposed partitioning uses a way-aligned scheme that helps in the reduction of both dynamic and static energy. This scheme, on an average offers 70% and 30% reduction in dynamic and static energy respectively, while maintaining high performance on par with state-of-the-art cache partitioning schemes. Finally, when Last Level Cache (LLC) ways are equal to or less than the number of cores present in many-core systems, cooperative partitioning cannot be used for partitioning the LLC. This thesis proposes a region aware cache partitioning scheme as an energy-efficient approach for many core systems that share the LLC, with a core to LLC way ratio of 1:2 and 1:1. The proposed partitioning, on an average offers 68% and 33% reduction in dynamic and static energy respectively, while again maintaining high performance on par with state-of-the-art LLC cache management techniques. 004.5
2	Toward Next-generation Data Centers : Principles of Software-Defined “Hardware” Infrastructures and Resource Disaggregation Roozbeh, Amir January 2019 (has links) The cloud is evolving due to additional demands introduced by new technological advancements and the wide movement toward digitalization. Therefore, next-generation data centers (DCs) and clouds are expected (and need) to become cheaper, more efficient, and capable of offering more predictable services. Aligned with this, we examine the concept of software-defined “hardware” infrastructures (SDHI) based on hardware resource disaggregation as one possible way of realizing next-generation DCs. We start with an overview of the functional architecture of a cloud based on SDHI. Following this, we discuss a series of use-cases and deployment scenarios enabled by SDHI and explore the role of each functional block of SDHI’s architecture, i.e., cloud infrastructure, cloud platforms, cloud execution environments, and applications. Next, we propose a framework to evaluate the impact of SDHI on techno-economic efficiency of DCs, specifically focusing on application profiling, hardware dimensioning, and total cost of ownership (TCO). Our study shows that combining resource disaggregation and software-defined capabilities makes DCs less expensive and easier to expand; hence they can rapidly follow the exponential demand growth. Additionally, we elaborate on technologies behind SDHI, its challenges, and its potential future directions. Finally, to identify a suitable memory management scheme for SDHI and show its advantages, we focus on the management of Last Level Cache (LLC) in currently available Intel processors. Aligned with this, we investigate how better management of LLC can provide higher performance, more predictable response time, and improved isolation between threads. More specifically, we take advantage of LLC’s non-uniform cache architecture (NUCA) in which the LLC is divided into “slices,” where access by the core to which it closer is faster than access to other slices. Based upon this, we introduce a new memory management scheme, called slice-aware memory management, which carefully maps the allocated memory to LLC slices based on their access time latency rather than the de facto scheme that maps them uniformly. Many applications can benefit from our memory management scheme with relatively small changes. As an example, we show the potential benefits that Key-Value Store (KVS) applications gain by utilizing our memory management scheme. Moreover, we discuss how this scheme could be used to provide explicit CPU slicing – which is one of the expectations of SDHI and hardware resource disaggregation. / <p>QC 20190415</p> cloud computing next-generation data centers resource disaggregation TCO last level cache Communication Systems Kommunikationssystem
3	Improving Last-Level Cache Performance in Single and Multi-Core Processsors Manikanth, R January 2013 (has links) (PDF) With off-chip memory access taking 100's of processor cycles, getting data to the processor in a timely fashion remains one of the key performance bottlenecks in current systems. With increasing core counts, this problem aggravates and the memory access latency becomes even more critical in multi-core systems. Thus the Last Level Cache (LLC) is of particular importance as any miss experienced at the LLC translates into a costly off-chip memory access. A combination of on-chip caches and prefacers is used to hide the off-chip memory access latency. While a hierarchy of caches focus on exploiting locality by retaining useful data, prefacers complement them by initating data accesses early for blocks that are likely to be accessed in future. In the first half of this thesis, we focus on improving the performance of LLC in single-core processors by focusing on prefetchers. In the case of multi-cores, the LLC is shared across many cores and therefore by many programs running on them. Thus, in the second half of this thesis, we focus on novel and efficient management mechanisms for shared LLC to improve the performance of programs running on the various cores. Prefetchers observe a training stream of primary misses in the cache and rely on the regularity present in them to predict and avoid future misses. We quantify the regularity present in the training stream using the information theoretic measure of entropy and study the impact on regularity by extending the training stream to include secondary misses and accesses. We also consider triggering prefetches on secondary misses. We _nd that the extended histories are more regular in general and it is beneficial to trigger prefetches on secondary misses also. However, the best design choice varies on a per-benchmark and prefetcher basis, necessitating a dynamic approach to identify the best prefetcher configuration. We propose an inexpensive bloom filter based dynamic mechanism to identify the best performing prefetch design point at run time. The adaptive scheme improves the performance in terms of Instructions Per Cycle (IPC) by 4.6% on average over a baseline prefetcher. This performance improvement is achieved along with a reduction in memory traffic requirements. It is well known that aggressive prefetching can harm performance due to increased contention for memory bandwidth and cache pollution. Prefetchers treat all loads as equal and try to eliminate as many misses as possible while certain (static) load instructions are known to be more performance critical. As our second contribution, we propose Focused Prefetching, a generic mechanism to introduce performance awareness in prefetching. We identify that a small number of static loads, referred to as Loads Incurring Majority of Commit Stalls (LIMCOS), account for a majority of the commit stalls in processors. We propose simple history-based classifier to identify LIMCOS with high accuracy. We use the classifier to focus the prefetching efforts on LIMCOS. This is achieved in a generic prefetcher-agnostic fashion by filtering the history used by the prefetchers. Focused Prefetching improves performance in terms of IPC by 9.8% for a set of memory intensive SPEC2000 workloads. This performance gain is achieved along with a reduction in memory traffic and an improvement in prefetch accuracy. In the second part of the thesis, we focus on improving the performance of shared caches in multi-core systems. Last level caches are affected by a lack of temporal locality in the access stream as the locality gets filtered out by caches above it. In the case of multi-cores, the interleaving of accesses from the various cores further adds to the problem. To overcome this, we propose a PC-Centric Next-Use Aware Cache Organization (NUcache) for shared caches in multi-cores, with an ability to retain a subset of cache blocks longer. This is achieved by a logical partitioning of the associative ways of a cache set into Main Ways and Deli Ways. While all the blocks have access to the Main Ways, blocks that are likely to be accessed in the near future (with shorter Next-Use distance) are candidates to be retained longer in the Deli Ways to eliminate future misses. We make use of the fact that a small number of PCs, referred to as delinquent PCs, bring in a majority of the cache blocks and learn the Next-Use characteristic of blocks brought in by them. We propose an intelligent cost-benefit based PC-selection mechanism to identify the best set of delinquent PCs that should have access to the Deli Ways to maximize the cache hits. Performance evaluation reveals that NUcache improves the performance (in terms of Average Normalized Turnaround Time, ANTT) of multi-programmed workloads by 6.2%, 13.9%, 15.8% and 19.6% in dual, quad, eight and sixteen core machines respectively. NUcache also performs better than some of the state-of-the-art cache partitioning mechanisms. The last part of the thesis deals with effective shared cache management in multi-core systems to achieve various performance objectives. Explicitly controlling the shared cache occupancy of competing applications is a flexible and practical way to achieve a variety of high level performance goals. Existing solutions control cache occupancy at a coarser granularity, do not scale well to large core counts and, in some cases, lack the flexibility to support a variety of performance goals. To overcome this, we propose Probabilistic Shared Cache Management (PriSM), a framework to manage the cache occupancy of different cores at cache block granularity by controlling their eviction probabilities. The proposed framework requires only simple hardware changes to implement, can scale to larger core count and is flexible enough to support a variety of performance goals like hit-maximization, fairness and QoS. PriSM with Hit-Maximization improves the performance (of multi-programmed workloads) in terms of ANTT by 16.5%, 18.7% and 12.7% over baseline LRU in eight, sixteen and thirty two core machines respectively. Multi-Core Processors Single-Core Processors Single-Core Processors Prefetching Last Level Cache Management Shared Cache Management Next-Use Aware Cache Organization NUcache Last Level Cache (LLC) Computer Science
4	Realizing Low-Latency Internet Services via Low-Level Optimization of NFV Service Chains : Every nanosecond counts! Farshin, Alireza January 2019 (has links) By virtue of the recent technological developments in cloud computing, more applications are deployed in a cloud. Among these modern cloud-based applications, some require bounded and predictable low-latency responses. However, the current cloud infrastructure is unsuitable as it cannot satisfy these requirements, due to many limitations in both hardware and software. This licentiate thesis describes attempts to reduce the latency of Internet services by carefully studying the currently available infrastructure, optimizing it, and improving its performance. The focus is to optimize the performance of network functions deployed on commodity hardware, known as network function virtualization (NFV). The performance of NFV is one of the major sources of latency for Internet services. The first contribution is related to optimizing the software. This project began by investigating the possibility of superoptimizing virtualized network functions(VNFs). This began with a literature review of available superoptimization techniques, then one of the state-of-the-art superoptimization tools was selected to analyze the crucial metrics affecting application performance. The result of our analysis demonstrated that having better cache metrics could potentially improve the performance of all applications. The second contribution of this thesis employs the results of the first part by taking a step toward optimizing cache performance of time-critical NFV service chains. By doing so, we reduced the tail latencies of such systems running at 100Gbps. This is an important achievement as it increases the probability of realizing bounded and predictable latency for Internet services. / Tack vare den senaste tekniska utvecklingen inom beräkningar i molnet(“cloud computing”) används allt fler tillämpningar i molnlösningar. Flera avdessa moderna molnbaserade tillämpningar kräver korta svarstider är låga ochatt dessa ska vara förutsägbara och ligga inom givna gränser. Den nuvarandemolninfrastrukturen är dock otillräcklig eftersom den inte kan uppfylla dessa krav,på grund av olika typer av begränsningar i både hårdvara och mjukvara. I denna licentiatavhandling beskrivs försök att minska fördröjningen iinternettjänster genom att noggrant studera den nuvarande tillgängligainfrastrukturen, optimera den och förbättra dess prestanda. Fokus ligger påatt optimera prestanda för nätverksfunktioner som realiseras med hjälp avstandardhårdvara, känt som nätverksfunktionsvirtualisering (NFV). Prestanda hosNFV är en av de viktigaste källorna till fördröjning i internettjänster. Det första bidraget är relaterat till att optimera mjukvaran. Detta projektbörjade med att undersöka möjligheten att “superoptimera” virtualiseradenätverksfunktioner (VNF). Detta inleddes med en litteraturöversikt av tillgängligasuperoptimeringstekniker, och sedan valdes ett av de toppmodernasuperoptimeringsverktygen för att analysera de viktiga mätvärden som påverkartillämpningssprestanda. Resultatet av vår analys visade att bättre cache-mätningar potentiellt skulle kunna förbättra prestanda för alla tillämpningar. Det andra bidraget i denna avhandling utnyttjar resultaten från den förstadelen genom att ta ett steg mot att optimera cache-prestanda för tidskritiskakedjor av NFV-tjänster. Genom att göra så reducerade vi de långa fördröjningarnahos sådana system som kördes vid 100 Gbps. Detta är en viktig bedrift eftersomdetta ökar sannolikheten för att uppnå en begränsad och förutsägbar fördrörninghos internettjänster. / <p>QC 20190415</p> / Time-Critical Clouds / ULTRA Low-latency Internet services Network Function Virtualization Low-level Optimization Superoptimization Last Level Cache Internettjänster med låg fördröjning Virtualisering av nätverksfunktioner Optimering på låg nivå Superoptimering Sista-nivåns cache Communication Systems Kommunikationssystem
5	An Evaluation of Intel Cache Allocation Technology for Data- Intensive Applications / En utvärdering av Intel Cache Allocation Technology för dataintensiva applikationer Ihre Sherif, Alan January 2021 (has links) On certain CPUs part of the Intel Xeon Scalable CPU family, the level three (L3) cache is shared among the CPU cores residing on the same CPU socket. This has benefits in that a larger and more scalable cache space is available to the CPU cores. However, when the L3 cache is shared between CPU cores and thereby by the applications running there, the applications can affect the performance of each other if some of them have high L3 cache usage. This can be particularly problematic if an application is over-utilizing the L3 cache and effectively evicting the data of other applications, which are more prioritized, from the L3 cache. Such applications are called L3 cache noisy neighbors. The experiments in this thesis study the effect L3 cache noisy neighbors have on other, more prioritized, applications and if Intel Cache Allocation Technology (CAT) can be used to limit the performance impact the noisy neighbors have. Intel CAT provides functionality to control the amount of L3 cache allocated to a CPU core and by allocating less L3 cache to a noisy neighbor it no longer shares as much L3 cache with the prioritized applications and thus the prioritized applications can again utilize more of the L3 cache and regain their performance. The research question of this thesis is to investigate in what cases Intel CAT can provide advantages and where it is a disadvantage to use it by studying its use for three commonly used applications; bzip2, Redis, and Graph500. All the three applications were significantly impacted when running simultaneously with a noisy neighbor and for the Redis application there was a decrease of 49.2% in the number of ’GET’ requests per second that the Redis server could handle and an 18.2% decrease for ’SET’ requests. For the bzip2 and Graph500 applications, there was a 14.7% and 28.1% increase in execution time respectively. Intel CAT was successfully used to limit the impact of the noisy neighbor on the three applications. For the Redis application, the number of requests per second increased by 8.6% for the ’GET’ operation and by 4.2% for the ’SET’ operation. For the bzip2 and Graph500 applications, there was a 5.8% and 12.0% decrease in execution time respectively. Moreover, the thesis studies the scenario when only prioritized applications are running and if their performance can be increased by isolating the L3 cache for each one of them so that they cannot cause L3 cache evictions for each other. The use case of Intel CAT in such a scenario is not as clear as when mitigating the impact of a noisy neighbor but some performance benefits can be observed when running multiple Redis instances on the same machine and isolating some of the L3 cache available to them. / För vissa processorer som tillhör familjen Intel Xeon Scalable är den tredje nivåns cache (L3-cache) delad mellan CPU-kärnorna som befinner sig på samma CPU-sockel. Detta har fördelen att ett större och mer skalbart cacheutrymme blir tillgängligt för CPU-kärnorna. Att L3-cache är delat mellan kärnorna innebär däremot att applikationerna som kör där kan påverka varandras prestanda om någon av dem överutnyttjar L3-cache. När en applikation överutnyttjar L3-cache leder det till att data från andra applikationer, som kan vara mer prioriterade, inte längre får plats i cachen. Sådana applikationer kallas för ”L3-cache noisy neigbors”. Experimenten i denna studie undersöker effekterna av L3-cache noisy neigbors på mer prioriterade applikationer och om Intel Cache Allocation Technology (CAT) kan användas för att begränsa den påverkan som L3-cache noisy neigbors har. Intel CAT har funktionalitet för att kontrollera mängden L3-cache som allokeras till en CPU-kärna och genom att allokera mindre L3-cache till en noisy neigbor så delar den inte lika mycket L3-cache med de prioriterade applikationerna och därmed kan de prioriterade applikationerna återfå sin prestanda. Frågeställningen för denna studie är att undersöka i vilka användningsområden Intel CAT har fördelar och när det är en nackdel att använda det genom att studera dess användning för tre välanvända applikationer, bzip2, Redis och Graph500. Prestandan för alla av dessa tre applikationer blev tydligt påverkad när de kördes samtidigt som en noisy neigbor och Intel CAT kunde användas för att minska den påverkan. För Redis ökade antalet frågor som hanterades av Redis med 8.6% för GET-operationer och 4.2% för SET-operationer. För bzip2 och Graph500 observerades en minskning i exekveringstid på 5.8% och 12.0% respektive. Denna uppsats undersöker även scenariot där bara prioriterade applikationer körs och om deras prestanda kan ökas genom att isolera L3-cache för var och en av dem så att de inte tar plats från varandra i L3-cachen. När Intel CAT användes i ett sådant scenario är fördelarna inte lika tydliga som när påverkan av en noisy neighbor begränsades men en viss förbättring i prestanda går att observera när flera Redisservrar körs på samma maskin och en del av L3-cachen isoleras till var och en av dem. L3 cache Last Level Cache Intel Cache Allocation Technology L3 Cache allocation L3 cache performance improvement Redis performance Graph500 performance bzip2 performance L3-cache Sista nivåns cache Intel Cache Allocation Technology L3 Cache-allokering L3-cache prestandaförbättring Redis prestanda Graph500 prestanda bzip2 prestanda Computer Sciences Datavetenskap (datalogi)

1

Page generated in 0.0925 seconds