1 |
PERFORMANCE EVALUATION AND OPTIMIZATION OF THE UNSTRUCTURED CFD CODE UNCLEGupta, Saurabh 01 January 2006 (has links)
Numerous advancements made in the field of computational sciences have made CFD a viable solution to the modern day fluid dynamics problems. Progress in computer performance allows us to solve a complex flow field in practical CPU time. Commodity clusters are also gaining popularity as computational research platform for various CFD communities. This research focuses on evaluating and enhancing the performance of an in-house, unstructured, 3D CFD code on modern commodity clusters. The fundamental idea is to tune the codes to optimize the cache behavior of the node on commodity clusters to achieve enhanced code performance. Accordingly, this work presents discussion of various available techniques for data access optimization and detailed description of those which yielded improved code performance. These techniques were tested on various steady, unsteady, laminar, and turbulent test cases and the results are presented. The critical hardware parameters which influenced the code performance were identified. A detailed study investigating the effect of these parameters on the code performance was conducted and the results are presented. The successful single node improvements were also efficiently tested on parallel platform. The modified version of the code was also ported to different hardware architectures with successful results. Loop blocking is established as a predictor of code performance.
|
2 |
Cache Characterization and Performance Studies Using Locality SurfacesSorenson, Elizabeth Schreiner 14 July 2005 (has links) (PDF)
Today's processors commonly use caches to help overcome the disparity between processor and main memory speeds. Due to the principle of locality, most of the processor's requests for data are satisfied by the fast cache memory, resulting in a signficant performance improvement. Methods for evaluating workloads and caches in terms of locality are valuable for cache design. In this dissertation, we present a locality surface which displays both temporal and spatial locality on one three-dimensional graph. We provide a solid, mathematical description of locality data and equations for visualization. We then use the locality surface to examine the locality of a variety of workloads from the SPEC CPU 2000 benchmark suite. These surfaces contain a number of features that represent sequential runs, loops, temporal locality, striding, and other patterns from the input trace. The locality surface can also be used to evaluate methodologies that involve locality. For example, we evaluate six synthetic trace generation methods and find that none of them accurately reproduce an original trace's locality. We then combine a mathematical description of caches with our locality definition to create cache characterization surfaces. These new surfaces visually relate how references with varying degrees of locality function in a given cache. We examine how varying the cache size, line size, and associativity affect a cache's response to different types of locality. We formally prove that the locality surface can predict the miss rate in some types of caches. Our locality surface matches well with cache simulation results, particularly caches with large associativities. We can qualitatively choose prudent values for cache and line size. Further, the locality surface can predict the miss rate with 100% accuracy for some fully associative caches and with some error for set associative caches. One drawback to the locality surface is the time intensity of the stack-based algorithm. We provide a new parallel algorithm that reduces the computation time significantly. With this improvement, the locality surface becomes a viable and valuable tool for characterizing workloads and caches, predicting cache simulation results, and evaluating any procedure involving locality.
|
3 |
Experiments with the pentium Performance monitoring countersAgarwal, Gunjan 06 1900 (has links)
Performance monitoring counters are implemented in most recent microprocessors. In this thesis, we describe various performance measurement experiments for a program and a system that we conducted on a Linux operating system using the Pentium performance counters. We carried out our performance measurements on a Pentium II microprocessor. The Pentium II performance counters can be configured to count events such as cache misses, TLB misses, instructions executed etc. We used a low intrusive overhead technique to access these performance counters.
We used these performance counters to measure the cache miss overheads due to context switches in Linux system. Our methodology involves sampling the hardware counters every 50ps. The sampling was set up using signals related to interval timers. We describe an analytical cache performance model under multiprogrammed condition from the literature and validate it using the performance monitoring counters.
We next explores the long term performance of a system under different workload conditions. Various performance monitoring events - data cache h, data TLB misses, data cache reads or writes, branches etc. - are monitored over a 24 hour period. This is useful in identifying activities which cause loss of system performance. We used timer interrupts for sampling the performance counters.
We develop a profiling methodology to give a perspective of performance of the different functions of a program, not only on the basis of execution-time but also on the data cache misses. Available tools like prof on Unix can be used to pinpoint the regions of performance loss of programs, but they mainly rely on an execution-time profiles. This does not give insight into problems in cache performance for that program. So we develop this methodology to get the performance of each function of the program not only on the basis of its execution time but also on the basis of its cache behavior.
|
4 |
An Evaluation of Intel Cache Allocation Technology for Data- Intensive Applications / En utvärdering av Intel Cache Allocation Technology för dataintensiva applikationerIhre Sherif, Alan January 2021 (has links)
On certain CPUs part of the Intel Xeon Scalable CPU family, the level three (L3) cache is shared among the CPU cores residing on the same CPU socket. This has benefits in that a larger and more scalable cache space is available to the CPU cores. However, when the L3 cache is shared between CPU cores and thereby by the applications running there, the applications can affect the performance of each other if some of them have high L3 cache usage. This can be particularly problematic if an application is over-utilizing the L3 cache and effectively evicting the data of other applications, which are more prioritized, from the L3 cache. Such applications are called L3 cache noisy neighbors. The experiments in this thesis study the effect L3 cache noisy neighbors have on other, more prioritized, applications and if Intel Cache Allocation Technology (CAT) can be used to limit the performance impact the noisy neighbors have. Intel CAT provides functionality to control the amount of L3 cache allocated to a CPU core and by allocating less L3 cache to a noisy neighbor it no longer shares as much L3 cache with the prioritized applications and thus the prioritized applications can again utilize more of the L3 cache and regain their performance. The research question of this thesis is to investigate in what cases Intel CAT can provide advantages and where it is a disadvantage to use it by studying its use for three commonly used applications; bzip2, Redis, and Graph500. All the three applications were significantly impacted when running simultaneously with a noisy neighbor and for the Redis application there was a decrease of 49.2% in the number of ’GET’ requests per second that the Redis server could handle and an 18.2% decrease for ’SET’ requests. For the bzip2 and Graph500 applications, there was a 14.7% and 28.1% increase in execution time respectively. Intel CAT was successfully used to limit the impact of the noisy neighbor on the three applications. For the Redis application, the number of requests per second increased by 8.6% for the ’GET’ operation and by 4.2% for the ’SET’ operation. For the bzip2 and Graph500 applications, there was a 5.8% and 12.0% decrease in execution time respectively. Moreover, the thesis studies the scenario when only prioritized applications are running and if their performance can be increased by isolating the L3 cache for each one of them so that they cannot cause L3 cache evictions for each other. The use case of Intel CAT in such a scenario is not as clear as when mitigating the impact of a noisy neighbor but some performance benefits can be observed when running multiple Redis instances on the same machine and isolating some of the L3 cache available to them. / För vissa processorer som tillhör familjen Intel Xeon Scalable är den tredje nivåns cache (L3-cache) delad mellan CPU-kärnorna som befinner sig på samma CPU-sockel. Detta har fördelen att ett större och mer skalbart cacheutrymme blir tillgängligt för CPU-kärnorna. Att L3-cache är delat mellan kärnorna innebär däremot att applikationerna som kör där kan påverka varandras prestanda om någon av dem överutnyttjar L3-cache. När en applikation överutnyttjar L3-cache leder det till att data från andra applikationer, som kan vara mer prioriterade, inte längre får plats i cachen. Sådana applikationer kallas för ”L3-cache noisy neigbors”. Experimenten i denna studie undersöker effekterna av L3-cache noisy neigbors på mer prioriterade applikationer och om Intel Cache Allocation Technology (CAT) kan användas för att begränsa den påverkan som L3-cache noisy neigbors har. Intel CAT har funktionalitet för att kontrollera mängden L3-cache som allokeras till en CPU-kärna och genom att allokera mindre L3-cache till en noisy neigbor så delar den inte lika mycket L3-cache med de prioriterade applikationerna och därmed kan de prioriterade applikationerna återfå sin prestanda. Frågeställningen för denna studie är att undersöka i vilka användningsområden Intel CAT har fördelar och när det är en nackdel att använda det genom att studera dess användning för tre välanvända applikationer, bzip2, Redis och Graph500. Prestandan för alla av dessa tre applikationer blev tydligt påverkad när de kördes samtidigt som en noisy neigbor och Intel CAT kunde användas för att minska den påverkan. För Redis ökade antalet frågor som hanterades av Redis med 8.6% för GET-operationer och 4.2% för SET-operationer. För bzip2 och Graph500 observerades en minskning i exekveringstid på 5.8% och 12.0% respektive. Denna uppsats undersöker även scenariot där bara prioriterade applikationer körs och om deras prestanda kan ökas genom att isolera L3-cache för var och en av dem så att de inte tar plats från varandra i L3-cachen. När Intel CAT användes i ett sådant scenario är fördelarna inte lika tydliga som när påverkan av en noisy neighbor begränsades men en viss förbättring i prestanda går att observera när flera Redisservrar körs på samma maskin och en del av L3-cachen isoleras till var och en av dem.
|
Page generated in 0.3046 seconds