Global ETD Search

1	Improving cache Behavior in CMP architectures throug cache partitioning techniques Moretó Planas, Miquel 19 March 2010 (has links) The evolution of microprocessor design in the last few decades has changed significantly, moving from simple inorder single core architectures to superscalar and vector architectures in order to extract the maximum available instruction level parallelism. Executing several instructions from the same thread in parallel allows significantly improving the performance of an application. However, there is only a limited amount of parallelism available in each thread, because of data and control dependences. Furthermore, designing a high performance, single, monolithic processor has become very complex due to power and chip latencies constraints. These limitations have motivated the use of thread level parallelism (TLP) as a common strategy for improving processor performance. Multithreaded processors allow executing different threads at the same time, sharing some hardware resources. There are several flavors of multithreaded processors that exploit the TLP, such as chip multiprocessors (CMP), coarse grain multithreading, fine grain multithreading, simultaneous multithreading (SMT), and combinations of them.To improve cost and power efficiency, the computer industry has adopted multicore chips. In particular, CMP architectures have become the most common design decision (combined sometimes with multithreaded cores). Firstly, CMPs reduce design costs and average power consumption by promoting design re-use and simpler processor cores. For example, it is less complex to design a chip with many small, simple cores than a chip with fewer, larger, monolithic cores.Furthermore, simpler cores have less power hungry centralized hardware structures. Secondly, CMPs reduce costs by improving hardware resource utilization. On a multicore chip, co-scheduled threads can share costly microarchitecture resources that would otherwise be underutilized. Higher resource utilization improves aggregate performance and enables lower cost design alternatives.One of the resources that impacts most on the final performance of an application is the cache hierarchy. Caches store data recently used by the applications in order to take advantage of temporal and spatial locality of applications. Caches provide fast access to data, improving the performance of applications. Caches with low latencies have to be small, which prompts the design of a cache hierarchy organized into several levels of cache.In CMPs, the cache hierarchy is normally organized in a first level (L1) of instruction and data caches private to each core. A last level of cache (LLC) is normally shared among different cores in the processor (L2, L3 or both). Shared caches increase resource utilization and system performance. Large caches improve performance and efficiency by increasing the probability that each application can access data from a closer level of the cache hierarchy. It also allows an application to make use of the entire cache if needed.A second advantage of having a shared cache in a CMP design has to do with the cache coherency. In parallel applications, different threads share the same data and keep a local copy of this data in their cache. With multiple processors, it is possible for one processor to change the data, leaving another processor's cache with outdated data. Cache coherency protocol monitors changes to data and ensures that all processor caches have the most recent data. When the parallel application executes on the same physical chip, the cache coherency circuitry can operate at the speed of on-chip communications, rather than having to use the much slower between-chip communication, as is required with discrete processors on separate chips. These coherence protocols are simpler to design with a unified and shared level of cache onchip.Due to the advantages that multicore architectures offer, chip vendors use CMP architectures in current high performance, network, real-time and embedded systems. Several of these commercial processors have a level of the cache hierarchy shared by different cores. For example, the Sun UltraSPARC T2 has a 16-way 4MB L2 cache shared by 8 cores each one up to 8-way SMT. Other processors like the Intel Core 2 family also share up to a 12MB 24-way L2 cache. In contrast, the AMD K10 family has a private L2 cache per core and a shared L3 cache, with up to a 6MB 64-way L3 cache.As the long-term trend of increasing integration continues, the number of cores per chip is also projected to increase with each successive technology generation. Some significant studies have shown that processors with hundreds of cores per chip will appear in the market in the following years. The manycore era has already begun. Although this era provides many opportunities, it also presents many challenges. In particular, higher hardware resource sharing among concurrently executing threads can cause individual thread's performance to become unpredictable and might lead to violations of the individual applications' performance requirements. Current resource management mechanisms and policies are no longer adequate for future multicore systems.Some applications present low re-use of their data and pollute caches with data streams, such as multimedia, communications or streaming applications, or have many compulsory misses that cannot be solved by assigning more cache space to the application. Traditional eviction policies such as Least Recently Used (LRU), pseudo LRU or random are demand-driven, that is, they tend to give more space to the application that has more accesses to the cache hierarchy.When no direct control over shared resources is exercised (the last level cache in this case), it is possible that a particular thread allocates most of the shared resources, degrading other threads performance. As a consequence, high resource sharing and resource utilization can cause systems to become unstable and violate individual applications' requirements. If we want to provide a Quality of Service (QoS) to applications, we need to enhance the control over shared resources and enrich the collaboration between the OS and the architecture.In this thesis, we propose software and hardware mechanisms to improve cache sharing in CMP architectures. We make use of a holistic approach, coordinating targets of software and hardware to improve system aggregate performance and provide QoS to applications. We make use of explicit resource allocation techniques to control the shared cache in a CMP architecture, with resource allocation targets driven by hardware and software mechanisms.The main contributions of this thesis are the following:- We have characterized different single- and multithreaded applications and classified workloads with a systematic method to better understand and explain the cache sharing effects on a CMP architecture. We have made a special effort in studying previous cache partitioning techniques for CMP architectures, in order to acquire the insight to propose improved mechanisms.- In CMP architectures with out-of-order processors, cache misses can be served in parallel and share the miss penalty to access main memory. We take this fact into account to propose new cache partitioning algorithms guided by the memory-level parallelism (MLP) of each application. With these algorithms, the system performance is improved (in terms of throughput and fairness) without significantly increasing the hardware required by previous proposals.- Driving cache partition decisions with indirect indicators of performance such as misses, MLP or data re-use may lead to suboptimal cache partitions. Ideally, the appropriate metric to drive cache partitions should be the target metric to optimize, which is normally related to IPC. Thus, we have developed a hardware mechanism, OPACU, which is able to obtain at run-time accurate predictions of the performance of an application when running with different cache assignments.- Using performance predictions, we have introduced a new framework to manage shared caches in CMP architectures, FlexDCP, which allows the OS to optimize different IPC-related target metrics like throughput or fairness and provide QoS to applications. FlexDCP allows an enhanced coordination between the hardware and the software layers, which leads to improved system performance and flexibility.- Next, we have made use of performance estimations to reduce the load imbalance problem in parallel applications. We have built a run-time mechanism that detects parallel applications sensitive to cache allocation and, in these situations, the load imbalance is reduced by assigning more cache space to the slowest threads. This mechanism, helps reducing the long optimization time in terms of man-years of effort devoted to large-scale parallel applications.- Finally, we have stated the main characteristics that future multicore processors with thousands of cores should have. An enhanced coordination between the software and hardware layers has been proposed to better manage the shared resources in these architectures. load balancing quality of service performance predictability cache partitioning shared cache CMP architectures 004
2	Squelettes algorithmiques méta-programmés : implantations, performances et sémantique / Metaprogrammed algorithmic skeletons : implementations, performances and semantics Javed, Noman 21 October 2011 (has links) Les approches de parallélisme structuré sont un compromis entre la parallélisation automatique et la programmation concurrentes et réparties telle qu'offerte par MPI ou les Pthreads. Le parallélisme à squelettes est l'une de ces approches. Un squelette algorithmique peut être vu comme une fonction d'ordre supérieur qui capture un algorithme parallèle classique tel qu'un pipeline ou une réduction parallèle. Souvent la sémantique des squelettes est simple et correspondant à celle de fonctions d'ordre supérieur similaire dans les langages de programmation fonctionnels. L'utilisation combine les squelettes disponibles pour construire son application parallèle. Lorsqu'un programme parallèle est conçu, les performances sont bien sûr importantes. Il est ainsi très intéressant pour le programmeur de disposer d'un modèle de performance, simple mais réaliste. Le parallélisme quasi-synchrone (BSP) offre un tel modèle. Le parallélisme étant présent maintenant dans toutes les machines, du téléphone au super-calculateur, il est important que les modèles de programmation s'appuient sur des sémantiques formelles pour permettre la vérification de programmes. Les travaux menés on conduit à la conception et au développement de la bibliothèque Orléans Skeleton Library ou OSL. OSL fournit un ensemble de squelettes algorithmiques data-parallèles quasi-synchrones. OSL est une bibliothèque pour le langage C++ et utilise des techniques de programmation avancées pour atteindre une bonne efficacité. Les communications se basent sur la bibliothèque MPI. OSL étant basée sur le modèle BSP, il est possible non seulement de prévoir les performances des programmes OSL mais également de fournir une portabilité des performances. Le modèle de programmation d'OSL a été formalisé dans l'assistant de preuve Coq. L'utilisation de cette sémantique pour la preuve de programmes est illustrée par un exemple. / Structured parallelism approaches are a trade-off between automatic parallelisation and concurrent and distributed programming such as Pthreads and MPI. Skeletal parallelism is one of the structured approaches. An algorithmic skeleton can be seen as higher-order function that captures a pattern of a parallel algorithm such as a pipeline, a parallel reduction, etc. Often the sequential semantics of the skeleton is quite simple and corresponds to the usual semantics of similar higher-order functions in functional programming languages. The user constructs a parallel program by combined calls to the available skeletons. When one is designing a parallel program, the parallel performance is of course important. It is thus very interesting for the programmer to rely on a simple yet realistic parallel performance model. Bulk Synchronous Parallelism (BSP) offers such a model. As the parallelism can now be found everywhere from smart-phones to the super computers, it becomes critical for the parallel programming models to support the proof of correctness of the programs developed with them. . The outcome of this work is the Orléans Skeleton Library or OSL. OSL provides a set of data parallel skeletons which follow the BSP model of parallel computation. OSL is a library for C++ currently implemented on top of MPI and using advanced C++ techniques to offer good efficiency. With OSL being based over the BSP performance model, it is possible not only to predict the performances of the application but also provides the portability of performance. The programming model of OSL is formalized using the big-step semantics in the Coq proof assistant. Based on this formal model the correctness of an OSL example is proved. Squelettes algorithmiques Parallélisme quasi-synchrone Algorithmic skeletons Bulk synchronous parallelism
3	Queue Streaming Model Theory, Algorithms, and Implementation Zope, Anup D 03 May 2019 (has links) In this work, a model of computation for shared memory parallelism is presented. To address fundamental constraints of modern memory systems, the presented model constrains how parallelism interacts with memory access patterns and in doing so provides a method for design and analysis of algorithms that estimates reliable execution time based on a few architectural parameters. This model is presented as an alternative to modern thread based models that focus on computational concurrency but rely on reactive hardware policies to hide and amortize memory latency. Since modern processors use reactive mechanisms and heuristics to deduce the data access requirement of computations, the memory access costs of these threaded programs may be difficult to predict reliably. This research presents the Queue Streaming Model (QSM) that aims to address these shortcomings by providing a prescriptive mechanism to achieve latency-amortized and predictable-cost data access. Further, the work presents application of the QSM to algorithms commonly used in a number of applications. These algorithms include structured regular computations represented by merge sort, unstructured irregular computations represented by sparse matrix dense vector multiplication, and dynamic computations represented by MapReduce. The analysis of these algorithms reveal architectural tradeoffs between memory system bottlenecks and algorithm design. The techniques described in this dissertation reveal a general software approach that could be used to construct more general irregular applications, provided they can be transformed into a relational query form. It demonstrates that the QSM can be used to design algorithms that enhance utilization of memory system resources by structuring concurrency and memory accesses such that system bandwidths are balanced and latency is amortized. Finally, the benefit of applying the QSM algorithm to the Euler inviscid flow solver is demonstrated through experiments on the Intel(R) Xeon(R) E5-2680 v2 processor using ten cores. The transformation produced a speed-up of 25% over an optimized OpenMP implementation having identical computational structure. queue streaming model bridging model of computation streaming access

1

Page generated in 0.0725 seconds