Spelling suggestions: "subject:"cnumad"" "subject:"ebinuma""
1 |
vNUMA: Virtual shared-memory multiprocessorsChapman, Matthew, Computer Science & Engineering, Faculty of Engineering, UNSW January 2009 (has links)
Shared memory systems, such as SMP and ccNUMA topologies, simplify programming and administration. On the other hand, systems without hardware support for shared memory, such as clusters of commodity workstations, are commonly used due to cost and flexibility considerations. In this thesis, virtualisation is proposed as a technique that can bridge the gap between these architectures. The resulting system, vNUMA, is a hypervisor with a unique feature: it provides the illusion of shared memory across separate nodes on a fast network. This allows a cluster of workstations to be transformed into a single shared memory multiprocessor, supporting existing operating systems and applications. Such an approach could also have applications for emerging highly-parallel architectures, allowing a shared memory programming model to be retained while reducing hardware complexity. To build such a system, it is necessary to meld both a high-performance hypervisor and a high-performance distributed shared memory (DSM) system. This thesis addresses the challenges inherent in both of these tasks. First, designing an efficient hypervisor layer is considered; since vNUMA is implemented on the Itanium processor architecture, this is with particular reference to Itanium processor virtualisation. Then, novel DSM protocols are developed that allow SMP consistency models to be reproduced while providing better performance than a simple atomically-consistent DSM system. Finally, the system is evaluated, proving that it can provide good performance and compelling advantages for a variety of applications.
|
2 |
vNUMA: Virtual shared-memory multiprocessorsChapman, Matthew, Computer Science & Engineering, Faculty of Engineering, UNSW January 2009 (has links)
Shared memory systems, such as SMP and ccNUMA topologies, simplify programming and administration. On the other hand, systems without hardware support for shared memory, such as clusters of commodity workstations, are commonly used due to cost and flexibility considerations. In this thesis, virtualisation is proposed as a technique that can bridge the gap between these architectures. The resulting system, vNUMA, is a hypervisor with a unique feature: it provides the illusion of shared memory across separate nodes on a fast network. This allows a cluster of workstations to be transformed into a single shared memory multiprocessor, supporting existing operating systems and applications. Such an approach could also have applications for emerging highly-parallel architectures, allowing a shared memory programming model to be retained while reducing hardware complexity. To build such a system, it is necessary to meld both a high-performance hypervisor and a high-performance distributed shared memory (DSM) system. This thesis addresses the challenges inherent in both of these tasks. First, designing an efficient hypervisor layer is considered; since vNUMA is implemented on the Itanium processor architecture, this is with particular reference to Itanium processor virtualisation. Then, novel DSM protocols are developed that allow SMP consistency models to be reproduced while providing better performance than a simple atomically-consistent DSM system. Finally, the system is evaluated, proving that it can provide good performance and compelling advantages for a variety of applications.
|
3 |
Solving dense linear systems on accelerated multicore architectures / Résoudre des systèmes linéaires denses sur des architectures composées de processeurs multicœurs et d’accélerateursRémy, Adrien 08 July 2015 (has links)
Dans cette thèse de doctorat, nous étudions des algorithmes et des implémentations pour accélérer la résolution de systèmes linéaires denses en utilisant des architectures composées de processeurs multicœurs et d'accélérateurs. Nous nous concentrons sur des méthodes basées sur la factorisation LU. Le développement de notre code s'est fait dans le contexte de la bibliothèque MAGMA. Tout d'abord nous étudions différents solveurs CPU/GPU hybrides basés sur la factorisation LU. Ceux-ci visent à réduire le surcoût de communication dû au pivotage. Le premier est basé sur une stratégie de pivotage dite "communication avoiding" (CALU) alors que le deuxième utilise un préconditionnement aléatoire du système original pour éviter de pivoter (RBT). Nous montrons que ces deux méthodes surpassent le solveur utilisant la factorisation LU avec pivotage partiel quand elles sont utilisées sur des architectures hybrides multicœurs/GPUs. Ensuite nous développons des solveurs utilisant des techniques de randomisation appliquées sur des architectures hybrides utilisant des GPU Nvidia ou des coprocesseurs Intel Xeon Phi. Avec cette méthode, nous pouvons éviter l'important surcoût du pivotage tout en restant stable numériquement dans la plupart des cas. L'architecture hautement parallèle de ces accélérateurs nous permet d'effectuer la randomisation de notre système linéaire à un coût de calcul très faible par rapport à la durée de la factorisation. Finalement, nous étudions l'impact d'accès mémoire non uniformes (NUMA) sur la résolution de systèmes linéaires denses en utilisant un algorithme de factorisation LU. En particulier, nous illustrons comment un placement approprié des processus légers et des données sur une architecture NUMA peut améliorer les performances pour la factorisation du panel et accélérer de manière conséquente la factorisation LU globale. Nous montrons comment ces placements peuvent améliorer les performances quand ils sont appliqués à des solveurs hybrides multicœurs/GPU. / In this PhD thesis, we study algorithms and implementations to accelerate the solution of dense linear systems by using hybrid architectures with multicore processors and accelerators. We focus on methods based on the LU factorization and our code development takes place in the context of the MAGMA library. We study different hybrid CPU/GPU solvers based on the LU factorization which aim at reducing the communication overhead due to pivoting. The first one is based on a communication avoiding strategy of pivoting (CALU) while the second uses a random preconditioning of the original system to avoid pivoting (RBT). We show that both of these methods outperform the solver using LU factorization with partial pivoting when implemented on hybrid multicore/GPUs architectures. We also present new solvers based on randomization for hybrid architectures for Nvidia GPU or Intel Xeon Phi coprocessor. With this method, we can avoid the high cost of pivoting while remaining numerically stable in most cases. The highly parallel architecture of these accelerators allow us to perform the randomization of our linear system at a very low computational cost compared to the time of the factorization. Finally we investigate the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and data on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We show how these placements can improve the performance when applied to hybrid multicore/GPU solvers.
|
4 |
Efficient techniques to provide scalability for token-based cache coherence protocolsCuesta Sáez, Blas Antonio 17 July 2009 (has links)
Cache coherence protocols based on tokens can provide low latency without relying on non-scalable interconnects thanks to the use of efficient requests that are unordered. However, when these unordered requests contend for the same memory block, they may cause protocols races. To resolve the races and ensure
the completion of all the cache misses, token protocols use a starvation prevention mechanism that is inefficient and non-scalable in terms of required storage structures and generated traffic. Besides, token protocols use
non-silent invalidations which increase the latency of write misses proportionally to the system size. All these problems make token protocols non-scalable.
To overcome the main problems of token protocols and increase their scalability, we propose a new starvation prevention mechanism named Priority Requests. This mechanism resolves contention by an efficient, elegant, and flexible method based on ordered requests. Furthermore, thanks to Priority Requests, efficient
techniques can be applied to limit the storage requirements of the starvation prevention mechanism, to reduce the total traffic generated for managing protocol races, and to reduce the latency of write misses. Thus, the main problems of token protocols can be solved, which, in turn, contributes to wide their efficiency and scalability. / Cuesta Sáez, BA. (2009). Efficient techniques to provide scalability for token-based cache coherence protocols [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/6024
|
Page generated in 0.0337 seconds