Global ETD Search

1	Unstructured Computations on Emerging Architectures Al Farhan, Mohammed 05 May 2019 (has links) This dissertation describes detailed performance engineering and optimization of an unstructured computational aerodynamics software system with irregular memory accesses on various multi- and many-core emerging high performance computing scalable architectures, which are expected to be the building blocks of energy-austere exascale systems, and on which algorithmic- and architecture-oriented optimizations are essential for achieving worthy performance. We investigate several state-of-the-practice shared-memory optimization techniques applied to key kernels for the important problem class of unstructured meshes. We illustrate for a broad spectrum of emerging microprocessor architectures as representatives of the compute units in contemporary leading supercomputers, identifying and addressing performance challenges without compromising the floating-point numerics of the original code. While the linear algebraic kernels are bottlenecked by memory bandwidth for even modest numbers of hardware cores sharing a common address space, the edge-based loop kernels, which arise in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, are compute-intensive and effectively exploit contemporary multi- and many-core processing hardware. We therefore employ low- and high-level algorithmic- and architecture-specific code optimizations and tuning in light of thread- and data-level parallelism, with a focus on strong thread scaling at the node-level. Our approaches are based upon novel multi-level hierarchical workload distribution mechanisms of data across different compute units (from the address space down to the registers) within every hardware core. We analyze the demonstrated aerodynamics application on specific computing architectures to develop certain performance metrics and models to bespeak the upper and lower bounds of the performance. We present significant full application speedup relative to the baseline code, on a succession of many-core processor architectures, i.e., Intel Xeon Phi Knights Corner (5.0x) and Knights Landing (2.9x). In addition, the performance of Knights Landing outperforms, at significantly lower power consumption, Intel Xeon Skylake with nearly twofold speedup. These optimizations are expected to be of value for many other unstructured mesh partial differential equation-based scientific applications as multi- and many- core architecture evolves. Performance Optimizations Thread-level parallelism Data-level parallelism Unstructured Grids Computational Aerodynamics Intel Xeon Phi
2	Paralelizace ultrazvukových simulací pomocí akcelerátoru Intel Xeon Phi / Parallelisation of Ultrasound Simulations on Intel Xeon Phi Accelerator Vrbenský, Andrej January 2015 (has links) Nowadays, the simulation of ultrasound acoustic waves has a wide range of practical usage. As one of them we can name the simulation in realistic tissue media, which is successfully used in medicine. There are several software applications dedicated to perform such simulations. k-Wave is one of them. The computational difficulty of the simulation itself is very high, and this leaves a space to explore new speed-up methods. In this master's thesis, we proposed a way to speed-up the simulation based on parallelization using Intel Xeon Phi accelerator. The accelerator contains large amount of cores and an extra-wide vector unit, and therefore, is ideal for purpose of parallelization and vectorization. The implementation is using OpenMP version 4.0, which brings some new options such as explicit vectorization. Results were measured during extensive experiments.
3	Vysoce náročné aplikace na svazku karet Intel Xeon Phi / High Performance Applications on Intel Xeon Phi Cluster Kačurik, Tomáš January 2016 (has links) The main topic of this thesis is the implementation and subsequent optimization of high performance applications on a cluster of Intel Xeon Phi coprocessors. Using two approaches to solve the N-Body problem, the possibilities of the program execution on a cluster of processors, coprocessors or both device types have been demonstrated. Two particular versions of the N-Body problem have been chosen - the naive and Barnes-hut. Both problems have been implemented and optimized. For better comparison of the achieved results, we only considered achieved acceleration against single node runs using processors only. In the case of the naive version a 15-fold increase has been achieved when using combination of processors and coprocessors on 8 computational nodes. The performance in this case was 9 TFLOP/s. Based on the obtained results we concluded the advantages and disadvantages of the program execution in the distributed environments using processors, coprocessors or both.
4	Solving dense linear systems on accelerated multicore architectures / Résoudre des systèmes linéaires denses sur des architectures composées de processeurs multicœurs et d’accélerateurs Rémy, Adrien 08 July 2015 (has links) Dans cette thèse de doctorat, nous étudions des algorithmes et des implémentations pour accélérer la résolution de systèmes linéaires denses en utilisant des architectures composées de processeurs multicœurs et d'accélérateurs. Nous nous concentrons sur des méthodes basées sur la factorisation LU. Le développement de notre code s'est fait dans le contexte de la bibliothèque MAGMA. Tout d'abord nous étudions différents solveurs CPU/GPU hybrides basés sur la factorisation LU. Ceux-ci visent à réduire le surcoût de communication dû au pivotage. Le premier est basé sur une stratégie de pivotage dite "communication avoiding" (CALU) alors que le deuxième utilise un préconditionnement aléatoire du système original pour éviter de pivoter (RBT). Nous montrons que ces deux méthodes surpassent le solveur utilisant la factorisation LU avec pivotage partiel quand elles sont utilisées sur des architectures hybrides multicœurs/GPUs. Ensuite nous développons des solveurs utilisant des techniques de randomisation appliquées sur des architectures hybrides utilisant des GPU Nvidia ou des coprocesseurs Intel Xeon Phi. Avec cette méthode, nous pouvons éviter l'important surcoût du pivotage tout en restant stable numériquement dans la plupart des cas. L'architecture hautement parallèle de ces accélérateurs nous permet d'effectuer la randomisation de notre système linéaire à un coût de calcul très faible par rapport à la durée de la factorisation. Finalement, nous étudions l'impact d'accès mémoire non uniformes (NUMA) sur la résolution de systèmes linéaires denses en utilisant un algorithme de factorisation LU. En particulier, nous illustrons comment un placement approprié des processus légers et des données sur une architecture NUMA peut améliorer les performances pour la factorisation du panel et accélérer de manière conséquente la factorisation LU globale. Nous montrons comment ces placements peuvent améliorer les performances quand ils sont appliqués à des solveurs hybrides multicœurs/GPU. / In this PhD thesis, we study algorithms and implementations to accelerate the solution of dense linear systems by using hybrid architectures with multicore processors and accelerators. We focus on methods based on the LU factorization and our code development takes place in the context of the MAGMA library. We study different hybrid CPU/GPU solvers based on the LU factorization which aim at reducing the communication overhead due to pivoting. The first one is based on a communication avoiding strategy of pivoting (CALU) while the second uses a random preconditioning of the original system to avoid pivoting (RBT). We show that both of these methods outperform the solver using LU factorization with partial pivoting when implemented on hybrid multicore/GPUs architectures. We also present new solvers based on randomization for hybrid architectures for Nvidia GPU or Intel Xeon Phi coprocessor. With this method, we can avoid the high cost of pivoting while remaining numerically stable in most cases. The highly parallel architecture of these accelerators allow us to perform the randomization of our linear system at a very low computational cost compared to the time of the factorization. Finally we investigate the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and data on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We show how these placements can improve the performance when applied to hybrid multicore/GPU solvers. Systèmes linéaires denses Factorisation LU Bibliothèque MAGMA Calcul hybride multicœur/GPU Processeurs graphiques Intel Xeon Phi . ccNUMA Communication-avoiding Randomisation Placement des processus légers Dense linear systems LU factorization Dense linear algebra libraries MAGMA library Hybrid multicore/GPU computing Graphics process units Intel Xeon Phi . ccNUMA Communication-avoiding algorithms Randomization Thread placement
5	Accelerated Deep Learning using Intel Xeon Phi Viebke, André January 2015 (has links) Deep learning, a sub-topic of machine learning inspired by biology, have achieved wide attention in the industry and research community recently. State-of-the-art applications in the area of computer vision and speech recognition (among others) are built using deep learning algorithms. In contrast to traditional algorithms, where the developer fully instructs the application what to do, deep learning algorithms instead learn from experience when performing a task. However, for the algorithm to learn require training, which is a high computational challenge. High Performance Computing can help ease the burden through parallelization, thereby reducing the training time; this is essential to fully utilize the algorithms in practice. Numerous work targeting GPUs have investigated ways to speed up the training, less attention have been paid to the Intel Xeon Phi coprocessor. In this thesis we present a parallelized implementation of a Convolutional Neural Network (CNN), a deep learning architecture, and our proposed parallelization scheme, CHAOS. Additionally a theoretical analysis and a performance model discuss the algorithm in detail and allow for predictions if even more threads are available in the future. The algorithm is evaluated on an Intel Xeon Phi 7120p, Xeon E5-2695v2 2.4 GHz and Core i5 661 3.33 GHz using various architectures and thread counts on the MNIST dataset. Findings show a 103.5x, 99.9x, 100.4x speed up for the large, medium, and small architecture respectively for 244 threads compared to 1 thread on the coprocessor. Moreover, a 10.9x - 14.1x (large to small) speed up compared to the sequential version running on Xeon E5. We managed to decrease training time from 7 days on the Core i5 and 31 hours on the Xeon E5, to 3 hours on the Intel Xeon Phi when training our large network for 15 epochs Machine Learning Deep Learning Supervised Deep Learning Intel Xeon Phi Convolutional Neural Network CNN High Performance Computing CHAOS parallel computing coprocessor MIC speed up performance model evaluation Computer Sciences Datavetenskap (datalogi)

1

Page generated in 0.0809 seconds