Global ETD Search

81	Efektivní komunikace v multi-GPU systémech / Efficient Communication in Multi-GPU Systems Špeťko, Matej January 2018 (has links) After the introduction of CUDA by Nvidia, the GPUs became devices capable of accelerating any general purpose computation. GPUs are designed as parallel processors which posses huge computation power. Modern supercomputers are often equipped with GPU accelerators. Sometimes the performance or the memory capacity of a single GPU is not enough for a scientific application. The application needs to be scaled into multiple GPUs. During the computation there is need for the GPUs to exchange partial results. This communication represents computation overhead. For this reason it is important to research the methods of the effective communication between GPUs. This means less CPU involvement, lower latency, shared system buffers. Inter-node and intra-node communication is examined. The main focus is on GPUDirect technologies from Nvidia and CUDA-Aware MPI. Subsequently k-Wave toolbox for simulating the propagation of acoustic waves is introduced. This application is accelerated by using CUDA-Aware MPI.
82	VAMPIR: Visualization and Analysis of MPI Resources Nagel, Wolfgang E., Arnold, Alfred, Weber, Michael, Hoppe, Hans-Christian, Solchenbach, Karl 04 February 2010 (has links) Performance analysis most often is based on the detailed knowledge of program behavior. One option to get this information is tracing. Based on the research tool PARvis, the visualization environment VAMPIR was developed at KFA which now supports the new message passing standard MPI. VAMPIR translates a given trace file into a variety of graphical views, e.g., state diagrams, activity charts, time-line displays, and statistics. Moreover, it supports an animation mode that can help to locate performance bottlenecks, and it provides flexible filter operations to reduce the amount of information displayed. The most interesting part of VAMPIR is the powerful zooming feature that allows to identify problems at any level of detail. info:eu-repo/classification/ddc/004 ddc:004 MPI, visualization MPI, Visualisation
83	Designing optimized MPI+NCCL hybrid collective communication routines for dense many-GPU clusters Senthil Kumar, Nithin 04 October 2021 (has links) No description available. Computer Science MPI NCCL NVIDIA Collective Communications Library CUDA-aware MPI MVAPICH2-GDR MVAPICH2
84	Improving Performance And Programmer Productivity For I/o-intensive High Performance Computing Applications Sehrish, Saba 01 January 2010 (has links) Due to the explosive growth in the size of scientific data sets, data-intensive computing is an emerging trend in computational science. HPC applications are generating and processing large amount of data ranging from terabytes (TB) to petabytes (PB). This new trend of growth in data for HPC applications has imposed challenges as to what is an appropriate parallel programming framework to efficiently process large data sets. In this work, we study the applicability of two programming models (MPI/MPI-IO and MapReduce) to a variety of I/O-intensive HPC applications ranging from simulations to analytics. We identify several performance and programmer productivity related limitations of these existing programming models, if used for I/O-intensive applications. We propose new frameworks which will improve both performance and programmer productivity for the emerging I/O-intensive applications. Message Passing Interface (MPI) is widely used for writing HPC applications. MPI/MPI- IO allows a fine-grained control of assigning data and task distribution. At the programming frameworks level, various optimizations have been proposed to improve the performance of MPI/MPI-IO function calls. These performance optimizations are provided as various function options to the programmers. In order to write an efficient code, they are required to know the exact usage of the optimization functions, hence programmer productivity is limited. We propose an abstraction called Reduced Function Set Abstraction (RFSA) for MPI-IO to reduce the number of I/O functions and provide methods to automate the selection of appropriate I/O function for writing HPC simulation applications. The purpose of RFSA is to hide the performance optimization functions from the application developer, and relieve the application developer from deciding on a specific function. The proposed set of functions relies on a selection algorithm to decide among the most common optimizations provided by MPI-IO. Additionally, many application scientists are looking to integrate data-intensive computing into computational-intensive High Performance Computing facilities, particularly for data analytics. We have observed several scientific applications which must migrate their data from an HPC storage system to a data-intensive one. There is a gap between the data semantics of HPC storage and data-intensive system, hence, once migrated, the data must be further refined and reorganized. This reorganization must be performed before existing data-intensive tools such as MapReduce can be effectively used to analyze data. This reorganization requires at least two complete scans through the data set and then at least one MapReduce program to prepare the data before analyzing it. Running multiple MapReduce phases causes significant overhead for the application, in the form of excessive I/O operations. For every MapReduce application that must be run in order to complete the desired data analysis, a distributed read and write operation on the file system must be performed. Our contribution is to extend Map-Reduce to eliminate the multiple scans and also reduce the number of pre-processing MapReduce programs. We have added additional expressiveness to the MapReduce language in our novel framework called MapReduce with Access Patterns (MRAP), which allows users to specify the logical semantics of their data such that 1) the data can be analyzed without running multiple data pre-processing MapReduce programs, and 2) the data can be simultaneously reorganized as it is migrated to the data-intensive file system. We also provide a scheduling mechanism to further improve the performance of these applications. The main contributions of this thesis are, 1) We implement a selection algorithm for I/O functions like read/write, merge a set of functions for data types and file views and optimize the atomicity function by automating the locking mechanism in RFSA. By running different parallel I/O benchmarks on both medium-scale clusters and NERSC supercomputers, we show an improved programmer productivity (35.7% on average). This approach incurs an overhead of 2-5% for one particular optimization, and shows performance improvement of 17% when a combination of different optimizations is required by an application. 2) We provide an augmented Map-Reduce system (MRAP), which consist of an API and corresponding optimizations i.e. data restructuring and scheduling. We have demonstrated up to 33% throughput improvement in one real application (read-mapping in bioinformatics), and up to 70% in an I/O kernel of another application (halo catalogs analytics). Our scheduling scheme shows performance improvement of 18% for an I/O kernel of another application (QCD analytics). Data-intensive computing High Performance computing MapReduce MPI/MPI-IO Computer Engineering Engineering
85	PFFT - An Extension of FFTW to Massively Parallel Architectures Pippig, Michael 12 July 2012 (has links) (PDF) We present a MPI based software library for computing the fast Fourier transforms on massively parallel, distributed memory architectures. Similar to established transpose FFT algorithms, we propose a parallel FFT framework that is based on a combination of local FFTs, local data permutations and global data transpositions. This framework can be generalized to arbitrary multi-dimensional data and process meshes. All performance relevant building blocks can be implemented with the help of the FFTW software library. Therefore, our library offers great flexibility and portable performance. Likewise FFTW, we are able to compute FFTs of complex data, real data and even- or odd-symmetric real data. All the transforms can be performed completely in place. Furthermore, we propose an algorithm to calculate pruned FFTs more efficiently on distributed memory architectures. For example, we provide performance measurements of FFTs of size 512^3 and 1024^3 up to 262144 cores on a BlueGene/P architecture. parallel Fourier-Transformation MPI FFT parallel Fourier transform MPI FFT 65T50 65Y05 ddc:004 ddc:518 Schnelle Fourier-Transformation MPI <Schnittstelle> Parallelverarbeitung
86	Power-Constrained Supercomputing Bailey, Peter E. January 2015 (has links) As we approach exascale systems, power is turning from an optimization goal to a critical operating constraint. With power bounds imposed by both stakeholders and the limitations of existing infrastructure, achieving practical exascale computing will therefore rely on optimizing performance subject to a power constraint. However, this requirement should not add to the burden of application developers; optimizing the runtime environment given restricted power will primarily be the job of high-performance system software. In this dissertation, we explore this area and develop new techniques that extract maximum performance subject to a particular power constraint. These techniques include a method to find theoretical optimal performance, a runtime system that shifts power in real time to improve performance, and a node-level prediction model for selecting power-efficient operating points. We use a linear programming (LP) formulation to optimize application schedules under various power constraints, where a schedule consists of a DVFS state and number of OpenMP threads for each section of computation between consecutive message passing events. We also provide a more flexible mixed integer-linear (ILP) formulation and show that the resulting schedules closely match schedules from the LP formulation. Across four applications, we use our LP-derived upper bounds to show that current approaches trail optimal, power-constrained performance by up to 41%. This demonstrates limitations of current systems, and our LP formulation provides future optimization approaches with a quantitative optimization target. We also introduce Conductor, a run-time system that intelligently distributes available power to nodes and cores to improve performance. The key techniques used are configuration space exploration and adaptive power balancing. Configuration exploration dynamically selects the optimal thread concurrency level and DVFS state subject to a hardware-enforced power bound. Adaptive power balancing efficiently predicts where critical paths are likely to occur and distributes power to those paths. Greater power, in turn, allows increased thread concurrency levels, CPU frequency/voltage, or both. We describe these techniques in detail and show that, compared to the state-of-the-art technique of using statically predetermined, per-node power caps, Conductor leads to a best-case performance improvement of up to 30%, and an average improvement of 19.1%. At the node level, an accurate power/performance model will aid in selecting the right configuration from a large set of available configurations. We present a novel approach to generate such a model offline using kernel clustering and multivariate linear regression. Our model requires only two iterations to select a configuration, which provides a significant advantage over exhaustive search-based strategies. We apply our model to predict power and performance for different applications using arbitrary configurations, and show that our model, when used with hardware frequency-limiting in a runtime system, selects configurations with significantly higher performance at a given power limit than those chosen by frequency-limiting alone. When applied to a set of 36 computational kernels from a range of applications, our model accurately predicts power and performance; our runtime system based on the model maintains 91% of optimal performance while meeting power constraints 88% of the time. When the runtime system violates a power constraint, it exceeds the constraint by only 6% in the average case, while simultaneously achieving 54% more performance than an oracle. Through the combination of the above contributions, we hope to provide guidance and inspiration to research practitioners working on runtime systems for power-constrained environments. We also hope this dissertation will draw attention to the need for software and runtime-controlled power management under power constraints at various levels, from the processor level to the cluster level. exascale MPI parallel power constraint supercomputer Computer Science energy
87	A Case Study of Semi-Automatic Parallelization of Divide and Conquer Algorithms Using Invasive Interactive Parallelization Hansson, Erik January 2009 (has links) <p>Since computers supporting parallel execution have become more and more common the last years, especially on the consumer market, the need for methods and tools for parallelizing existing sequential programs has highly increased. Today there exist different methods of achieving this, in a more or less user friendly way. We have looked at one method, Invasive Interactive Parallelization (IIP), on a special problem area, divide and conquer algorithms, and performed a case study. This case study shows that by using IIP, sequential programs can be parallelized both for shared and distributed memory machines. We have focused on parallelizing Quick Sort for OpenMP and MPI environment using a tool, Reuseware, which is based on the concepts of Invasive Software Composition.</p> parallelization IIP ISC HPC OpenMP MPI Computer science Datavetenskap
88	Réseau longue distance et application distribuée dans les grilles de calcul : étude et propositions pour une interaction efficace. Hablot, Ludovic 17 December 2009 (has links) (PDF) Apparu en 1970, le calcul parallèle permet, contrairement aux applications classiques qui exécutent un algorithme de manière séquentielle, d'exécuter des tâches d'une même application sur plusieurs processeurs en même temps. Les premières architectures -- les supercalculateurs -- qui regroupaient des milliers de processeurs au sein de la même machine, ont fait place aux grappes, à la fin des années 1970 : une interconnexion d'ordinateurs standard par un réseau rapide. Ces architectures s'étant développées un peu partout, les grilles ont fait leur apparition au début des années 1990, de manière à fédérer les ressources de différentes entités en les interconnectant et ainsi disposer d'une plus grande puissance de calcul globale. La grille, telle que nous la considérons dans ce manuscrit sera donc définie comme une interconnexion de grappes par un réseau longue distance.<br /> Les applications parallèles s'appuient la plupart du temps sur le standard MPI qui fonctionne par passage de message. Initialement destiné aux grappes, celui-ci est toujours utilisé pour programmer les communications des applications s'exécutant sur les grilles. Cela permet la réutilisation d'anciennes applications.<br /> Alors que différents problèmes ont été résolus pour les communications au sein des grappes, le réseau longue distance de la grille pose plusieurs problèmes. Tout d'abord, les messages MPI sont transmis de manière fiable sur le réseau longue distance via le protocole TCP. Or TCP, qui reste le protocole de transport utilisé dans la plupart des grilles, est basé sur un transfert de données à l'aide de flux ; il est donc peu adapté aux communications MPI. Ensuite, la grande latence du réseau longue distance implique des communications et des retransmissions de paquets perdus qui sont coûteuses. Enfin, le débit disponible sur le lien d'accès à ce réseau est généralement inférieur à la somme des débits nécessaires si tous les processus communiquent en même temps sur ce lien. Ceci crée de la congestion à la fois au sein d'une même application et à la fois avec les autres applications qui l'utilisent, et il devient nécessaire de gérer ce goulot d'étranglement.<br /> L'objectif principal de cette thèse est d'étudier en détail les interactions entre les applications parallèles et la couche de transport dans les réseaux longue distance des grilles de calcul, puis de proposer des solutions à ces problèmes. Grid MPI network TCP
89	Efficient Methods for Arbitrary Data Redistribution Bai, Sheng-Wen 21 July 2005 (has links) In many parallel programs, run-time data redistribution is usually required to enhance data locality and reduce remote memory access on the distributed memory multicomputers. For the heterogeneous computation environment, irregular data redistributions can be used to adjust data assignment. Since data redistribution is performed at run-time, there is a performance trade-off between the efficiency of the new data distribution for a subsequent phase of an algorithm and the cost of redistributing array among processors. Thus, efficient methods for performing data redistribution are of great importance for the development of distributed memory compilers for data-parallel programming languages. For the regular data redistribution, two approaches are presented in this dissertation, indexing approach and packing/unpacking approach. In the indexing approach, we propose a generalized basic-cycle calculation (GBCC) technique to efficiently generate the communication sets for a BLOCK-CYCLIC(s) over P processors to BLOCK-CYCLIC(t) over Q processors data redistribution. In the packing/unpacking approach, we present a User-Defined Types (UDT) method to perform BLOCK-CYCLIC(s) to BLOCK-CYCLIC(t) redistribution, using MPI user-defined datatypes. This method reduces the required memory buffers and avoids unnecessary movement of data. For the irregular data redistribution, in this dissertation, an Essential Cycle Calculation (ECC) method will be presented. The above methods are originally developed for one dimension array. However, the multi-dimension array can also be performed by simply applying these methods dimension by dimension starting from the first (last) dimension if array is in column-major (row-major). GBCC ECC Data Redistribution MPI User-Defined Datatypes Data Distribution
90	Erweiterung eines existierenden Infiniband Benchmarks Viertel, Carsten 01 June 2006 (has links) (PDF) Infiniband wird zunehmend als Verbindungsnetzwerk für Cluster eingesetzt. Dadurch wird es nötig existierende Bibliotheken für parallele Programmiersprachen an das neue Netzwerk bestmöglich anzupassen. Ein wichtiger Bestandteil paralleler Programmiersprachen sind dabei kollektive Operationen, die es erfordern, eine Nachricht von einem Knoten zu vielen anderen oder auch von vielen Knoten an einen einzelnen zu senden. Um herauszufinden, welche Verbindungsarten und Operationen am besten für diese kollektiven Operationen geeignet sind, wurde ein Benchmark entwickelt. Ziel dieser Studienarbeit ist es, dieses Programm zu erweitern, auf einem Cluster zu testen und die Ergebnisse auszuwerten. InfiniBand Kollektive Operationen Multicast ddc:004 Benchmark Cluster MPI

Search results