1 |
Exploring Performance Portability for Accelerators via High-level Parallel PatternsHou, Kaixi 27 August 2018 (has links)
Nowadays, parallel accelerators have become prominent and ubiquitous, e.g., multi-core CPUs, many-core GPUs (Graphics Processing Units) and Intel Xeon Phi. The performance gains from them can be as high as many orders of magnitude, attracting extensive interest from many scientific domains. However, the gains are closely followed by two main problems: (1) A complete redesign of existing codes might be required if a new parallel platform is used, leading to a nightmare for developers. (2) Parallel codes that execute efficiently on one platform might be either inefficient or even non-executable for another platform, causing portability issues.
To handle these problems, in this dissertation, we propose a general approach using parallel patterns, an effective and abstracted layer to ease the generating efficient parallel codes for given algorithms and across architectures. From algorithms to parallel patterns, we exploit the domain expertise to analyze the computational and communication patterns in the core computations and represent them in DSL (Domain Specific Language) or algorithmic skeletons. This preserves the essential information, such as data dependencies, types, etc., for subsequent parallelization and optimization. From parallel patterns to actual codes, we use a series of automation frameworks and transformations to determine which levels of parallelism can be used, what optimal instruction sequences are, how the implementation change to match different architectures, etc. Experiments show that our approaches by investigating a couple of important computational kernels, including sort (and segmented sort), sequence alignment, stencils, etc., across various parallel platforms (CPUs, GPUs, Intel Xeon Phi). / Ph. D. / Nowadays, parallel accelerators have become prominent and ubiquitous, e.g., multi-core CPUs, many-core GPUs (Graphics Processing Units) and Intel Xeon Phi. The performance gains from them can be as high as many orders of magnitude, attracting extensive interest from many scientific domains. However, the gains are closely followed by two main problems: (1) A complete redesign of existing codes might be required if a new parallel platform is used, leading to a nightmare for developers. (2) Parallel codes that execute efficiently on one platform might be either inefficient or even non-executable for another platform, causing portability issues.
To handle these problems, in this dissertation, we propose a general approach using parallel patterns, an effective and abstracted layer to ease the generating efficient parallel codes for given algorithms and across architectures. From algorithms to parallel patterns, we exploit the domain expertise to analyze the computational and communication patterns in the core computations and represent them in DSL (Domain Specific Language) or algorithmic skeletons. This preserves the essential information, such as data dependencies, types, etc., for subsequent parallelization and optimization. From parallel patterns to actual codes, we use a series of automation frameworks and transformations to determine which levels of parallelism can be used, what optimal instruction sequences are, how the implementation change to match different architectures, etc. Experiments show that our approaches by investigating a couple of important computational kernels, including sort (and segmented sort), sequence alignment, stencils, etc., across various parallel platforms (CPUs, GPUs, Intel Xeon Phi).
|
2 |
Conflict Detection-Based Run-Length Encoding: AVX-512 CD Instruction Set in ActionLehner, Wolfgang, Ungethum, Annett, Pietrzyk, Johannes, Damme, Patrick, Habich, Dirk 18 January 2023 (has links)
Data as well as hardware characteristics are two key aspects for efficient data management. This holds in particular for the field of in-memory data processing. Aside from increasing main memory capacities, efficient in-memory processing benefits from novel processing concepts based on lightweight compressed data. Thus, an active research field deals with the adaptation of new hardware features such as vectorization using SIMD instructions to speedup lightweight data compression algorithms. Following this trend, we propose a novel approach for run-length encoding, a well-known and often applied lightweight compression technique. Our novel approach is based on newly introduced conflict detection (CD) instructions in Intel's AVX-512 instruction set extension. As we are going to show, our CD-based approach has unique properties and outperforms the state-of-the-art RLE approach for data sets with small run lengths.
|
3 |
A study on SSE optimisation regarding initialisation and evaluation of the Fast Multipole MethodHjerpe, Daniel January 2016 (has links)
The following study examines whether the initialisation (multipole expansions at the finest level) and evaluation of the numerical method Fast Multipole Method (FMM) can benefit from implementing SSE instructions. The implementation of SSE-instructions have been studied and compared to the serial case. Moreover, studied parts of the algorithm include arithmetics on complex numbers, and the usage of applying SSE instructions to complex numbers of double precision. In conclusion, the initialisation has not experienced any improvement in terms of throughput by appliying SSE instructions. However, the evaluation reached almost the double speed-up when SSE instructions were applied. The difference in results are most likely due to the structure of the both algorithms. The initialisation is simple, but the evaluation which involves more operations can benefit from SSE instructions. Furthermore, a scheme is proposed for how SSE instructions can be applied to data sets which are not divisable by the unroll factor and to data sets of varying size.
|
4 |
NEW TECHNIQUES ON VLSI CIRCUIT TESTING & EFFICIENT IMPLEMENTATIONS OF ARITHMETIC OPERATIONSPoulos, Konstantinos 01 December 2020 (has links)
Testing is necessary factor to guarantee that ICs operate according to specifications before being delivered to customers. Testing is a process used to identify ICs containing imperfections or manufacturing defects that may cause failures. Inaccuracy and imperfections can be introduced during the fabrication of the chips due to the complex mechanical and chemical steps required during the manufacturing processes. The testing process step applies test patterns to circuits and analyzes their responses. This work focuses on VLSI circuit testing with two implementations for DFT (Design for testability); the first is an ATPG tool for sequential circuits and the second is a BIT (Built in Test) circuit for high frequency signal classification.There has been a massive increase in the number of transistors integrated in a chip, and the complexity of the circuit is increasing along with it. This growth has become a bottleneck for the test developers. The proposed ATPG tool was designed for testing sequential circuits. Scan Chains in Design For Testability (DFT) gained more prominence due to the increase in the complexity of the modern circuits. As the test time increases along with the number of memory elements in the circuit, new and improved methods are needed. Even though scan chains implementation effectively increases observability and controllability, a big portion of the time is wasted while shifting in and shifting out the test patterns through the scan chain. Additionally, the modern applications require operating speed at higher frequencies and there is a growing demand in testing equipment capable to test CMOS circuits utilized in high frequency applications.With the modern applications requiring operating speed at higher frequencies, there is a growing demand in testing equipment capable to test CMOS circuits utilized in high frequency applications. Two main problems have been associated when using external test equipment to test high frequency circuits; the effect of the resistance and capacitance of the probe on the performance of the circuit under test which leads to a faulty evaluation; and the cost of a dedicated high frequency tester. To solve these problems innovative test techniques are needed such as Built In Test (BIT) where self-evaluation takes place with a small area overhead and reduced requirements for external equipment. In the proposed methodology a Built In Test (BIT) detection circuit provides an efficient way to transform the high frequency response of the circuit under test into a DC signal.This work is focused in two major fields. The first topic is on VLSI circuit testing with two implementations for DFT (Design for testability); the first is an ATPG tool for sequential circuits and the second is a BIT (Built in Test) circuit for high frequency signal classification as explained. The second topic is focused on efficient implementations of arithmetic operations in arbitrary long numbers with emphasis to addition. Arbitrary-Precision arithmetic refers to a set of data structures and algorithms which allows to process much greater numbers that exceed the standard data types. . An application example where arbitrary long numbers are widely used is cryptography, because longer numbers offer higher encryption security. Modern systems typically employ up to 64-bit registers, way less than what an arbitrary number requires, while conventional algorithms do not exploit hardware characteristics as well. Mathematical models such as weather prediction and experimental mathematics require high precision calculations that exceed the precision found in most Arithmetic Logic Units (ALU). In this work, we propose a new scalable algorithm to add arbitrary long numbers. The algorithm performs bitwise logic operations rather than arithmetic on 64-bit registers. We propose two approaches of the same algorithm that utilize the same basic function created according to the rules of binary addition
|
5 |
SIMD Optimizations of Software Rendering in 2D Video Games / SIMD optimeringar i mjukvarurendering av 2D spelMendel, Oskar, Bergström, Jesper January 2019 (has links)
Optimizing rendering is one of the greatest challenges faced by game developers. Most game engines make use of hardware rendering which uses technology specifically built for rendering. Before such hardware existed, game developers had to rely on the CPU to render their games. This is known as software rendering. Software rendering is not commonly used nowadays but has been seen in cases such as a backup for when the end users machine does not support the hardware based renderer of the application. Since the CPU is not purposely built for rendering, unlike the GPU, the developer has to perform optimizations to make the renderer more efficient in terms of speed. In this thesis, we present an approach which is a subset of parallel programming called Single Instruction, Multiple Data. This technique operates on vector based registers which means operations can be performed on multiple pieces of data at once. This is applied to an already built game engine in order to optimize its rendering. The results show a speed-up of 90.5% and a framerate increase from 30 frames per second to 133 frames per second within the rendering routine.
|
6 |
Optimierung der Energie-Effizienz für Algorithmen der Linearen Algebra durch SIMD-Programmierung und AVX-VektorisierungJakobs, Thomas 10 January 2022 (has links)
Neben einer kurzen Ausführungszeit rückt bei der Optimierung von Anwendungen und Algorithmen ein geringer Energieverbrauch der genutzten Rechenressourcen in den Fokus der aktuellen Forschung. Eine hohe Energie-Effizienz von Programmen wird dabei erreicht, indem der Energieverbrauch von Programmen und Technologien reduziert wird, ohne dafür die Ausführungszeit übermäßig zu erhöhen. Im parallelen wissenschaftlichen Rechnen ist der Bedarf an energie-effizienten Programmausführungen vor allem für Algorithmen der linearen Algebra gegeben, die als Unterfunktionen in einer Vielzahl von Anwendungen eingesetzt werden. Die Vektorisierung von Programmen durch die Prozessor- und Instruktionssatzerweiterung AVX zeigt Potenzial zur energie-effizienten Ausführung von Algorithmen der linearen Algebra, wobei die erzielte Energie-Effizienz von der Umsetzung der Implementierung abhängt.
Für die gezeigten Untersuchungen werden drei repräsentativ ausgewählte Algorithmen der linearen Algebra für die Ausführung auf AVX-Vektoreinheiten genutzt. Bei der AVX-Vektorisierung der Algorithmen werden verschiedene Programmvarianten erstellt, mit denen Ausführungszeit und Energieverbrauch bei der Ausführung ermittelt werden. Die Programmvarianten unterscheiden sich dabei unter anderem in der Anwendung von Programmtransformationen, wie Loop Tiling oder einer veränderten Speicherzugriffsstruktur. Zusätzlich wird gezeigt, wie die Umsetzung verschiedener Programmieransätze, wie Autovektorisierung oder unterschiedlicher Instruktionssätze, sowie Implementierungsvarianten durch die Auswahl der verwendeten Instruktionen, die Ausführungszeit und den Energieverbrauch der Programmausführung beeinflussen. Die so erstellten Programmvarianten werden auf modernen Prozessoren verschiedener Architekturfamilien mit unterschiedlichen Ausführungsparametern, wie der eingestellten Prozessorfrequenz, ausgeführt. Die Untersuchungen zeigen, dass sich Ausführungszeit und Energieverbrauch von Programmen durch die Vektorisierung reduzieren lassen. Die Auswahl der Programmtransformationen, des Programmieransatzes und der Ausführungsparameter für die energie-effiziente Ausführung von vektorisierten Programmen kann dabei anwendungsspezifisch aufgrund der Eigenschaften des ausgewählten Algorithmus getroffen werden.
|
7 |
Paralelizace ultrazvukových simulací pomocí 2D dekompozice / Parallelization of Ultrasound Simulations Using 2D DecompositionNikl, Vojtěch January 2014 (has links)
This thesis is a part of the k-Wave project, which is a toolbox for the simulation and reconstruction of acoustic wave felds and one of its main contributions is the planning of focused ultrasound surgeries (HIFU). One simulation can take tens of hours and about 60% of the simulation time is taken by the calculation of the 3D Fast Fourier transforms. Up until now the 3D FFT has been calculated purely by the FFTW library and its 1D decomposition, whose major limitation is the maximum number of employable cores. Therefore we introduce a new approach, called the 2D hybrid decomposition of the 3D FFT (HybridFFT), where we combine both MPI processes and OpenMP threads to reach as best performance as possible. On a low number of cores, on the order of a few hundreds, we are about as fast or slightly faster than FFTW and pure MPI 2D decomposition libraries (PFFT and P3DFFT). One of the best results was achieved on a 512^3FFT using 512 cores, where our hybrid version run 31ms, FFTW run 39ms and PFFT run 44ms. The most significant performance advantage should be seen when employing around 8-16 thousand cores, however we haven't had an access to a machine with such resources. Almost a linear scalability has been proven for up to 2048 employed cores.
|
8 |
Akcelerace vektorových a krytografických operací na platformě x86-64 / Acceleration of Vector and Cryptographic Operations on x86-64 PlatformŠlenker, Samuel January 2017 (has links)
The aim of this thesis was to study and subsequently process a comparison of older and newer SIMD processing units of modern microprocessors on the x86-64 platform. The thesis provides an overview of the fastest computations of vector operations with matrices and vectors, including corresponding source codes. Furthermore, the thesis is focused on authenticated encryption, specifically on block cipher AES operating in Galois Counter Mode, and on a discussion of possibilities of instruction sets for cryptographic support.
|
Page generated in 0.035 seconds