Global ETD Search

1	Unstructured Computations on Emerging Architectures Al Farhan, Mohammed 05 May 2019 (has links) This dissertation describes detailed performance engineering and optimization of an unstructured computational aerodynamics software system with irregular memory accesses on various multi- and many-core emerging high performance computing scalable architectures, which are expected to be the building blocks of energy-austere exascale systems, and on which algorithmic- and architecture-oriented optimizations are essential for achieving worthy performance. We investigate several state-of-the-practice shared-memory optimization techniques applied to key kernels for the important problem class of unstructured meshes. We illustrate for a broad spectrum of emerging microprocessor architectures as representatives of the compute units in contemporary leading supercomputers, identifying and addressing performance challenges without compromising the floating-point numerics of the original code. While the linear algebraic kernels are bottlenecked by memory bandwidth for even modest numbers of hardware cores sharing a common address space, the edge-based loop kernels, which arise in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, are compute-intensive and effectively exploit contemporary multi- and many-core processing hardware. We therefore employ low- and high-level algorithmic- and architecture-specific code optimizations and tuning in light of thread- and data-level parallelism, with a focus on strong thread scaling at the node-level. Our approaches are based upon novel multi-level hierarchical workload distribution mechanisms of data across different compute units (from the address space down to the registers) within every hardware core. We analyze the demonstrated aerodynamics application on specific computing architectures to develop certain performance metrics and models to bespeak the upper and lower bounds of the performance. We present significant full application speedup relative to the baseline code, on a succession of many-core processor architectures, i.e., Intel Xeon Phi Knights Corner (5.0x) and Knights Landing (2.9x). In addition, the performance of Knights Landing outperforms, at significantly lower power consumption, Intel Xeon Skylake with nearly twofold speedup. These optimizations are expected to be of value for many other unstructured mesh partial differential equation-based scientific applications as multi- and many- core architecture evolves. Performance Optimizations Thread-level parallelism Data-level parallelism Unstructured Grids Computational Aerodynamics Intel Xeon Phi
2	Idiom-driven innermost loop vectorization in the presence of cross-iteration data dependencies in the HotSpot C2 compiler / Idiomdriven vektorisering av inre loopar med databeroenden i HotSpots C2 kompilator Sjöblom, William January 2020 (has links) This thesis presents a technique for automatic vectorization of innermost single statement loops with a cross-iteration data dependence by analyzing data-flow to recognize frequently recurring program idioms. Recognition is carried out by matching the circular SSA data-flow found around the loop body’s φ-function against several primitive patterns, forming a tree representation of the relevant data-flow that is then pruned down to a single parameterized node, providing a high-level specification of the data-flow idiom at hand used to guide algorithmic replacement applied to the intermediate representation. The versatility of the technique is shown by presenting an implementation supporting vectorization of both a limited class of linear recurrences as well as prefix sums, where the latter shows how the technique generalizes to intermediate representations with memory state in SSA-form. Finally, a thorough performance evaluation is presented, showing the effectiveness of the vectorization technique. compiler vectorization SIMD Java HotSpot code optimization reductions prefix sums parallel programming data-level parallelism Computer Sciences Datavetenskap (datalogi)
3	Exploiting Data-Level-Parallelism for Modern In-Memory OLAP Query Engines Pietrzyk, Johannes 14 January 2025 (has links) The development of in-memory query engines is constantly adapting to evolving hardware features to maintain high performance for modern data analytics. To achieve this, parallelizing query processing is crucial for leveraging the full potential of modern hardware. While traditional parallelization techniques like multi-threading and distributed computing are well-established, the increasing availability and continuous advancements of data-level parallelism, known as Single Instruction Multiple Data (SIMD), have made SIMD a cutting-edge approach for improving single-query performance. Three general trends can be identified in the evolution of SIMD hardware: (i) continual increase in the available degree of data-level parallelism through wider SIMD registers, (ii) ongoing expansion of functional capabilities, and (iii) notable heterogeneity of the SIMD hardware landscape. The increasing width of SIMD registers, ranging from 128-bit to 512-bit for Intel and ARM platforms and up to 16 Kb on the NEC SX-Aurora TSUBASA vector engine, provides significant potential for data-level parallelism. This trend benefits query operators with linear access patterns, such as simple scans or arithmetic operations. However, it can negatively impact operators that rely on random access patterns or introduce loop-carried dependencies, such as sorting or a Hash-Join. With the introduction of novel instruction set extensions, like Intel's AVX-512, the functional capabilities of SIMD hardware have been continuously extended, allowing more complex operations to be efficiently executed in parallel. Consequently, this thesis outlines the opportunities and challenges of wider SIMD registers for in-memory query processing and how novel instruction set extensions can be used to mitigate some challenges. Additionally, the SIMD hardware landscape has become increasingly heterogeneous, with different architectures available. This diversity poses a significant challenge for academia and industry, as it requires constant adaptation and re-evaluation of SIMD-based algorithms and data structures. To address this challenge, we implemented a SIMD hardware abstraction library called Template SIMD Library (TSL), which provides a unified interface to different architectures. In this thesis, we present the design and implementation of TSL, demonstrating its capabilities by implementing a set of SIMD-based algorithms for in-memory query processing. We also show that even modern FPGAs benefit from the SIMD paradigm and can be programmed using high-level synthesis tools, such as Intel's oneAPI. Moreover, we demonstrate the potential of SIMD-based FPGA programming by implementing SIMD-based algorithms for in-memory query processing on different FPGA cards and integrating the necessary functionality into the TSL. Finally, we propose a novel workload-aware SIMD query processing approach called SIMQ. This approach leverages SIMD registers to share data access and SIMD instructions to share computation across queries.:1 Introduction 1.1 Single Instruction Multiple Data (SIMD) 1.1.1 Utilizing SIMD 1.1.2 SIMD in OLAP Query Processing 1.2 Thesis Goals and Contributions 1.2.1 Implications of SIMD Advancements on Analytical Query Processing 1.2.2 Leveraging SIMD to Optimize OLAP Query Throughput 1.2.3 Embracing the Heterogeneity of SIMD-aware Hardware 1.3 Impact of Thesis Contributions 1.3.1 Publications of Thesis Contribution 1.3.2 Open Source Contributions 1.3.3 Further Publications 1.4 Structure of the Thesis 2 Evaluating the Vector Supercomputer SX-Aurora TSUBASA 2.1 Problem Statement 2.2 Hardware System SX-Aurora TSUBASA 2.2.1 Overall Architecture 2.2.2 Vector Processing and Specific Systems 2.2.3 Execution Model and Programming Approach 2.3 MorphStore — In-Memory Database System 2.4 Comparative Evaluation 2.4.1 Selected Investigated Operations 2.4.2 Experimental Setup and Methodology 2.4.3 Single-Thread Evaluation Results 2.4.4 Multi-Thread Evaluation Results 2.4.5 Summary 2.5 Future Work 2.6 Conclusion 3 Fighting Duplicates in Hashing 3.1 Problem Statement 3.2 Background 3.2.1 Linear Probing 3.2.2 State-of-the-Art Vectorized Implementation of Linear Probing 3.2.3 Novel SIMD Instructions 3.3 CD-aware Vectorized Implementation of Linear Probing 3.3.1 CD-aware Hash Table Data Structures 3.3.2 Handling of Bucket Duplicates 3.3.3 Handling of Key Duplicates 3.4 Experimental Evaluation 3.4.1 Evaluation Result for Hashing without Value Handling 3.4.2 Evaluation Results for Hashing with Value Handling 3.5 Related Work 3.6 Conclusion and Future Work 4 Leveraging SIMD to Optimize OLAP Query Throughput 4.1 Problem Statement 4.2 Background and Related Work 4.2.1 Vectorization in General 4.2.2 Vectorization in Database Systems 4.2.3 Work Sharing in Database Systems 4.3 Sharing Vector Registers 4.3.1 SISQ: Single Instruction Single Query 4.3.2 SIMQ: Single Instruction Multiple Queries 4.4 Evaluation 4.4.1 Single-Threaded Evaluation 4.4.2 Multi-Threaded Evaluation 4.5 Discussion 4.6 Conclusion 5 Program your (custom) SIMD instruction set on FPGA in C++ 5.1 Problem Statement 5.2 FPGA Programming in C++ 5.2.1 Naïve C++ programming 5.2.2 Data Parallel C++ programming. 5.2.3 Using specialized data types 5.2.4 Analyzing FPGA resources 5.2.5 Summary 5.3 Programming SIMD instruction set 5.3.1 Register Definition 5.3.2 Instruction Definition 5.3.3 Comparing with RTL kernels 5.4 Use Case Studies 5.4.1 FilterCount 5.4.2 Binary Packing 5.5 Custom SIMD instructions 5.6 DPC++ Best practices 5.7 Related Work 5.8 Conclusion 6 Mastering the SIMD Heterogeneity 6.1 Problem Statement 6.2 Background 6.2.1 Applicability 6.2.2 Evolving API 6.3 TSLGen - Generative Framework 6.3.1 General Concepts 6.3.2 Framework Description 6.4 Advanced Features 6.4.1 Test Generation 6.4.2 Build Environment Files Generation 6.4.3 Further Extensions 6.4.4 Toolsupport 6.5 Use-Case Studies 6.5.1 Case Studies Description 6.5.2 Applicability 6.5.3 Extensibility 6.6 Discussion 6.7 Related Work 6.8 Conclusion 7 Tutorial on SIMD - Foundations, Abstraction, and Advanced Techniques 7.1 Problem Statement 7.2 Foundations 7.3 Abstraction 7.4 Advanced Techniques 7.4.1 Wider SIMD Registers 7.4.2 Flexibly-Sized SIMD Registers 7.4.3 Partition-based SIMD Processing 7.4.4 Increasing Heterogeneity 7.5 Summary / Kontinuierlich wachsende Mengen an zu verwaltenden Daten und komplexer werdende analytische Anfragen erfordern eine stetige Weiterentwicklung und Verbesserung von In-Memory Query Engines. Die effiziente Ausnutzung sich ständig weiterentwickelnder Hardware-Features fällt dabei eine entscheidende Rolle zu. Ein grundlegender Ansatz zur Beschleunigung von Algorithmen besteht in der Parallelisierung der Verarbeitung. Neben Scale-up und Scale-out Ansätzen, wie Multi-Thread beziehungsweiße Distributed Computing, existiert die Möglichkeit der Datenlevel parallelen (Single Instruction Multiple Data (SIMD)) Verarbeitung. Durch die zunehmende Verfügbarkeit und kontinuierliche Weiterentwicklung ist vorallem SIMD eine häufig eingesetzte Technik zur Beschleunigung verschiedenster Algorithmen in Forschung und Industrie geworden. Drei allgemeine Trends lassen sich im Bezug auf die Entwicklung von SIMD-Hardware identifizieren: (i) eine kontinuierliche Zunahme des verfügbaren Grades an Datenparallelismus durch breitere SIMD-Register, (ii) eine kontinuierliche Erweiterung des Funktionsumfangs und (iii) eine große Heterogenität der SIMD-Hardware-Landschaft. Die zunehmende Breite der verfügbaren SIMD-Register, die von 128 bit bis 512 bit für Inel und ARM Plattformen und bis zu 16 Kb auf der NEC SX-Aurora TSUBASA reicht, nützt vorallem Algorithmen mit linearem Zugriffsmuster ohne relevante Abhängigkeiten zwischen den einzelnen zu verarbeitenden Daten. Algorithmen welche auf nicht-lineare Zugriffsmuster angewiesen sind oder deren Verhalten stark von den zu Grunde liegenden Daten abhängig ist, können jedoch häufig nicht von einem höheren Grad an Datenlevel Parallelismus profitieren. Ein Beispiel für solch einen Algorithmus ist der Hash-Join, dessen Ziel es ist, zwei Relationen über einen gemeinsamen Schlüssel zu verknüpfen. Durch die Einführung neuer Befehlssatzerweiterungen wie AVX-512 von Intel werden die funktionalen Fähigkeiten der der verfügbaren SIMD-Hardware kontinuierlich erweitert, wodurch bisherige Annahmen neu bewertet werden müssen. Die vorliegende Arbeit untersucht die Möglichkeiten und Herausforderungen breiterer SIMD-Register für die In-Memory Anfrageverarbeitung und inwiefern neue Befehlssatzerweiterungen genutzt werden können, um wichtige Herausforderungen in der Datenlevel paralleln Verarbeitung zu bewältigen. Der dritte Trend, die zunehmende Ausdifferenzierung verfügbarer Hardware erschwert den konsistenten Einsatz von SIMD. Unterschiedliche Funktionen, Datentypen und Konzepte verhindern eine einfache Portierung bereits optimierter Algorithmen. Neue Funktionalitäten erfordern mitunter eine architekturspezifische Anpassung des gesamten Algorithmus um vorhandene Möglichkeiten bestmöglich auszunutzen. Um dieser Herausforderung zu begegnen, wird die im Rahmen dieser Dissertation entwickelte SIMD-Hardwareabstraktionsbibliothek namens Template SIMD Library (TSL) vorgestellt, die eine einheitliche Schnittstelle bereitstellt um auf einer Vielzahl von Architekturen mit unterschiedlichen Datentypen Datenlevel parallelität auszunutzen. Darüber hinaus wird gezeigt, dass selbst moderne FPGAs vom SIMD-Paradigma profitieren und mit Hilfe von high-level synthesis Tools (HLS) wie Intels oneAPI programmiert werden können. Das Potenzial von SIMD zur Beschleunigung der Verarbeitung auf FPGAs wird durch diverse Benchmarks von Datenbank-spezifischen Algorithmen auf zwei verschiedenen FPGA-Karten aufgezeigt und die erforderliche Funktionalität in die TSL integriert. Während alle bisherigen Ansätze zum Einsatz von SIMD in der In-Memory Anfrageverarbeitung auf die Beschleunigung einzelner Anfragen abzielten, wird in der vorliegenden Arbeit ein neues Konzept zur Workload-Optimierung namens SIMQ vorgestellt. Dieser Ansatz nutzt SIMD-Register als von der Hardware bereitgestellte Ressource, um gemeinsame Datenzugriffe unterschiedlicher Anfragen zusammenzuführen.:1 Introduction 1.1 Single Instruction Multiple Data (SIMD) 1.1.1 Utilizing SIMD 1.1.2 SIMD in OLAP Query Processing 1.2 Thesis Goals and Contributions 1.2.1 Implications of SIMD Advancements on Analytical Query Processing 1.2.2 Leveraging SIMD to Optimize OLAP Query Throughput 1.2.3 Embracing the Heterogeneity of SIMD-aware Hardware 1.3 Impact of Thesis Contributions 1.3.1 Publications of Thesis Contribution 1.3.2 Open Source Contributions 1.3.3 Further Publications 1.4 Structure of the Thesis 2 Evaluating the Vector Supercomputer SX-Aurora TSUBASA 2.1 Problem Statement 2.2 Hardware System SX-Aurora TSUBASA 2.2.1 Overall Architecture 2.2.2 Vector Processing and Specific Systems 2.2.3 Execution Model and Programming Approach 2.3 MorphStore — In-Memory Database System 2.4 Comparative Evaluation 2.4.1 Selected Investigated Operations 2.4.2 Experimental Setup and Methodology 2.4.3 Single-Thread Evaluation Results 2.4.4 Multi-Thread Evaluation Results 2.4.5 Summary 2.5 Future Work 2.6 Conclusion 3 Fighting Duplicates in Hashing 3.1 Problem Statement 3.2 Background 3.2.1 Linear Probing 3.2.2 State-of-the-Art Vectorized Implementation of Linear Probing 3.2.3 Novel SIMD Instructions 3.3 CD-aware Vectorized Implementation of Linear Probing 3.3.1 CD-aware Hash Table Data Structures 3.3.2 Handling of Bucket Duplicates 3.3.3 Handling of Key Duplicates 3.4 Experimental Evaluation 3.4.1 Evaluation Result for Hashing without Value Handling 3.4.2 Evaluation Results for Hashing with Value Handling 3.5 Related Work 3.6 Conclusion and Future Work 4 Leveraging SIMD to Optimize OLAP Query Throughput 4.1 Problem Statement 4.2 Background and Related Work 4.2.1 Vectorization in General 4.2.2 Vectorization in Database Systems 4.2.3 Work Sharing in Database Systems 4.3 Sharing Vector Registers 4.3.1 SISQ: Single Instruction Single Query 4.3.2 SIMQ: Single Instruction Multiple Queries 4.4 Evaluation 4.4.1 Single-Threaded Evaluation 4.4.2 Multi-Threaded Evaluation 4.5 Discussion 4.6 Conclusion 5 Program your (custom) SIMD instruction set on FPGA in C++ 5.1 Problem Statement 5.2 FPGA Programming in C++ 5.2.1 Naïve C++ programming 5.2.2 Data Parallel C++ programming. 5.2.3 Using specialized data types 5.2.4 Analyzing FPGA resources 5.2.5 Summary 5.3 Programming SIMD instruction set 5.3.1 Register Definition 5.3.2 Instruction Definition 5.3.3 Comparing with RTL kernels 5.4 Use Case Studies 5.4.1 FilterCount 5.4.2 Binary Packing 5.5 Custom SIMD instructions 5.6 DPC++ Best practices 5.7 Related Work 5.8 Conclusion 6 Mastering the SIMD Heterogeneity 6.1 Problem Statement 6.2 Background 6.2.1 Applicability 6.2.2 Evolving API 6.3 TSLGen - Generative Framework 6.3.1 General Concepts 6.3.2 Framework Description 6.4 Advanced Features 6.4.1 Test Generation 6.4.2 Build Environment Files Generation 6.4.3 Further Extensions 6.4.4 Toolsupport 6.5 Use-Case Studies 6.5.1 Case Studies Description 6.5.2 Applicability 6.5.3 Extensibility 6.6 Discussion 6.7 Related Work 6.8 Conclusion 7 Tutorial on SIMD - Foundations, Abstraction, and Advanced Techniques 7.1 Problem Statement 7.2 Foundations 7.3 Abstraction 7.4 Advanced Techniques 7.4.1 Wider SIMD Registers 7.4.2 Flexibly-Sized SIMD Registers 7.4.3 Partition-based SIMD Processing 7.4.4 Increasing Heterogeneity 7.5 Summary info:eu-repo/classification/ddc/004 ddc:004

1

Page generated in 0.085 seconds