• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 84
  • 13
  • 10
  • 8
  • 7
  • 4
  • 4
  • 2
  • 1
  • Tagged with
  • 146
  • 64
  • 36
  • 32
  • 24
  • 23
  • 19
  • 19
  • 18
  • 16
  • 15
  • 14
  • 14
  • 13
  • 13
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Model-based Integer Compression Schemes for In-Memory Query Processing

Hildebrandt, Juliana 15 April 2024 (has links)
In-memory column-store database systems have to keep base data as well as generated intermediate results during query processing in main memory. Moreover, base data as well as intermediate results are usually represented as integer data in this kind of database systems. Therefore, the effort to access intermediate results is equivalent to the effort to access the base data. Based on this characteristic, the optimization of intermediate results is an interesting general optimization knob and has a high impact on the performance of the query processing. In this direction, we propose the continuous use of integer compression methods for intermediate results and design a novel compression-aware query processing concept. To minimize the overall query execution time, the best-fitting integer compression scheme has to be used. However and as shown in the literature, there is no single-best integer compression algorithm --- data and hardware characteristics as well as the optimization objectives like decompression speed or compression rate determine the properties of fitting algorithms. To systematically tackle this challenge from a system perspective, we present our vision of a model-based approach to make a large corpus of integer compression schemes available in our compression-aware query processing concept. Then, we describe in detail our model-based approach by (i) introducing our developed metamodel for integer compression schemes called COLLATE and by (ii) showing that a large corpus of integer compression schemes can be described as a model conforming with COLLATE. Afterwards, we introduce a concept and an implementation to automatically derive scalar compression and decompression routines out of model compression schema. Finally, we propose a novel generalized SIMD concept for integer compression algorithms called BOUNCE to adapt a scalar execution to a data parallel execution using SIMD instruction set extensions of general-purpose CPUs.:1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1. Thesis Goals and Contributions . . . . . . . . . . . . . . . . . .7 1.1.1. Model-Based Concept for Compression Schemes . . . . . . . . . . 8 1.1.2. Model-Based Generator Framework . . . . . . . . . . . . . . . .10 1.1.3. Generalized SIMD Concept for Integer Compression . . . . . . . 12 1.2. Impact of Thesis Contributions . . . . . . . . . . . . . . . . . 14 1.2.1. Publications of Thesis Contribution . . . . . . . . . . . . . .14 1.2.2. Open Source Contributions . . . . . . . . . . . . . . . . . . .17 1.2.3. Additional Contributions . . . . . . . . . . . . . . . . . . . 18 1.3. Structure of the Thesis . . . . . . . . . . . . . . . . . . . . .21 I. Model-Based Concept for Compression Schemes . . . . . . . . . . . 23 2. Compression-Aware In-Memory Query Processing . . . . . . . . . . .25 2.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1.1. Vision of Compression-Aware In-Memory Query Processing . . . . 26 2.1.2. System Design Challenge for Compression-Aware Processing . . . 26 2.1.3. Our Contribution and Outline . . . . . . . . . . . . . . . . . 27 2.2. System Design Overview . . . . . . . . . . . . . . . . . . . . . 28 2.3. Survey of Lightweight Data Compression Algorithms . . . . . . . .29 2.3.1. Analysis of Basic Lightweight Compression Techniques . . . . . 29 2.3.2. Analysis of Lightweight Compression Algorithms . . . . . . . . 30 2.3.3. Derived System Description and Properties . . . . . . . . . . .31 2.4. COLLATE Model . . . . . . . . . . . . . . . . . . . . . . . . . .32 2.5. Transformation of Model Instances . . . . . . . . . . . . . . . .34 2.5.1. Model Instances for Byte-oriented Encoding Algorithms . . . . .34 2.5.2. Transformation to Executable Code . . . . . . . . . . . . . . .35 2.6. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .38 2.6.1. Structural Aspect . . . . . . . . . . . . . . . . . . . . . . .39 2.6.2. Operational Aspect . . . . . . . . . . . . . . . . . . . . . . 39 2.6.3. Optimization Aspect . . . . . . . . . . . . . . . . . . . . . .40 2.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3. Metamodeling Lightweight Data Compression Algorithms . . . . . . .41 3.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . .41 3.2. Metamodeling Lightweight Compression Algorithms . . . . . . . . .43 3.2.1. Properties of Data Compression Algorithms . . . . . . . . . . .43 3.2.2. COLLATE Metamodel . . . . . . . . . . . . . . . . . . . . . . .44 3.2.3. Interaction of COLLATE Concepts . . . . . . . . . . . . . . . .46 3.3. Application Scenarios . . . . . . . . . . . . . . . . . . . . . .46 3.3.1. Algorithms as Models . . . . . . . . . . . . . . . . . . . . . 46 3.3.2. Compression Algorithm Language CoALa . . . . . . . . . . . . . 48 3.3.3. Transformation of CoALa Code to Executable Code . . . . . . . .49 3.4. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 II. Model-Based Generator Framework . . . . . . . . . . . . . . . . .55 4. LCTL: Lightweight Compression Template Library . . . . . . . . . .57 4.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . .57 4.2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . .59 4.2.1. Characteristics of Lightweight Compression . . . . . . . . . . 59 4.2.2. Metamodel COLLATE . . . . . . . . . . . . . . . . . . . . . . .60 4.3. Lightweight Compression Template Library . . . . . . . . . . . . 62 4.3.1. Template Metaprogramming for Tree Transformations . . . . . . .62 4.3.2. Lightweight Compression Template Library . . . . . . . . . . . 64 4.4. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.1. Format Space Diversity . . . . . . . . . . . . . . . . . . . . 71 4.4.2. Compile-Times . . . . . . . . . . . . . . . . . . . . . . . . .72 4.4.3. Run-Times . . . . . . . . . . . . . . . . . . . . . . . . . . .73 4.5. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .73 4.5.1. Further Code Optimization . . . . . . . . . . . . . . . . . . .74 4.5.2. Format Selection . . . . . . . . . . . . . . . . . . . . . . . 74 4.5.3. Integration . . . . . . . . . . . . . . . . . . . . . . . . . .75 4.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5. Integrating Compression into Apache Arrow . . . . . . . . . . . . 77 5.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . .77 5.2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . .79 5.2.1. Apache Arrow . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.2. Lightweight Integer Compression . . . . . . . . . . . . . . . .82 5.2.3. Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . .83 5.3. ArrowComp Framework . . . . . . . . . . . . . . . . . . . . . . .83 5.3.1. Metamodel for (De)Compression . . . . . . . . . . . . . . . . .84 5.3.2. Self-Describing Columns . . . . . . . . . . . . . . . . . . . .85 5.3.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.1. Data Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.2. Run Times . . . . . . . . . . . . . . . . . . . . . . . . . . .90 5.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5.1. Storage-oriented Formats . . . . . . . . . . . . . . . . . . . 92 5.5.2. Processing-oriented Formats . . . . . . . . . . . . . . . . . .93 5.5.3. Apache Arrow-centric Work . . . . . . . . . . . . . . . . . . .93 5.6. Conclusion and Future Work . . . . . . . . . . . . . . . . . . . 94 III. Generalized SIMD Concept for Integer Compression . . . . . . . .95 6. Partition-Based SIMD Processing . . . . . . . . . . . . . . . . . 97 6.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . .97 6.2. Gather Evaluation . . . . . . . . . . . . . . . . . . . . . . . .99 6.2.1. Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.2. Evaluation Results . . . . . . . . . . . . . . . . . . . . . .101 6.3. Partition-based SIMD . . . . . . . . . . . . . . . . . . . . . .105 6.4. Application Use Cases . . . . . . . . . . . . . . . . . . . . . 106 6.4.1. Vectorized Query Processing . . . . . . . . . . . . . . . . . 107 6.4.2. Integer Compression . . . . . . . . . . . . . . . . . . . . . 109 6.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . .110 6.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .111 7. BOUNCE: Memory-Efficient SIMD Compression . . . . . . . . . . . .113 7.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 113 7.2. Analyzing State-of-the-Art . . . . . . . . . . . . . . . . . . .115 7.2.1. Bitpacking . . . . . . . . . . . . . . . . . . . . . . . . . .116 7.2.2. Simple Algorithms . . . . . . . . . . . . . . . . . . . . . . 120 7.2.3. Varint Algorithms . . . . . . . . . . . . . . . . . . . . . . 122 7.2.4. Further Compression Algorithms . . . . . . . . . . . . . . . .124 7.2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.3. BOUNCE: Block concurrent SIMD Concept . . . . . . . . . . . . . 126 7.4. BOUNCE Application . . . . . . . . . . . . . . . . . . . . . . .128 7.4.1. Application to BP . . . . . . . . . . . . . . . . . . . . . . 128 7.4.2. Application to Simple Algorithms . . . . . . . . . . . . . . .130 7.4.3. Application to Varint Algorithms . . . . . . . . . . . . . . .130 7.4.4. Application to Further Algorithms . . . . . . . . . . . . . . 130 7.5. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .131 7.5.1. Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . .131 7.5.2. Performance of the Data Access Pattern . . . . . . . . . . . .132 7.5.3. Evaluating Bitpacking . . . . . . . . . . . . . . . . . . . . 132 7.5.4. Synthetic Data Sets with Fixed Bit Widths . . . . . . . . . . 133 7.5.5. Synthetic Data Sets with Different Bit Widths . . . . . . . . 134 7.5.6. Impact of Different Hardware Architectures . . . . . . . . . .135 7.5.7. Evaluation on Real Data Sets . . . . . . . . . . . . . . . . .136 7.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . .137 7.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .137 8. Conclusion and Research Outlook . . . . . . . . . . . . . . . . .139 8.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.2. Research Outlook . . . . . . . . . . . . . . . . . . . . . . . .140 / Spaltenorientierte InMemory Datenbanksysteme müssen sowohl Basisdaten als auch generierte Zwischenergebnisse während der Anfrageverarbeitung im Hauptspeicher halten. Darüber hinaus werden Basisddaten sowie auch Zwischenergebnisse in dieser Art von Datenbanksystemen normalerweise als Integerdaten repräsentiert. Daher ist der Aufwand für den Zugriff auf Zwischenergebnisse äquivalent zum Aufwand für den Zugriff auf die Basisdaten. Basierend auf dieser Eigenschaft ist die Optimierung von Zwischenergebnissen ein wichtiger genereller Optimierungsparameter und hat einen hohen Einfluss auf die Leistung der Anfrageausführung. Deshalb untersuchen wir die kontinuierliche Verwendung von Integerkompressionsschemata für Zwischenergebnisse und haben ein neues Anfrageverarbeitungskonzept entworfen, welches sich der Datenkompression zur Effizienzsteigerung bedient. Um die Gesamtanfrageausführungszeit zu minimieren, muss das am besten geeignete Integerkompressionsschema verwendet werden. Bisherige Untersuchungen haben jedoch erwiesen, dass es keinen allgemeinen besten Integerkompressionsalgorithmus gibt. Daten und Hardwaremerkmale sowie Optimierungsziele wie Dekompressionsgeschwindigkeit oder Kompressionsrate bestimmen die Eigenschaften passender Algorithmen. Um hier einen Lösungsansatz ausgehend von der Systemperspektive zu eröffnen, präsentieren wir in dieser Arbeit einen modellbasierten Ansatz, um eine große Anzahl von Integerkompressionsschemata in unserem kompressionsbewussten Anfrageverarbeitungskonzept verfügbar zu machen. Danach beschreiben wir im Detail unseren modellbasierten Ansatz, indem wir (i) unser entwickeltes Metamodell für Integerkompressionsschemata namens COLLATE vorstellen und (ii) zeigen, dass eine große Anzahl von Integerkompressionsschemata als Modell beschrieben werden kann, welches dem Ansatz von COLLATE entspricht. Danach stellen wir ein Konzept und eine Implementierung vor, um skalare Kompressions und Dekompressionsroutinen automatisch aus dem Kompressionsschemamodell abzuleiten. Schließlich stellen wir ein neuartiges generalisiertes SIMDKonzept für Integerkompressionsalgorithmen namens BOUNCE vor, um aus einer skalaren Ausführung eine datenparallele Ausführung unter Verwendung von SIMDBefehlssatzerweiterungen von allgemeinen CPUs abzuleiten.:1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1. Thesis Goals and Contributions . . . . . . . . . . . . . . . . . .7 1.1.1. Model-Based Concept for Compression Schemes . . . . . . . . . . 8 1.1.2. Model-Based Generator Framework . . . . . . . . . . . . . . . .10 1.1.3. Generalized SIMD Concept for Integer Compression . . . . . . . 12 1.2. Impact of Thesis Contributions . . . . . . . . . . . . . . . . . 14 1.2.1. Publications of Thesis Contribution . . . . . . . . . . . . . .14 1.2.2. Open Source Contributions . . . . . . . . . . . . . . . . . . .17 1.2.3. Additional Contributions . . . . . . . . . . . . . . . . . . . 18 1.3. Structure of the Thesis . . . . . . . . . . . . . . . . . . . . .21 I. Model-Based Concept for Compression Schemes . . . . . . . . . . . 23 2. Compression-Aware In-Memory Query Processing . . . . . . . . . . .25 2.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.1.1. Vision of Compression-Aware In-Memory Query Processing . . . . 26 2.1.2. System Design Challenge for Compression-Aware Processing . . . 26 2.1.3. Our Contribution and Outline . . . . . . . . . . . . . . . . . 27 2.2. System Design Overview . . . . . . . . . . . . . . . . . . . . . 28 2.3. Survey of Lightweight Data Compression Algorithms . . . . . . . .29 2.3.1. Analysis of Basic Lightweight Compression Techniques . . . . . 29 2.3.2. Analysis of Lightweight Compression Algorithms . . . . . . . . 30 2.3.3. Derived System Description and Properties . . . . . . . . . . .31 2.4. COLLATE Model . . . . . . . . . . . . . . . . . . . . . . . . . .32 2.5. Transformation of Model Instances . . . . . . . . . . . . . . . .34 2.5.1. Model Instances for Byte-oriented Encoding Algorithms . . . . .34 2.5.2. Transformation to Executable Code . . . . . . . . . . . . . . .35 2.6. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .38 2.6.1. Structural Aspect . . . . . . . . . . . . . . . . . . . . . . .39 2.6.2. Operational Aspect . . . . . . . . . . . . . . . . . . . . . . 39 2.6.3. Optimization Aspect . . . . . . . . . . . . . . . . . . . . . .40 2.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3. Metamodeling Lightweight Data Compression Algorithms . . . . . . .41 3.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . .41 3.2. Metamodeling Lightweight Compression Algorithms . . . . . . . . .43 3.2.1. Properties of Data Compression Algorithms . . . . . . . . . . .43 3.2.2. COLLATE Metamodel . . . . . . . . . . . . . . . . . . . . . . .44 3.2.3. Interaction of COLLATE Concepts . . . . . . . . . . . . . . . .46 3.3. Application Scenarios . . . . . . . . . . . . . . . . . . . . . .46 3.3.1. Algorithms as Models . . . . . . . . . . . . . . . . . . . . . 46 3.3.2. Compression Algorithm Language CoALa . . . . . . . . . . . . . 48 3.3.3. Transformation of CoALa Code to Executable Code . . . . . . . .49 3.4. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 II. Model-Based Generator Framework . . . . . . . . . . . . . . . . .55 4. LCTL: Lightweight Compression Template Library . . . . . . . . . .57 4.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . .57 4.2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . .59 4.2.1. Characteristics of Lightweight Compression . . . . . . . . . . 59 4.2.2. Metamodel COLLATE . . . . . . . . . . . . . . . . . . . . . . .60 4.3. Lightweight Compression Template Library . . . . . . . . . . . . 62 4.3.1. Template Metaprogramming for Tree Transformations . . . . . . .62 4.3.2. Lightweight Compression Template Library . . . . . . . . . . . 64 4.4. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.1. Format Space Diversity . . . . . . . . . . . . . . . . . . . . 71 4.4.2. Compile-Times . . . . . . . . . . . . . . . . . . . . . . . . .72 4.4.3. Run-Times . . . . . . . . . . . . . . . . . . . . . . . . . . .73 4.5. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .73 4.5.1. Further Code Optimization . . . . . . . . . . . . . . . . . . .74 4.5.2. Format Selection . . . . . . . . . . . . . . . . . . . . . . . 74 4.5.3. Integration . . . . . . . . . . . . . . . . . . . . . . . . . .75 4.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5. Integrating Compression into Apache Arrow . . . . . . . . . . . . 77 5.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . .77 5.2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . .79 5.2.1. Apache Arrow . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.2. Lightweight Integer Compression . . . . . . . . . . . . . . . .82 5.2.3. Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . .83 5.3. ArrowComp Framework . . . . . . . . . . . . . . . . . . . . . . .83 5.3.1. Metamodel for (De)Compression . . . . . . . . . . . . . . . . .84 5.3.2. Self-Describing Columns . . . . . . . . . . . . . . . . . . . .85 5.3.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.1. Data Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.2. Run Times . . . . . . . . . . . . . . . . . . . . . . . . . . .90 5.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5.1. Storage-oriented Formats . . . . . . . . . . . . . . . . . . . 92 5.5.2. Processing-oriented Formats . . . . . . . . . . . . . . . . . .93 5.5.3. Apache Arrow-centric Work . . . . . . . . . . . . . . . . . . .93 5.6. Conclusion and Future Work . . . . . . . . . . . . . . . . . . . 94 III. Generalized SIMD Concept for Integer Compression . . . . . . . .95 6. Partition-Based SIMD Processing . . . . . . . . . . . . . . . . . 97 6.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . .97 6.2. Gather Evaluation . . . . . . . . . . . . . . . . . . . . . . . .99 6.2.1. Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.2. Evaluation Results . . . . . . . . . . . . . . . . . . . . . .101 6.3. Partition-based SIMD . . . . . . . . . . . . . . . . . . . . . .105 6.4. Application Use Cases . . . . . . . . . . . . . . . . . . . . . 106 6.4.1. Vectorized Query Processing . . . . . . . . . . . . . . . . . 107 6.4.2. Integer Compression . . . . . . . . . . . . . . . . . . . . . 109 6.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . .110 6.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .111 7. BOUNCE: Memory-Efficient SIMD Compression . . . . . . . . . . . .113 7.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 113 7.2. Analyzing State-of-the-Art . . . . . . . . . . . . . . . . . . .115 7.2.1. Bitpacking . . . . . . . . . . . . . . . . . . . . . . . . . .116 7.2.2. Simple Algorithms . . . . . . . . . . . . . . . . . . . . . . 120 7.2.3. Varint Algorithms . . . . . . . . . . . . . . . . . . . . . . 122 7.2.4. Further Compression Algorithms . . . . . . . . . . . . . . . .124 7.2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.3. BOUNCE: Block concurrent SIMD Concept . . . . . . . . . . . . . 126 7.4. BOUNCE Application . . . . . . . . . . . . . . . . . . . . . . .128 7.4.1. Application to BP . . . . . . . . . . . . . . . . . . . . . . 128 7.4.2. Application to Simple Algorithms . . . . . . . . . . . . . . .130 7.4.3. Application to Varint Algorithms . . . . . . . . . . . . . . .130 7.4.4. Application to Further Algorithms . . . . . . . . . . . . . . 130 7.5. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .131 7.5.1. Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . .131 7.5.2. Performance of the Data Access Pattern . . . . . . . . . . . .132 7.5.3. Evaluating Bitpacking . . . . . . . . . . . . . . . . . . . . 132 7.5.4. Synthetic Data Sets with Fixed Bit Widths . . . . . . . . . . 133 7.5.5. Synthetic Data Sets with Different Bit Widths . . . . . . . . 134 7.5.6. Impact of Different Hardware Architectures . . . . . . . . . .135 7.5.7. Evaluation on Real Data Sets . . . . . . . . . . . . . . . . .136 7.6. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . .137 7.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .137 8. Conclusion and Research Outlook . . . . . . . . . . . . . . . . .139 8.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.2. Research Outlook . . . . . . . . . . . . . . . . . . . . . . . .140
42

Applying Source Level Auto-Vectorization to Aparapi Java

Albert, Frank Curtis 19 June 2014 (has links)
Ever since chip manufacturers hit the power wall preventing them from increasing processor clock speed, there has been an increased push towards parallelism for performance improvements. This parallelism comes in the form of both data parallel single instruction multiple data (SIMD) instructions, as well as parallel compute cores in both central processing units (CPUs) and graphics processing units (GPUs). While these hardware enhancements offer potential performance enhancements, programs must be re-written to take advantage of them in order to see any performance improvement Some lower level languages that compile directly to machine code already take advantage of the data parallel SIMD instructions, but often higher level interpreted languages do not. Java, one of the most popular programming languages in the world, still does not include support for these SIMD instructions. In this thesis, we present a vector library that implements all of the major SIMD instructions in functions that are accessible to Java through JNI function calls. This brings the benefits of general purpose SIMD functionality to Java. This thesis also works with the data parallel Aparapi Java extension to bring these SIMD performance improvements to programmers who use the extension without any additional effort on their part. Aparapi already provides programmers with an API that allows programmers to declare certain sections of their code parallel. These parallel sections are then run on OpenCL capable hardware with a fallback path in the Java thread pool to ensure code reliability. This work takes advantage of the knowledge of independence of the parallel sections of code to automatically modify the Java thread pool fallback path to include the vectorization library through the use of an auto-vectorization tool created for this work. When the code is not vectorizable the auto-vectorizer tool is still able to offer performance improvements over the default fallback path through an improved looped implementation that executes the same code but with less overhead. Experiments conducted by this work illustrate that for all 10 benchmarks tested the auto-vectorization tool was able to produce an implementation that was able to beat the default Aparapi fallback path. In addition it was found that this improved fallback path even outperformed the GPU implementation for several of the benchmarks tested. / Master of Science
43

Akcelerace adversariálních algoritmů s využití grafického procesoru / GPU Accelerated Adversarial Search

Brehovský, Martin January 2011 (has links)
General purpose graphical processing units were proven to be useful for accelerating computationally intensive algorithms. Their capability to perform massive parallel computing significantly improve performance of many algorithms. This thesis focuses on using graphical processors (GPUs) to accelerate algorithms based on adversarial search. We investigate whether or not the adversarial algorithms are suitable for single instruction multiple data (SIMD) type of parallelism, which GPU provides. Therefore, parallel versions of selected algorithms accelerated by GPU were implemented and compared with the algorithms running on CPU. Obtained results show significant speed improvement and proof the applicability of GPU technology in the domain of adversarial search algorithms.
44

Transforming TLP into DLP with the dynamic inter-thread vectorization architecture / Transformer le TLP en DLP avec l'architecture de vectorisation dynamique inter-thread

Kalathingal, Sajith 13 December 2016 (has links)
De nombreux microprocesseurs modernes mettent en œuvre le multi-threading simultané (SMT) pour améliorer l'efficacité globale des processeurs superscalaires. SMT masque les opérations à longue latence en exécutant les instructions de plusieurs threads simultanément. Lorsque les threads exécutent le même programme (cas des applications SPMD), les mêmes instructions sont souvent exécutées avec des entrées différentes. Les architectures SMT traditionnelles exploitent le parallélisme entre threads, ainsi que du parallélisme de données explicite au travers d'unités d'exécution SIMD. L'exécution SIMD est efficace en énergie car le nombre total d'instructions nécessaire pour exécuter un programme est significativement réduit. Cette réduction du nombre d'instructions est fonction de la largeur des unités SIMD et de l'efficacité de la vectorisation. L'efficacité de la vectorisation est cependant souvent limitée en pratique. Dans cette thèse, nous proposons l'architecture de vectorisation dynamique inter-thread (DITVA) pour tirer parti du parallélisme de données implicite des applications SPMD en assemblant dynamiquement des instructions vectorielles à l'exécution. DITVA augmente un processeur à exécution dans l'ordre doté d'unités SIMD en lui ajoutant un mode d'exécution vectorisant entre threads. Lorsque les threads exécutent les mêmes instructions simultanément, DITVA vectorise dynamiquement ces instructions pour assembler des instructions SIMD entre threads. Les threads synchronisés sur le même chemin d'exécution partagent le même flot d'instructions. Pour conserver du parallélisme de threads, DITVA groupe de manière statique les threads en warps ordonnancés indépendamment. DITVA tire parti des unités SIMD existantes et maintient la compatibilité binaire avec les architectures CPU existantes. / Many modern microprocessors implement Simultaneous Multi-Threading (SMT) to improve the overall efficiency of superscalar CPU. SMT hides long latency operations by executing instructions from multiple threads simultaneously. SMT may execute threads of different processes, threads of the same processes or any combination of them. When the threads are from the same process, they often execute the same instructions with different data most of the time, especially in the case of Single-Program Multiple Data (SPMD) applications.Traditional SMT architecture exploit thread-level parallelism and with the use of SIMD execution units, they also support explicit data-level parallelism. SIMD execution is power efficient as the total number of instructions required to execute a complete program is significantly reduced. This instruction reduction is a factor of the width of SIMD execution units and the vectorization efficiency. Static vectorization efficiency depends on the programmer skill and the compiler. Often, the programs are not optimized for vectorization and hence it results in inefficient static vectorization by the compiler.In this thesis, we propose the Dynamic Inter-Thread vectorization Architecture (DITVA) to leverage the implicit data-level parallelism in SPMD applications by assembling dynamic vector instructions at runtime. DITVA optimizes an SIMD-enabled in-order SMT processor with inter-thread vectorization execution mode. When the threads are running in lockstep, similar instructions across threads are dynamically vectorized to form a SIMD instruction. The threads in the convergent paths share an instruction stream. When all the threads are in the convergent path, there is only a single stream of instructions. To optimize the performance in such cases, DITVA statically groups threads into fixed-size independently scheduled warps. DITVA leverages existing SIMD units and maintains binary compatibility with existing CPU architectures.
45

SIMD-MIMD cocktail in a hybrid memory glass: shaken, not stirred

Zarubin, Mikhail, Damme, Patrick, Krause, Alexander, Habich, Dirk, Lehner, Wolfgang 23 November 2021 (has links)
Hybrid memory systems consisting of DRAM and NVRAM offer a great opportunity for column-oriented data systems to persistently store and to efficiently process columnar data completely in main memory. While vectorization (SIMD) of query operators is state-of-the-art to increase the single-thread performance, it has to be combined with thread-level parallelism (MIMD) to satisfy growing needs for higher performance and scalability. However, it is not well investigated how such a SIMD-MIMD interplay could be leveraged efficiently in hybrid memory systems. On the one hand, we deliver an extensive experimental evaluation of typical workloads on columnar data in this paper. We reveal that the choice of the most performant SIMD version differs greatly for both memory types. Moreover, we show that the throughput of concurrent queries can be boosted (up to 2x) when combining various SIMD flavors in a multi-threaded execution. On the other hand, to enable that optimization, we propose an adaptive SIMD-MIMD cocktail approach incurring only a negligible runtime overhead.
46

Fast Integer Compression using SIMD Instructions

Schlegel, Benjamin, Gemulla, Rainer, Lehner, Wolfgang 25 May 2022 (has links)
We study algorithms for efficient compression and decompression of a sequence of integers on modern hardware. Our focus is on universal codes in which the codeword length is a monotonically non-decreasing function of the uncompressed integer value; such codes are widely used for compressing 'small integers'. In contrast to traditional integer compression, our algorithms make use of the SIMD capabilities of modern processors by encoding multiple integer values at once. More specifically, we provide SIMD versions of both null suppression and Elias gamma encoding. Our experiments show that these versions provide a speedup from 1.5x up to 6.7x for decompression, while maintaining a similar compression performance.
47

Design and implementation of LTE-A and 5G kernel algorithms on SIMD vector processor

Guo, Jiabing January 2015 (has links)
With the wide spread of wireless technology, the time for 4G has arrived, and 5G will appear not so far in the future. However, no matter whether it is 4G or 5G, low latency is a mandatory requirement for baseband processing at base stations for modern cellular standards. In particular, in a future 5G wireless system, with massive MIMO and ultra-dense cells, the demand for low round trip latency between the mobile device and the base station requires a baseband processing delay of 1 ms. This is 10 percentage of today’s LTE-A round trip latency, while at the same time massive MIMO requires large-scale matrix computations. This is especially true for channel estimation and MIMO detection at the base station. Therefore, it is essential to ensure low latency for the user data traffic. In this master’s thesis, LTE/LTE-A uplink physical layer processing is examined, especially the process of channel estimation and MIMO detection. In order to analyze this processing we compare two conventional algorithms’ performance and complexity for channel estimation and MIMO detection. The key aspect which affects the algorithms’ speed is identified as the need for “massive complex matrix inversion”. A parallel coding scheme is proposed to implement a matrix inversion kernel algorithm on a single instruction multiple data stream (SIMD) vector processor. The major contribution of this thesis is implementation and evaluation of a parallel massive complex matrix inversion algorithm. Two aspects have been addressed: the selection of the algorithm to perform this matrix computation and the implementation of a highly parallel version of this algorithm. / Med den breda spridningen av trådlös teknik, har tiden för 4G kommit, och 5G kommer inom en överskådlig framtid. Men oavsett om det gäller 4G eller 5G, låg latens är ett obligatoriskt krav för basbandsbehandling vid basstationer för moderna mobila standarder. I synnerhet i ett framtida trådlöst 5G-system, med massiva MIMO och ultratäta celler, behövs en basbandsbehandling fördröjning på 1 ms för att klara efterfrågan på en låg rundresa latens mellan den mobila enheten och basstationen. Detta är 10 procent av dagens LTE-E rundresa latens, medan massiva MIMO samtidigt kräver storskaliga matrisberäkningar. Detta är särskilt viktigt för kanaluppskattning och MIMO-detektion vid basstationen. Därför är det viktigt att se till att det är låg latens för användardatatrafik. I detta examensarbete, skall LTE/LTE-A upplänk fysiska lagret bearbetning undersökas, och då särskilt processen för kanaluppskattning och MIMO-detektion. För att analysera denna processing jämför vi två konventionella algoritmers prestationer och komplexitet för kanaluppskattning och MIMO-detektion. Den viktigaste aspekten som påverkar algoritmernas hastighet identifieras som behovet av "massiva komplex matrisinversion". Ett parallellt kodningsschema föreslås för att implementera en "matrisinversion kernel-algoritmen" på singelinstruktion multidataström (SIMD) vektorprocessor. Det största bidraget med denna avhandling är genomförande och utvärdering av en parallell massiva komplex matrisinversion kernel-algoritmen. Två aspekter har tagits upp: valet av algoritm för att utföra denna matrisberäkning och implementationen av en högst parallell version av denna algoritm.
48

To share or not to share vector registers?

Pietrzyk, Johannes, Krause, Alexander, Habich, Dirk, Lehner, Wolfgang 04 June 2024 (has links)
Query execution techniques in database systems constantly adapt to novel hardware features to achieve high query performance, in particular for analytical queries. In recent years, vectorization based on the Single Instruction Multiple Data parallel paradigm has been established as a state-of-the-art approach to increase single-query performance. However, since concurrent analytical queries running in parallel often access the same columns and perform a same set of vectorized operations, data accesses and computations among different queries may be executed redundantly. Various techniques have already been proposed to avoid such redundancy, ranging from concurrent scans via the construction of materialized views to applying multiple query optimization techniques. Continuing this line of research, we investigate the opportunity of sharing vector registers for concurrently running queries in analytical scenarios in this paper. In particular, our novel sharing approach relies on processing data elements of different queries together within a single vector register. As we are going to show, sharing vector registers to optimize the execution of concurrent analytical queries can be very beneficial in single-threaded as well as multi-thread environments. Therefore, we demonstrate the feasibility and applicability of such a novel work sharing strategy and thus open up a wide spectrum of future research opportunities.
49

Algorithm Adaptation and Optimization of a Novel DSP Vector Co-processor

Karlsson, Andréas January 2010 (has links)
<p>The Division of Computer Engineering at Linköping's university is currently researching the possibility to create a highly parallel DSP platform, that can keep up with the computational needs of upcoming standards for various applications, at low cost and low power consumption. The architecture is called ePUMA and it combines a general RISC DSP master processor with eight SIMD co-processors on a single chip. The master processor will act as the main processor for general tasks and execution control, while the co-processors will accelerate computing intensive and parallel DSP kernels.This thesis investigates the performance potential of the co-processors by implementing matrix algebra kernels for QR decomposition, LU decomposition, matrix determinant and matrix inverse, that run on a single co-processor. The kernels will then be evaluated to find possible problems with the co-processors' microarchitecture and suggest solutions to the problems that might exist. The evaluation shows that the performance potential is very good, but a few problems have been identified, that causes significant overhead in the kernels. Pipeline mismatches, that occurs due to different pipeline lengths for different instructions, causes pipeline hazards and the current solution to this, doesn't allow effective use of the pipeline. In some cases, the single port memories will cause bottlenecks, but the thesis suggests that the situation could be greatly improved by using buffered memory write-back. Also, the lack of register forwarding makes kernels with many data dependencies run unnecessarily slow.</p>
50

Algorithm Adaptation and Optimization of a Novel DSP Vector Co-processor

Karlsson, Andréas January 2010 (has links)
The Division of Computer Engineering at Linköping's university is currently researching the possibility to create a highly parallel DSP platform, that can keep up with the computational needs of upcoming standards for various applications, at low cost and low power consumption. The architecture is called ePUMA and it combines a general RISC DSP master processor with eight SIMD co-processors on a single chip. The master processor will act as the main processor for general tasks and execution control, while the co-processors will accelerate computing intensive and parallel DSP kernels.This thesis investigates the performance potential of the co-processors by implementing matrix algebra kernels for QR decomposition, LU decomposition, matrix determinant and matrix inverse, that run on a single co-processor. The kernels will then be evaluated to find possible problems with the co-processors' microarchitecture and suggest solutions to the problems that might exist. The evaluation shows that the performance potential is very good, but a few problems have been identified, that causes significant overhead in the kernels. Pipeline mismatches, that occurs due to different pipeline lengths for different instructions, causes pipeline hazards and the current solution to this, doesn't allow effective use of the pipeline. In some cases, the single port memories will cause bottlenecks, but the thesis suggests that the situation could be greatly improved by using buffered memory write-back. Also, the lack of register forwarding makes kernels with many data dependencies run unnecessarily slow.

Page generated in 0.0518 seconds