Make Larger Vector Register Sizes New Challenges?: Lessons Learned from the Area of Vectorized Lightweight Compression AlgorithmsHabich, Dirk, Damme, Patrick, Ungethüm, Annett, Lehner, Wolfgang 15 September 2022 (has links)
The exploitation of data as well as hardware properties is a core aspect for efficient data management. This holds in particular for the field of in-memory data processing. Aside from increasing main memory capacities, in-memory data processing also benefits from novel processing concepts based on lightweight compressed data. To speed up compression as well as decompression, an active research field deals with the specialization of these algorithms to hardware features such as vectorization using SIMD instructions. Most of the vectorized implementations have been proposed for 128 bit vector registers. However, hardware vendors still increase the vector register sizes, whereby a straightforward transformation to these wider vector sizes is possible in most-cases. Thus, we systematically investigated the impact of different SIMD instruction set extensions with wider vector sizes on the behavior of straightforward transformed implementations. In this paper, we will describe our evaluation methodology and present selective results of our exhaustive evaluation. In particular, we will highlight some challenges and present first approaches to tackle them.
Low-Rank Tensor Approximation in post Hartree-Fock MethodsBenedikt, Udo 24 February 2014 (has links) (PDF)
In this thesis the application of novel tensor decomposition and tensor representation techniques in highly accurate post Hartree-Fock methods is evaluated. These representation techniques can help to overcome the steep scaling behaviour of high level ab-initio calculations with increasing system size and therefore break the "curse of dimensionality". After a comparison of various tensor formats the application of the "canonical polyadic" format (CP) is described in detail. There, especially the casting of a normal, index based tensor into the CP format (tensor decomposition) and a method for a low rank approximation (rank reduction) of the two-electron integrals in the AO basis are investigated. The decisive quantity for the applicability of the CP format is the scaling of the rank with increasing system and basis set size. The memory requirements and the computational effort for tensor manipulations in the CP format are only linear in the number of dimensions but still depend on the expansion length (rank) of the approximation. Furthermore, the AO-MO transformation and a MP2 algorithm with decomposed tensors in the CP format is evaluated and the scaling with increasing system and basis set size is investigated. Finally, a Coupled-Cluster algorithm based only on low-rank CP representation of the MO integrals is developed. There, especially the successive tensor contraction during the iterative solution of the amplitude equations and the error propagation upon multiple application of the reduction procedure are discussed. In conclusion the overall complexity of a Coupled-Cluster procedure with tensors in CP format is evaluated and some possibilities for improvements of the rank reduction procedure tailored to the needs in electronic structure calculations are shown. / Die vorliegende Arbeit beschäftigt sich mit der Anwendung neuartiger Tensorzerlegungs- und Tensorrepesentationstechniken in hochgenauen post Hartree-Fock Methoden um das hohe Skalierungsverhalten dieser Verfahren mit steigender Systemgröße zu verringern und somit den "Fluch der Dimensionen" zu brechen. Nach einer vergleichenden Betrachtung verschiedener Representationsformate wird auf die Anwendung des "canonical polyadic" Formates (CP) detailliert eingegangen. Dabei stehen zunächst die Umwandlung eines normalen, indexbasierten Tensors in das CP Format (Tensorzerlegung) und eine Methode der Niedrigrang Approximation (Rangreduktion) für Zweielektronenintegrale in der AO Basis im Vordergrund. Die entscheidende Größe für die Anwendbarkeit ist dabei das Skalierungsverhalten das Ranges mit steigender System- und Basissatzgröße, da der Speicheraufwand und die Berechnungskosten für Tensormanipulationen im CP Format zwar nur noch linear von der Anzahl der Dimensionen des Tensors abhängen, allerdings auch mit der Expansionslänge (Rang) skalieren. Im Anschluss wird die AO-MO Transformation und der MP2 Algorithmus mit zerlegten Tensoren im CP Format diskutiert und erneut das Skalierungsverhalten mit steigender System- und Basissatzgröße untersucht. Abschließend wird ein Coupled-Cluster Algorithmus vorgestellt, welcher ausschließlich mit Tensoren in einer Niedrigrang CP Darstellung arbeitet. Dabei wird vor allem auf die sukzessive Tensorkontraktion während der iterativen Bestimmung der Amplituden eingegangen und die Fehlerfortpanzung durch Anwendung des Rangreduktions-Algorithmus analysiert. Abschließend wird die Komplexität des gesamten Verfahrens bewertet und Verbesserungsmöglichkeiten der Reduktionsprozedur aufgezeigt.
From a Comprehensive Experimental Survey to a Cost-based Selection Strategy for Lightweight Integer Compression AlgorithmsDamme, Patrick, Ungethüm, Annett, Hildebrandt, Juliana, Habich, Dirk, Lehner, Wolfgang 11 January 2023 (has links)
Lightweight integer compression algorithms are frequently applied in in-memory database systems to tackle the growing gap between processor speed and main memory bandwidth. In recent years, the vectorization of basic techniques such as delta coding and null suppression has considerably enlarged the corpus of available algorithms. As a result, today there is a large number of algorithms to choose from, while different algorithms are tailored to different data characteristics. However, a comparative evaluation of these algorithms with different data and hardware characteristics has never been sufficiently conducted in the literature. To close this gap, we conducted an exhaustive experimental survey by evaluating several state-of-the-art lightweight integer compression algorithms as well as cascades of basic techniques. We systematically investigated the influence of data as well as hardware properties on the performance and the compression rates. The evaluated algorithms are based on publicly available implementations as well as our own vectorized reimplementations. We summarize our experimental findings leading to several new insights and to the conclusion that there is no single-best algorithm. Moreover, in this article, we also introduce and evaluate a novel cost model for the selection of a suitable lightweight integer compression algorithm for a given dataset.
Compressed Decision Problems in Groups / Komprimierte Entscheidungsprobleme in GruppenHaubold, Niko 19 March 2012 (has links) (PDF)
Wir beschäftigen uns mit Problemen der algorithmischen Gruppentheorie und untersuchen dabei die Komplexität von komprimierten Versionen des Wortproblems und des Konjugationsproblems für endlich erzeugte Gruppen.
Das Wortproblem fragt für eine feste, endlich erzeugte Gruppe ob ein gegebenes Wort über der Erzeugermenge das neutrale Element der Gruppe repräsentiert. Wir betrachten das gegebene Wort jedoch in einer komprimierten Form, als Straight-line Program (SLP) und untersuchen die Komplexität dieses Problems, das wir \'komprimiertes Wortproblem\' nennen. SLPs sind kontextfreie Grammatiken, die genau einen String erzeugen. Die Eingabegröße ist dabei stets die Größe des gegebenen SLPs. Eine Hauptmotivation ist dabei, dass für eine feste endlich erzeugte Gruppe das Wortproblem ihrer Automorphismengruppe durch eine Turingmaschine in Polynomialzeit auf das komprimierte Wortproblem der Gruppe selbst reduzierbar ist.
Wir untersuchen das komprimierte Wortproblem für die verbreiteten Gruppenerweiterungen HNN-Erweiterungen (amalgamierte Produkte und Graphprodukte) und können zeigen, dass sich Instanzen des komprimierten Wortproblems von einer Turingmaschine in Polynomialzeit auf Instanzen des komprimierten Wortproblems der Basisgruppe (respektive Basisgruppen und Knotengruppen) reduzieren lassen. Weiterhin zeigen wir, dass das komprimierte Wortproblem für endlich erzeugte nilpotente Gruppen von einer Turingmaschine in Polynomialzeit entscheidbar ist.
Wir betrachten außerdem eine komprimierte Variante des Konjugationsproblems. Das unkomprimierte Konjugationsproblem fragt für zwei gegebene Wörter über den Erzeugern einer festen endlich erzeugten Gruppe, ob sie in dieser Gruppe konjugiert sind. Beim komprimierten Konjugationsproblem besteht die Eingabe aus zwei SLPs und es wird gefragt, ob die beiden Wörter die von den SLPs erzeugt werden in der Gruppe konjugierte Elemente präsentieren. Wir konnten zeigen, dass sich das komprimierte Konjugationsproblem für Graphgruppen in Polynomialzeit entscheiden lässt.
Weiterhin haben wir das Wortproblem der äußeren Automorphismengruppen von Graphprodukten endlich erzeugter Gruppen untersucht. Durch den engen Zusammenhang des komprimierten Konjugationsproblems einer Gruppe mit dem Wortproblem der äußeren Automorphismengruppe konnten wir zeigen, dass sich das Wortproblem der äußeren Automorphismengruppe eines Graphprodukts von endlich erzeugten Gruppen durch eine Turingmaschine in Polynomialzeit auf Instanzen von simultanen komprimierten Konjugationsproblemen der Knotengruppen und Instanzen von komprimierten Wortproblemen der Knotengruppen reduzieren lässt.
Als Anwendung gelten obige Resultate auch für right-angled Coxetergruppen und Graphgruppen, da beide spezielle Graphprodukte sind. So folgt beispielsweise, dass das komprimierte Wortproblem einer right-angled Coxetergruppe in Polynomialzeit entscheidbar ist.
Frequent itemset mining on multiprocessor systemsSchlegel, Benjamin 08 May 2014 (has links) (PDF)
Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism.
In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined.
For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets.
