Global ETD Search

Return to search

Compilation Techniques, Algorithms, and Data Structures for Efficient and Expressive Data Processing Systems

<pre>The proliferation of digital data, driven by factors like social media, e-commerce, etc., has created an increasing demand for highly processed data at higher levels of fidelity, which puts increasing demands on modern data processing systems. In the past, data processing systems faced bottlenecks due to limited main memory availability. However, as main memory becomes more abundant, their optimization focus has shifted from disk I/O to optimized computation through techniques like compilation. This dissertation addresses several critical limitations within such compilation-based data processing systems. In modern data analytics pipelines, combination of workloads from various paradigms, such as traditional DBMS and Machine Learning, is common. These pipelines are typically managed by specialized systems designed for specific workload types. While these specialized systems optimize their individual performance, substantial performance loss occurs when they are combined to handle mixed workloads. This loss is mainly due to overheads at system boundaries, including data copying and format conversions, as well as the general inability to perform cross-system optimizations. This dissertation tackles this problem in two angles. First, it proposes an efficient post-hoc integration of individual systems using generative programming via the construction of common intermediate layers. This approach preserves the best-of-breed performance of individual workloads while achieving state-of-the-art performance for combined workloads. Second, we introduce a high-level query language capable of expressing various workload types, acting as a general substrate to implement combined workloads. This allows the generation of optimized code for end-to-end workloads through the construction of an intermediate representation (IR). The dissertation then shifts focus to data processing systems used for incremental view maintenance (IVM). While existing IVM systems achieve high performance through compilation and novel algorithms, they have limitations in handling specific query classes. Notably, they are incapable of handling queries involving correlated nested aggregate subqueries. To address this, our work proposes a novel indexing scheme based on a new data structure and a corresponding set of algorithms that fully incrementalize such queries. This approach result in substantial asymptotic speedups and order-of-magnitude performance improvements for workloads of practical importance. Finally, the dissertation explores efficient and expressive fixed-point computations, with a focus on Datalog--a language widely used for declarative program analysis. Although existing Datalog engines rely on compilation and specialized code generation to achieve performance, they lack the flexibility to support extensions required for complex program analysis. Our work introduces a new Datalog engine built using generative programming techniques that offers both flexibility and state-of-the-art performance through specialized code generation.</pre>

10.25394/pgs.24674109.v1

Database systems

Programming languages

Compilation.

Data structures.

Datalog

Incremental Computing

Machine Learning

Query Compilation

Program Analysis

Declarative Program Analysis

Declarative Query Languages

Identifer	oai:union.ndltd.org:purdue.edu/oai:figshare.com:article/24674109
Date	30 November 2023
Creators	Supun Madusha Bandara Abeysinghe Tennakoon Mudiyanselage (17454786)
Source Sets	Purdue University
Detected Language	English
Type	Text, Thesis
Rights	CC BY 4.0
Relation	https://figshare.com/articles/thesis/Compilation_Techniques_Algorithms_and_Data_Structures_for_Efficient_and_Expressive_Data_Processing_Systems/24674109

Page generated in 0.0055 seconds

Compilation Techniques, Algorithms, and Data Structures for Efficient and Expressive Data Processing Systems

Description

Links & Downloads

Tags

Additional Fields