1 |
Online Deduplication for Distributed DatabasesXu, Lianghong 01 September 2016 (has links)
The rate of data growth outpaces the decline of hardware costs, and there has been an ever-increasing demand in reducing the storage and network overhead for online database management systems (DBMSs). The most widely used approach for data reduction in DBMSs is blocklevel compression. Although this method is simple and effective, it fails to address redundancy across blocks and therefore leaves significant room for improvement for many applications. This dissertation proposes a systematic approach, termed similaritybased deduplication, which reduces the amount of data stored on disk and transmitted over the network beyond the benefits provided by traditional compression schemes. To demonstrate the approach, we designed and implemented dbDedup, a lightweight record-level similaritybased deduplication engine for online DBMSs. The design of dbDedup exploits key observations we find in database workloads, including small item sizes, temporal locality, and the incremental nature of record updates. The proposed approach differs from traditional chunk-based deduplication approaches in that, instead of finding identical chunks anywhere else in the data corpus, similarity-based deduplication identifies a single similar data-item and performs differential compression to remove the redundant parts for greater savings. To achieve high efficiency, dbDedup introduces novel encoding, caching and similarity selection techniques that significantly mitigate the deduplication overhead with minimal loss of compression ratio. For evaluation, we integrated dbDedup into the storage and replication components of a distributed NoSQL DBMS and analyzed its properties using four real datasets. Our results show that dbDedup achieves up to 37⇥ reduction in the storage size and replication traffic of the database on its own and up to 61⇥ reduction when paired with the DBMS’s block-level compression. dbDedup provides both benefits with negligible effect on DBMS throughput or client latency (average and tail).
|
2 |
An Efficient, Extensible, Hardware-aware Indexing KernelSadoghi Hamedani, Mohammad 20 June 2014 (has links)
Modern hardware has the potential to play a central role in scalable data management systems. A realization of this potential arises in the context of indexing queries, a recurring theme in real-time data analytics, targeted advertising, algorithmic trading, and data-centric workflows, and of indexing data, a challenge in multi-version analytical query processing. To enhance query and data indexing, in this thesis, we present an efficient, extensible, and hardware-aware indexing kernel. This indexing kernel rests upon novel data structures and (parallel) algorithms that utilize the capabilities offered by modern hardware, especially abundance of main memory, multi-core architectures, hardware accelerators, and solid state drives.
This thesis focuses on presenting our query indexing techniques to cope with processing queries in data-intensive applications that are susceptible to ever increasing data volume and velocity. At the core of our query indexing kernel lies the BE-Tree family of memory-resident indexing structures that scales by overcoming the curse of dimensionality through a novel two-phase space-cutting technique, an effective Top-k processing, and adaptive parallel algorithms to operate directly on compressed data (that exploits the multi-core architecture). Furthermore, we achieve line-rate processing by harnessing the unprecedented degrees of parallelism and pipelining only available through low-level logic design using FPGAs. Finally, we present a comprehensive evaluation that establishes the superiority of BE-Tree in comparison with state-of-the-art algorithms.
In this thesis, we further expand the scope of our indexing kernel and describe how to accelerate analytical queries on (multi-version) databases by enabling indexes on the most recent data. Our goal is to reduce the overhead of index maintenance, so that indexes can be used effectively for analytical queries without being a heavy burden on transaction throughput. To achieve this end, we re-design the data structures in the storage hierarchy to employ an extra level of indirection over solid state drives. This indirection layer dramatically reduces the amount of magnetic disk I/Os that is needed for updating indexes and localizes the index maintenance. As a result, by rethinking how data is indexed, we eliminate the dilemma between update vs. query performance and reduce index maintenance and query processing cost substantially.
|
3 |
An Efficient, Extensible, Hardware-aware Indexing KernelSadoghi Hamedani, Mohammad 20 June 2014 (has links)
Modern hardware has the potential to play a central role in scalable data management systems. A realization of this potential arises in the context of indexing queries, a recurring theme in real-time data analytics, targeted advertising, algorithmic trading, and data-centric workflows, and of indexing data, a challenge in multi-version analytical query processing. To enhance query and data indexing, in this thesis, we present an efficient, extensible, and hardware-aware indexing kernel. This indexing kernel rests upon novel data structures and (parallel) algorithms that utilize the capabilities offered by modern hardware, especially abundance of main memory, multi-core architectures, hardware accelerators, and solid state drives.
This thesis focuses on presenting our query indexing techniques to cope with processing queries in data-intensive applications that are susceptible to ever increasing data volume and velocity. At the core of our query indexing kernel lies the BE-Tree family of memory-resident indexing structures that scales by overcoming the curse of dimensionality through a novel two-phase space-cutting technique, an effective Top-k processing, and adaptive parallel algorithms to operate directly on compressed data (that exploits the multi-core architecture). Furthermore, we achieve line-rate processing by harnessing the unprecedented degrees of parallelism and pipelining only available through low-level logic design using FPGAs. Finally, we present a comprehensive evaluation that establishes the superiority of BE-Tree in comparison with state-of-the-art algorithms.
In this thesis, we further expand the scope of our indexing kernel and describe how to accelerate analytical queries on (multi-version) databases by enabling indexes on the most recent data. Our goal is to reduce the overhead of index maintenance, so that indexes can be used effectively for analytical queries without being a heavy burden on transaction throughput. To achieve this end, we re-design the data structures in the storage hierarchy to employ an extra level of indirection over solid state drives. This indirection layer dramatically reduces the amount of magnetic disk I/Os that is needed for updating indexes and localizes the index maintenance. As a result, by rethinking how data is indexed, we eliminate the dilemma between update vs. query performance and reduce index maintenance and query processing cost substantially.
|
Page generated in 0.1089 seconds