Return to search

Optimization for big joins and recursive query evaluation using intersection and difference filters in MapReduce

The information technology community has created unprecedented amount of data through large-scale applications. As a result, the Big Data is considered as gold mines of information that just wait for the processing power to be available, reliable, and apt at evaluating complex analytic algorithms. MapReduce is one of the most popular programming models designed to support such processing. It has become a standard for processing, analyzing and generating large data in a massively parallel manner. However, the MapReduce programming model suffers from severe limitations of operations beyond simple scan/grouping, particularly operations with multiple inputs. In the present dissertation we efficiently investigate and optimize the evaluation, in a MapReduce environment, of one of the most salient and representative such operations: Join. It focuses not only on two-way joins, but also complex joins such as multi-way joins and recursive joins. To achieve these objectives, we first devise a new type of filter called intersection filter using a probabilistic model to represent an approximation of the set intersection. The intersection filter is then applied to two-way join operations to eliminate most non-joining elements in input datasets before sending data to actual join processing. In addition, we make an extension of the intersection filter to improve the performance of three-way joins and chain joins including both cyclic chain joins with many shared join keys. We use the Lagrangian multiplier method to indicate a good choice between our optimized solutions for the multi-way joins. Another important proposal is a difference filter, which is a probabilistic data structure designed to represent a set and examine disjoint elements of the set. It can be applied to a wide range of popular problems such as reconciliation, deduplication, error-correction, especially a recursive join operation. A recursive join using the difference filter is implemented as an iteration of one join job instead of two jobs including a join job and a difference job. This improvement will significantly reduce the number of executed jobs by half, and the related overheads such as data rescanning, intermediate data, and communication for the deduplication and difference operations. Besides, this research also improves the general semi-naive algorithm, as well as the evaluation of recursive queries in MapReduce. We then provide general cost models for two-way joins, multi-way joins, and recursive joins. Thanks to these cost models, we can make comparisons of the join algorithms more persuasive. As a result, with using the proposed filters, the join operations can minimize disk I/O and communication costs. Moreover, the intersection filter-based join operations are demonstrated to be more efficient than existing solutions through experimental evaluations. Experimental comparisons of different algorithms for joins are examined with respect to intermediate data amount, the total output amount, the total execution time, and especially task timelines. Finally, our improvements on the join operations contribute to the global scene of optimizing data management for MapReduce applications on large-scale distributed infrastructures.

Identiferoai:union.ndltd.org:CCSD/oai:tel.archives-ouvertes.fr:tel-01066612
Date07 July 2014
CreatorsPhan, Thuong-Cang
PublisherUniversité Blaise Pascal - Clermont-Ferrand II
Source SetsCCSD theses-EN-ligne, France
LanguageEnglish
Detected LanguageEnglish
TypePhD thesis

Page generated in 0.0017 seconds