An Extensible and Scalable Pilot-MapReduce Framework for Data Intensive Applications on Distributed Cyberinfrastructure

The volume and complexity of data that must be analyzed in scientific applications is increasing exponentially. Often, this data is distributed; thus, the ability to analyze data by localizing it will yield limited returns. Therefore, an efficient processing of large distributed datasets is required, whilst ideally not introducing fundamentally new programming models or methods. For example, extending MapReduce - a proven effective programming model for processing large datasets, to work more effectively on distributed data and on different infrastructure (such as non-Hadoop, general-purpose clusters) is desirable. We posit that this can be achieved with an effective and efficient runtime environment and without refactoring MapReduce itself. MapReduce on distributed data requires effective distributed coordination of computation (map and reduce) and data, as well as distributed data management (in particular the transfer of intermediate data units). To address these requirements, we design and implement Pilot-MapReduce (PMR) - a flexible, infrastructure-independent runtime environment for MapReduce. PMR is based on Pilot abstractions for both compute (Pilot- Jobs) and data (Pilot-Data): it utilizes Pilot-Jobs to couple the map phase computation to the nearby source data, and Pilot-Data to move intermediate data using parallel data transfers to the reduce computation phase. We analyze the effectiveness of PMR over applications with different characteristics (e. g. different volumes of intermediate and output data). Our experimental evaluations show that the Pilot abstraction for data movement across multiple clusters is promising, and can lower the execution time span of the entire MapReduce execution. We also investigate the performance of PMR with distributed data using a Word Count and a genome sequencing application over different MapReduce configurations. We find that PMR is a viable tool to support distributed NGS analytics by comparing and contrasting the PMR approach to similar capabilities of Seqal and Crossbow, two Next Generation Sequencing(NGS) Hadoop MapReduce based applications. Our experiments show that PMR provides the desired flexibility in the deployment and configuration of MapReduce runs to address specific application characteristics and achieve an optimal performance, both locally and over wide-area multiple clusters.

Identiferoai:union.ndltd.org:LSU/oai:etd.lsu.edu:etd-06282012-165206
Date12 July 2012
CreatorsMantha, Pradeep Kumar
ContributorsJha, Shantenu, Allen, Gabrielle, Hall, Randall
PublisherLSU
Source SetsLouisiana State University
LanguageEnglish
Detected LanguageEnglish
Typetext
Formatapplication/pdf
Sourcehttp://etd.lsu.edu/docs/available/etd-06282012-165206/
Rightsunrestricted, I hereby certify that, if appropriate, I have obtained and attached herein a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to LSU or its agents the non-exclusive license to archive and make accessible, under the conditions specified below and in appropriate University policies, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.

Page generated in 0.0021 seconds