Global ETD Search

Return to search

Data-Intensive Biocomputing in the Cloud

Next-generation sequencing (NGS) technologies have made it possible to rapidly sequence the human genome, heralding a new era of health-care innovations based on personalized genetic information. However, these NGS technologies generate data at a rate that far outstrips Moore\'s Law. As a consequence, analyzing this exponentially increasing data deluge requires enormous computational and storage resources, resources that many life science institutions do not have access to. As such, cloud computing has emerged as an obvious, but still nascent, solution.

This thesis intends to investigate and design an efficient framework for running and managing large-scale data-intensive scientific applications in the cloud. Based on the learning from our parallel implementation of a genome analysis pipeline in the cloud, we aim to provide a framework for users to run such data-intensive scientific workflows using a hybrid setup of client and cloud resources. We first present SeqInCloud, our highly scalable parallel implementation of a popular genetic variant pipeline called genome analysis toolkit (GATK), on the Windows Azure HDInsight cloud platform. Together with a parallel implementation of GATK on Hadoop, we evaluate the potential of using cloud computing for large-scale DNA analysis and present a detailed study on efficiently utilizing cloud resources for running data-intensive, life-science applications. Based on our experience from running SeqInCloud on Azure, we present CloudFlow, a feature rich workflow manager for running MapReduce-based bioinformatic pipelines utilizing both client and cloud resources. CloudFlow, built on the top of an existing MapReduce-based workflow manager called Cloudgene, provides unique features that are not offered by existing MapReduce-based workflow managers, such as enabling simultaneous use of client and cloud resources, automatic data-dependency handling between client and cloud resources, and the flexibility of implementing user-defined plugins for data transformations. In-general, we believe that our work attempts to increase the adoption of cloud resources for running data-intensive scientific workloads. / Master of Science

Cloud Computing

Next Generation Sequencing

MapReduce

GATK

Workflow

Identifer	oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/23847
Date	25 September 2013
Creators	Meeramohideen Mohamed, Nabeel
Contributors	Computer Science, Feng, Wu-chun, Butt, Ali R., Lin, Heshan
Publisher	Virginia Tech
Source Sets	Virginia Tech Theses and Dissertation
Detected Language	English
Type	Thesis
Format	ETD, application/pdf
Rights	In Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0018 seconds

Data-Intensive Biocomputing in the Cloud

Description

Links & Downloads

Tags

Additional Fields