Global ETD Search

Return to search

An Apache Hadoop Framework for Large-Scale Peptide Identification

Peptide identification is an essential step in protein identification, and Peptide Spectrum Match (PSM) data set is huge, which is a time consuming process to work on a single machine. In a typical run of the peptide identification method, PSMs are positioned by a cross correlation, a statistical score, or a likelihood that the match between the trial and hypothetical is correct and unique. This process takes a long time to execute, and there is a demand for an increase in performance to handle large peptide data sets. Development of distributed frameworks are needed to reduce the processing time, but this comes at the price of complexity in developing and executing them. In distributed computing, the program may divide into multiple parts to be executed. The work in this thesis describes the implementation of Apache Hadoop framework for large-scale peptide identification using C-Ranker. The Apache Hadoop data processing software is immersed in a complex environment composed of massive machine clusters, large data sets, and several processing jobs. The framework uses Apache Hadoop Distributed File System (HDFS) and Apache Mapreduce to store and process the peptide data respectively.The proposed framework uses a peptide processing algorithm named CRanker which takes peptide data as an input and identifies the correct PSMs. The framework has two steps: Execute the C-Ranker algorithm on Hadoop cluster and compare the correct PSMs data generated via Hadoop approach with the normal execution approach of C-Ranker. The goal of this framework is to process large peptide datasets using Apache Hadoop distributed approach.

MapReduce

CRanker

Peptide Spectrum Match

PSM

Computer Sciences

OS and Networks

Identifer	oai:union.ndltd.org:WKU/oai:digitalcommons.wku.edu:theses-2531
Date	01 July 2015
Creators	Donepudi, Harinivesh
Publisher	TopSCHOLAR®
Source Sets	Western Kentucky University Theses
Detected Language	English
Type	text
Format	application/pdf
Source	Masters Theses & Specialist Projects

Page generated in 0.0015 seconds

An Apache Hadoop Framework for Large-Scale Peptide Identification

Description

Links & Downloads

Tags

Additional Fields