Return to search

An Apache Hadoop Framework for Large-Scale Peptide Identification

Peptide identification is an essential step in protein identification, and Peptide Spectrum Match (PSM) data set is huge, which is a time consuming process to work on a single machine. In a typical run of the peptide identification method, PSMs are positioned by a cross correlation, a statistical score, or a likelihood that the match between the trial and hypothetical is correct and unique. This process takes a long time to execute, and there is a demand for an increase in performance to handle large peptide data sets. Development of distributed frameworks are needed to reduce the processing time, but this comes at the price of complexity in developing and executing them. In distributed computing, the program may divide into multiple parts to be executed. The work in this thesis describes the implementation of Apache Hadoop framework for large-scale peptide identification using C-Ranker. The Apache Hadoop data processing software is immersed in a complex environment composed of massive machine clusters, large data sets, and several processing jobs. The framework uses Apache Hadoop Distributed File System (HDFS) and Apache Mapreduce to store and process the peptide data respectively.The proposed framework uses a peptide processing algorithm named CRanker which takes peptide data as an input and identifies the correct PSMs. The framework has two steps: Execute the C-Ranker algorithm on Hadoop cluster and compare the correct PSMs data generated via Hadoop approach with the normal execution approach of C-Ranker. The goal of this framework is to process large peptide datasets using Apache Hadoop distributed approach.

Identiferoai:union.ndltd.org:WKU/oai:digitalcommons.wku.edu:theses-2531
Date01 July 2015
CreatorsDonepudi, Harinivesh
PublisherTopSCHOLAR®
Source SetsWestern Kentucky University Theses
Detected LanguageEnglish
Typetext
Formatapplication/pdf
SourceMasters Theses & Specialist Projects

Page generated in 0.0015 seconds