Global ETD Search

Return to search

Applications and extensions of pClust to big microbial proteomic data

The goal of biological sciences is to understand the biomolecular mechanics of living organisms. Proteins serve as the foundation for organisms functional analysis and sequence analysis has shown to be invaluable in answering questions about individual organisms. The first step in any sequence analysis is alignment and it is common that even modestly sized studies involve hundreds of thousands of protein sequences. In multigenome studies, the time consideration for sequence alignment becomes paramount and heuristic algorithms are frequently used sacrificing accuracy for speedup. At the same time, new algorithms have appeared that provide not only highly efficient performance, but also guarantee to deliver optimal solutions. However, the adoption of these algorithms is hindered by the absence of generalized analysis pipeline as well as availability of user-friendly computational tools. In this dissertation we present applications of existing, computationally efficient algorithms to multigenome studies where we apply our developed pClust pipelineto various sets of microbial organisms. The computational time is significantly improved and the results are more accurate than those obtained by traditional methods. The first study is a baseline comparison study on a small set of 11 microorganisms. It compares pClust results to the existing scientific knowledge and finds it to be consistent while at the same time providing new insights. The second study addresses the question of identification of common tick-transmissiblity mechanisms across different species. It involves a larger set of 108 microbial genomes with approximately 127K protein sequences. Traditionally, a study of such scope would have required days or at least hours of CPU time of high-performance computers to produce all-versus-all sequence alignment. Using pClust it took less than 10 minutes on a desktop computer to perform sequence alignment and clustering. For this study we also developed a graphical user interface for pClust in order to make the new algorithms more accessible for use by microbiologists. The third study analyzes the set of all proteobacterial genomes. The study comprised of 2326 complete genomes containing 8.7M protein sequences. The alignment was performed using pGraph-Tascel algorithm on high-performance computers. This is the first study of its kind.

http://pqdtopen.proquest.com/#viewpdf?dispub=10139743

Identifer	oai:union.ndltd.org:PROQUEST/oai:pqdtoai.proquest.com:10139743
Date	19 July 2016
Creators	Lockwood, Svetlana
Publisher	Washington State University
Source Sets	ProQuest.com
Language	English
Detected Language	English
Type	thesis

Page generated in 0.0018 seconds

Applications and extensions of pClust to big microbial proteomic data

Description

Links & Downloads

Tags

Additional Fields