Return to search

Computational Representation Of Protein Sequences For Homology Detection And Classification

Machine learning techniques have been widely used for classification problems in computational biology. They require that the input must be a collection of fixedlength feature vectors. Since proteins are of varying lengths, there is a need for a
means of representing protein sequences by a fixed-number of features. This thesis
introduces three novel methods for this purpose: n-peptide compositions with
reduced alphabets, pairwise similarity scores by maximal unique matches, and
pairwise similarity scores by probabilistic suffix trees.
New sequence representations described in the thesis are applied on three
challenging problems of computational biology: remote homology detection,
subcellular localization prediction, and solvent accessibility prediction, with some
problem-specific modifications. Rigorous experiments are conducted on common
benchmarking datasets, and a comparative analysis is performed between the new
methods and the existing ones for each problem.
On remote homology detection tests, all three methods achieve competitive
accuracies with the state-of-the-art methods, while being much more efficient. A
combination of new representations are used to devise a hybrid system, called
PredLOC, for predicting subcellular localization of proteins and it is tested on two
distinct eukaryotic datasets. To the best of author&rsquo / s knowledge, the accuracy
achieved by PredLOC is the highest one ever reported on those datasets. The
maximal unique match method is resulted with only a slight improvement in
solvent accessibility predictions.

Identiferoai:union.ndltd.org:METU/oai:etd.lib.metu.edu.tr:http://etd.lib.metu.edu.tr/upload/12606997/index.pdf
Date01 January 2006
CreatorsOgul, Hasan
ContributorsMumcuoglu, Unal Erkan
PublisherMETU
Source SetsMiddle East Technical Univ.
LanguageEnglish
Detected LanguageEnglish
TypePh.D. Thesis
Formattext/pdf
RightsTo liberate the content for public access

Page generated in 0.0013 seconds