Global ETD Search

Return to search

Coevolution Based Prediction Of Protein-protein Interactions With Reduced Training Data

Protein-protein interactions are important for the prediction of protein functions since two interacting proteins usually have similar functions in a cell. Available protein interaction networks are incomplete / but, they can be used to predict new interactions in a supervised learning framework. However, in the case that the known protein network includes large number of protein pairs, the training time of the machine learning algorithm becomes quite long. In this thesis work, our aim is to predict protein-protein interactions with a known portion of the interaction network. We used Support Vector Machines (SVM) as the machine learning algoritm and used the already known protein pairs in the network. We chose to use phylogenetic profiles of proteins to form the feature vectors required for the learner since the similarity of two proteins in evolution gives a reasonable rating about whether the two proteins interact or not. For large data sets, the training time of SVM becomes quite long, therefore we reduced the data size in a sensible way while we keep approximately the same prediction accuracy.

We applied a number of clustering techniques to extract the most representative data and features in a two categorical framework. Knowing that the training data set is a two dimensional matrix, we applied data reduction methods in both dimensions, i.e., both in data size and in
feature vector size. We observed that the data clustered by the k-means clustering technique gave superior results in prediction accuracies compared to another data clustering algorithm which was also developed for reducing data size for SVM training. Still the true positive and false positive rates (TPR-FPR) of the training data sets constructed by the two clustering
methods did not give satisfying results about which method outperforms the other. On the other hand, we applied feature selection methods on the feature vectors of training data by selecting the most representative features in biological and in statistical meaning. We used phylogenetic tree of organisms to identify the organisms which are evolutionarily significant.
Additionally we applied Fisher&sbquo / &Auml / &ocirc / s test method to select the features which are most representative statistically. The accuracy and TPR-FPR values obtained by feature selection methods could not provide to make a certain decision on the performance comparisons. However it can be mentioned that phylogenetic tree method resulted in acceptable prediction values when compared to Fisher&sbquo / &Auml / &ocirc / s test.

http://etd.lib.metu.edu.tr/upload/3/12610389/index.pdf

QA Computer Software 76.75-76.765

Identifer	oai:union.ndltd.org:METU/oai:etd.lib.metu.edu.tr:http://etd.lib.metu.edu.tr/upload/3/12610389/index.pdf
Date	01 February 2009
Creators	Pamuk, Bahar
Contributors	Can, Tolga
Publisher	METU
Source Sets	Middle East Technical Univ.
Language	English
Detected Language	English
Type	M.S. Thesis
Format	text/pdf
Rights	To liberate the content for public access

Page generated in 0.0022 seconds

Coevolution Based Prediction Of Protein-protein Interactions With Reduced Training Data

Description

Links & Downloads

Tags

Additional Fields