High-throughput genome sequencing and sequence analysis technologies have
created the need for automated annotation and analysis of large sets of genes. The
Gene Ontology (GO) provides a common controlled vocabulary for describing gene
function. However, the process for annotating proteins with GO terms is usually
through a tedious manual curation process by trained professional annotators. With
the wealth of genomic data that are now available, there is a need for accurate auto-
mated annotation methods.
The overall objective of my research is to improve our ability to automatically an-
notate proteins with GO terms. The first method, Automatic Annotation of Protein
Functional Class (AAPFC), employs protein functional domains as features and learns
independent Support Vector Machine classifiers for each GO term. This approach relies only on protein functional domains as features, and demonstrates that statistical
pattern recognition can outperform expert curated mapping of protein functional
domain features to protein functions. The second method Predict of Gene Ontology
(PoGO) describes a meta-classification method that integrates multiple heterogeneous
data sources. This method leads to improved performance than the protein domain
method can achieve alone.
Apart from these two methods, several systems have been developed that employ pattern recognition to assign gene function using a variety of features, such as the sequence similarity, presence of protein functional domains and gene expression
patterns. Most of these approaches have not considered the hierarchical relationships
among the terms in the form of a directed acyclic graph (DAG). The DAG represents
the functional relationships between the GO terms, thus it should be an important
component of an automated annotation system. I describe a Bayesian network used as
a multi-layered classifier that incorporates the relationships among GO terms found in
the GO DAG. I also describe an inference algorithm for quickly assigning GO terms
to unlabeled proteins. A comparative analysis of the method to other previously
described annotation systems shows that the method provides improved annotation
accuracy when the performance of individual GO terms are compared. More importantly, this method enables the classification of significantly more GO terms to more
proteins than was previously possible.
Identifer | oai:union.ndltd.org:tamu.edu/oai:repository.tamu.edu:1969.1/ETD-TAMU-2008-08-41 |
Date | 16 January 2010 |
Creators | Jung, Jae |
Contributors | Thon, Michael R. |
Source Sets | Texas A and M University |
Language | en_US |
Detected Language | English |
Type | Book, Thesis, Electronic Dissertation |
Format | application/pdf |
Page generated in 0.0015 seconds