Return to search

Automatic Protein Function Annotation Through Text Mining

The knowledge of a protein’s function is essential to many studies in molecular biology, genetic experiments and protein-protein interactions. The Gene Ontology (GO) captures gene products' functions in classes and establishes relationship between them. Manually annotating proteins with GO functions from the bio-medical litera- ture is a tedious process which calls for automation. We develop a novel, dictionary- based method to annotate proteins with functions from text. We extract text-based features from words matched against a dictionary of GO. Since classes are included upon any word match with their class description, the number of negative samples outnumbers the positive ones. To mitigate this imbalance, we apply strict rules before weakly labeling the dataset according to the curated annotations. Furthermore, we discard samples of low statistical evidence and train a logistic regression classifier. The results of a 5-fold cross-validation show a high precision of 91% and 96% accu- racy in the best performing fold. The worst fold showed a precision of 80% and an accuracy of 95%. We conclude by explaining how this method can be used for similar annotation problems.

Identiferoai:union.ndltd.org:kaust.edu.sa/oai:repository.kaust.edu.sa:10754/656601
Date25 August 2019
CreatorsToonsi, Sumyyah
ContributorsHoehndorf, Robert, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Moshkov, Mikhail, Bajic, Vladimir B.
Source SetsKing Abdullah University of Science and Technology
LanguageEnglish
Detected LanguageEnglish
TypeThesis

Page generated in 0.0038 seconds