Global ETD Search

Return to search

Evaluation of Annotation Performances between Automated and Curated Databases of <i>E.COLI</i> Using the Correlation Coefficient

This project compared the performance of the correlation coefficient to show similarities in annotations between a predictive automated bacterial annotation database and the curated EcoCyc database. EcoCyc is a conservative multidimensional annotation system that is exclusively based on experimentally validated findings by over 15,000 publications. The automated annotation system, used in the comparison was BASys. It is often used as a first pass annotation tool that tries to add as many annotations as possible by drawing upon over 30 information sources. Gene ontology served as one basis of comparison between these databases because of the limited common terms in the ontology annotations. Translation libraries were used to extend the number of BASys terms that could be compared to the gene ontology terms in EcoCyc. Additional, non-ontology terms and metadata in BASys were compared to EcoCyc terms after parsing them into root words. The different term sources were quantitatively compared by using the correlation coefficient as the evaluation metric. The direct gene ontology comparison gave the lowest correlation coefficient. The addition of gene ontology terms to BASys by using translation tables of metadata greatly increased the correlation coefficient, which was comparable to the parsed word comparison. The combination of enhanced gene ontology and parsed word methods gave the highest correlation coefficient of 0.16.
The controlled vocabulary system of gene ontology was not sufficient to compare two annotated databases. The addition of gene ontology terms from translation libraries greatly increased the performance of these comparisons. In general, as the number of comparison terms increased the correlation coefficient increased. Future comparisons should include the enhanced gene ontology dataset in order to monitor the organization pertaining to formal nomenclature and the datasets generated from Word parsing can be used to monitor the degree of additional terms might be incorporated with translation libraries.

computational biology

genome biology

Computational Biology

Genomics

Molecular genetics

Identifer	oai:union.ndltd.org:WKU/oai:digitalcommons.wku.edu:theses-1094
Date	01 August 2009
Creators	Marpuri, ReddySalilaja
Publisher	TopSCHOLAR®
Source Sets	Western Kentucky University Theses
Detected Language	English
Type	text
Format	application/pdf
Source	Masters Theses & Specialist Projects

Page generated in 0.0022 seconds

Evaluation of Annotation Performances between Automated and Curated Databases of <i>E.COLI</i> Using the Correlation Coefficient

Description

Links & Downloads

Tags

Additional Fields