Global ETD Search

Return to search

Using sequence similarity to predict the function of biological sequences.

In this thesis we examine issues surrounding the development of software that predicts the function of biological sequences using sequence similarity. There is a pressing need for high throughput software that can annotate protein or DNA sequences with functional information due to the exponential growth in sequence data. In Chapter 1 we briefly introduce the molecular biology and bioinformatics that is assumed knowledge, and the objectives for the research presented here. In Chapter 2 we discuss the development of a method of comparing competing designs for software annotators, using precision and recall metrics, and a benchmark method referred to as Best BLAST. From this we conclude that data-mining approaches may be useful in the development of annotation algorithms, and that any new annotator should demonstrate its effectiveness against other approaches before being adopted. As any new annotator that utilises sequence similarity to predict the function of a sequence will rely on the quality of existing annotations, we examine the error rate of existing sequence annotations in Chapter 3. We develop a new method that allows for the estimation of annotation error rates. This involves adding annotation errors at known rates to a sample of reference sequence annotations that was found to be similar to query sequences. The precision at each error rate treatment is determined, and linear regression then used to find the error rate at estimated values for the maximum precision possible given assumptions concerning the impact of semantic variation on precision. We found that the error rate of curated annotations based on sequence similarity (ISS) is far higher than those that use other forms of evidence (49% versus 13-18%, respectively). As such we conclude that software annotators should avoid basing predictions on ISS annotations where possible. In Chapter 4 we detail the development of GOSLING, Gene Ontology Similarity Listing using Information Graphs, a software annotator with a design based on the principles discovered in previous chapters. Chapter 5 concludes the thesis by discussing the major findings from the research presented. / http://library.adelaide.edu.au/cgi-bin/Pwebrecon.cgi?BBID=1280882 / Thesis (M.Sc.(M&CS)) -- School of Computer Science, 2007

http://hdl.handle.net/2440/40403

bioinformatics, computer science

Bioinformatics

Computer science

Identifer	oai:union.ndltd.org:ADTP/264389
Date	January 2007
Creators	Jones, Craig E.
Source Sets	Australiasian Digital Theses Program
Language	en_US
Detected Language	English

Page generated in 0.0023 seconds

Using sequence similarity to predict the function of biological sequences.

Description

Links & Downloads

Tags

Additional Fields