Of the many existing eukaryotic gene finding software programs, none are able to guarantee accurate identification of genomic protein coding regions and other biological signals central to pathway from DNA to the protein. Eukaryotic gene finding is difficult mainly due to noncontiguous and non-continuous nature of genes. Existing approaches are heavily dependent on the compositional statistics of the sequences they learn from and are not equally suitable for all types of sequences. This thesis firstly develops efficient digital signal processing-based methods for the identification of genomic protein coding regions, and then combines the optimum signal processing-based non-data-driven technique with an existing data-driven statistical method in a novel system demonstrating improved identification of acceptor splice sites. Most existing well-known DNA symbolic-to-numeric representations map the DNA information into three or four numerical sequences, potentially increasing the computational requirement of the sequence analyzer. Proposed mapping schemes, to be used for signal processing-based gene and exon prediction, incorporate DNA structural properties in the representation, in addition to reducing complexity in subsequent processing. A detailed comparison of all DNA representations, in terms of computational complexity and relative accuracy for the gene and exon prediction problem, reveals the newly proposed ?paired numeric? to be the best DNA representation. Existing signal processing-based techniques rely mostly on the period-3 behaviour of exons to obtain one dimensional gene and exon prediction features, and are not well equipped to capture the complementary properties of exonic / intronic regions and deal with the background noise in detection of exons at their nucleotide levels. These issues have been addressed in this thesis, by proposing six one-dimensional and three multi-dimensional signal processing-based gene and exon prediction features. All one-dimensional and multi-dimensional features have been evaluated using standard datasets such as Burset/Guigo1996, HMR195, and the GENSCAN test set. This is the first time that different gene and exon prediction features have been compared using substantial databases and using nucleotide-level metrics. Furthermore, the first investigation of the suitability of different window sizes for period-3 exon detection is performed. Finally, the optimum signal processing-based gene and exon prediction scheme from our evaluations is combined with a data-driven statistical technique for the recognition of acceptor splice sites. The proposed DSP-statistical hybrid is shown to achieve 43% reduction in false positives over WWAM, as used in GENSCAN.
Identifer | oai:union.ndltd.org:ADTP/257969 |
Date | January 2008 |
Creators | Akhtar, Mahmood, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW |
Publisher | Publisher:University of New South Wales. Electrical Engineering & Telecommunications |
Source Sets | Australiasian Digital Theses Program |
Language | English |
Detected Language | English |
Rights | http://unsworks.unsw.edu.au/copyright, http://unsworks.unsw.edu.au/copyright |
Page generated in 0.0017 seconds