AN ABSTRACT OF THE DISSERTATION OF PRAGYAN P. MOHANTY, for the Doctor of Philosophy degree in ELECTRICAL AND COMPUTER ENGINEERING, presented on June 11, 2015, at Southern Illinois University Carbondale. TITLE: FUNCTION-BASED ALGORITHMS FOR BIOLOGICAL SEQUENCES MAJOR PROFESSOR: Dr. Spyros Tragoudas Two problems at two different abstraction levels of computational biology are studied. At the molecular level, efficient pattern matching algorithms in DNA sequences are presented. For gene order data, an efficient data structure is presented capable of storing all gene re-orderings in a systematic manner. A common characteristic of presented methods is the use of binary decision diagrams that store and manipulate binary functions. Searching for a particular pattern in a very large DNA database, is a fundamental and essential component in computational biology. In the biological world, pattern matching is required for finding repeats in a particular DNA sequence, finding motif and aligning sequences etc. Due to immense amount and continuous increase of biological data, the searching process requires very fast algorithms. This also requires encoding schemes for efficient storage of these search processes to operate on. Due to continuous progress in genome sequencing, genome rearrangements and construction of evolutionary genome graphs, which represent the relationships between genomes, become challenging tasks. Previous approaches are largely based on distance measure so that relationship between more phylogenetic species can be established with some specifically required rearrangement operations and hence within certain computational time. However because of the large volume of the available data, storage space and construction time for this evolutionary graph is still a problem. In addition, it is important to keep track of all possible rearrangement operations for a particular genome as biological processes are uncertain. This study presents a binary function-based tool set for efficient DNA sequence storage. A novel scalable method is also developed for fast offline pattern searches in large DNA sequences. This study also presents a method which efficiently stores all the gene sequences associated with all possible genome rearrangements such as transpositions and construct the evolutionary genome structure much faster for multiple species. The developed methods benefit from the use of Boolean functions; their compact storage using canonical data structure and the existence of built-in operators for these data structures. The time complexities depend on the size of the data structures used for storing the functions that represent the DNA sequences and/or gene sequences. It is shown that the presented approaches exhibit sub linear time complexity to the sequence size. The number of nodes present in the DNA data structure, string search time on these data structures, depths of the genome graph structure, and the time of the rearrangement operations are reported. Experiments on DNA sequences from the NCBI database are conducted for DNA sequence storage and search process. Experiments on large gene order data sets such as: human mitochondrial data and plant chloroplast data are conducted and depth of this structure was studied for evolutionary processes on gene sequences. The results show that the developed approaches are scalable.
Identifer | oai:union.ndltd.org:siu.edu/oai:opensiuc.lib.siu.edu:dissertations-2124 |
Date | 01 December 2015 |
Creators | Mohanty, Pragyan Paramita |
Publisher | OpenSIUC |
Source Sets | Southern Illinois University Carbondale |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | Dissertations |
Page generated in 0.0018 seconds