Return to search

Repeats in Strings and Application in Bioinformatics

A string is a sequence of symbols, usually called letters, drawn from some alphabet.
It is one of the most fundamental and important structures in computing, bioinformatics and mathematics. Computer files, contents of a computer memory, network
and satellite signals are all instances of strings. The genome of every living thing
can be represented by a string drawn from the alphabet {a, c, g, t}. The algorithms
processing strings have a wide range of applications such as information retrieval,
search engines, data compression, cryptography and bioinformatics. In a DNA sequence the indeterminate symbol {a, c} is used when it is unclear whether a given nucleotide is a or c, We could then say that {a, c} matches
another symbol {c, g} which in turn matches {g, t}, but {a, c} certainly does not
match {g, t}. The processing of indeterminate strings is much more difficult because
of this nontransitivity of matching. Thus a combinatorial understanding of indeterminate strings becomes essential to the development of efficient methods for their
processing. With indeterminate strings, as with ordinary ones, the main task is the
recognition/computation of patterns called regularities . We are particularly interested in regularities called repeats, whether tandem such as acgacg or nontandem
(acgtacg). In this thesis we focus on newly-discovered regularities in strings, especially the enhanced cover array and the Lyndon array, with attention paid to extending the
computations to indeterminate strings. Much of this work is necessarily abstract in
nature, because the intention is to produce results that are applicable over a wide
range of application areas. We will focus on finding algorithms to construct different
data structures to represent strings such as cover arrays and Lyndon arrays. The
idea of cover comes from strings which are not truly periodic but "almost" periodic
in nature. For example abaababa is covered by aba but is not periodic. Similarly the
Lyndon array describes the string in another unique way and is used in many fields of
string algorithms. These data structures will help us in the field of string processing.
As one application of these data structures we will work on "Reverse Engineering";
that is, given data structures derived from of a string, how can we get the string back. Since DNA, RNA and peptide sequences are effectively "strings" with unique
properties, we will adapt our algorithms for regular or indeterminate strings to these
sequences. Sequence analysis can be used to assign function to genes and proteins
by observing the similarities between the compared sequences. Identifying unusual
repetitive patterns will aid in the identification of intrinsic features of the sequence
such as active sites, gene-structures and regulatory elements. As an application of
periodic strings we investigate microsatellites which are short repetitive DNA patterns where repeated substrings are of length 2 to 5. Microsatellites are used in a
wide range of studies due to their small size and repetitive nature, and they have
played an important role in the identification of numerous important genetic loci. A
deeper understanding of the evolutionary and mutational properties of microsatellites
is needed, not only to understand how the genome is organized, but also to correctly
interpret and use microsatellite data in population genetics studies. / Thesis / Doctor of Philosophy (PhD)

Identiferoai:union.ndltd.org:mcmaster.ca/oai:macsphere.mcmaster.ca:11375/22018
Date11 1900
CreatorsIslam, A S M Sohidull
ContributorsSmyth, William F, Golding, Brian, Computational Engineering and Science
Source SetsMcMaster University
LanguageEnglish
Detected LanguageEnglish
TypeThesis

Page generated in 0.002 seconds