Return to search

Efficient Computation of Regularities in Strings and Applications

<p> Regularities in strings model many phenomena and thus form the subject of extensive mathematical studies. Perhaps the most conspicuous regularities in strings are those that manifest themselves in the form of repeated subpatterns, that is, repeats, multirepeats, repetitions, runs and others. Regularities in the form of repeating substrings were the basis of one of the earliest and still widely used compression algorithms and remain central in more recent approaches. Repeats and repetitions of lengthy substrings in DNA and protein sequences are important markers in biological research. </p> <p> A large proportion of the available algorithms for computing regularities in strings depends on the prior computation of a suffix tree, or, more recently, of a suffix array. The design of new algorithms for computing regularities should emphasize conceptual simplicity, as well as both time and space efficiency.</p> <p> In this thesis, we investigate mathematical and algorithmical aspects of the computation of regularities in strings.</p> <p> The first part of the thesis is the development of space and time efficient nonextendible (NE) and supernonextendible (SNE) repeats algorithms RPT, shown to be more efficient than previous methods based on tests using different real data sets. In particular, we describe four variants of a new fast algorithm RPT1 that, based on suffix array construction, computes all the complete NE repeats in a given string x whose length (period) p ≥ pmin, where pmin ≥ 1 is a user-specified minimum. RPT1 uses 5n bytes of space directly, but requires the LCP array, whose construction needs 6n bytes. The variants RPT1-3 and RPT1-4 execute in O(n) time independent of alphabet size and are faster than the two other algorithms previously proposed for this problem. To provide a basis of comparison for RPT1, we also describe a straightforward algorithm RPT2 that computes complete NE repeats without any recourse to suffix arrays and whose total space requirement is only 5n bytes; however, this algorithm is slower than RPT1. Furthermore, we describe new fast algorithms RPT3 for computing all complete SNE repeats in x. Of these, RPT3-2 executes in O(n) time independent of alphabet size, thus asymptotically faster than the methods previously proposed. We conclude with a brief discussion of applications to bioinformatics and data compression.</p> <p> The second part of the thesis deals with the issue of finding the NE multirepeats in a set of N strings of average length n under various constraints. A multirepeat is a repeat that occurs at least m times (m ≥ 2) in each of at least q ≥ 1 strings in a given set of strings. We show that RPT1 can be extended to locate the multirepeats based on the investigation of the properties of the multirepeats and various strategies. We describe algorithms to find complete NE multirepeats, first with no restriction on "gap length" (that is, the gap between occurrences of the multirepeat), then with bounded gaps. For the first problem, we propose two algorithms with worst-case time complexities O(Nn+αlog2N) and O(Nn+α) that use 9Nn and 10Nn bytes of space, respectively, where α is the alphabet size. For the second problem, we describe an algorithm with worst-case time complexity O(RNn) that requires approximately 10Nn bytes, where R is the number of multirepeats output. We remark that if we set the min and max constraints on gaps equal to zero in this algorithm, we can find all repetitions (tandem repeats) in arbitrary subsets of a given set. We demonstrate that our algorithms are faster, more flexible and much more space efficient than algorithms recently proposed for this problem.</p> <p> Finally, the third part of the thesis provides a convenient framework for comparing the LZ factorization algorithms which are used in the computation of regularities in strings rather than in the traditional application to text compression. LZ factorization is the computational bottleneck in numerous string processing algorithms, especially in regularity studies, such as computing repetitions, runs, repeats with fixed gap, branching repeats, sequence alignment, local periods, and data compression. Since 1977, when Ziv and Lempel described a kind of string factorization useful for data compression, there has been a succession of algorithms proposed for computing "LZ factorization". In particular, there have been several recent algorithms proposed that extend the usefulness of LZ factorization, especially to the computation of runs in a string x. We choose these algorithms and analyze each algorithm separately, and remark on them by comparing some of their important aspects, for example, additional space required and handling mechanism. We also address their output format differences and some special features. We then provide a complete theoretical comparison of their time and space efficiency. We conduct intensive testing on both time and space performance and analyze the results carefully to draw conclusions in which situations these algorithms perform best. We believe that our investigation and analysis will be very useful for researchers in their choice of the proper LZ factorization algorithms to deal with the problems related to computation of the regularities in strings.</p> / Thesis / Doctor of Philosophy (PhD)

Identiferoai:union.ndltd.org:mcmaster.ca/oai:macsphere.mcmaster.ca:11375/17358
Date08 1900
CreatorsYusufu, Munina
ContributorsSmyth, William F., Computing and Software
Source SetsMcMaster University
Languageen_US
Detected LanguageEnglish
TypeThesis

Page generated in 0.0102 seconds