Return to search

Suffix trees for very large inputs

A suffix tree is a fundamental data structure for string searching algorithms.
Unfortunately, when it comes to the use of suffix trees in real-life applications,
the current methods for constructing suffix trees do not scale for large inputs.
As suffix trees are larger than their input sequences and quickly outgrow the main memory, the first half of this work is focused on designing a practical algorithm that avoids massive random access to the
trees being built. This effort resulted in a new algorithm DiGeST which improves
significantly over previous work in reducing random access to the suffix tree and performing only two passes over disk data. As a result, this algorithm scales to larger genomic data than managed before.
All the existing practical algorithms perform random access to the input string,
thus requiring in essence that the input be small enough to be kept in main memory.
The ever increasing amount of genomic data requires however the ability to build suffix trees for much larger strings.
In the second half of this work we present another suffix tree construction algorithm, BBST that is able to construct suffix trees for input sequences significantly larger
than the size of the available main memory.
Both the input string and the suffix tree are kept on disk and the algorithm is designed to avoid multiple random I/Os to both of them.
As a proof of concept, we show that BBST allows to build a suffix tree for 12 GB of real DNA sequences in 26 hours on a single machine with 2 GB of RAM. This input is four times the size of the Human Genome.
The construction of suffix trees for inputs of such magnitude was never reported before.
Finally, we show that, after the off-line suffix tree construction is complete,
search queries on entire sequenced genomes can be performed very efficiently.
This high query performance is achieved due to a special disk layout of the suffix trees produced by our algorithms.

Identiferoai:union.ndltd.org:uvic.ca/oai:dspace.library.uvic.ca:1828/2901
Date16 July 2010
CreatorsBarsky, Marina
ContributorsThomo, Alex, Stege, Ulrike, Upton, Christopher
Source SetsUniversity of Victoria
LanguageEnglish, English
Detected LanguageEnglish
TypeThesis
RightsAvailable to the World Wide Web

Page generated in 0.0021 seconds