1 |
Search-Optimized Disk Layouts For Suffix-Tree Genomic IndexesBhavsar, Rajul D 08 1900 (has links) (PDF)
Over the last decade, biological sequence repositories have been growing at an exponential rate. Sophisticated indexing techniques are required to facilitate efficient searching through these humongous genetic repositories. A particularly attractive index structure for such sequence processing is the classical suffix-tree, a vertically compressed trie structure built over the set of all suffixes of a sequence. Its attractiveness stems from its linearity properties -- suffix-tree construction times are linear in the size of the indexed sequences, while search times are linear in the size of the query strings.
In practice, however, the promise of suffix-trees is not realized for extremely long sequences, such as the human genome, that run into the billions of characters. This is because suffix-trees, which are typically an order of magnitude larger than the indexed sequence, necessarily have to be disk-resident for such elongated sequences, and their traditional construction and traversal algorithms result in random disk accesses. We investigate, in this thesis, post-construction techniques for disk-based suffix-tree storage optimization, with the objective of maximizing disk-reference locality during query processing. We begin by focusing on the layout reorganization in which the node-to-block assignments and sequence of blocks are reworked. Our proposed algorithm is based on combining the breadth-first layout approach advocated in the recent literature with probabilistic techniques for minimizing the physical distance between successive block accesses, based on an analysis of node traversal patterns. In our next step, we consider techniques for reducing the space overheads incurred by suffix-trees. In particular, we propose an embedding strategy whereby leaf nodes can be completely represented within their parent internal nodes, without requiring any space extension of the parent node's structure.
To quantitatively evaluate the benefits of our reorganized and restructured layouts, we have conducted extensive experiments on complete human genome sequences, with complex and computationally expensive user queries that involve finding the maximal common substring matches of the query strings.
We show, for the first time, that the layout reorganization approach can be scaled to entire genomes, including the human genome. In the layout reorganization, with careful choice of node-to-block assignment condition and optimized sequence of blocks, search-time improvements ranging from 25% to 75% can be achieved with respect to the construction layouts on such genomes. While the layout reorganization does take considerable time, it is a one-time process whereas searches will be repeatedly invoked on this index. The internalization of leaf nodes results in a 25% reduction in the suffix-tree space occupancy. More importantly, when applied to the construction layout, it provides search-time improvements ranging from 25% to 85%, and in conjunction with the reorganized layout, searches are speeded up by 50% to 90%. Overall, our study and experimental results indicate that through careful choice of node implementations and layouts, the disk access locality of suffix-trees can be improved to the extent that upto an order-of-magnitude improvements in search-times may result relative to the classical implementations.
|
2 |
Temporal pattern recognition in noisy non-stationary time series based on quantization into symbolic streams. Lessons learned from financial volatility trading.Tino, Peter, Schittenkopf, Christian, Dorffner, Georg January 2000 (has links) (PDF)
In this paper we investigate the potential of the analysis of noisy non-stationary time series by quantizing it into streams of discrete symbols and applying finite-memory symbolic predictors. The main argument is that careful quantization can reduce the noise in the time series to make model estimation more amenable given limited numbers of samples that can be drawn due to the non-stationarity in the time series. As a main application area we study the use of such an analysis in a realistic setting involving financial forecasting and trading. In particular, using historical data, we simulate the trading of straddles on the financial indexes DAX and FTSE 100 on a daily basis, based on predictions of the daily volatility differences in the underlying indexes. We propose a parametric, data-driven quantization scheme which transforms temporal patterns in the series of daily volatility changes into grammatical and statistical patterns in the corresponding symbolic streams. As symbolic predictors operating on the quantized streams we use the classical fixed-order Markov models, variable memory length Markov models and a novel variation of fractal-based predictors introduced in its original form in (Tino, 2000b). The fractal-based predictors are designed to efficiently use deep memory. We compare the symbolic models with continuous techniques such as time-delay neural networks with continuous and categorical outputs, and GARCH models. Our experiments strongly suggest that the robust information reduction achieved by quantizing the real-valued time series is highly beneficial. To deal with non-stationarity in financial daily time series, we propose two techniques that combine ``sophisticated" models fitted on the training data with a fixed set of simple-minded symbolic predictors not using older (and potentially misleading) data in the training set. Experimental results show that by quantizing the volatility differences and then using symbolic predictive models, market makers can generate a statistically significant excess profit. However, with respect to our prediction and trading techniques, the option market on the DAX does seem to be efficient for traders and non-members of the stock exchange. There is a potential for traders to make an excess profit on the FTSE 100. We also mention some interesting observations regarding the memory structure in the studied series of daily volatility differences. (author's abstract) / Series: Report Series SFB "Adaptive Information Systems and Modelling in Economics and Management Science"
|
3 |
Suffix Trees for Document RetrievalReck, Ryan 01 June 2012 (has links)
This thesis presents a look at the suitability of Suffix Trees for full text indexing and retrieval. Typically suffix trees are built on a character level, where the tree records which characters follow each other character. By building suffix trees for documents based on words instead of characters, the resulting tree effectively indexes every word or sequence of words that occur in any of the documents. Ukkonnen's algorithm is adapted to build word-level suffix trees. But the primary focus is on developing Algorithms for searching the suffix tree for exact and approximate, or fuzzy, matches to arbitrary query strings. A proof-of-concept implementation is built and compared to a Lucene index for retrieval over a subset of the Reuters RCV1 data set.
|
4 |
FINDING TEMPORAL ASSOCIATION RULES BETWEEN FREQUENT PATTERNS IN MULTIVARIATE TIME SERIESTATAVARTY, GIRIDHAR 03 April 2006 (has links)
No description available.
|
5 |
Towards a Human Genomic Coevolution NetworkSavel, Daniel M. 04 June 2018 (has links)
No description available.
|
6 |
Přibližné vyhledávání řetězců v předzpracovaných dokumentech / Approximate String Matching in Preprocessed DocumentsToth, Róbert January 2014 (has links)
This thesis deals with the problem of approximate string matching, also called string matching allowing errors. The thesis targets the area of offline algorithms, which allows very fast pattern matching thanks to index created during initial text preprocessing phase. Initially, we will define the problem itself and demonstrate variety of its applications, followed by short survey of different approaches to cope with this problem. Several existing algorithms based on suffix trees will be explained in detail and new hybrid algorithm will be proposed. Algorithms wil be implemented in C programming language and thoroughly compared in series of experiments with focus on newly presented algorithm.
|
Page generated in 0.0315 seconds