Global ETD Search

1	Indexing Compressed Text He, Meng January 2003 (has links) As a result of the rapid growth of the volume of electronic data, text compression and indexing techniques are receiving more and more attention. These two issues are usually treated as independent problems, but approaches of combining them have recently attracted the attention of researchers. In this thesis, we review and test some of the more effective and some of the more theoretically interesting techniques. Various compression and indexing techniques are presented, and we also present two compressed text indices. Based on these techniques, we implement an compressed full-text index, so that compressed texts can be indexed to support fast queries without decompressing the whole texts. The experiments show that our index is compact and supports fast search. Computer Science text compression text indexing
2	Indexing Compressed Text He, Meng January 2003 (has links) As a result of the rapid growth of the volume of electronic data, text compression and indexing techniques are receiving more and more attention. These two issues are usually treated as independent problems, but approaches of combining them have recently attracted the attention of researchers. In this thesis, we review and test some of the more effective and some of the more theoretically interesting techniques. Various compression and indexing techniques are presented, and we also present two compressed text indices. Based on these techniques, we implement an compressed full-text index, so that compressed texts can be indexed to support fast queries without decompressing the whole texts. The experiments show that our index is compact and supports fast search. Computer Science text compression text indexing
3	Succinct Data Structures Gupta, Ankur 14 December 2007 (has links) The world is drowning in data. The recent explosion of web publishing, XML data, bioinformation, scientific data, image data, geographical map data, and even email communications has put a strain on our ability to manage the information contained there. In general, the influx of massive data sets for all kinds of data present a number of difficulties with storage, organization of information, and data accessibility. A primary computing challenge in these cases is how to compress the data but still allow it to be queried quickly.In real-life situations, many instances of data are highly compressible, presenting a major opportunity for space savings. In mobile applications, such savings are critical, since space and the power to access information are at a premium. In a streaming environment, where new data are being generated constantly, compression can aid in prediction as well. In the case of bioinformatics, understanding succinct representations of DNA sequences could lead to a more fundamental understanding of the nature of our own "data stream," perhaps even giving hints on secondary and tertiary structure, gene evolution, and other important topics.In this thesis, we focus our attention on the important problem of <i>compressed text indexing<\i>, where the goal is to compress a text document and allow arbitrary searching for patterns in the best possible time <i>without first decompressing the text<\i>. We develop a number of compressed data structures that either solve this problem directly, or are used as smaller components of an overall text indexing solution. Each component has a number of applications beyond text indexing as well. For each structure, we provide a theoretical study of its space usage and query performance on a suite of operations crucial to access the stored data. In each case, we relate its space usage to the <i>compressed size of the original data</i> and show that the supported operations function in near-optimal or optimal time. We also present a number of experimental results that validate our theoretical findings, showing that our methodology is competitive with the state-of-the-art. / Dissertation Computer Science compression text indexing dictionaries dynamic optimal
4	Efficient Index Maintenance for Text Databases Lester, Nicholas, nml@cs.rmit.edu.au January 2006 (has links) All practical text search systems use inverted indexes to quickly resolve user queries. Offline index construction algorithms, where queries are not accepted during construction, have been the subject of much prior research. As a result, current techniques can invert virtually unlimited amounts of text in limited main memory, making efficient use of both time and disk space. However, these algorithms assume that the collection does not change during the use of the index. This thesis examines the task of index maintenance, the problem of adapting an inverted index to reflect changes in the collection it describes. Existing approaches to index maintenance are discussed, including proposed optimisations. We present analysis and empirical evidence suggesting that existing maintenance algorithms either scale poorly to large collections, or significantly degrade query resolution speed. In addition, we propose a new strategy for index maintenance that trades a strictly controlled amount of querying efficiency for greatly increased maintenance speed and scalability. Analysis and empirical results are presented that show that this new algorithm is a useful trade-off between indexing and querying efficiency. In scenarios described in Chapter 7, the use of the new maintenance algorithm reduces the time required to construct an index to under one sixth of the time taken by algorithms that maintain contiguous inverted lists. In addition to work on index maintenance, we present a new technique for accumulator pruning during ranked query evaluation, as well as providing evidence that existing approaches are unsatisfactory for collections of large size. Accumulator pruning is a key problem in both querying efficiency and overall text search system efficiency. Existing approaches either fail to bound the memory footprint required for query evaluation, or suffer loss of retrieval accuracy. In contrast, the new pruning algorithm can be used to limit the memory footprint of ranked query evaluation, and in our experiments gives retrieval accuracy not worse than previous alternatives. The results presented in this thesis are validated with robust experiments, which utilise collections of significant size, containing real data, and tested using appropriate numbers of real queries. The techniques presented in this thesis allow information retrieval applications to efficiently index and search changing collections, a task that has been historically problematic. text indexing search engines index construction index update accumulator pruning
5	Upper and Lower Bounds for Text Upper and Lower Bounds for Text Indexing Data Structures Golynski, Alexander 10 December 2007 (has links) The main goal of this thesis is to investigate the complexity of a variety of problems related to text indexing and text searching. We present new data structures that can be used as building blocks for full-text indices which occupies minute space (FM-indexes) and wavelet trees. These data structures also can be used to represent labeled trees and posting lists. Labeled trees are applied in XML documents, and posting lists in search engines. The main emphasis of this thesis is on lower bounds for time-space tradeoffs for the following problems: the rank/select problem, the problem of representing a string of balanced parentheses, the text retrieval problem, the problem of computing a permutation and its inverse, and the problem of representing a binary relation. These results are divided in two groups: lower bounds in the cell probe model and lower bounds in the indexing model. The cell probe model is the most natural and widely accepted framework for studying data structures. In this model, we are concerned with the total space used by a data structure and the total number of accesses (probes) it performs to memory, while computation is free of charge. The indexing model imposes an additional restriction on the storage: the object in question must be stored in its raw form together with a small index that facilitates an efficient implementation of a given set of queries, e.g. finding rank, select, matching parenthesis, or an occurrence of a given pattern in a given text (for the text retrieval problem). We propose a new technique for proving lower bounds in the indexing model and use it to obtain lower bounds for the rank/select problem and the balanced parentheses problem. We also improve the existing techniques of Demaine and Lopez-Ortiz using compression and present stronger lower bounds for the text retrieval problem in the indexing model. The most important result of this thesis is a new technique for cell probe lower bounds. We demonstrate its strength by proving new lower bounds for the problem of representing permutations, the text retrieval problem, and the problem of representing binary relations. (Previously, there were no non-trivial results known for these problems.) In addition, we note that the lower bounds for the permutations problem and the binary relations problem are tight for a wide range of parameters, e.g. the running time of queries, the size and density of the relation. theoretical computer science data structures lower bounds text indexing rank/select problem representation of permutations Computer Science
6	Upper and Lower Bounds for Text Upper and Lower Bounds for Text Indexing Data Structures Golynski, Alexander 10 December 2007 (has links) The main goal of this thesis is to investigate the complexity of a variety of problems related to text indexing and text searching. We present new data structures that can be used as building blocks for full-text indices which occupies minute space (FM-indexes) and wavelet trees. These data structures also can be used to represent labeled trees and posting lists. Labeled trees are applied in XML documents, and posting lists in search engines. The main emphasis of this thesis is on lower bounds for time-space tradeoffs for the following problems: the rank/select problem, the problem of representing a string of balanced parentheses, the text retrieval problem, the problem of computing a permutation and its inverse, and the problem of representing a binary relation. These results are divided in two groups: lower bounds in the cell probe model and lower bounds in the indexing model. The cell probe model is the most natural and widely accepted framework for studying data structures. In this model, we are concerned with the total space used by a data structure and the total number of accesses (probes) it performs to memory, while computation is free of charge. The indexing model imposes an additional restriction on the storage: the object in question must be stored in its raw form together with a small index that facilitates an efficient implementation of a given set of queries, e.g. finding rank, select, matching parenthesis, or an occurrence of a given pattern in a given text (for the text retrieval problem). We propose a new technique for proving lower bounds in the indexing model and use it to obtain lower bounds for the rank/select problem and the balanced parentheses problem. We also improve the existing techniques of Demaine and Lopez-Ortiz using compression and present stronger lower bounds for the text retrieval problem in the indexing model. The most important result of this thesis is a new technique for cell probe lower bounds. We demonstrate its strength by proving new lower bounds for the problem of representing permutations, the text retrieval problem, and the problem of representing binary relations. (Previously, there were no non-trivial results known for these problems.) In addition, we note that the lower bounds for the permutations problem and the binary relations problem are tight for a wide range of parameters, e.g. the running time of queries, the size and density of the relation. theoretical computer science data structures lower bounds text indexing rank/select problem representation of permutations Computer Science
7	Přibližné vyhledávání řetězců v předzpracovaných dokumentech / Approximate String Matching in Preprocessed Documents Toth, Róbert January 2014 (has links) This thesis deals with the problem of approximate string matching, also called string matching allowing errors. The thesis targets the area of offline algorithms, which allows very fast pattern matching thanks to index created during initial text preprocessing phase. Initially, we will define the problem itself and demonstrate variety of its applications, followed by short survey of different approaches to cope with this problem. Several existing algorithms based on suffix trees will be explained in detail and new hybrid algorithm will be proposed. Algorithms wil be implemented in C programming language and thoroughly compared in series of experiments with focus on newly presented algorithm.
8	Representações cache eficientes para índices baseados em Wavelet trees SILVA, Israel Batista Freitas da 12 December 2016 (has links) Submitted by Rafael Santana (rafael.silvasantana@ufpe.br) on 2017-08-30T19:22:34Z No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) Israel Batista Freitas da Silva.pdf: 1433243 bytes, checksum: 5b1ac5501cae385e4811343e1426e6c9 (MD5) / Made available in DSpace on 2017-08-30T19:22:34Z (GMT). No. of bitstreams: 2 license_rdf: 811 bytes, checksum: e39d27027a6cc9cb039ad269a5db8e34 (MD5) Israel Batista Freitas da Silva.pdf: 1433243 bytes, checksum: 5b1ac5501cae385e4811343e1426e6c9 (MD5) Previous issue date: 2016-12-12 / CNPQ, FACEPE. / Hoje em dia, há um exponencial crescimento do volume de informação no mundo. Esta explosão cria uma demanda por técnicas mais eficientes de indexação e consulta de dados, uma vez que, para serem úteis, eles precisarão ser manipuláveis. Casamento de padrões se refere à busca de um texto menor (padrão) em um texto muito maior (texto), reportando a quantidade de ocorrências e/ou as localizações das ocorrências. Para tal, pode-se construir uma estrutura chamada índice que pré-processará o texto e permitirá que consultas sejam feitas eficientemente. A eficiência prática de um índice, além da sua eficiência teórica, pode definir o quão utilizado ele será, e isto está diretamente ligado a como ele se comporta nas arquiteturas dos computadores atuais. O principal objetivo deste estudo é analisar o uso da estrutura Wavelet Tree como índice avaliando o impacto da reorganização interna dos seus dados quanto à localidade espacial e, assim propor formas de organização que reduzam efetivamente a quantidade de cache misses ocorridos na execução de operações neste índice. Através de análises empíricas com dados simulados e dados textuais obtidos de dois repositórios públicos, avaliou-se alguns aspectos de cinco tipos de organizações para os dados da estrutura com o objetivo de compará-las quanto ao tempo de execução e quantidade de cache misses ocorridos. Adicionalmente, uma análise teórica da complexidade da quantidade de cache misses ocorridos para operação de consulta de um padrão é descrita para uma das organizações propostas. Dois experimentos realizados sugerem comportamentos assintóticos para duas das organizações analisadas. Um terceiro experimento executado mostra que, para quatro das cinco organizações apresentadas, houve uma sistemática redução na quantidade de cache misses ocorridos para a cache de menor nível. Entretanto a redução de cache misses para cache de menor nível não se refletiu integralmente numa diferença no tempo de execução das operações, tendo sido esta menos significativa, nem na quantidade de cache misses ocorridos na cache de maior nível, onde houveram variações positivas e negativas.Os resultados obtidos permitem concluir que a escolha de uma representação adequada pode acarretar numa melhora significativa de utilização da cache. Diferentemente do modelo teórico, o custo de acesso à memória responde apenas por uma fração do tempo de computação das operações sobre as Wavelet Trees, pelo que a diminuição no número de cache misses não se traduziu integralmente no tempo de execução. No entanto, este fator pode ser crítico em situações mais extremas de utilização de memória. / Today, there is an exponential growth in the volume of information in the world. This increase creates the demand for more efficient indexing and querying techniques, since, to be useful, that data needs to be manageable. Pattern matching means searching for a string (pattern) in a much bigger string (text), reporting the number of occurrences and/or its locations. To do that, we need to build a data structure known as index. This structure will preprocess the text to allow for efficient queries. The adoption of an index depends heavily on its efficiency, and this is directly related to how well it performs on current machine architectures. The main objective of this work is to analyze the Wavelet Tree data structure as an index, assessing the impact of its internal organization with respect to spatial locality, and propose ways to organize its data as to reduce the amount of cache misses incurred by its operations. We performed an empirical analysis using both real and simulated textual data to compare the running time and cache behavior of Wavelet Trees using five different proposals of internal data layout. A theoretical analysis about the cache complexity of a query operation is also presented for the most efficient layout. Two experiments suggest good asymptotic behavior for two of the analyzed layouts. A third experiment shows that for four of the five layouts, there was a systematic reduction in the number of cache misses for the lowest level cache. Despite this, this reduction was not reflected in the runtime, neither in the performance for the highest level cache. The results obtained allow us to conclude that the choice of a suitable layout can lead to a significant improvement in cache usage. Unlike the theoretical model, however, the cost of memory access only accounts for a fraction of the operations’ computation time on the Wavelet Trees, so the decrease in the number of cache misses did not translate fully into gains in the execution time. However, this factor can still be critical in more extreme memory utilization situations.
9	A Probabilistic Formulation of Keyword Spotting Puigcerver I Pérez, Joan 18 February 2019 (has links) [ES] La detección de palabras clave (Keyword Spotting, en inglés), aplicada a documentos de texto manuscrito, tiene como objetivo recuperar los documentos, o partes de ellos, que sean relevantes para una cierta consulta (query, en inglés), indicada por el usuario, entre una gran colección de documentos. La temática ha recogido un gran interés en los últimos 20 años entre investigadores en Reconocimiento de Formas (Pattern Recognition), así como bibliotecas y archivos digitales. Esta tesis, en primer lugar, define el objetivo de la detección de palabras clave a partir de una perspectiva basada en la Teoría de la Decisión y una formulación probabilística adecuada. Más concretamente, la detección de palabras clave se presenta como un caso particular de Recuperación de la Información (Information Retrieval), donde el contenido de los documentos es desconocido, pero puede ser modelado mediante una distribución de probabilidad. Además, la tesis también demuestra que, bajo las distribuciones de probabilidad correctas, el marco de trabajo desarrollada conduce a la solución óptima del problema, según múltiples medidas de evaluación utilizadas tradicionalmente en el campo. Más tarde, se utilizan distintos modelos estadísticos para representar las distribuciones necesarias: Redes Neuronales Recurrentes o Modelos Ocultos de Markov. Los parámetros de estos son estimados a partir de datos de entrenamiento, y las respectivas distribuciones son representadas mediante Transductores de Estados Finitos con Pesos (Weighted Finite State Transducers). Con el objetivo de hacer que el marco de trabajo sea práctico en grandes colecciones de documentos, se presentan distintos algoritmos para construir índices de palabras a partir de modelos probabilísticos, basados tanto en un léxico cerrado como abierto. Estos índices son muy similares a los utilizados por los motores de búsqueda tradicionales. Además, se estudia la relación que hay entre la formulación probabilística presentada y otros métodos de gran influencia en el campo de la detección de palabras clave, destacando cuáles son las limitaciones de los segundos. Finalmente, todas la aportaciones se evalúan de forma experimental, no sólo utilizando pruebas académicas estándar, sino también en colecciones con decenas de miles de páginas provenientes de manuscritos históricos. Los resultados muestran que el marco de trabajo presentado permite construir sistemas de detección de palabras clave muy rápidos y precisos, con una sólida base teórica. / [CA] La detecció de paraules clau (Keyword Spotting, en anglès), aplicada a documents de text manuscrit, té com a objectiu recuperar els documents, o parts d'ells, que siguen rellevants per a una certa consulta (query, en anglès), indicada per l'usuari, dintre d'una gran col·lecció de documents. La temàtica ha recollit un gran interés en els últims 20 anys entre investigadors en Reconeixement de Formes (Pattern Recognition), així com biblioteques i arxius digitals. Aquesta tesi defineix l'objectiu de la detecció de paraules claus a partir d'una perspectiva basada en la Teoria de la Decisió i una formulació probabilística adequada. Més concretament, la detecció de paraules clau es presenta com un cas concret de Recuperació de la Informació (Information Retrieval), on el contingut dels documents és desconegut, però pot ser modelat mitjançant una distribució de probabilitat. A més, la tesi també demostra que, sota les distribucions de probabilitat correctes, el marc de treball desenvolupat condueix a la solució òptima del problema, segons diverses mesures d'avaluació utilitzades tradicionalment en el camp. Després, diferents models estadístics s'utilitzen per representar les distribucions necessàries: Xarxes Neuronal Recurrents i Models Ocults de Markov. Els paràmetres d'aquests són estimats a partir de dades d'entrenament, i les corresponents distribucions són representades mitjançant Transductors d'Estats Finits amb Pesos (Weighted Finite State Transducers). Amb l'objectiu de fer el marc de treball útil per a grans col·leccions de documents, es presenten distints algorismes per construir índexs de paraules a partir dels models probabilístics, tan basats en un lèxic tancat com en un obert. Aquests índexs són molt semblants als utilitzats per motors de cerca tradicionals. A més a més, s'estudia la relació que hi ha entre la formulació probabilística presentada i altres mètodes de gran influència en el camp de la detecció de paraules clau, destacant algunes limitacions dels segons. Finalment, totes les aportacions s'avaluen de forma experimental, no sols utilitzant proves acadèmics estàndard, sinó també en col·leccions amb desenes de milers de pàgines provinents de manuscrits històrics. Els resultats mostren que el marc de treball presentat permet construir sistemes de detecció de paraules clau molt acurats i ràpids, amb una sòlida base teòrica. / [EN] Keyword Spotting, applied to handwritten text documents, aims to retrieve the documents, or parts of them, that are relevant for a query, given by the user, within a large collection of documents. The topic has gained a large interest in the last 20 years among Pattern Recognition researchers, as well as digital libraries and archives. This thesis, first defines the goal of Keyword Spotting from a Decision Theory perspective. Then, the problem is tackled following a probabilistic formulation. More precisely, Keyword Spotting is presented as a particular instance of Information Retrieval, where the content of the documents is unknown, but can be modeled by a probability distribution. In addition, the thesis also proves that, under the correct probability distributions, the framework provides the optimal solution, under many of the evaluation measures traditionally used in the field. Later, different statistical models are used to represent the probability distribution over the content of the documents. These models, Hidden Markov Models or Recurrent Neural Networks, are estimated from training data, and the corresponding distributions over the transcripts of the images can be efficiently represented using Weighted Finite State Transducers. In order to make the framework practical for large collections of documents, this thesis presents several algorithms to build probabilistic word indexes, using both lexicon-based and lexicon-free models. These indexes are very similar to the ones used by traditional search engines. Furthermore, we study the relationship between the presented formulation and other seminal approaches in the field of Keyword Spotting, highlighting some limitations of the latter. Finally, all the contributions are evaluated experimentally, not only on standard academic benchmarks, but also on collections including tens of thousands of pages of historical manuscripts. The results show that the proposed framework and algorithms allow to build very accurate and very fast Keyword Spotting systems, with a solid underlying theory. / Puigcerver I Pérez, J. (2018). A Probabilistic Formulation of Keyword Spotting [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/116834 Keyword Spotting Handwritten Text Recognition Information Retrieval Pattern Recognition Image Retrieval Probabilistic Text Indexing Historical Manuscripts Weighted Finite State Transducer Hidden Markov Model Artificial Neural Network LENGUAJES Y SISTEMAS INFORMATICOS

Search results