Spelling suggestions: "subject:"patternmatching"" "subject:"altersmatching""
61 |
Transform Based And Search Aware Text Compression Schemes And Compressed Domain Text RetrievalZhang, Nan 01 January 2005 (has links)
In recent times, we have witnessed an unprecedented growth of textual information via the Internet, digital libraries and archival text in many applications. While a good fraction of this information is of transient interest, useful information of archival value will continue to accumulate. We need ways to manage, organize and transport this data from one point to the other on data communications links with limited bandwidth. We must also have means to speedily find the information we need from this huge mass of data. Sometimes, a single site may also contain large collections of data such as a library database, thereby requiring an efficient search mechanism even to search within the local data. To facilitate the information retrieval, an emerging ad hoc standard for uncompressed text is XML which preprocesses the text by putting additional user defined metadata such as DTD or hyperlinks to enable searching with better efficiency and effectiveness. This increases the file size considerably, underscoring the importance of applying text compression. On account of efficiency (in terms of both space and time), there is a need to keep the data in compressed form for as much as possible. Text compression is concerned with techniques for representing the digital text data in alternate representations that takes less space. Not only does it help conserve the storage space for archival and online data, it also helps system performance by requiring less number of secondary storage (disk or CD Rom) accesses and improves the network transmission bandwidth utilization by reducing the transmission time. Unlike static images or video, there is no international standard for text compression, although compressed formats like .zip, .gz, .Z files are increasingly being used. In general, data compression methods are classified as lossless or lossy. Lossless compression allows the original data to be recovered exactly. Although used primarily for text data, lossless compression algorithms are useful in special classes of images such as medical imaging, finger print data, astronomical images and data bases containing mostly vital numerical data, tables and text information. Many lossy algorithms use lossless methods at the final stage of the encoding stage underscoring the importance of lossless methods for both lossy and lossless compression applications. In order to be able to effectively utilize the full potential of compression techniques for the future retrieval systems, we need efficient information retrieval in the compressed domain. This means that techniques must be developed to search the compressed text without decompression or only with partial decompression independent of whether the search is done on the text or on some inversion table corresponding to a set of key words for the text. In this dissertation, we make the following contributions: (1) Star family compression algorithms: We have proposed an approach to develop a reversible transformation that can be applied to a source text that improves existing algorithm's ability to compress. We use a static dictionary to convert the English words into predefined symbol sequences. These transformed sequences create additional context information that is superior to the original text. Thus we achieve some compression at the preprocessing stage. We have a series of transforms which improve the performance. Star transform requires a static dictionary for a certain size. To avoid the considerable complexity of conversion, we employ the ternary tree data structure that efficiently converts the words in the text to the words in the star dictionary in linear time. (2) Exact and approximate pattern matching in Burrows-Wheeler transformed (BWT) files: We proposed a method to extract the useful context information in linear time from the BWT transformed text. The auxiliary arrays obtained from BWT inverse transform brings logarithm search time. Meanwhile, approximate pattern matching can be performed based on the results of exact pattern matching to extract the possible candidate for the approximate pattern matching. Then fast verifying algorithm can be applied to those candidates which could be just small parts of the original text. We present algorithms for both k-mismatch and k-approximate pattern matching in BWT compressed text. A typical compression system based on BWT has Move-to-Front and Huffman coding stages after the transformation. We propose a novel approach to replace the Move-to-Front stage in order to extend compressed domain search capability all the way to the entropy coding stage. A modification to the Move-to-Front makes it possible to randomly access any part of the compressed text without referring to the part before the access point. (3) Modified LZW algorithm that allows random access and partial decoding for the compressed text retrieval: Although many compression algorithms provide good compression ratio and/or time complexity, LZW is the first one studied for the compressed pattern matching because of its simplicity and efficiency. Modifications on LZW algorithm provide the extra advantage for fast random access and partial decoding ability that is especially useful for text retrieval systems. Based on this algorithm, we can provide a dynamic hierarchical semantic structure for the text, so that the text search can be performed on the expected level of granularity. For example, user can choose to retrieve a single line, a paragraph, or a file, etc. that contains the keywords. More importantly, we will show that parallel encoding and decoding algorithm is trivial with the modified LZW. Both encoding and decoding can be performed with multiple processors easily and encoding and decoding process are independent with respect to the number of processors.
|
62 |
Efficient Implementation & Application of Maximal String Covering Algorithm / MAXIMAL COVER ALGORITHM IMPLEMENTATIONKoponen, Holly January 2022 (has links)
This thesis describes the development and application of the new software MAXCOVER that computes maximal covers and non-extendible repeats (a.k.a. “maximal repeats”).
A string is a finite array x[1..n] of elements chosen from a set of totally ordered symbols called an alphabet. A repeat is a substring that occurs at least twice in x. A repeat is left/right extendible if every occurrence is preceded/followed by the same symbol; otherwise, it is non-left/non-right extendible (NLE/NRE). A non-extendible (NE) repeat is both NLE and NRE. A repeat covers a position i if x[i] lies within the repeat. A maximal cover (a.k.a. “optimal cover”) is a repeat that covers the most positions in x.
For simplicity, we first describe a quadratic O(n2) implementation of MAXCOVER to compute all maximal covers of a given string based on the pseudocode given in [1]. Then, we consider the logarithmic O(n log n) pseudocode in [1], in which we identify several errors. We leave a complete correction and implementation for future work. Instead, we propose two improved quadratic algorithms that, shown through experiments, will execute in linear time for the average case.
We perform a benchmark evaluation of MAXCOVER’s performance and demonstrate its value to biologists in the protein context [2]. To do so, we develop an extension of MAXCOVER for the closely related task of computing NE repeats. Then, we compare MAXCOVER to the repeat-match feature of the well-known MUMmer software [3] (600+ citations). We determine that MAXCOVER is an order-of-magnitude faster than MUMmer with much lower space requirements. We also show that MAXCOVER produces a more compact, exact, and user-friendly output that specifies the repeats.
Availability: Open source code, binaries, and test data are available on Github at https://github.com/hollykoponen/MAXCOVER. Currently runs on Linux, untested on other OS. / Thesis / Master of Science (MSc) / This thesis deals with a simple yet essential data structure called a string, a sequence of symbols drawn from an alphabet. For example, a DNA sequence is a string comprised of four letters.
We describe a new software called MAXCOVER that identifies maximal covers of a given string x (a repeating substring that ‘covers’ the most positions in x). This software is based on the algorithms in [1]. We propose two new algorithms that perform faster in practice.
We also extended MAXCOVER for the closely related task of computing non-extendible repeats. We compare this extension to the well-known MUMmer software (600+ citations). We find that MAXCOVER is many times faster than MUMmer with much lower space requirements and produces a more compact, exact and user-friendly output.
|
63 |
Modeling and Matching of Landmarks for Automation of Mars Rover LocalizationWang, Jue 05 September 2008 (has links)
No description available.
|
64 |
Feature-based Vehicle Classification in Wide-angle Synthetic Aperture RadarDungan, Kerry Edward 03 August 2010 (has links)
No description available.
|
65 |
Search over Encrypted Data in Cloud ComputingWang, Bing 25 June 2016 (has links)
Cloud computing which provides computation and storage resources in a pay-per-usage manner has emerged as the most popular computation model nowadays. Under the new paradigm, users are able to request computation resources dynamically in real-time to accommodate their workload requirements. The flexible resource allocation feature endows cloud computing services with the capability to offer affordable and efficient computation services. However, moving data and applications into the cloud exposes a privacy leakage risk of the user data. As the growing awareness of data privacy, more and more users begin to choose proactive protection for their data in the cloud through data encryption. One major problem of data encryption is that it hinders many necessary data utilization functions since most of the functions cannot be directly applied to the encrypted data. The problem could potentially jeopardize the popularity of the cloud computing, therefore, achieving efficient data utilization over encrypted data while preserving user data privacy is an important research problem in cloud computing.
The focus of this dissertation is to design secure and efficient schemes to address essential data utilization functions over encrypted data in cloud computing. To this end, we studied three problems in this research area. The first problem that is studied in this dissertation is fuzzy multi-keyword search over encrypted data. As fuzzy search is one of the most useful and essential data utilization functions in our daily life, we propose a novel design that incorporates Bloom filter and Locality-Sensitive Hashing to fulfill the security and function requirements of the problem. Secondly, we propose a secure index which is based on the most popular index structure, i.e., the inverted index. Our innovative design provides privacy protection over the secure index, the user query as well as the search pattern and the search result. Also, users can verify the correctness of the search results to ensure the proper computation is performed by the cloud. Finally, we focus ourselves on the privacy-sensitive data application in cloud computing, i.e., genetic testings over DNA sequences. To provide secure and efficient genetic testings in the cloud, we utilize Predicate Encryption and design a bilinear pairing based secure sequence matching scheme to achieve strong privacy guarantee while fulfilling the functionality requirement efficiently. In all of the three research thrusts, we present thorough theoretical security analysis and extensive simulation studies to evaluate the performance of the proposed schemes. The results demonstrate that the proposed schemes can effectively and efficiently address the challenging problems in practice. / Ph. D.
|
66 |
Incremental Matching on Word ChainsNilsson, Wilmer January 2024 (has links)
Pattern matching, which is the process of finding a given pattern in a given text, is widely used in areas such as search and replace functions in text processing programs or in DNA sequence analysis, where the pattern can be a search term or a specific sequence of characters. Finding and analysing nucleic acid sequences in DNA data can in some cases require sequences to be found which in turn are made up of several specific sub sequences, where the nucleotides between them, as well as the number of them, are irrelevant. This pattern, also called a word chain, can more efficiently be found by pre-processing the pattern and text. This thesis explores, investigates and presents a data structure, used to match a word chain pattern, with the ability to incrementally alter this pre-computed information in order to more efficiently, time wise, handle text alterations such as split and concatenation operations.
|
67 |
Why-Query Support in Graph DatabasesVasilyeva, Elena 28 March 2017 (has links) (PDF)
In the last few decades, database management systems became powerful tools for storing large amount of data and executing complex queries over them. In addition to extended functionality, novel types of databases appear like triple stores, distributed databases, etc. Graph databases implementing the property-graph model belong to this development branch and provide a new way for storing and processing data in the form of a graph with nodes representing some entities and edges describing connections between them. This consideration makes them suitable for keeping data without a rigid schema for use cases like social-network processing or data integration. In addition to a flexible storage, graph databases provide new querying possibilities in the form of path queries, detection of connected components, pattern matching, etc.
However, the schema flexibility and graph queries come with additional costs. With limited knowledge about data and little experience in constructing the complex queries, users can create such ones, which deliver unexpected results. Forced to debug queries manually and overwhelmed by the amount of query constraints, users can get frustrated by using graph databases. What is really needed, is to improve usability of graph databases by providing debugging and explaining functionality for such situations. We have to assist users in the discovery of what were the reasons of unexpected results and what can be done in order to fix them.
The unexpectedness of result sets can be expressed in terms of their size or content. In the first case, users have to solve the empty-answer, too-many-, or too-few-answers problems. In the second case, users care about the result content and miss some expected answers or wonder about presence of some unexpected ones. Considering the typical problems of receiving no or too many results by querying graph databases, in this thesis we focus on investigating the problems of the first group, whose solutions are usually represented by why-empty, why-so-few, and why-so-many queries. Our objective is to extend graph databases with debugging functionality in the form of why-queries for unexpected query results on the example of pattern matching queries, which are one of general graph-query types. We present a comprehensive analysis of existing debugging tools in the state-of-the-art research and identify their common properties.
From them, we formulate the following features of why-queries, which we discuss in this thesis, namely: holistic support of different cardinality-based problems, explanation of unexpected results and query reformulation, comprehensive analysis of explanations, and non-intrusive user integration. To support different cardinality-based problems, we develop methods for explaining no, too few, and too many results. To cover different kinds of explanations, we present two types: subgraph- and modification-based explanations. The first type identifies the reasons of unexpectedness in terms of query subgraphs and delivers differential graphs as answers. The second one reformulates queries in such a way that they produce better results. Considering graph queries to be complex structures with multiple constraints, we investigate different ways of generating explanations starting from the most general one that considers only a query topology through coarse-grained rewriting up to fine-grained modification that allows fine changes of predicates and topology. To provide a comprehensive analysis of explanations, we propose to compare them on three levels including a syntactic description, a content, and a size of a result set. In order to deliver user-aware explanations, we discuss two models for non-intrusive user integration in the generation process.
With the techniques proposed in this thesis, we are able to provide fundamentals for debugging of pattern-matching queries, which deliver no, too few, or too many results, in graph databases implementing the property-graph model.
|
68 |
Uma medida de similaridade híbrida para correspondência aproximada de múltiplos padrões / A hybrid similarity measure for multiple approximate pattern matchingDezembro, Denise Gazotto 07 March 2019 (has links)
A busca aproximada por múltiplos padrões similares é um problema encontrado em diversas áreas de pesquisa, tais como biologia computacional, processamento de sinais e recuperação de informação. Na maioria das vezes, padrões não possuem uma correspondência exata e, portanto, buscam-se padrões aproximados, de acordo com um modelo de erro. Em geral, o modelo de erro utiliza uma função de distância para determinar o quanto dois padrões são diferentes. As funções de distância são baseadas em medidas de similaridade, que são classificadas em medidas de similaridade baseadas em distância de edição, medidas de similaridade baseadas em token e medidas de similaridade híbridas. Algumas dessas medidas extraem um vetor de características de todos os termos que constituem o padrão. A similaridade entre os vetores pode ser calculada pela distância entre cossenos ou pela distância euclidiana, por exemplo. Essas medidas apresentam alguns problemas: tornam-se inviáveis conforme o tamanho do padrão aumenta, não realizam a correção ortográfica ou apresentam problemas de normalização. Neste projeto de pesquisa propõe-se uma nova medida de similaridade híbrida que combina TF-IDF Weighting e uma medida de similaridade baseada em distância de edição para estimar a importância de um termo dentro de um padrão na tarefa de busca textual. A medida DGD não descarta completamente os termos que não fazem parte do padrão, mas atribui um peso baseando-se na alta similaridade deste termo com outro que está no padrão e com a média de TF-IDF Weighting do termo na coleção. Alguns experimentos foram conduzidos mostrando o comportamento da medida proposta comparada com as outras existentes na literatura. Tem-se como recomendação geral o limiar de {tf-idf+cosseno, Jaccard, Soft tf-idf} 0,60 e {Jaro, Jaro-Winkler, Monge-Elkan} 0,90 para detecção de padrões similares. A medida de similaridade proposta neste trabalho (DGD+cosseno) apresentou um melhor desempenho quando comparada com tf idf+cosseno e Soft tf-idf na identificação de padrões similares e um melhor desempenho do que as medidas baseadas em distância de edição (Jaro e JaroWinkler) na identificação de padrões não similares. Atuando como classificador, em geral, a medida de similaridade híbrida proposta neste trabalho (DGD+cosseno) apresentou um melhor desempenho (embora não sinificativamente) do que todas as outras medidas de similaridade analisadas, o que se mostra como um resultado promissor. Além disso, é possível concluir que o melhor valor de a ser usado, onde corresponde ao limiar do valor da medida de similaridade secundária baseada em distância de edição entre os termos do padrão, corresponde a 0,875. / Multiple approximate pattern matching is a challenge found in many research areas, such as computational biology, signal processing and information retrieval. Most of the time, a pattern does not have an exact match in the text, and therefore an error model becomes necessary to search for an approximate pattern match. In general, the error model uses a distance function to determine how different two patterns are. Distance functions use similarity measures which can be classified in token-based, edit distance based and hybrid measures. Some of these measures extract a vector of characteristics from all terms in the pattern. Then, the similarity between vectors can be calculated by cosine distance or by euclidean distance, for instance. These measures present some problems: they become infeasible as the size of the pattern increases, do not perform the orthographic correction or present problems of normalization. In this research, we propose a new hybrid similarity metric, named DGD, that combines TF-IDF Weighting and a edit distance based measure to estimate the importance of a term within patterns. The DGD measure doesnt completely rule out terms that are not part of the pattern, but assigns a weight based on the high similarity of this term to another that is in the pattern and with the TF-IDF Weighting mean of the term in the collection. Experiment were conducted showing the soundness of the proposed metric compared to others in the literature. The general recommendation is the threshold of {tf-idf+cosseno, Jaccard, Soft tf-idf} 0.60 and {Jaro, Jaro-Winkler, Monge-Elkan} 0.90 for detection of similar patterns. The similarity measure proposed in this work (DGD + cosine) presented a better performance when compared with tf-idf+cosine and Soft tf-idf in the identification of similar patterns and a better performance than the edit distance based measures (Jaro and Jaro-Winkler) in identifying non-similar patterns. As a classifier, in general, the hybrid similarity measure proposed in this work (DGD+cosine) performed better (although not significantly) than all other similarity measures analyzed, which is shown as a promising result . In addition, it is possible to conclude that the best value of to be used, where is the theshold of the value of the secondary similarity measure based on edit distance between the terms of the pattern, corresponds to 0.875.
|
69 |
Un système de types pour la programmation par réécriture embarquée / A type system for embedded rewriting programmingOliveira Kiermes Tavares, Claudia Fernanda 02 March 2012 (has links)
Dans le domaine de l'ingénierie du logiciel, les systèmes de types sont souvent considérés pour la prévention de l'occurrence de termes dénués de sens par rapport à une spécification des types. Dans le cadre de l'extension d'un langage de programmation avec des caractéristiques dédiées, le typage de ces dernières doit être compatible avec les caractéristiques du langage hôte. Cette thèse se situe dans le contexte de la réécriture de termes embarquée dans la programmation orientée objet. Elle vise à développer un système de types avec sous-typage pour le support du filtrage de motifs associatif sur des termes algébriques construits sur des opérateurs variadiques. Ce travail s'appuie sur le langage de réécriture Tom qui fournit des constructions de filtrage de motifs et des stratégies de réécriture à des langages généralistes comme Java. Nous décrivons l'évaluation de code Tom à travers la définition de la sémantique opérationnelle de ce langage en tant qu'élément essentiel de la preuve de la sûreté du système de types. Celui-ci inclut la vérification de types ainsi que l'inférence de types à base de contraintes. Le langage de contraintes est composé d'une part, de contraintes d'égalité, résolues par unification, d'autre part, de contraintes de sous-typage, résolues par la combinaison de phases de simplification, de génération d'une solution et de ramassage de miettes. Le système de types a été intégré au langage Tom, ce qui permet une plus forte expressivité et plus de sûreté a fin d'assurer que les transformations décrites par des règles de réécriture préservent le type des termes / In software engineering, type systems are often considered in order to prevent the occurrence of meaningless terms in regard to a type specification. When extending a given programming language with new dedicated features, the typing of these features must be compatible with the ones in the host language. This thesis is situated in the context of term rewriting embedded in object-oriented programming and aims to develop a safe type system featuring subtyping for the support of associative pattern matching on algebraic terms built from variadic operators. In this work we consider the Tom rewriting language that provides associative pattern matching constructs and rewrite strategies for Java. We describe Tom code evaluation through the definition of the operational semantics of the Tom language as an essential element to show that the type system is safe. The type system includes type checking and constraint-based type inference. The constraint language is composed of equality constraints solved by unification and subtyping constraints solved by a combination of simplification, generation of solution and garbage collecting. The type system was integrated in Tom which provides both stronger expressiveness and more safety able to ensure that the transformations described by rewrite rules preserve the type of terms
|
70 |
SNIF TOOL - Sniffing for Patterns in Continuous StreamsMUKHERJI, ABHISHEK 11 February 2008 (has links)
Recent technological advances in sensor networks and mobile devices give rise to new challenges in processing of live streams. In particular, time-series sequence matching, namely, the similarity matching of live streams against a set of predefined pattern sequence queries, is an important technology for a broad range of domains that include monitoring the spread of hazardous waste and administering network traffic. In this thesis, I use the time critical application of monitoring of fire growth in an intelligent building as my motivating example. Various measures and algorithms have been established in the current literature for similarity of static time-series data. Matching continuous data poses the following new challenges: 1) fluctuations in stream characteristics, 2) real-time requirements of the application, 3) limited system resources, and, 4) noisy data. Thus the matching techniques proposed for static time-series are mostly not applicable for live stream matching. In this thesis, I propose a new generic framework, henceforth referred to as the n-Snippet Indices Framework (in short, SNIF), for discovering the similarity between a live stream and pattern sequences. The framework is composed of two key phases: (1.) Off-line preprocessing phase: where the pattern sequences are processed offline and stored into an approximate 2-level index structure; and (2.) On-line live stream matching phase: streaming time-series (or the live stream) is on-the-fly matched against the indexed pattern sequences. I introduce the concept of n-Snippets for numeric data as the unit for matching. The insight is to match small snippets of the live stream against prefixes of the patterns and maintain them in succession. Longer the pattern prefixes identified to be similar to the live stream, better the confirmation of the match. Thus, the live stream matching is performed in two levels of matching: bag matching for matching snippets and order checking for maintaining the lengths of the match. I propose four variations of matching algorithms that allow the user the capability to choose between the two conflicting characteristics of result accuracy versus response time. The effectiveness of SNIF to detect patterns has been thoroughly tested through extensive experimental evaluations using the continuous query engine CAPE as platform. The evaluations made use of real datasets from multiple domains, including fire monitoring, chlorine monitoring and sensor networks. Moreover, SNIF is demonstrated to be tolerant to noisy datasets.
|
Page generated in 0.0815 seconds