11 |
Time Series Data AnalyticsAhsan, Ramoza 29 April 2019 (has links)
Given the ubiquity of time series data, and the exponential growth of databases, there has recently been an explosion of interest in time series data mining. Finding similar trends and patterns among time series data is critical for many applications ranging from financial planning, weather forecasting, stock analysis to policy making. With time series being high-dimensional objects, detection of similar trends especially at the granularity of subsequences or among time series of different lengths and temporal misalignments incurs prohibitively high computation costs. Finding trends using non-metric correlation measures further compounds the complexity, as traditional pruning techniques cannot be directly applied. My dissertation addresses these challenges while meeting the need to achieve near real-time responsiveness. First, for retrieving exact similarity results using Lp-norm distances, we design a two-layered time series index for subsequence matching. Time series relationships are compactly organized in a directed acyclic graph embedded with similarity vectors capturing subsequence similarities. Powerful pruning strategies leveraging the graph structure greatly reduce the number of time series as well as subsequence comparisons, resulting in a several order of magnitude speed-up. Second, to support a rich diversity of correlation analytics operations, we compress time series into Euclidean-based clusters augmented by a compact overlay graph encoding correlation relationships. Such a framework supports a rich variety of operations including retrieving positive or negative correlations, self correlations and finding groups of correlated sequences. Third, to support flexible similarity specification using computationally expensive warped distance like Dynamic Time Warping we design data reduction strategies leveraging the inexpensive Euclidean distance with subsequent time warped matching on the reduced data. This facilitates the comparison of sequences of different lengths and with flexible alignment still within a few seconds of response time. Comprehensive experimental studies using real-world and synthetic datasets demonstrate the efficiency, effectiveness and quality of the results achieved by our proposed techniques as compared to the state-of-the-art methods.
|
12 |
Index pro podobnostní vyhledávání ve vysokodimenzionálních prostorech / Index Suitable for Similar Search in High-dimensional SpacesKrejčová, Martina January 2012 (has links)
In this paper, we focus on indexing and searching in high-dimensional data. To achieve the target we implemented the Metric Index, a model of the similarity search based on the metric spaces, that employs many of known principles of partitioning and filtering. The metric space is a general model of similarity, which enables the usage of implemented index for various data. With this index, stored data could be searched effectively. The internal structure of data is hidden, we just require an implementation of the function for feature extraction, which produces a vector representing data, and the metric function applicable to the given data. The Metric Index was implemented as a data cartridge, the mechanism for extending the capabilities of the Oracle server. This data cartridge enables indexing of large unstructured data in the Oracle server known as LOBs.
|
13 |
Podobnostní vyhledávání v databázích proteinových struktur / Similarity Search in Protein Structure DatabasesGalgonek, Jakub January 2012 (has links)
Proteins are one of the most important biopolymers having a wide range of functions in living organisms. Their huge functional diversity is achieved by their ability to fold into various 3D structures. Moreover, it has been shown that proteins sharing similar structure often share also other properties (e.g, a biological function, an evolutionary origin, etc.). Therefore, protein structures and methods to identify their similarities are so widely studied. In this thesis, we introduce a system allowing similarity search in pro- tein structure databases. The system retrieves, given a query structure, all database structures being similar to the query structure. It employs several key components. We have introduced a novel similarity measure assigning similarity scores to pairs of protein structures. We have designed specific access method based on LAESA metric indexing and using the proposed measure. The access method allows to search similar structures more effi- ciently than when a sequential scan of a database is employed. To achieve further speedup, the measure and the access method have been parallelized, resulting in almost linear speedup with the respect to the number of available cores. The last component is a web user interface that allows to accept a query structure and to present a list of...
|
14 |
Nasazení paralelních architektur v podrobnostním vyhledávání / Employing Parallel Architectures in Similarity SearchKruliš, Martin January 2013 (has links)
This work examines the possibilities of employing highly parallel architectures in database systems, which are based on the similarity search paradigm. The main objective of our research is utilizing the computational power of current GPU devices for similarity search in the databases of images. Despite leaping progress made in the past few years, the similarity search problems remain very expensive from a compu- tational point of view, which limits the scope of their applicability. GPU devices have a tremendous computational power at their disposal; however, the usability of this power for particular problems is often complicated due to the specific properties of this architecture. Therefore, the existing algorithms and data structures require extensive modifications if they are to be adapted for the GPUs. We have addressed all the aspects of this domain, such as efficient utilization of the GPU hardware for generic computations, parallelization of similarity search process, and acceleration of image indexing techniques. In most cases, employing the GPU devices brought a speedup of two orders of magnitude with respect to single-core CPUs and approximately one order of magnitude with respect to multiprocessor NUMA servers. This thesis summarizes our experience and discoveries from several years of research,...
|
15 |
Online hashing for fast similarity searchCakir, Fatih 02 February 2018 (has links)
In this thesis, the problem of online adaptive hashing for fast similarity search is studied. Similarity search is a central problem in many computer vision applications. The ever-growing size of available data collections and the increasing usage of high-dimensional representations in describing data have increased the computational cost of performing similarity search, requiring search strategies that can explore such collections in an efficient and effective manner. One promising family of approaches is based on hashing, in which the goal is to map the data into the Hamming space where fast search mechanisms exist, while preserving the original neighborhood structure of the data. We first present a novel online hashing algorithm in which the hash mapping is updated in an iterative manner with streaming data. Being online, our method is amenable to variations of the data. Moreover, our formulation is orders of magnitude faster to train than state-of-the-art hashing solutions. Secondly, we propose an online supervised hashing framework in which the goal is to map data associated with similar labels to nearby binary representations. For this purpose, we utilize Error Correcting Output Codes (ECOCs) and consider an online boosting formulation in learning the hash mapping. Our formulation does not require any prior assumptions on the label space and is well-suited for expanding datasets that have new label inclusions. We also introduce a flexible framework that allows us to reduce hash table entry updates. This is critical, especially when frequent updates may occur as the hash table grows larger and larger. Thirdly, we propose a novel mutual information measure to efficiently infer the quality of a hash mapping and retrieval performance. This measure has lower complexity than standard retrieval metrics. With this measure, we first address a key challenge in online hashing that has often been ignored: the binary representations of the data must be recomputed to keep pace with updates to the hash mapping. Based on our novel mutual information measure, we propose an efficient quality measure for hash functions, and use it to determine when to update the hash table. Next, we show that this mutual information criterion can be used as an objective in learning hash functions, using gradient-based optimization. Experiments on image retrieval benchmarks confirm the effectiveness of our formulation, both in reducing hash table recomputations and in learning high-quality hash functions.
|
16 |
Locality Sensitive Indexing for Efficient High-Dimensional Query Answering in the Presence of Excluded RegionsJanuary 2016 (has links)
abstract: Similarity search in high-dimensional spaces is popular for applications like image
processing, time series, and genome data. In higher dimensions, the phenomenon of
curse of dimensionality kills the effectiveness of most of the index structures, giving
way to approximate methods like Locality Sensitive Hashing (LSH), to answer similarity
searches. In addition to range searches and k-nearest neighbor searches, there
is a need to answer negative queries formed by excluded regions, in high-dimensional
data. Though there have been a slew of variants of LSH to improve efficiency, reduce
storage, and provide better accuracies, none of the techniques are capable of
answering queries in the presence of excluded regions.
This thesis provides a novel approach to handle such negative queries. This is
achieved by creating a prefix based hierarchical index structure. First, the higher
dimensional space is projected to a lower dimension space. Then, a one-dimensional
ordering is developed, while retaining the hierarchical traits. The algorithm intelligently
prunes the irrelevant candidates while answering queries in the presence of
excluded regions. While naive LSH would need to filter out the negative query results
from the main results, the new algorithm minimizes the need to fetch the redundant
results in the first place. Experiment results show that this reduces post-processing
cost thereby reducing the query processing time. / Dissertation/Thesis / Masters Thesis Computer Science 2016
|
17 |
Explorando variedade em consultas por similaridade / Investigationg variety in similarity queriesLúcio Fernandes Dutra Santos 26 October 2012 (has links)
A complexidade dos dados armazenados em grandes bases de dados aumenta sempre, criando a necessidade de novas formas de consulta. As consultas por similaridade vêm apresentando crescente interesse para tratar de dados complexos, sendo as mais representativas a consulta por abrangência (\'R IND. q\' Range query) e a consulta aos k-vizinhos mais próximos (k-\'NN IND. q\' k-Nearest Neighboor query). Até recentemente, essas consultas não estavam disponíveis nos Sistemas de Gerenciamento de Bases de Dados (SGBD). Agora, com o início de sua disponibilidade, tem se tornado claro que os operadores de busca fundamentais usados para executá-las não são suficientes para atender às necessidades das aplicações que as demandam. Assim, estão sendo estudadas variações e extensões aos operadores fundamentais, em geral voltados às necessidades de domínios de aplicações específicas. Além disso, os seguintes problemas vêm impactando diretamente sua aceitação por parte dos usuários e, portanto, sua usabilidade: (i) os operadores fundamentais são pouco expressivos em situações reais; (ii) a cardinalidade dos resultados tende a ser grande, obrigando o usuário analisar muitos elementos; e (iii) os resultados nem sempre atendem ao interesse do usuário, implicando na reformulação e ajuste frequente das consultas. O objetivo desta dissertação é o desenvolvimento de uma técnica inédita para exibir um grau de variedade nas respostas às consultas aos k-vizinhos mais próximos em domínios de dados métricos, explorando aspectos de diversidade em extensões dos operadores fundamentais usando apenas as propriedades básicas do espaço métrico sem a solicitação de outra informação por parte do usuário. Neste sentido, são apresentados: a formalização de um modelo de variedade que possibilita inserir diversidade nas consultas por similaridade sem a definição de parâmetros por parte do usuário; um algoritmo incremental para responder às consultas aos k-vizinhos mais próximos com variedade; um método de avaliação de sobreposição de variedade para as consultas por similaridade. As propriedades desses resultados permitem usar as técnicas desenvolvidas para apoiar a propriedade de variedade nas consultas aos k-vizinhos mais próximos em Sistemas de Gerenciamento de Bases de Dados / The data being collected and generated nowadays increases not only in volume, but also in complexity, leading to the need of new query operators. Similarity queries are one of the most pursued resources to retrieve complex data. The most studied operators to perform similarity are the Range Query (\'R IND.q\') and the k-Nearest Neighbor Query (k-\'NN IND. q\'). Until recently, those queries were not available in the Database Management Systems. Now they are starting to become available, but since its earliest applications to develop real systems, it became clear that the basic similarity query operators are not enough to meet the requirements of the target applications. Therefore, new variations and extensions to the basic operators are being studied, although every work up to now is only pursuing the requirements of specific application domains. Furthermore, the following issues are directly impacting their acceptance by users and therefore its usability: (i) the basic operators are not expressive in real situations, (ii) the result-set cardinality tends to be large, imposing to the user the need to analyze to many elements, and (iii) the results do not always meet the users interest, resulting in the reformulation and adjustment of the queries. The goal of this dissertation is the development of a novel technique to enable a degree of variety the answers of k-nearest neighbor queries in metric spaces, investigating aspects of diversity in extensions of the basic operators using only the properties of metric spaces, never requesting extra information from the user. In this monograph, we present: the formalization of the variety model that allows to support diversity in similarity queries without requiring diversification parameters from the user; a greedy algorithm to obtain answers for similarity queries to the k-nearest neighbors with variety; an evaluation method to assess the diversification ratio existing on a subset of elements in metric space. The properties of those results allow using our proposed techniques to support variety in k-nearest neighbor queries in Database Management Systems
|
18 |
Operações de consulta por similaridade em grandes bases de dados complexos / Similarity search operations in large complex databasesMaria Camila Nardini Barioni 04 September 2006 (has links)
Os Sistemas de Gerenciamento de Bases de Dados (SGBD) foram desenvolvidos para armazenar e recuperar de maneira eficiente dados formados apenas por números ou cadeias de caracteres. Entretanto, nas últimas décadas houve um aumento expressivo, não só da quantidade, mas da complexidade dos dados manipulados em bases de dados, dentre eles os de natureza multimídia (como imagens, áudio e vídeo), informações geo-referenciadas, séries temporais, entre outros. Assim, surgiu a necessidade do desenvolvimento de novas técnicas que permitam a manipulação eficiente de tipos de dados complexos. Para atender às buscas necessárias às aplicações de base de dados modernas é preciso que os SGBD ofereçam suporte para buscas por similaridade ? consultas que realizam busca por objetos da base similares a um objeto de consulta, de acordo com uma certa medida de similaridade. Outro fator importante que veio contribuir para a necessidade de suportar a realização de consultas por similaridade em SGBD está relacionado à integração de técnicas de mineração de dados. É fundamental para essa integração o fornecimento de recursos pelos SGBD que permitam a realização de operações básicas para as diversas técnicas de mineração de dados existentes. Uma operação básica para várias dessas técnicas, tais como a técnica de detecção de agrupamentos de dados, é justamente o cálculo de medidas de similaridade entre pares de objetos de um conjunto de dados. Embora haja necessidade de fornecer suporte para a realização desse tipo de consultas em SGBD, o atual padrão da linguagem SQL não prevê a realização de consultas por similaridade. Esta tese pretende contribuir para o fornecimento desse suporte, incorporando ao SQL recursos capazes de permitir a realização de operações de consulta por similaridade sobre grandes bases de dados complexos de maneira totalmente integrada com os demais recursos da linguagem / Database Management Systems (DBMS) were developed to store and efficiently retrieve only data composed by numbers and small strings. However, over the last decades, there was an expressive increase in the volume and complexity of the data being managed, such as multimedia data (images, audio tracks and video), geo-referenced information and time series. Thus, the need to develop new techniques that allow the efficient handling of complex data types also increased. In order to support these data and the corresponding applications, the DBMS needs to support similarity queries, i.e., queries that search for objects similar to a query object according to a similarity measure. The need to support similarity queries in DBMS is also related to the integration of data mining techniques, which requires the DBMS acting as the provider for resources that allow the execution of basic operations for several existing data mining techniques. A basic operation for several of these techniques, such as clustering detection, is again the computation of similarity measures among pairs of objects of a data set. Although there is a need to execute these kind of queries in DBMS, the SQL standard does not allow the specification of similarity queries. Hence, this thesis aims at contributing to support such queries, integrating to the SQL the resources capable to execute similarity query operations over large sets of complex data
|
19 |
Comparison and Tracking Methods for Interactive Visualization of Topological Structures in Scalar FieldsSaikia, Himangshu January 2017 (has links)
Scalar fields occur quite commonly in several application areas in both static and time-dependent forms. Hence a proper visualization of scalar fieldsneeds to be equipped with tools to extract and focus on important features of the data. Similarity detection and pattern search techniques in scalar fields present a useful way of visualizing important features in the data. This is done by isolating these features and visualizing them independently or show all similar patterns that arise from a given search pattern. Topological features are ideal for this purpose of isolating meaningful patterns in the data set and creating intuitive feature descriptors. The Merge Tree is one such topological feature which has characteristics ideally suited for this purpose. Subtrees of merge trees segment the data into hierarchical regions which are topologically defined. This kind of feature-based segmentation is more intelligent than pure data based segmentations involving windows or bounding volumes. In this thesis, we explore several different techniques using subtrees of merge trees as features in scalar field data. Firstly, we begin with a discussion on static scalar fields and devise techniques to compare features - topologically segmented regions given by the subtrees of the merge tree - against each other. Second, we delve into time-dependent scalar fields and extend the idea of feature comparison to spatio-temporal features. In this process, we also come up with a novel approach to track features in time-dependent data considering the entire global network of likely feature associations between consecutive time steps.The highlight of this thesis is the interactivity that is enabled using these feature-based techniques by the real-time computation speed of our algorithms. Our techniques are implemented in an open-source visualization framework Inviwo and are published in several peer-reviewed conferences and journals. / <p>QC 20171020</p>
|
20 |
FREDDYGünther, Michael 25 February 2020 (has links)
Word embeddings are useful in many tasks in Natural Language Processing and Information Retrieval, such as text mining and classification, sentiment analysis, sentence completion, or dictionary construction. Word2vec and its predecessor fastText, both well-known models to produce word embeddings, are powerful techniques to study the syntactic and semantic relations between words by representing them in a low-dimensional vector. By applying algebraic operations on these vectors semantic relationships such as word analogies, gender-inflections, or geographical relationships can be easily recovered. The aim of this work is to investigate how word embeddings could be utilized to augment and enrich queries in DBMSs, e.g. to compare text values according to their semantic relation or to group rows according to the similarity of their text values. For this purpose, we use pre-trained word embedding models of large text corpora such as Wikipedia. By exploiting this external knowledge during query processing we are able to apply inductive reasoning on text values. Thereby, we reduce the demand for explicit knowledge in database systems. In the context of the IMDB database schema, this allows for example to query movies that are semantically close to genres such as historical fiction or road movie without maintaining this information. Another example query is sketched in Listing 1, that returns the top-3 nearest neighbors (NN) of each movie in IMDB. Given the movie “Godfather” as input this results in “Scarface”, “Goodfellas” and “Untouchables”.
|
Page generated in 0.045 seconds