Spelling suggestions: "subject:"similarity search"" "subject:"imilarity search""
1 |
Distributed high-dimensional similarity search with music information retrieval applicationsFaghfouri, Aidin 29 August 2011 (has links)
Today, the advent of networking technologies and computer hardware have enabled more and more inexpensive PCs, various mobile devices, smart phones, PDAs, sensors and cameras to be linked to the Internet with better connectivity. In recent years, we have witnessed the emergence of several instances of distributed applications, providing infrastructures for social interactions over large-scale wide-area networks and facilitating the ways users share and publish data. User generated data today range from simple text files to (semi-) structured documents and multimedia content. With the emergence of Semantic Web, the number of features (associated with a content) that are used in order to index those large amounts of heterogenous pieces of data is growing dramatically. The feature sets associated with each content type can grow continuously as we discover new ways of describing a content in formulated terms.
As the number of dimensions in the feature data grow (as high as 100 to 1000), it becomes harder and harder to search for information in a dataset due to the curse of dimensionality and it is not appropriate to use naive search methods, as their performance degrade to linear search. As an alternative, we can distribute the content and the query processing load to a set of peers in a distributed Peer-to-Peer (P2P) network and incorporate high-dimensional distributed search techniques to attack the problem.
Currently, a large percentage of Internet traffic consists of video and music files shared and exchanged over P2P networks. In most present services, searching for music is performed through keyword search and naive string-matching algorithms using collaborative filtering techniques which mostly use tag based approaches. In music information retrieval (MIR) systems, the main goal is to make recommendations similar to the music that the user listens to. In these systems, techniques based on acoustic feature extraction can be employed to achieve content-based music similarity search (i.e., searching through music based on what can be heard from the music track). Using these techniques we can devise an automated measure of similarity that can replace the need for human experts (or users) who assign descriptive genre tags and meta-data to each recording and solve the famous cold-start problem associated with the collaborative filtering techniques.
In this work we explore the advantages of distributed structures by efficiently distributing the content features and query processing load on the peers in a P2P network. Using a family of Locality Sensitive Hash (LSH) functions based on p-stable distributions we propose an efficient, scalable and load-balanced system, capable of performing K-Nearest-Neighbor (KNN) and Range queries. We also propose a new load-balanced indexing algorithm and evaluate it using our Java based simulator.
Our results show that this P2P design ensures load-balancing and guarantees logarithmic number of hops for query processing. Our system is extensible to be used with all types of multi-dimensional feature data and it can also be employed as the main indexing scheme of a multipurpose recommendation system. / Graduate
|
2 |
Protein Function Prediction Based on Sequence and Structure InformationSmaili, Fatima Z. 25 May 2016 (has links)
The number of available protein sequences in public databases is increasing exponentially. However, a significant fraction of these sequences lack functional annotation which is essential to our understanding of how biological systems and processes operate. In this master thesis project, we worked on inferring protein functions based on the primary protein sequence. In the approach we follow, 3D models are first constructed using I-TASSER. Functions are then deduced by structurally matching these predicted models, using global and local similarities, through three independent enzyme commission (EC) and gene ontology (GO) function libraries. The method was tested on 250 “hard” proteins, which lack homologous templates in both structure and function libraries. The results show that this method outperforms the conventional prediction methods based on sequence similarity or threading. Additionally, our method could be improved even further by incorporating protein-protein interaction information. Overall, the method we use provides an efficient approach for automated functional annotation of non-homologous proteins, starting from their sequence.
|
3 |
Efficient Semantic-based Content Search in P2P NetworkShen, Heng Tao, Shu, Yan Feng, Yu, Bei 01 1900 (has links)
Most existing Peer-to-Peer (P2P) systems support only title-based searches and are limited in functionality when compared to today’s search engines. In this paper, we present the design of a distributed P2P information sharing system that supports semantic-based content searches of relevant documents. First, we propose a general and extensible framework for searching similar documents in P2P network. The framework is based on the novel concept of Hierarchical Summary Structure. Second, based on the framework, we develop our efficient document searching system, by effectively summarizing and maintaining all documents within the network with different granularity. Finally, an experimental study is conducted on a real P2P prototype, and a large-scale network is further simulated. The results show the effectiveness, efficiency and scalability of the proposed system. / Singapore-MIT Alliance (SMA)
|
4 |
An ID-Tree Index Strategy for Information Filtering in Web-Based SystemsWang, Yi-Siang 10 July 2006 (has links)
With the booming development of WWW, many search engines have been developed to help users to find useful information from a great quantity of data. However, users may have different needs in different situations. Opposite to the Information Retrieval where users retrieve data actively, Information Filtering (IF) sends information from servers to passive users through broadcast mediums, rather than being searched by them. Therefore, each user has his (or her) profile stored in the database, where a profile records a set of interest items that can present his (or her) interests or habits. To efficiently store many user profiles in servers and filter irrelevant users, many signature-based index techniques are applied in IF systems. By using signatures, IF does not need to compare each item of profiles to filter out irrelevant ones. However, because signatures are incomplete information of profiles, it is very hard to answer the complex queries by using only the signatures. Therefore, a critical issue of the signature-based IF service is how to index the signatures of user profiles for an efficient filtering process. There are often two types of queries in the signature-based IF systems, the inexact filtering and the similarity search queries. In the inexact filtering, a query is an incoming document and it needs to find the profiles whose interest items are all included in the query. On the other hand, in the similarity search, a query is a user profile and it needs to find the users whose interest items are similar to the query user. In this thesis, we propose an ID-tree index strategy, which indexes signatures of user profiles by partitioning them into subgroups using a binary tree structure according to all of the different items among them. Basically, our ID-tree index strategy is a kind of the signature tree. In an ID-tree, each path from the root to a leaf node is the signature of the profile pointed by the leaf node. Because each profile is pointed only by one leaf node of the ID-tree, there will be no collision in the structure. In other words, there will be no two profiles assigned to the same signature. Moreover, only the different items among subgroups of profiles will be checked at one time to filter out irrelevant profiles for queries. Therefore, our strategy can answer the inexact filtering and the similarity search queries with less number of accessed profiles as compared to the previous strategies. Moreover, to build the index of signatures, it needs less time to batch a great deal of database profiles. From our simulation results, we show that our strategy can access less number of profiles to answer the queries than Chen's signature tree strategy for the inexact filtering and Aggarwal et al.'s SG-table strategy for the similarity search.
|
5 |
Generative models meet similarity search: efficient, heuristic-free and robust retrievalDoan, Khoa Dang 23 September 2021 (has links)
The rapid growth of digital data, especially visual and textual contents, brings many challenges to the problem of finding similar data. Exact similarity search, which aims to exhaustively find all relevant items through a linear scan in a dataset, is impractical due to its high computational complexity. Approximate-nearest-neighbor (ANN) search methods, especially the Learning-to-hash or Hashing methods, provide principled approaches that balance the trade-offs between the quality of the guesses and the computational cost for web-scale databases. In this era of data explosion, it is crucial for the hashing methods to be both computationally efficient and robust to various scenarios such as when the application has noisy data or data that slightly changes over time (i.e., out-of-distribution).
This Thesis focuses on the development of practical generative learning-to-hash methods and explainable retrieval models. We first identify and discuss the various aspects where the framework of generative modeling can be used to improve the model designs and generalization of the hashing methods. Then we show that these generative hashing methods similarly enjoy several appealing empirical and theoretical properties of generative modeling. Specifically, the proposed generative hashing models generalize better with important properties such as low-sample requirement, and out-of-distribution and data-corruption robustness. Finally, in domains with structured data such as graphs, we show that the computational methods in generative modeling have an interesting utility beyond estimating the data distribution and describe a retrieval framework that can explain its decision by borrowing the algorithmic ideas developed in these methods.
Two subsets of generative hashing methods and a subset of explainable retrieval methods are proposed. For the first hashing subset, we propose a novel adversarial framework that can be easily adapted to a new problem domain and three training algorithms that learn the hash functions without several hyperparameters commonly found in the previous hashing methods. The contributions of our work include: (1) Propose novel algorithms, which are based on adversarial learning, to learn the hash functions; (2) Design computationally efficient Wasserstein-related adversarial approaches which have low computational and sample efficiency; (3) Conduct extensive experiments on several benchmark datasets in various domains, including computational advertising, and text and image retrieval, for performance evaluation. For the second hashing subset, we propose energy-based hashing solutions which can improve the generalization and robustness of existing hashing approaches. The contributions of our work for this task include: (1) Propose data-synthesis solutions to improve the generalization of existing hashing methods; (2) Propose energy-based hashing solutions which exhibit better robustness against out-of-distribution and corrupted data; (3) Conduct extensive experiments for performance evaluations on several benchmark datasets in the image retrieval domain.
Finally, for the last subset of explainable retrieval methods, we propose an optimal alignment algorithm that achieves a better similarity approximation for a pair of structured objects, such as graphs, while capturing the alignment between the nodes of the graphs to explain the similarity calculation. The contributions of our work for this task include: (1) Propose a novel optimal alignment algorithm for comparing two sets of bag-of-vectors embeddings; (2) Propose a differentiable computation to learn the parameters of the proposed optimal alignment model; (3) Conduct extensive experiments, for performance evaluation of both the similarity approximation task and the retrieval task, on several benchmark graph datasets. / Doctor of Philosophy / Searching for similar items, or similarity search, is one of the fundamental tasks in this information age, especially when there is a rapid growth of visual and textual contents. For example, in a search engine such as Google, a user searches for images with similar content to a referenced image; in online advertising, an advertiser finds new users, and eventually targets these users with advertisements, where the new users have similar profiles to some referenced users who have previously responded positively to the same or similar advertisements; in the chemical domain, scientists search for proteins with a similar structure to a referenced protein. The practical search applications in these domains often face several challenges, especially when these datasets or databases can contain a large number (e.g., millions or even billions) of complex-structured items (e.g., texts, images, and graphs). These challenges can be organized into three central themes: search efficiency (the economical use of resources such as computation and time) and model-design effort (the ease of building the search model). Besides search efficiency and model-design effort, it is increasingly a requirement of a search model to possess the ability to explain the search results, especially in the scientific domains where the items are structured objects such as graphs.
This dissertation tackles the aforementioned challenges in practical search applications by using the computational techniques that learn to generate data. First, we overcome the need to scan the entire large dataset for similar items by considering an approximate similarity search technique called hashing. Then, we propose an unsupervised hashing framework that learns the hash functions with simpler objective functions directly from raw data. The proposed retrieval framework can be easily adapted into new domains with significantly lower effort in model design. When labeled data is available but is limited (which is a common scenario in practical search applications), we propose a hashing network that can synthesize additional data to improve the hash function learning process. The learned model also exhibits significant robustness against data corruption and slight changes in the underlying data. Finally, in domains with structured data such as graphs, we propose a computation approach that can simultaneously estimate the similarity of structured objects, such as graphs, and capture the alignment between their substructures, e.g., nodes. The alignment mechanism can help explain the reason why two objects are similar or dissimilar. This is a useful tool for domain experts who not only want to search for similar items but also want to understand how the search model makes its predictions.
|
6 |
Framställning av en GIS-metod samt analys av ingående parametrar för att lokalisera representativa delområden av ett avrinningsområde för snödjupsmätningar / Development of a GIS method and analysis of input parameters to locate representative sub-areas of a catchment area for snow depth measurementsKaplin, Jennifer, Leierdahl, Lisa January 2022 (has links)
Vattenkraft är en stor källa till energi i Sverige, främst i de norra delarna av landet. För att få ut maximal potential från vattenkraftverken behövs information om hur mycket vatten eller snö det finns uppströms från kraftverken. Genom att få fram tillförlitliga värden av snömängd är det möjligt att minska osäkerheten i uppskattningarna.Eftersom det är svårt att kartera större avrinningsområden via markbundna observationer, både praktiskt och ekonomiskt, har drönarobservationer utvecklats. För att använda sig av drönare krävs det vetskap om var de ska flygas i för område för att hela avrinningsområdet ska representeras. I projektet tas en modell fram i ArcGIS för att hitta mindre områden inom avrinningsområden som ska vara representativa inom utvalda parametrar. I projektet berörs parametrarna vegetation, höjd, lutningsgrad samt dess riktning.Arbetet för att ta fram en modell som ska underlätta framtida arbete inom och utanför forskningsprojektet DRONES är uppdelat i två delar. Den första delen är att ta fram och granska vilka parametrar som påverkar snödjupet i avrinningsområdet. Den andra delen innefattar arbetet med att skapa en modell i ArcGIS som ska analysera ett avrinningsområde med framtagna parametrar för att hitta mindre områden som representerar det hela.Resultatet från de framtagna modellerna kan tillämpas för att underlätta kartläggningen och snödjupsmätningar i avrinningsområden, vilket kan utnyttjas vid effektivisering av vattenreglering. / Hydropower is a major source of energy in Sweden mainly in the northern parts of the country. To get the maximum potential from the hydropower plants, information is required on how much water or snow there is upstream from the power plants. By obtaining reliable values of the amount of snow, it is possible to reduce the uncertainty in forecasts on spring flood.Due to difficulties in mapping larger catchment areas via ground-level observations, drone observations have been developed. In order to use drone observations, knowledge of where they are to be flown to represent the entire catchment area is required. In this project, a model was developed in ArcGIS to find smaller areas within catchments that are to be representative within selected parameters. The project touches upon the parameters vegetation, height, slope and aspect.The work to develop a model that will facilitate future work within and outside the DRONES research project is divided into two parts. The first part is to analyze which parameters affect the snow depth in the catchment area. The second part consists of creating a model in ArcGIS that will find a smaller area inside a catchment that represents the snow depth for the whole catchment.The results from the developed model can be applied to facilitate the mapping and snow depth measurements in catchment areas, which can be used to streamline water regulation.
|
7 |
Large scale optimization methods for metric and kernel learningJain, Prateek 06 November 2014 (has links)
A large number of machine learning algorithms are critically dependent on the underlying distance/metric/similarity function. Learning an appropriate distance function is therefore crucial to the success of many methods. The class of distance functions that can be learned accurately is characterized by the amount and type of supervision available to the particular application. In this thesis, we explore a variety of such distance learning problems using different amounts/types of supervision and provide efficient and scalable algorithms to learn appropriate distance functions for each of these problems. First, we propose a generic regularized framework for Mahalanobis metric learning and prove that for a wide variety of regularization functions, metric learning can be used for efficiently learning a kernel function incorporating the available side-information. Furthermore, we provide a method for fast nearest neighbor search using the learned distance/kernel function. We show that a variety of existing metric learning methods are special cases of our general framework. Hence, our framework also provides a kernelization scheme and fast similarity search scheme for such methods. Second, we consider a variation of our standard metric learning framework where the side-information is incremental, streaming and cannot be stored. For this problem, we provide an efficient online metric learning algorithm that compares favorably to existing methods both theoretically and empirically. Next, we consider a contrasting scenario where the amount of supervision being provided is extremely small compared to the number of training points. For this problem, we consider two different modeling assumptions: 1) data lies on a low-dimensional linear subspace, 2) data lies on a low-dimensional non-linear manifold. The first assumption, in particular, leads to the problem of matrix rank minimization over polyhedral sets, which is a problem of immense interest in numerous fields including optimization, machine learning, computer vision, and control theory. We propose a novel online learning based optimization method for the rank minimization problem and provide provable approximation guarantees for it. The second assumption leads to our geometry-aware metric/kernel learning formulation, where we jointly model the metric/kernel over the data along with the underlying manifold. We provide an efficient alternating minimization algorithm for this problem and demonstrate its wide applicability and effectiveness by applying it to various machine learning tasks such as semi-supervised classification, colored dimensionality reduction, manifold alignment etc. Finally, we consider the task of learning distance functions under no supervision, which we cast as a problem of learning disparate clusterings of the data. To this end, we propose a discriminative approach and a generative model based approach and we provide efficient algorithms with convergence guarantees for both the approaches. / text
|
8 |
Classificação de úlceras venosas dermatológicas para apoio a consultas por similaridade utilizando superpixels e aprendizado profundo / Classification of venous dermatological ulcers to support similarity queries using superpixels and deep learningBlanco, Gustavo 01 April 2019 (has links)
Sistemas de recuperação de imagens por conteúdo (do inglês Content-based ImageRetrieval - CBIR) têm sido cada vez mais utilizados em diversas aplicações de tratamento e análise de imagens, devido a dois fatores: CBIR é um procedimento que pode ser feito automaticamente, permitindo tratar o grande volume de imagens adquiridos em hospitais, e também é a base para o processamento de consultas por similaridade. No contexto médico tais sistemas auxiliam em diversas tarefas, desde treinamento de profissionais até em sistemas de auxílio a diagnóstico (do inglês Computer-Aided Diagnosis - CAD). Um sistema computacional capaz de comparar e classificar imagens obtidas em exames de pacientes utilizando uma base prévia de conhecimento poderia agilizar o atendimento da população e fornecer aos especialistas informações relevantes de forma rápida e simples. Neste trabalho, o foco foi na análise de imagens de úlceras venosas. Foram desenvolvidas duas técnicas para classificação dessas imagens. A primeira, denominada Counting-Labels Similarity Measure (CL-Measure) possuia vantagem de lidar com imagens segmentadas de forma automática, por superpixels, e ser versátil o suficiente para permitir adaptação para outros domínios. A ideia principal do CL-Measure consiste na criação de sub-imagens baseadas em uma classificação prévia, calcular a distância entre elas e agregar as distâncias parciais obtidas a partir de uma função apropriada. A segunda técnica, denominada Quality of Tissues from Dermatological Ulcers(QTDU), faz uso de redes convolucionais (CNNs) para rotulação dos superpixels com a vantagem de compor todo o processo de identificação de características e classificação, dispensando a necessidade de identificar qual o extrator de características mais adequado para o contexto em questão. Experimentos realizados sobre a base de imagens analisada, utilizando 179572 super pixels divididos em 4 classes, indicam que a QTDU é a abordagem mais eficaz até o momento para o contexto de classificação de imagens dermatológicas, com médias de AUC=0,986, sensitividade = 0,97,e especificidade=0,974 superando as abordagens anteriores baseadas em aprendizado de máquina em 11;7% e 8;2% considerando o coeficiente KAPPAeF-Measure, respectivamente. / Content-based Image Retrieval (CBIR) systems have been increasingly used in many image processing and analysis applications because of two factors: CBIR is a procedure that can be done automatically, allowing to handle the large volume of images acquired in hospitals, and it is also the basis for processing similarity queries. In the medical context, such systems assist in various tasks, from training of professionals to develop Computer-Aided Diagnosis CAD systems. A computer system capable of comparing and classifying images obtained from patient exams using a prior knowledge base could expedite the care of the population and provide specialists with relevant information quickly. In this study, the focus was on the analysis of images of venous ulcers. Two techniques were developed to classify these images. The first, called Counting-Labels Similarity Measure (CL-Measure) has the advantage of dealing with automatically segmented images by superpixels, and is versatile enough to allow adaptation to other domains. The main idea of CL-Measure is to create sub-images based on a previous classification, calculate the distance between them and add the partial distances obtained from an appropriate function. The second technique, called Quality of Tissues from Dermatological Ulcers (QTDU), makes use of convolutional networks (CNNs) for superpixels labeling, with the advantage of encompassing the whole process of identification of features and classification, without the need of identifying which extractor would be the best for the context in question. Experiments carried out on the image database using 179,572 superpixels divided into 4 classes, indicate that the QTDU is the most effective approach to date for the context of classification of dermatological ulcer images, with averages of AUC = 0.986, sensitivity = 0.97 , and specificity = 0.974, surpassing previous approaches based on machine learning in 11.7% and 8.2% considering the KAPPA and F-Measure coefficients, respectively.
|
9 |
Explorando variedade em consultas por similaridade / Investigationg variety in similarity queriesSantos, Lúcio Fernandes Dutra 26 October 2012 (has links)
A complexidade dos dados armazenados em grandes bases de dados aumenta sempre, criando a necessidade de novas formas de consulta. As consultas por similaridade vêm apresentando crescente interesse para tratar de dados complexos, sendo as mais representativas a consulta por abrangência (\'R IND. q\' Range query) e a consulta aos k-vizinhos mais próximos (k-\'NN IND. q\' k-Nearest Neighboor query). Até recentemente, essas consultas não estavam disponíveis nos Sistemas de Gerenciamento de Bases de Dados (SGBD). Agora, com o início de sua disponibilidade, tem se tornado claro que os operadores de busca fundamentais usados para executá-las não são suficientes para atender às necessidades das aplicações que as demandam. Assim, estão sendo estudadas variações e extensões aos operadores fundamentais, em geral voltados às necessidades de domínios de aplicações específicas. Além disso, os seguintes problemas vêm impactando diretamente sua aceitação por parte dos usuários e, portanto, sua usabilidade: (i) os operadores fundamentais são pouco expressivos em situações reais; (ii) a cardinalidade dos resultados tende a ser grande, obrigando o usuário analisar muitos elementos; e (iii) os resultados nem sempre atendem ao interesse do usuário, implicando na reformulação e ajuste frequente das consultas. O objetivo desta dissertação é o desenvolvimento de uma técnica inédita para exibir um grau de variedade nas respostas às consultas aos k-vizinhos mais próximos em domínios de dados métricos, explorando aspectos de diversidade em extensões dos operadores fundamentais usando apenas as propriedades básicas do espaço métrico sem a solicitação de outra informação por parte do usuário. Neste sentido, são apresentados: a formalização de um modelo de variedade que possibilita inserir diversidade nas consultas por similaridade sem a definição de parâmetros por parte do usuário; um algoritmo incremental para responder às consultas aos k-vizinhos mais próximos com variedade; um método de avaliação de sobreposição de variedade para as consultas por similaridade. As propriedades desses resultados permitem usar as técnicas desenvolvidas para apoiar a propriedade de variedade nas consultas aos k-vizinhos mais próximos em Sistemas de Gerenciamento de Bases de Dados / The data being collected and generated nowadays increases not only in volume, but also in complexity, leading to the need of new query operators. Similarity queries are one of the most pursued resources to retrieve complex data. The most studied operators to perform similarity are the Range Query (\'R IND.q\') and the k-Nearest Neighbor Query (k-\'NN IND. q\'). Until recently, those queries were not available in the Database Management Systems. Now they are starting to become available, but since its earliest applications to develop real systems, it became clear that the basic similarity query operators are not enough to meet the requirements of the target applications. Therefore, new variations and extensions to the basic operators are being studied, although every work up to now is only pursuing the requirements of specific application domains. Furthermore, the following issues are directly impacting their acceptance by users and therefore its usability: (i) the basic operators are not expressive in real situations, (ii) the result-set cardinality tends to be large, imposing to the user the need to analyze to many elements, and (iii) the results do not always meet the users interest, resulting in the reformulation and adjustment of the queries. The goal of this dissertation is the development of a novel technique to enable a degree of variety the answers of k-nearest neighbor queries in metric spaces, investigating aspects of diversity in extensions of the basic operators using only the properties of metric spaces, never requesting extra information from the user. In this monograph, we present: the formalization of the variety model that allows to support diversity in similarity queries without requiring diversification parameters from the user; a greedy algorithm to obtain answers for similarity queries to the k-nearest neighbors with variety; an evaluation method to assess the diversification ratio existing on a subset of elements in metric space. The properties of those results allow using our proposed techniques to support variety in k-nearest neighbor queries in Database Management Systems
|
10 |
Operações de consulta por similaridade em grandes bases de dados complexos / Similarity search operations in large complex databasesBarioni, Maria Camila Nardini 04 September 2006 (has links)
Os Sistemas de Gerenciamento de Bases de Dados (SGBD) foram desenvolvidos para armazenar e recuperar de maneira eficiente dados formados apenas por números ou cadeias de caracteres. Entretanto, nas últimas décadas houve um aumento expressivo, não só da quantidade, mas da complexidade dos dados manipulados em bases de dados, dentre eles os de natureza multimídia (como imagens, áudio e vídeo), informações geo-referenciadas, séries temporais, entre outros. Assim, surgiu a necessidade do desenvolvimento de novas técnicas que permitam a manipulação eficiente de tipos de dados complexos. Para atender às buscas necessárias às aplicações de base de dados modernas é preciso que os SGBD ofereçam suporte para buscas por similaridade ? consultas que realizam busca por objetos da base similares a um objeto de consulta, de acordo com uma certa medida de similaridade. Outro fator importante que veio contribuir para a necessidade de suportar a realização de consultas por similaridade em SGBD está relacionado à integração de técnicas de mineração de dados. É fundamental para essa integração o fornecimento de recursos pelos SGBD que permitam a realização de operações básicas para as diversas técnicas de mineração de dados existentes. Uma operação básica para várias dessas técnicas, tais como a técnica de detecção de agrupamentos de dados, é justamente o cálculo de medidas de similaridade entre pares de objetos de um conjunto de dados. Embora haja necessidade de fornecer suporte para a realização desse tipo de consultas em SGBD, o atual padrão da linguagem SQL não prevê a realização de consultas por similaridade. Esta tese pretende contribuir para o fornecimento desse suporte, incorporando ao SQL recursos capazes de permitir a realização de operações de consulta por similaridade sobre grandes bases de dados complexos de maneira totalmente integrada com os demais recursos da linguagem / Database Management Systems (DBMS) were developed to store and efficiently retrieve only data composed by numbers and small strings. However, over the last decades, there was an expressive increase in the volume and complexity of the data being managed, such as multimedia data (images, audio tracks and video), geo-referenced information and time series. Thus, the need to develop new techniques that allow the efficient handling of complex data types also increased. In order to support these data and the corresponding applications, the DBMS needs to support similarity queries, i.e., queries that search for objects similar to a query object according to a similarity measure. The need to support similarity queries in DBMS is also related to the integration of data mining techniques, which requires the DBMS acting as the provider for resources that allow the execution of basic operations for several existing data mining techniques. A basic operation for several of these techniques, such as clustering detection, is again the computation of similarity measures among pairs of objects of a data set. Although there is a need to execute these kind of queries in DBMS, the SQL standard does not allow the specification of similarity queries. Hence, this thesis aims at contributing to support such queries, integrating to the SQL the resources capable to execute similarity query operations over large sets of complex data
|
Page generated in 0.0596 seconds