Global ETD Search

1	Learning to hash for large scale image retrieval Moran, Sean James January 2016 (has links) This thesis is concerned with improving the effectiveness of nearest neighbour search. Nearest neighbour search is the problem of finding the most similar data-points to a query in a database, and is a fundamental operation that has found wide applicability in many fields. In this thesis the focus is placed on hashing-based approximate nearest neighbour search methods that generate similar binary hashcodes for similar data-points. These hashcodes can be used as the indices into the buckets of hashtables for fast search. This work explores how the quality of search can be improved by learning task specific binary hashcodes. The generation of a binary hashcode comprises two main steps carried out sequentially: projection of the image feature vector onto the normal vectors of a set of hyperplanes partitioning the input feature space followed by a quantisation operation that uses a single threshold to binarise the resulting projections to obtain the hashcodes. The degree to which these operations preserve the relative distances between the datapoints in the input feature space has a direct influence on the effectiveness of using the resulting hashcodes for nearest neighbour search. In this thesis I argue that the retrieval effectiveness of existing hashing-based nearest neighbour search methods can be increased by learning the thresholds and hyperplanes based on the distribution of the input data. The first contribution is a model for learning multiple quantisation thresholds. I demonstrate that the best threshold positioning is projection specific and introduce a novel clustering algorithm for threshold optimisation. The second contribution extends this algorithm by learning the optimal allocation of quantisation thresholds per hyperplane. In doing so I argue that some hyperplanes are naturally more effective than others at capturing the distribution of the data and should therefore attract a greater allocation of quantisation thresholds. The third contribution focuses on the complementary problem of learning the hashing hyperplanes. I introduce a multi-step iterative model that, in the first step, regularises the hashcodes over a data-point adjacency graph, which encourages similar data-points to be assigned similar hashcodes. In the second step, binary classifiers are learnt to separate opposing bits with maximum margin. This algorithm is extended to learn hyperplanes that can generate similar hashcodes for similar data-points in two different feature spaces (e.g. text and images). Individually the performance of these algorithms is often superior to competitive baselines. I unify my contributions by demonstrating that learning hyperplanes and thresholds as part of the same model can yield an additive increase in retrieval effectiveness. 006.3
2	New LSH-based Algorithm for Approximate Nearest Neighbor Andoni, Alexandr, Indyk, Piotr 04 November 2005 (has links) We present an algorithm for c-approximate nearest neighbor problem in a d-dimensional Euclidean space, achieving query time ofO(dn^{1/c^2+o(1)}) and space O(dn + n^{1+1/c^2+o(1)}). AI locality sensitive hashing nearest neighbor high dimensions
3	Salient Index for Similarity Search Over High Dimensional Vectors Lu, Yangdi January 2018 (has links) The approximate nearest neighbor(ANN) search over high dimensional data has become an unavoidable service for online applications. Fast and high-quality results of unknown queries are the largest challenge that most algorithms faced with. Locality Sensitive Hashing(LSH) is a well-known ANN search algorithm while suffers from inefficient index structure, poor accuracy in distributed scheme. The traditional index structures have most significant bits(MSB) problem, which is their indexing strategies have an implicit assumption that the bits from one direction in the hash value have higher priority. In this thesis, we propose a new content-based index called Random Draw Forest(RDF), which not only uses an adaptive tree structure by applying the dynamic length of compound hash functions to meet the different cardinality of data, but also applies the shuffling permutations to solve the MSB problem in the traditional LSH-based index. To raise the accuracy in the distributed scheme, we design a variable steps lookup strategy to search the multiple step sub-indexes which are most likely to hold the mistakenly partitioned similar objects. By analyzing the index, we show that RDF has a higher probability to retrieve the similar objects compare to the original index structure. In the experiment, we first learn the performance of different hash functions, then we show the effect of parameters in RDF and the performance of RDF compared with other LSH-based methods to meet the ANN search. / Thesis / Master of Science (MSc) approximate nearest neighbor locality sensitive hashing index structure searching strategy
4	Large scale optimization methods for metric and kernel learning Jain, Prateek 06 November 2014 (has links) A large number of machine learning algorithms are critically dependent on the underlying distance/metric/similarity function. Learning an appropriate distance function is therefore crucial to the success of many methods. The class of distance functions that can be learned accurately is characterized by the amount and type of supervision available to the particular application. In this thesis, we explore a variety of such distance learning problems using different amounts/types of supervision and provide efficient and scalable algorithms to learn appropriate distance functions for each of these problems. First, we propose a generic regularized framework for Mahalanobis metric learning and prove that for a wide variety of regularization functions, metric learning can be used for efficiently learning a kernel function incorporating the available side-information. Furthermore, we provide a method for fast nearest neighbor search using the learned distance/kernel function. We show that a variety of existing metric learning methods are special cases of our general framework. Hence, our framework also provides a kernelization scheme and fast similarity search scheme for such methods. Second, we consider a variation of our standard metric learning framework where the side-information is incremental, streaming and cannot be stored. For this problem, we provide an efficient online metric learning algorithm that compares favorably to existing methods both theoretically and empirically. Next, we consider a contrasting scenario where the amount of supervision being provided is extremely small compared to the number of training points. For this problem, we consider two different modeling assumptions: 1) data lies on a low-dimensional linear subspace, 2) data lies on a low-dimensional non-linear manifold. The first assumption, in particular, leads to the problem of matrix rank minimization over polyhedral sets, which is a problem of immense interest in numerous fields including optimization, machine learning, computer vision, and control theory. We propose a novel online learning based optimization method for the rank minimization problem and provide provable approximation guarantees for it. The second assumption leads to our geometry-aware metric/kernel learning formulation, where we jointly model the metric/kernel over the data along with the underlying manifold. We provide an efficient alternating minimization algorithm for this problem and demonstrate its wide applicability and effectiveness by applying it to various machine learning tasks such as semi-supervised classification, colored dimensionality reduction, manifold alignment etc. Finally, we consider the task of learning distance functions under no supervision, which we cast as a problem of learning disparate clusterings of the data. To this end, we propose a discriminative approach and a generative model based approach and we provide efficient algorithms with convergence guarantees for both the approaches. / text Rank minimization Metric learning Kernel learning Fast similarity search Locality sensitive hashing
5	Distributed Local Outlier Factor with Locality-Sensitive Hashing Zheng, Lining 08 November 2019 (has links) Outlier detection remains a heated area due to its essential role in a wide range of applications, including intrusion detection, fraud detection in finance, medical diagnosis, etc. Local Outlier Factor (LOF) has been one of the most influential outlier detection techniques over the past decades. LOF has distinctive advantages on skewed datasets with regions of various densities. However, the traditional centralized LOF faces new challenges in the era of big data and no longer satisfies the rigid time constraints required by many modern applications, due to its expensive computation overhead. A few researchers have explored the distributed solution of LOF, but existant methods are limited by their grid-based data partitioning strategy, which falls short when applied to high-dimensional data. In this thesis, we study efficient distributed solutions for LOF. A baseline MapReduce solution for LOF implemented with Apache Spark, named MR-LOF, is introduced. We demonstrate its disadvantages in communication cost and execution time through complexity analysis and experimental evaluation. Then an approximate LOF method is proposed, which relies on locality-sensitive hashing (LSH) for partitioning data and enables fully distributed local computation. We name it MR-LOF-LSH. To further improve the approximate LOF, we introduce a process called cross-partition updating. With cross-partition updating, the actual global k-nearest neighbors (k-NN) of the outlier candidates are found, and the related information of the neighbors is used to update the outlier scores of the candidates. The experimental results show that MR-LOF achieves a speedup of up to 29 times over the centralized LOF. MR-LOF-LSH further reduces the execution time by a factor of up to 9.9 compared to MR-LOF. The results also highlight that MR-LOF-LSH scales well as the cluster size increases. Moreover, with a sufficient candidate size, MR-LOF-LSH is able to detect in most scenarios over 90% of the top outliers with the highest LOF scores computed by the centralized LOF algorithm. Outlier detection Distributed computing Apache Spark Locality-sensitive hashing Local Outlier Factor
6	A scalable metric learning based voting method for expression recognition Wan, Shaohua 09 October 2013 (has links) In this research work, we propose a facial expression classification method using metric learning-based k-nearest neighbor voting. To achieve accurate classification of a facial expression from frontal face images, we first learn a distance metric structure from training data that characterizes the feature space pattern, then use this metric to retrieve the nearest neighbors from the training dataset, and finally output the classification decision accordingly. An expression is represented as a fusion of face shape and texture. This representation is based on registering a face image with a landmarking shape model and extracting Gabor features from local patches around landmarks. This type of representation achieves robustness and effectiveness by using an ensemble of local patch feature detectors at a global shape level. A naive implementation of the metric learning-based k-nearest neighbor would incur a time complexity proportional to the size of the training dataset, which precludes this method being used with enormous datasets. To scale to potential larger databases, a similar approach to that in [24] is used to achieve an approximate yet efficient ML-based kNN voting based on Locality Sensitive Hashing (LSH). A query example is directly hashed to the bucket of a pre-computed hash table where candidate nearest neighbors can be found, and there is no need to search the entire database for nearest neighbors. Experimental results on the Cohn-Kanade database and the Moving Faces and People database show that both ML-based kNN voting and its LSH approximation outperform the state-of-the-art, demonstrating the superiority and scalability of our method. / text Facial expression recognition k Nearest Neighbor Metric learning Locality sensitive hashing
7	Application of locality sensitive hashing to feature matching and loop closure detection Shahbazi, Hossein Unknown Date No description available. visual slam locality sensitive hashing loop closure detection parameter selection nearest neighbor search
8	Locality Sensitive Indexing for Efficient High-Dimensional Query Answering in the Presence of Excluded Regions January 2016 (has links) abstract: Similarity search in high-dimensional spaces is popular for applications like image processing, time series, and genome data. In higher dimensions, the phenomenon of curse of dimensionality kills the effectiveness of most of the index structures, giving way to approximate methods like Locality Sensitive Hashing (LSH), to answer similarity searches. In addition to range searches and k-nearest neighbor searches, there is a need to answer negative queries formed by excluded regions, in high-dimensional data. Though there have been a slew of variants of LSH to improve efficiency, reduce storage, and provide better accuracies, none of the techniques are capable of answering queries in the presence of excluded regions. This thesis provides a novel approach to handle such negative queries. This is achieved by creating a prefix based hierarchical index structure. First, the higher dimensional space is projected to a lower dimension space. Then, a one-dimensional ordering is developed, while retaining the hierarchical traits. The algorithm intelligently prunes the irrelevant candidates while answering queries in the presence of excluded regions. While naive LSH would need to filter out the negative query results from the main results, the new algorithm minimizes the need to fetch the redundant results in the first place. Experiment results show that this reduces post-processing cost thereby reducing the query processing time. / Dissertation/Thesis / Masters Thesis Computer Science 2016 Computer science High-Dimensional Locality Sensitive Hashing Negative Queries Range Search Similarity Search
9	Optimization of the Mainzelliste software for fast privacy-preserving record linkage Rohde, Florens, Franke, Martin, Sehili, Ziad, Lablans, Martin, Rahm, Erhard 11 February 2022 (has links) Background: Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods: We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results: The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion: We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants. info:eu-repo/classification/ddc/000 ddc:000
10	Similarity Estimation with Non-Transitive LSH Lewis, Robert R. 29 September 2021 (has links) No description available. Computer Science Locality Sensitive Hashing Dice Coefficient Non-Transitive Hash Collision Relation Similarity Estimation LSHable

Search results