Global ETD Search

Return to search

Time Efficient and Quality Effective K Nearest Neighbor Search in High Dimension Space

abstract: K-Nearest-Neighbors (KNN) search is a fundamental problem in many application domains such as database and data mining, information retrieval, machine learning, pattern recognition and plagiarism detection. Locality sensitive hash (LSH) is so far the most practical approximate KNN search algorithm for high dimensional data. Algorithms such as Multi-Probe LSH and LSH-Forest improve upon the basic LSH algorithm by varying hash bucket size dynamically at query time, so these two algorithms can answer different KNN queries adaptively. However, these two algorithms need a data access post-processing step after candidates' collection in order to get the final answer to the KNN query. In this thesis, Multi-Probe LSH with data access post-processing (Multi-Probe LSH with DAPP) algorithm and LSH-Forest with data access post-processing (LSH-Forest with DAPP) algorithm are improved by replacing the costly data access post-processing (DAPP) step with a much faster histogram-based post-processing (HBPP). Two HBPP algorithms: LSH-Forest with HBPP and Multi- Probe LSH with HBPP are presented in this thesis, both of them achieve the three goals for KNN search in large scale high dimensional data set: high search quality, high time efficiency, high space efficiency. None of the previous KNN algorithms can achieve all three goals. More specifically, it is shown that HBPP algorithms can always achieve high search quality (as good as LSH-Forest with DAPP and Multi-Probe LSH with DAPP) with much less time cost (one to several orders of magnitude speedup) and same memory usage. It is also shown that with almost same time cost and memory usage, HBPP algorithms can always achieve better search quality than LSH-Forest with random pick (LSH-Forest with RP) and Multi-Probe LSH with random pick (Multi-Probe LSH with RP). Moreover, to achieve a very high search quality, Multi-Probe with HBPP is always a better choice than LSH-Forest with HBPP, regardless of the distribution, size and dimension number of the data set. / Dissertation/Thesis / M.S. Computer Science 2011

http://hdl.handle.net/2286/R.I.9178

locality sensitive hash

Identifer	oai:union.ndltd.org:asu.edu/item:9178
Date	January 2011
Contributors	Yu, Renwei (Author), Candan, Kasim S (Advisor), Sapino, Maria L (Committee member), Chen, Yi (Committee member), Sundaram, Hari (Committee member), Arizona State University (Publisher)
Source Sets	Arizona State University
Language	English
Detected Language	English
Type	Masters Thesis
Format	81 pages
Rights	http://rightsstatements.org/vocab/InC/1.0/, All Rights Reserved

Page generated in 0.002 seconds

Time Efficient and Quality Effective K Nearest Neighbor Search in High Dimension Space

Description

Links & Downloads

Tags

Additional Fields