Global ETD Search

71	Time Efficient and Quality Effective K Nearest Neighbor Search in High Dimension Space January 2011 (has links) abstract: K-Nearest-Neighbors (KNN) search is a fundamental problem in many application domains such as database and data mining, information retrieval, machine learning, pattern recognition and plagiarism detection. Locality sensitive hash (LSH) is so far the most practical approximate KNN search algorithm for high dimensional data. Algorithms such as Multi-Probe LSH and LSH-Forest improve upon the basic LSH algorithm by varying hash bucket size dynamically at query time, so these two algorithms can answer different KNN queries adaptively. However, these two algorithms need a data access post-processing step after candidates' collection in order to get the final answer to the KNN query. In this thesis, Multi-Probe LSH with data access post-processing (Multi-Probe LSH with DAPP) algorithm and LSH-Forest with data access post-processing (LSH-Forest with DAPP) algorithm are improved by replacing the costly data access post-processing (DAPP) step with a much faster histogram-based post-processing (HBPP). Two HBPP algorithms: LSH-Forest with HBPP and Multi- Probe LSH with HBPP are presented in this thesis, both of them achieve the three goals for KNN search in large scale high dimensional data set: high search quality, high time efficiency, high space efficiency. None of the previous KNN algorithms can achieve all three goals. More specifically, it is shown that HBPP algorithms can always achieve high search quality (as good as LSH-Forest with DAPP and Multi-Probe LSH with DAPP) with much less time cost (one to several orders of magnitude speedup) and same memory usage. It is also shown that with almost same time cost and memory usage, HBPP algorithms can always achieve better search quality than LSH-Forest with random pick (LSH-Forest with RP) and Multi-Probe LSH with random pick (Multi-Probe LSH with RP). Moreover, to achieve a very high search quality, Multi-Probe with HBPP is always a better choice than LSH-Forest with HBPP, regardless of the distribution, size and dimension number of the data set. / Dissertation/Thesis / M.S. Computer Science 2011 Computer Science high dimension K nearest neighbor large scale locality sensitive hash
72	Online hashing for fast similarity search Cakir, Fatih 02 February 2018 (has links) In this thesis, the problem of online adaptive hashing for fast similarity search is studied. Similarity search is a central problem in many computer vision applications. The ever-growing size of available data collections and the increasing usage of high-dimensional representations in describing data have increased the computational cost of performing similarity search, requiring search strategies that can explore such collections in an efficient and effective manner. One promising family of approaches is based on hashing, in which the goal is to map the data into the Hamming space where fast search mechanisms exist, while preserving the original neighborhood structure of the data. We first present a novel online hashing algorithm in which the hash mapping is updated in an iterative manner with streaming data. Being online, our method is amenable to variations of the data. Moreover, our formulation is orders of magnitude faster to train than state-of-the-art hashing solutions. Secondly, we propose an online supervised hashing framework in which the goal is to map data associated with similar labels to nearby binary representations. For this purpose, we utilize Error Correcting Output Codes (ECOCs) and consider an online boosting formulation in learning the hash mapping. Our formulation does not require any prior assumptions on the label space and is well-suited for expanding datasets that have new label inclusions. We also introduce a flexible framework that allows us to reduce hash table entry updates. This is critical, especially when frequent updates may occur as the hash table grows larger and larger. Thirdly, we propose a novel mutual information measure to efficiently infer the quality of a hash mapping and retrieval performance. This measure has lower complexity than standard retrieval metrics. With this measure, we first address a key challenge in online hashing that has often been ignored: the binary representations of the data must be recomputed to keep pace with updates to the hash mapping. Based on our novel mutual information measure, we propose an efficient quality measure for hash functions, and use it to determine when to update the hash table. Next, we show that this mutual information criterion can be used as an objective in learning hash functions, using gradient-based optimization. Experiments on image retrieval benchmarks confirm the effectiveness of our formulation, both in reducing hash table recomputations and in learning high-quality hash functions. Computer science Hashing Nearest neighbor search Online hashing Online learning Retrieval Similarity search
73	Detecção de desvios vocais utilizando modelos auto regressivos e o algoritmo KNN Torres, Winnie de Lima 30 January 2018 (has links) Submitted by Automação e Estatística (sst@bczm.ufrn.br) on 2018-05-02T22:45:42Z No. of bitstreams: 1 WinnieDeLimaTorres_DISSERT.pdf: 1538022 bytes, checksum: ad6fc16589291a27b8b718b755afdf44 (MD5) / Approved for entry into archive by Arlan Eloi Leite Silva (eloihistoriador@yahoo.com.br) on 2018-05-07T21:40:35Z (GMT) No. of bitstreams: 1 WinnieDeLimaTorres_DISSERT.pdf: 1538022 bytes, checksum: ad6fc16589291a27b8b718b755afdf44 (MD5) / Made available in DSpace on 2018-05-07T21:40:35Z (GMT). No. of bitstreams: 1 WinnieDeLimaTorres_DISSERT.pdf: 1538022 bytes, checksum: ad6fc16589291a27b8b718b755afdf44 (MD5) Previous issue date: 2018-01-30 / Alguns campos da ciência propõem-se a estudar distúrbios no trato vocal a partir de análises sobre padrões de vibração da voz. Em geral, a importância dessas pesquisas está na identificação, em uma fase mais específica, de doenças de maior ou menor gravidade, a serem sanadas com terapia vocal ou que requerem maior atenção, gerando inclusive a necessidade de procedimentos cirúrgicos para o seu controle. Embora, já exista na literatura indicações de que o processamento digital de sinais permite diagnosticar, de um modo não invasivo, patologias laríngeas, como doenças vocais que ocasionem edema, nódulo e paralisia, não existe definição do método mais indicado e das características, ou parâmetros, mais adequados para detectar a presença de desvios vocais. Sendo assim, neste trabalho é proposto um algoritmo para detecção de desvios vocais por meio da análise de sinais de voz. Para a realização deste trabalho, utilizou-se dados constantes no banco de dados Disordered Voice Database, desenvolvido pelo Massachusetts Eye and Ear Infirmary (MEEI), devido sua utilização em pesquisas na área acústica de voz. Foram utilizados 166 sinais contidos nessa base de dados, com sinais de vozes saudáveis e de vozes patológicas afetadas por edema, por nódulo e por paralisia nas pregas vocais. A partir dos sinais de voz, foram gerados modelos Auto Regressivos (AR e ARMA) para representação desses sinais e, utilizando os parâmetros dos modelos obtidos, foi utilizado o algoritmo K-Nearest Neighbors (KNN) para a classificação dos sinais analisados. Com o intuito de analisar a eficiência do algoritmo proposto neste estudo, os resultados obtidos desse algoritmo foram comparados com um método de detecção considerando apenas distância euclidiana entre os sinais. Os resultados encontrados apontam que o método proposto neste trabalho apresenta um bom resultado, gerando uma taxa de acerto na classificação acima de 71% (maior que os 31% a partir do uso da distância euclidiana). Além disso, o método utilizado é de fácil implementação, podendo ser utilizado em hardwares mais simples. Logo, essa pesquisa tem potencial para gerar um classificador barato e acessível para a utilização em larga escala por profissionais de saúde, como uma alternativa de pré análise não invasiva para detecção de patologias otorrinolaringológicas que afetem a voz. / Some fields in Science propose to study vocal tract disorders from an analysis about voice vibration patterns. Generally, the weight of those researches is given by the identification – in a more specific level – of diseases in different stages of severity, which would be redressed through voice therapy or means that require more attention, hence generating the need of surgical procedures for its control. Although there are evidences in literature that the Digital Signal Processing allows a non-invasive diagnosis of laryngeal pathologies, such as vocal cord disorders, which provoke swelling, nodules, and paralyses, there is no definition of any most indicated method, and characteristics or appropriated parameters to detect voice deviations. Thus, the present paper proposes an algorithm to detect vocal deviances through the Voice Signal Analysis. In order to complete this study, it had been used data from the Disordered Voice Database, developed by the Massachusetts Eye and Ear Infirmary (MEEI) due to their wide use in researches regarding the voice and speech. A total of 166 signals from this database were used, including healthy voices and pathologic voices affected by swelling, nodule, and vocal fold paralysis. From the voice signals, autoregressive processes of order (AR and ARMA) were generated for a representation of those signals, and – by using the models’ parameters obtained – it had been used the KNN algorithm for a classification of the signals analyzed. Seeking an analysis of the efficiency of the algorithm proposed in this study, the results obtained from this algorithm were compared to a detection method, which only considers the Euclidian distance between the signals. The results found point that the propositioned method in this work presents a satisfactory result, generating a hit rate on the classification above 71% (more than the 31% from the use of the Euclidian distance). Moreover, the method used is easy to implement, so that it can be used along with simpler hardware. Consequently, this research has the potential to generate a cheap and accessible sorter for wide-scale use by health care professionals as a non-invasive pre-analysis to detect otorhinolaryngological pathologies that affect the voice. CNPQ::ENGENHARIAS::ENGENHARIA ELETRICA Detecção de desvios vocais Modelos auto regressivos k-nearest neighbor
74	Massively parallel nearest neighbors searches in dynamic point clouds on GPU José Silva Leite, Pedro 31 January 2010 (has links) Made available in DSpace on 2014-06-12T15:57:17Z (GMT). No. of bitstreams: 2 arquivo3157_1.pdf: 3737373 bytes, checksum: 7ca491f9a72f2e9cf51764a7acac3e3c (MD5) license.txt: 1748 bytes, checksum: 8a4605be74aa9ea9d79846c1fba20a33 (MD5) Previous issue date: 2010 / Conselho Nacional de Desenvolvimento Científico e Tecnológico / Esta dissertação introduz uma estrutura de dados baseada em gride implementada em GPU. Ela foi desenvolvida para pesquisa dos vizinhos mais próximos em nuvens de pontos dinâmicas, de uma forma massivamente paralela. A implementação possui desempenho em tempo real e é executada em GPU, ambas construção do gride e pesquisas dos vizinhos mais próximos (exatos e aproximados). Dessa forma, a transferência de memória entre sistema e dispositivo é minimizada, aumentando o desempenho de uma forma geral. O algoritmo proposto pode ser usado em diferentes aplicações com cenários estáticos ou dinâmicos. Além disso, a estrutura de dados suporta nuvens de pontos tridimensionais e dada sua natureza dinâmica, o usuário pode mudar seus parâmetros em tempo de execução. O mesmo se aplica ao número de vizinhos pesquisados. Uma referência em CPU foi implementada e comparações de desempenho justificam o uso de GPUs como processadores massivamente paralelos. Em adição, o desempenho da estrutura de dados proposta é comparada com implementações em CPU e GPU de trabalhos anteriores. Finalmente, uma aplicação de renderização baseada em pontos foi desenvolvida de forma a verificar o potencial da estrutura de dados PBR Point-based rendering ANN KNN GPGPU Massive parallel programming Nearest neighbor search
75	Data Collection, Analysis, and Classification for the Development of a Sailing Performance Evaluation System Sammon, Ryan January 2013 (has links) The work described in this thesis contributes to the development of a system to evaluate sailing performance. This work was motivated by the lack of tools available to evaluate sailing performance. The goal of the work presented is to detect and classify the turns of a sailing yacht. Data was collected using a BlackBerry PlayBook affixed to a J/24 sailing yacht. This data was manually annotated with three types of turn: tack, gybe, and mark rounding. This manually annotated data was used to train classification methods. Classification methods tested were multi-layer perceptrons (MLPs) of two sizes in various committees and nearest- neighbour search. Pre-processing algorithms tested were Kalman filtering, categorization using quantiles, and residual normalization. The best solution was found to be an averaged answer committee of small MLPs, with Kalman filtering and residual normalization performed on the input as pre-processing. artificial neural network nearest-neighbor search classification inertial measurement unit Kalman filtering sailing
76	FREDDY Günther, Michael 25 February 2020 (has links) Word embeddings are useful in many tasks in Natural Language Processing and Information Retrieval, such as text mining and classification, sentiment analysis, sentence completion, or dictionary construction. Word2vec and its predecessor fastText, both well-known models to produce word embeddings, are powerful techniques to study the syntactic and semantic relations between words by representing them in a low-dimensional vector. By applying algebraic operations on these vectors semantic relationships such as word analogies, gender-inflections, or geographical relationships can be easily recovered. The aim of this work is to investigate how word embeddings could be utilized to augment and enrich queries in DBMSs, e.g. to compare text values according to their semantic relation or to group rows according to the similarity of their text values. For this purpose, we use pre-trained word embedding models of large text corpora such as Wikipedia. By exploiting this external knowledge during query processing we are able to apply inductive reasoning on text values. Thereby, we reduce the demand for explicit knowledge in database systems. In the context of the IMDB database schema, this allows for example to query movies that are semantically close to genres such as historical fiction or road movie without maintaining this information. Another example query is sketched in Listing 1, that returns the top-3 nearest neighbors (NN) of each movie in IMDB. Given the movie “Godfather” as input this results in “Scarface”, “Goodfellas” and “Untouchables”. info:eu-repo/classification/ddc/004 ddc:004
77	DISTRIBUTED NEAREST NEIGHBOR CLASSIFICATION WITH APPLICATIONS TO CROWDSOURCING Jiexin Duan (11181162) 26 July 2021 (has links) The aim of this dissertation is to study two problems of distributed nearest neighbor classification (DiNN) systematically. The first one compares two DiNN classifiers based on different schemes: majority voting and weighted voting. The second one is an extension of the DiNN method to the crowdsourcing application, which allows each worker data has a different size and noisy labels due to low worker quality. Both statistical guarantees and numerical comparisons are studied in depth.<br><div><br></div><div><div>The first part of the dissertation focuses on the distributed nearest neighbor classification in big data. The sheer volume and spatial/temporal disparity of big data may prohibit centrally processing and storing the data. This has imposed a considerable hurdle for nearest neighbor predictions since the entire training data must be memorized. One effective way to overcome this issue is the distributed learning framework. Through majority voting, the distributed nearest neighbor classifier achieves the same rate of convergence as its oracle version in terms of the regret, up to a multiplicative constant that depends solely on the data dimension. The multiplicative difference can be eliminated by replacing majority voting with the weighted voting scheme. In addition, we provide sharp theoretical upper bounds of the number of subsamples in order for the distributed nearest neighbor classifier to reach the optimal convergence rate. It is interesting to note that the weighted voting scheme allows a larger number of subsamples than the majority voting one.</div></div><div><br></div><div>The second part of the dissertation extends the DiNN methods to the application in crowdsourcing. The noisy labels in crowdsourcing data and different sizes of worker data will deteriorate the performance of DiNN methods. We propose an enhanced nearest neighbor classifier (ENN) to overcome this issue. Our proposed method achieves the same regret as its oracle version on the expert data with the same size. We also propose two algorithms to estimate the worker quality if it is unknown in practice. One method constructs the estimators for worker quality based on the denoised worker labels through applying kNN classifier on expert data. Unlike previous worker quality estimation methods, which have no statistical guarantee, it achieves the same regret as the ENN with observed worker quality. The other method estimates the worker quality iteratively based on ENN, and it works well without expert data required by most previous methods.<br></div> Statistics nearest neighbor classification crowdsourcing distributed learning regret analysis worker quality big data
78	Distance to the Border in Spatial Point Patterns Joyner, Michele, Ross, Chelsea, Seier, Edith 01 November 2013 (has links) The analysis of spatial point patterns is commonly focused on the distances to the nearest neighbor. The distance of organisms to the edge of the enclosure is also of interest in some biological studies performed in the laboratory. We define the B (border) function and derive its shape assuming complete spatial randomness (CSR) for square, rectangular, circular, and some three-dimensional arenas. The idea is then extended outside the laboratory setting to work with maps and points located in geographical regions. Commands in R ( R Core Team, 2012) to calculate and plot the empirical B̂ function are included. The B function, based on distances to the nearest edge, in addition to the G function, based on distances to the nearest neighbor, contributes to the understanding of the spatial distribution of the points. distance to nearest edge experimental enclosures maps nearest neighbor Mathematics and Statistics
79	Distance to the Border in Spatial Point Patterns Joyner, Michele, Ross, Chelsea, Seier, Edith 01 November 2013 (has links) The analysis of spatial point patterns is commonly focused on the distances to the nearest neighbor. The distance of organisms to the edge of the enclosure is also of interest in some biological studies performed in the laboratory. We define the B (border) function and derive its shape assuming complete spatial randomness (CSR) for square, rectangular, circular, and some three-dimensional arenas. The idea is then extended outside the laboratory setting to work with maps and points located in geographical regions. Commands in R ( R Core Team, 2012) to calculate and plot the empirical B̂ function are included. The B function, based on distances to the nearest edge, in addition to the G function, based on distances to the nearest neighbor, contributes to the understanding of the spatial distribution of the points. distance to nearest edge experimental enclosures maps nearest neighbor Mathematics and Statistics
80	Methods for Efficient Synthesis of Large Reversible Binary and Ternary Quantum Circuits and Applications of Linear Nearest Neighbor Model Hawash, Maher Mofeid 30 May 2013 (has links) This dissertation describes the development of automated synthesis algorithms that construct reversible quantum circuits for reversible functions with large number of variables. Specifically, the research area is focused on reversible, permutative and fully specified binary and ternary specifications and the applicability of the resulting circuit to the physical limitations of existing quantum technologies. Automated synthesis of arbitrary reversible specifications is an NP hard, multiobjective optimization problem, where 1) the amount of time and computational resources required to synthesize the specification, 2) the number of primitive quantum gates in the resulting circuit (quantum cost), and 3) the number of ancillary qubits (variables added to hold intermediate calculations) are all minimized while 4) the number of variables is maximized. Some of the existing algorithms in the literature ignored objective 2 by focusing on the synthesis of a single solution without the addition of any ancillary qubits while others attempted to explore every possible solution in the search space in an effort to discover the optimal solution (i.e., sacrificed objective 1 and 4). Other algorithms resorted to adding a huge number of ancillary qubits (counter to objective 3) in an effort minimize the number of primitive gates (objective 2). In this dissertation, I first introduce the MMDSN algorithm that is capable of synthesizing binary specifications up to 30 variables, does not add any ancillary variables, produces better quantum cost (8-50% improvement) than algorithms which limit their search to a single solution and within a minimal amount of time compared to algorithms which perform exhaustive search (seconds vs. hours). The MMDSN algorithm introduces an innovative method of using the Hasse diagram to construct candidate solutions that are guaranteed to be valid and then selects the solution with the minimal quantum cost out of this subset. I then introduce the Covered Set Partitions (CSP) algorithm that expands the search space of valid candidate solutions and allows for exploring solutions outside the range of MMDSN. I show a method of subdividing the expansive search landscape into smaller partitions and demonstrate the benefit of focusing on partition sizes that are around half of the number of variables (15% to 25% improvements, over MMDSN, for functions less than 12 variables, and more than 1000% improvement for functions with 12 and 13 variables). For a function of n variables, the CSP algorithm, theoretically, requires n times more to synthesize; however, by focusing on the middle k (k by MMDSN which typically yields lower quantum cost. I also show that using a Tabu search for selecting the next set of candidate from the CSP subset results in discovering solutions with even lower quantum costs (up to 10% improvement over CSP with random selection). In Chapters 9 and 10 I question the predominant methods of measuring quantum cost and its applicability to physical implementation of quantum gates and circuits. I counter the prevailing literature by introducing a new standard for measuring the performance of quantum synthesis algorithms by enforcing the Linear Nearest Neighbor Model (LNNM) constraint, which is imposed by the today's leading implementations of quantum technology. In addition to enforcing physical constraints, the new LNNM quantum cost (LNNQC) allows for a level comparison amongst all methods of synthesis; specifically, methods which add a large number of ancillary variables to ones that add no additional variables. I show that, when LNNM is enforced, the quantum cost for methods that add a large number of ancillary qubits increases significantly (up to 1200%). I also extend the Hasse based method to the ternary and I demonstrate synthesis of specifications of up to 9 ternary variables (compared to 3 ternary variables that existed in the literature). I introduce the concept of ternary precedence order and its implication on the construction of the Hasse diagram and the construction of valid candidate solutions. I also provide a case study comparing the performance of ternary logic synthesis of large functions using both a CUDA graphic processor with 1024 cores and an Intel i7 processor with 8 cores. In the process of exploring large ternary functions I introduce, to the literature, eight families of ternary benchmark functions along with a Multiple Valued file specification (the Extended Quantum Specification XQS). I also introduce a new composite quantum gate, the multiple valued Swivel gate, which swaps the information of qubits around a centrally located pivot point. In summary, my research objectives are as follows: * Explore and create automated synthesis algorithms for reversible circuits both in binary and ternary logic for large number of variables. * Study the impact of enforcing Linear Nearest Neighbor Model (LNNM) constraint for every interaction between qubits for reversible binary specifications. * Advocate for a revised metric for measuring the cost of a quantum circuit in concordance with LNNM, where, on one hand, such a metric would provide a way for balanced comparison between the various flavors of algorithms, and on the other hand, represents a realistic cost of a quantum circuit with respect to an ion trap implementation. * Establish an open source repository for sharing the results, software code and publications with the scientific community. With the dwindling expectations for a new lifeline on silicon-based technologies, quantum computations have the potential of becoming the future workhorse of computations. Similar to the automated CAD tools of classical logic, my work lays the foundation for creating automated tools for constructing quantum circuits from reversible specifications. Quantum computers -- Research Many-valued logic Reversible computing Nearest neighbor analysis (Statistics)

Search results