101 |
Time Efficient and Quality Effective K Nearest Neighbor Search in High Dimension SpaceJanuary 2011 (has links)
abstract: K-Nearest-Neighbors (KNN) search is a fundamental problem in many application domains such as database and data mining, information retrieval, machine learning, pattern recognition and plagiarism detection. Locality sensitive hash (LSH) is so far the most practical approximate KNN search algorithm for high dimensional data. Algorithms such as Multi-Probe LSH and LSH-Forest improve upon the basic LSH algorithm by varying hash bucket size dynamically at query time, so these two algorithms can answer different KNN queries adaptively. However, these two algorithms need a data access post-processing step after candidates' collection in order to get the final answer to the KNN query. In this thesis, Multi-Probe LSH with data access post-processing (Multi-Probe LSH with DAPP) algorithm and LSH-Forest with data access post-processing (LSH-Forest with DAPP) algorithm are improved by replacing the costly data access post-processing (DAPP) step with a much faster histogram-based post-processing (HBPP). Two HBPP algorithms: LSH-Forest with HBPP and Multi- Probe LSH with HBPP are presented in this thesis, both of them achieve the three goals for KNN search in large scale high dimensional data set: high search quality, high time efficiency, high space efficiency. None of the previous KNN algorithms can achieve all three goals. More specifically, it is shown that HBPP algorithms can always achieve high search quality (as good as LSH-Forest with DAPP and Multi-Probe LSH with DAPP) with much less time cost (one to several orders of magnitude speedup) and same memory usage. It is also shown that with almost same time cost and memory usage, HBPP algorithms can always achieve better search quality than LSH-Forest with random pick (LSH-Forest with RP) and Multi-Probe LSH with random pick (Multi-Probe LSH with RP). Moreover, to achieve a very high search quality, Multi-Probe with HBPP is always a better choice than LSH-Forest with HBPP, regardless of the distribution, size and dimension number of the data set. / Dissertation/Thesis / M.S. Computer Science 2011
|
102 |
Online hashing for fast similarity searchCakir, Fatih 02 February 2018 (has links)
In this thesis, the problem of online adaptive hashing for fast similarity search is studied. Similarity search is a central problem in many computer vision applications. The ever-growing size of available data collections and the increasing usage of high-dimensional representations in describing data have increased the computational cost of performing similarity search, requiring search strategies that can explore such collections in an efficient and effective manner. One promising family of approaches is based on hashing, in which the goal is to map the data into the Hamming space where fast search mechanisms exist, while preserving the original neighborhood structure of the data. We first present a novel online hashing algorithm in which the hash mapping is updated in an iterative manner with streaming data. Being online, our method is amenable to variations of the data. Moreover, our formulation is orders of magnitude faster to train than state-of-the-art hashing solutions. Secondly, we propose an online supervised hashing framework in which the goal is to map data associated with similar labels to nearby binary representations. For this purpose, we utilize Error Correcting Output Codes (ECOCs) and consider an online boosting formulation in learning the hash mapping. Our formulation does not require any prior assumptions on the label space and is well-suited for expanding datasets that have new label inclusions. We also introduce a flexible framework that allows us to reduce hash table entry updates. This is critical, especially when frequent updates may occur as the hash table grows larger and larger. Thirdly, we propose a novel mutual information measure to efficiently infer the quality of a hash mapping and retrieval performance. This measure has lower complexity than standard retrieval metrics. With this measure, we first address a key challenge in online hashing that has often been ignored: the binary representations of the data must be recomputed to keep pace with updates to the hash mapping. Based on our novel mutual information measure, we propose an efficient quality measure for hash functions, and use it to determine when to update the hash table. Next, we show that this mutual information criterion can be used as an objective in learning hash functions, using gradient-based optimization. Experiments on image retrieval benchmarks confirm the effectiveness of our formulation, both in reducing hash table recomputations and in learning high-quality hash functions.
|
103 |
Classification of Twitter disaster data using a hybrid feature-instance adaptation approachMazloom, Reza January 1900 (has links)
Master of Science / Department of Computer Science / Doina Caragea / Huge amounts of data that are generated on social media during emergency situations are regarded as troves of critical information. The use of supervised machine learning techniques in the early stages of a disaster is challenged by the lack of labeled data for that particular disaster. Furthermore, supervised models trained on labeled data from a prior disaster may not produce accurate results.
To address these challenges, domain adaptation approaches, which learn models for predicting the target, by using unlabeled data from the target disaster in addition to labeled data from prior source disasters, can be used. However, the resulting models can still be affected by the variance between the target domain and the source domain.
In this context, we propose to use a hybrid feature-instance adaptation approach based on matrix factorization and the k-nearest neighbors algorithm, respectively. The proposed hybrid adaptation approach is used to select a subset of the source disaster data that is representative of the target disaster. The selected subset is subsequently used to learn accurate supervised or domain adaptation Naïve Bayes classifiers for the target disaster. In other words, this study focuses on transforming the existing source data to bring it closer to the target data, thus overcoming the domain variance which may prevent effective transfer of information from source to target. A combination of selective and transformative methods are used on instances and features, respectively. We show experimentally that the proposed approaches are effective in transferring information from source to target. Furthermore, we provide insights with respect to what types and combinations of selections/transformations result in more accurate models for the target.
|
104 |
Detecção de desvios vocais utilizando modelos auto regressivos e o algoritmo KNNTorres, Winnie de Lima 30 January 2018 (has links)
Submitted by Automação e Estatística (sst@bczm.ufrn.br) on 2018-05-02T22:45:42Z
No. of bitstreams: 1
WinnieDeLimaTorres_DISSERT.pdf: 1538022 bytes, checksum: ad6fc16589291a27b8b718b755afdf44 (MD5) / Approved for entry into archive by Arlan Eloi Leite Silva (eloihistoriador@yahoo.com.br) on 2018-05-07T21:40:35Z (GMT) No. of bitstreams: 1
WinnieDeLimaTorres_DISSERT.pdf: 1538022 bytes, checksum: ad6fc16589291a27b8b718b755afdf44 (MD5) / Made available in DSpace on 2018-05-07T21:40:35Z (GMT). No. of bitstreams: 1
WinnieDeLimaTorres_DISSERT.pdf: 1538022 bytes, checksum: ad6fc16589291a27b8b718b755afdf44 (MD5)
Previous issue date: 2018-01-30 / Alguns campos da ciência propõem-se a estudar distúrbios no trato vocal a partir de
análises sobre padrões de vibração da voz. Em geral, a importância dessas pesquisas está
na identificação, em uma fase mais específica, de doenças de maior ou menor gravidade,
a serem sanadas com terapia vocal ou que requerem maior atenção, gerando inclusive a
necessidade de procedimentos cirúrgicos para o seu controle. Embora, já exista na literatura
indicações de que o processamento digital de sinais permite diagnosticar, de um
modo não invasivo, patologias laríngeas, como doenças vocais que ocasionem edema, nódulo
e paralisia, não existe definição do método mais indicado e das características, ou
parâmetros, mais adequados para detectar a presença de desvios vocais. Sendo assim,
neste trabalho é proposto um algoritmo para detecção de desvios vocais por meio da análise
de sinais de voz. Para a realização deste trabalho, utilizou-se dados constantes no
banco de dados Disordered Voice Database, desenvolvido pelo Massachusetts Eye and
Ear Infirmary (MEEI), devido sua utilização em pesquisas na área acústica de voz. Foram
utilizados 166 sinais contidos nessa base de dados, com sinais de vozes saudáveis e
de vozes patológicas afetadas por edema, por nódulo e por paralisia nas pregas vocais. A
partir dos sinais de voz, foram gerados modelos Auto Regressivos (AR e ARMA) para
representação desses sinais e, utilizando os parâmetros dos modelos obtidos, foi utilizado
o algoritmo K-Nearest Neighbors (KNN) para a classificação dos sinais analisados. Com
o intuito de analisar a eficiência do algoritmo proposto neste estudo, os resultados obtidos
desse algoritmo foram comparados com um método de detecção considerando apenas
distância euclidiana entre os sinais. Os resultados encontrados apontam que o método
proposto neste trabalho apresenta um bom resultado, gerando uma taxa de acerto na classificação
acima de 71% (maior que os 31% a partir do uso da distância euclidiana). Além
disso, o método utilizado é de fácil implementação, podendo ser utilizado em hardwares
mais simples. Logo, essa pesquisa tem potencial para gerar um classificador barato
e acessível para a utilização em larga escala por profissionais de saúde, como uma alternativa
de pré análise não invasiva para detecção de patologias otorrinolaringológicas que
afetem a voz. / Some fields in Science propose to study vocal tract disorders from an analysis about
voice vibration patterns. Generally, the weight of those researches is given by the identification
– in a more specific level – of diseases in different stages of severity, which would
be redressed through voice therapy or means that require more attention, hence generating
the need of surgical procedures for its control. Although there are evidences in literature
that the Digital Signal Processing allows a non-invasive diagnosis of laryngeal pathologies,
such as vocal cord disorders, which provoke swelling, nodules, and paralyses, there
is no definition of any most indicated method, and characteristics or appropriated parameters
to detect voice deviations. Thus, the present paper proposes an algorithm to detect
vocal deviances through the Voice Signal Analysis. In order to complete this study, it
had been used data from the Disordered Voice Database, developed by the Massachusetts
Eye and Ear Infirmary (MEEI) due to their wide use in researches regarding the voice and
speech. A total of 166 signals from this database were used, including healthy voices and
pathologic voices affected by swelling, nodule, and vocal fold paralysis. From the voice
signals, autoregressive processes of order (AR and ARMA) were generated for a representation
of those signals, and – by using the models’ parameters obtained – it had been
used the KNN algorithm for a classification of the signals analyzed. Seeking an analysis
of the efficiency of the algorithm proposed in this study, the results obtained from this
algorithm were compared to a detection method, which only considers the Euclidian distance
between the signals. The results found point that the propositioned method in this
work presents a satisfactory result, generating a hit rate on the classification above 71%
(more than the 31% from the use of the Euclidian distance). Moreover, the method used is
easy to implement, so that it can be used along with simpler hardware. Consequently, this
research has the potential to generate a cheap and accessible sorter for wide-scale use by
health care professionals as a non-invasive pre-analysis to detect otorhinolaryngological
pathologies that affect the voice.
|
105 |
Massively parallel nearest neighbors searches in dynamic point clouds on GPUJosé Silva Leite, Pedro 31 January 2010 (has links)
Made available in DSpace on 2014-06-12T15:57:17Z (GMT). No. of bitstreams: 2
arquivo3157_1.pdf: 3737373 bytes, checksum: 7ca491f9a72f2e9cf51764a7acac3e3c (MD5)
license.txt: 1748 bytes, checksum: 8a4605be74aa9ea9d79846c1fba20a33 (MD5)
Previous issue date: 2010 / Conselho Nacional de Desenvolvimento Científico e Tecnológico / Esta dissertação introduz uma estrutura de dados baseada em gride implementada em GPU. Ela foi desenvolvida para pesquisa dos vizinhos mais próximos em nuvens de pontos dinâmicas, de uma forma massivamente paralela. A implementação possui desempenho em tempo real e é executada em GPU, ambas construção do gride e pesquisas dos vizinhos mais próximos (exatos e aproximados). Dessa forma, a transferência de memória entre sistema e dispositivo é minimizada, aumentando o desempenho de uma forma geral. O algoritmo proposto pode ser usado em diferentes aplicações com cenários estáticos ou dinâmicos. Além disso, a estrutura de dados suporta nuvens de pontos tridimensionais e dada sua natureza dinâmica, o usuário pode mudar seus parâmetros em tempo de execução. O mesmo se aplica ao número de vizinhos pesquisados. Uma referência em CPU foi implementada e comparações de desempenho justificam o uso de GPUs como processadores massivamente paralelos. Em adição, o desempenho da estrutura de dados proposta é comparada com implementações em CPU e GPU de trabalhos anteriores. Finalmente, uma aplicação de renderização baseada em pontos foi desenvolvida de forma a verificar o potencial da estrutura de dados
|
106 |
Pattern Recognition applied to Continuous integration system.VANGALA, SHIVAKANTHREDDY January 2018 (has links)
Context: Thisthesis focuses on regression testing in the continuous integration environment which is integration testing that ensures that changes made in the new development code to thesoftware product do not introduce new faults to the software product. Continuous integration is software development practice which integrates all development, testing, and deployment activities. In continuous integration,regression testing is done by manually selecting and prioritizingtestcases from a larger set of testcases. The main challenge faced using manual testcases selection and prioritization is insome caseswhereneeded testcases are ignored in subset of selected testcasesbecause testers didn’t includethem manually while designing hourly cycle regression test suite for particular feature development in product. So, Ericsson, the company in which environment this thesis is conducted,aims at improvingtheirtestcase selection and prioritization in regression testing using pattern recognition. Objectives:This thesis study suggests prediction models using pattern recognition algorithms for predicting future testcases failures using historical data. This helpsto improve the present quality of continuous integration environment by selecting appropriate subset of testcases from larger set of testcases for regression testing. There exist several candidate pattern recognition algorithms that are promising for predicting testcase failures. Based on the characteristics of the data collected at Ericsson, suitable pattern recognition algorithms are selected and predictive models are built. Finally, two predictive models are evaluated and the best performing model is integrated into the continuous integration system. Methods:Experiment research method is chosen for this research because discovery of cause and effect relationships between dependent and independent variables can be used for the evaluation of the predictive model.The experiment is conducted in RStudio, which facilitates to train the predictive models using continuous integration historical data. The predictive ability of the algorithms is evaluated using prediction accuracy evaluation metrics. Results: After implementing two predictive models (neural networks & k-nearest means) using continuous integration data, neural networks achieved aprediction accuracy of 75.3%, k-nearest neighbor gave result 67.75%. Conclusions: This research investigated the feasibility of an adaptive and self-learning test machinery by pattern recognition in continuous integration environment to improve testcase selection and prioritization in regression testing. Neural networks have proved effective capability of predicting failure testcase by 75.3% over the k-nearest neighbors.Predictive model can only make continuous integration efficient only if it has 100% prediction capability, the prediction capability of the 75.3% will not make continuous integration system more efficient than present static testcase selection and prioritization as it has deficiency of lacking prediction 25%. So, this research can only conclude that neural networks at present has 75.3% prediction capability but in future when data availability is more,this may reach to 100% predictive capability. The present Ericsson continuous integration system needs to improve its data storage for historical data at present it can only store 30 days of historical data. The predictive models require large data to give good prediction. To support continuous integration at present Ericsson is using jenkins automation server, there are other automation servers like Team city, Travis CI, Go CD, Circle CI which can store data more than 30 days using them will mitigate the problem of data storage.
|
107 |
Automatic Pain Assessment from Infants’ Crying SoundsPai, Chih-Yun 01 November 2016 (has links)
Crying is infants utilize to express their emotional state. It provides the parents and the nurses a criterion to understand infants’ physiology state. Many researchers have analyzed infants’ crying sounds to diagnose specific diseases or define the reasons for crying. This thesis presents an automatic crying level assessment system to classify infants’ crying sounds that have been recorded under realistic conditions in the Neonatal Intensive Care Unit (NICU) as whimpering or vigorous crying. To analyze the crying signal, Welch’s method and Linear Predictive Coding (LPC) are used to extract spectral features; the average and the standard deviation of the frequency signal and the maximum power spectral density are the other spectral features which are used in classification. For classification, three state-of-the-art classifiers, namely K-nearest Neighbors, Random Forests, and Least Squares Support Vector Machine are tested in this work, and the experimental result achieves the highest accuracy in classifying whimper and vigorous crying using the clean dataset is 90%, which is sampled with 10 seconds before scoring and 5 seconds after scoring and uses K-nearest neighbors as the classifier.
|
108 |
Data Collection, Analysis, and Classification for the Development of a Sailing Performance Evaluation SystemSammon, Ryan January 2013 (has links)
The work described in this thesis contributes to the development of a system to evaluate sailing performance. This work was motivated by the lack of tools available to evaluate sailing performance. The goal of the work presented is to detect and classify the turns of a sailing yacht. Data was collected using a BlackBerry PlayBook affixed to a J/24 sailing yacht. This data was manually annotated with three types of turn: tack, gybe, and mark rounding. This manually annotated data was used to train classification methods. Classification methods tested were multi-layer perceptrons (MLPs) of two sizes in various committees and nearest- neighbour search. Pre-processing algorithms tested were Kalman filtering, categorization using quantiles, and residual normalization. The best solution was found to be an averaged answer committee of small MLPs, with Kalman filtering and residual normalization performed on the input as pre-processing.
|
109 |
Automating the Characterization and Detection of Software Performance Antipatterns Using a Data-Driven ApproachChalawadi, Ram Kishan January 2021 (has links)
Background: With the increase in automating the performance testing strategies, many efforts have been made to detect the Software Performance Antipatterns (SPAs). These performance antipatterns have become a major threat to software platforms at the enterprise level, and detecting these anomalies is essential in any company dealing with performance-sensitive software as these processes should be performed quite often. Due to the complexity of the process, the manual identification of performance issues has become challenging and time-consuming. Objectives: The thesis aims to address and solve the issues mentioned above by developing a tool that automatically Characterizes and Detects Software Performance Antipatterns. The goal is to automate the parameterization process of the existing approach that helps characterize SPAs and improve the interpretation of detection of SPAs. These two processes are integrated into the tool designed to be deployed in the CI/CD pipeline. The developed tool is named Chanterelle. Methods: A case study and a survey has been used in this research. A case study has been conducted at Ericsson. A similar process as in the existing approach has been automated using python. A literature review is conducted to identify an appropriate approach to improve the interpretation of the detection of SPAs. A static user validation has been conducted with the help of a survey consisting of Chanterelle feasibility and usability questions. The responses are provided by Ericsson staff (developers and tester in the field of Software performance) after the tool is presented. Results: The results indicate that the automated parameterization and detection process proposed in this thesis have a considerable execution time compared to the existing approaches and helps the developers interpret the detection results easily. Moreover, it does not include domain experts t run the tests. The results of the static user validation show that Chanterelle is feasible and usable as a tool to be used by the developers. Conclusions: The validation of the tool suggests that Chanterelle helps the developers to interpret the performance-related bugs easily. It performs the automated parameterization and detection process in a considerable time when compared with the existing approaches.
|
110 |
FREDDYGünther, Michael 25 February 2020 (has links)
Word embeddings are useful in many tasks in Natural Language Processing and Information Retrieval, such as text mining and classification, sentiment analysis, sentence completion, or dictionary construction. Word2vec and its predecessor fastText, both well-known models to produce word embeddings, are powerful techniques to study the syntactic and semantic relations between words by representing them in a low-dimensional vector. By applying algebraic operations on these vectors semantic relationships such as word analogies, gender-inflections, or geographical relationships can be easily recovered. The aim of this work is to investigate how word embeddings could be utilized to augment and enrich queries in DBMSs, e.g. to compare text values according to their semantic relation or to group rows according to the similarity of their text values. For this purpose, we use pre-trained word embedding models of large text corpora such as Wikipedia. By exploiting this external knowledge during query processing we are able to apply inductive reasoning on text values. Thereby, we reduce the demand for explicit knowledge in database systems. In the context of the IMDB database schema, this allows for example to query movies that are semantically close to genres such as historical fiction or road movie without maintaining this information. Another example query is sketched in Listing 1, that returns the top-3 nearest neighbors (NN) of each movie in IMDB. Given the movie “Godfather” as input this results in “Scarface”, “Goodfellas” and “Untouchables”.
|
Page generated in 0.0517 seconds