Global ETD Search

171	Soluções aproximadas para algoritmos escaláveis de mineração de dados em domínios de dados complexos usando GPGPU / On approximate solutions to scalable data mining algorithms for complex data problems using GPGPU Alexander Victor Ocsa Mamani 22 September 2011 (has links) A crescente disponibilidade de dados em diferentes domínios tem motivado o desenvolvimento de técnicas para descoberta de conhecimento em grandes volumes de dados complexos. Trabalhos recentes mostram que a busca em dados complexos é um campo de pesquisa importante, já que muitas tarefas de mineração de dados, como classificação, detecção de agrupamentos e descoberta de motifs, dependem de algoritmos de busca ao vizinho mais próximo. Para resolver o problema da busca dos vizinhos mais próximos em domínios complexos muitas abordagens determinísticas têm sido propostas com o objetivo de reduzir os efeitos da maldição da alta dimensionalidade. Por outro lado, algoritmos probabilísticos têm sido pouco explorados. Técnicas recentes relaxam a precisão dos resultados a fim de reduzir o custo computacional da busca. Além disso, em problemas de grande escala, uma solução aproximada com uma análise teórica sólida mostra-se mais adequada que uma solução exata com um modelo teórico fraco. Por outro lado, apesar de muitas soluções exatas e aproximadas de busca e mineração terem sido propostas, o modelo de programação em CPU impõe restrições de desempenho para esses tipos de solução. Uma abordagem para melhorar o tempo de execução de técnicas de recuperação e mineração de dados em várias ordens de magnitude é empregar arquiteturas emergentes de programação paralela, como a arquitetura CUDA. Neste contexto, este trabalho apresenta uma proposta para buscas kNN de alto desempenho baseada numa técnica de hashing e implementações paralelas em CUDA. A técnica proposta é baseada no esquema LSH, ou seja, usa-se projeções em subespac¸os. O LSH é uma solução aproximada e tem a vantagem de permitir consultas de custo sublinear para dados em altas dimensões. Usando implementações massivamente paralelas melhora-se tarefas de mineração de dados. Especificamente, foram desenvolvidos soluções de alto desempenho para algoritmos de descoberta de motifs baseados em implementações paralelas de consultas kNN. As implementações massivamente paralelas em CUDA permitem executar estudos experimentais sobre grandes conjuntos de dados reais e sintéticos. A avaliação de desempenho realizada neste trabalho usando GeForce GTX470 GPU resultou em um aumento de desempenho de até 7 vezes, em média sobre o estado da arte em buscas por similaridade e descoberta de motifs / The increasing availability of data in diverse domains has created a necessity to develop techniques and methods to discover knowledge from huge volumes of complex data, motivating many research works in databases, data mining and information retrieval communities. Recent studies have suggested that searching in complex data is an interesting research field because many data mining tasks such as classification, clustering and motif discovery depend on nearest neighbor search algorithms. Thus, many deterministic approaches have been proposed to solve the nearest neighbor search problem in complex domains, aiming to reduce the effects of the well-known curse of dimensionality. On the other hand, probabilistic algorithms have been slightly explored. Recently, new techniques aim to reduce the computational cost relaxing the quality of the query results. Moreover, in large-scale problems, an approximate solution with a solid theoretical analysis seems to be more appropriate than an exact solution with a weak theoretical model. On the other hand, even though several exact and approximate solutions have been proposed, single CPU architectures impose limits on performance to deliver these kinds of solution. An approach to improve the runtime of data mining and information retrieval techniques by an order-of-magnitude is to employ emerging many-core architectures such as CUDA-enabled GPUs. In this work we present a massively parallel kNN query algorithm based on hashing and CUDA implementation. Our method, based on the LSH scheme, is an approximate method which queries high-dimensional datasets with sub-linear computational time. By using the massively parallel implementation we improve data mining tasks, specifically we create solutions for (soft) realtime time series motif discovery. Experimental studies on large real and synthetic datasets were carried out thanks to the highly CUDA parallel implementation. Our performance evaluation on GeForce GTX 470 GPU resulted in average runtime speedups of up to 7x on the state-of-art of similarity search and motif discovery solutions Busca ao vizinho mais próximo Busca por similaridade aproximada CUDA Dados complexos Descoberta der motifs GPGPU Mineração de dados Projeção aleatória Complex data CUDA Data mining GPGPU Motif-discovery Nearest neighbor search Random projection Similarity serach
172	Bank Customer Churn Prediction : A comparison between classification and evaluation methods Tandan, Isabelle, Goteman, Erika January 2020 (has links) This study aims to assess which supervised statistical learning method; random forest, logistic regression or K-nearest neighbor, that is the best at predicting banks customer churn. Additionally, the study evaluates which cross-validation set approach; k-Fold cross-validation or leave-one-out cross-validation that yields the most reliable results. Predicting customer churn has increased in popularity since new technology, regulation and changed demand has led to an increase in competition for banks. Thus, with greater reason, banks acknowledge the importance of maintaining their customer base. The findings of this study are that unrestricted random forest model estimated using k-Fold is to prefer out of performance measurements, computational efficiency and a theoretical point of view. Albeit, k-Fold cross-validation and leave-one-out cross-validation yield similar results, k-Fold cross-validation is to prefer due to computational advantages. For future research, methods that generate models with both good interpretability and high predictability would be beneficial. In order to combine the knowledge of which customers end their engagement as well as understanding why. Moreover, interesting future research would be to analyze at which dataset size leave-one-out cross-validation and k-Fold cross-validation yield the same results. machine learning cross-validation k-fold leave-one-out random forest decision trees k-nearest neighbor logistic regression supervised learning supervised statistical learning binary classification customer churn bank customer churn. Probability Theory and Statistics Sannolikhetsteori och statistik
173	Machine learning techniques for content-based information retrieval / Méthodes d’apprentissage automatique pour la recherche par le contenu de l’information Chafik, Sanaa 22 December 2017 (has links) Avec l’évolution des technologies numériques et la prolifération d'internet, la quantité d’information numérique a considérablement évolué. La recherche par similarité (ou recherche des plus proches voisins) est une problématique que plusieurs communautés de recherche ont tenté de résoudre. Les systèmes de recherche par le contenu de l’information constituent l’une des solutions prometteuses à ce problème. Ces systèmes sont composés essentiellement de trois unités fondamentales, une unité de représentation des données pour l’extraction des primitives, une unité d’indexation multidimensionnelle pour la structuration de l’espace des primitives, et une unité de recherche des plus proches voisins pour la recherche des informations similaires. L’information (image, texte, audio, vidéo) peut être représentée par un vecteur multidimensionnel décrivant le contenu global des données d’entrée. La deuxième unité consiste à structurer l’espace des primitives dans une structure d’index, où la troisième unité -la recherche par similarité- est effective.Dans nos travaux de recherche, nous proposons trois systèmes de recherche par le contenu de plus proches voisins. Les trois approches sont non supervisées, et donc adaptées aux données étiquetées et non étiquetées. Elles sont basées sur le concept du hachage pour une recherche efficace multidimensionnelle des plus proches voisins. Contrairement aux approches de hachage existantes, qui sont binaires, les approches proposées fournissent des structures d’index avec un hachage réel. Bien que les approches de hachage binaires fournissent un bon compromis qualité-temps de calcul, leurs performances en termes de qualité (précision) se dégradent en raison de la perte d’information lors du processus de binarisation. À l'opposé, les approches de hachage réel fournissent une bonne qualité de recherche avec une meilleure approximation de l’espace d’origine, mais induisent en général un surcoût en temps de calcul.Ce dernier problème est abordé dans la troisième contribution. Les approches proposées sont classifiées en deux catégories, superficielle et profonde. Dans la première catégorie, on propose deux techniques de hachage superficiel, intitulées Symmetries of the Cube Locality sensitive hashing (SC-LSH) et Cluster-Based Data Oriented Hashing (CDOH), fondées respectivement sur le hachage aléatoire et l’apprentissage statistique superficiel. SCLSH propose une solution au problème de l’espace mémoire rencontré par la plupart des approches de hachage aléatoire, en considérant un hachage semi-aléatoire réduisant partiellement l’effet aléatoire, et donc l’espace mémoire, de ces dernières, tout en préservant leur efficacité pour la structuration des espaces hétérogènes. La seconde technique, CDOH, propose d’éliminer l’effet aléatoire en combinant des techniques d’apprentissage non-supervisé avec le concept de hachage. CDOH fournit de meilleures performances en temps de calcul, en espace mémoire et en qualité de recherche.La troisième contribution est une approche de hachage basée sur les réseaux de neurones profonds appelée "Unsupervised Deep Neuron-per-Neuron Hashing" (UDN2H). UDN2H propose une indexation individuelle de la sortie de chaque neurone de la couche centrale d’un modèle non supervisé. Ce dernier est un auto-encodeur profond capturant une structure individuelle de haut niveau de chaque neurone de sortie.Nos trois approches, SC-LSH, CDOH et UDN2H, ont été proposées séquentiellement durant cette thèse, avec un niveau croissant, en termes de la complexité des modèles développés, et en termes de la qualité de recherche obtenue sur de grandes bases de données d'information / The amount of media data is growing at high speed with the fast growth of Internet and media resources. Performing an efficient similarity (nearest neighbor) search in such a large collection of data is a very challenging problem that the scientific community has been attempting to tackle. One of the most promising solutions to this fundamental problem is Content-Based Media Retrieval (CBMR) systems. The latter are search systems that perform the retrieval task in large media databases based on the content of the data. CBMR systems consist essentially of three major units, a Data Representation unit for feature representation learning, a Multidimensional Indexing unit for structuring the resulting feature space, and a Nearest Neighbor Search unit to perform efficient search. Media data (i.e. image, text, audio, video, etc.) can be represented by meaningful numeric information (i.e. multidimensional vector), called Feature Description, describing the overall content of the input data. The task of the second unit is to structure the resulting feature descriptor space into an index structure, where the third unit, effective nearest neighbor search, is performed.In this work, we address the problem of nearest neighbor search by proposing three Content-Based Media Retrieval approaches. Our three approaches are unsupervised, and thus can adapt to both labeled and unlabeled real-world datasets. They are based on a hashing indexing scheme to perform effective high dimensional nearest neighbor search. Unlike most recent existing hashing approaches, which favor indexing in Hamming space, our proposed methods provide index structures adapted to a real-space mapping. Although Hamming-based hashing methods achieve good accuracy-speed tradeoff, their accuracy drops owing to information loss during the binarization process. By contrast, real-space hashing approaches provide a more accurate approximation in the mapped real-space as they avoid the hard binary approximations.Our proposed approaches can be classified into shallow and deep approaches. In the former category, we propose two shallow hashing-based approaches namely, "Symmetries of the Cube Locality Sensitive Hashing" (SC-LSH) and "Cluster-based Data Oriented Hashing" (CDOH), based respectively on randomized-hashing and shallow learning-to-hash schemes. The SC-LSH method provides a solution to the space storage problem faced by most randomized-based hashing approaches. It consists of a semi-random scheme reducing partially the randomness effect of randomized hashing approaches, and thus the memory storage problem, while maintaining their efficiency in structuring heterogeneous spaces. The CDOH approach proposes to eliminate the randomness effect by combining machine learning techniques with the hashing concept. The CDOH outperforms the randomized hashing approaches in terms of computation time, memory space and search accuracy.The third approach is a deep learning-based hashing scheme, named "Unsupervised Deep Neuron-per-Neuron Hashing" (UDN2H). The UDN2H approach proposes to index individually the output of each neuron of the top layer of a deep unsupervised model, namely a Deep Autoencoder, with the aim of capturing the high level individual structure of each neuron output.Our three approaches, SC-LSH, CDOH and UDN2H, were proposed sequentially as the thesis was progressing, with an increasing level of complexity in terms of the developed models, and in terms of the effectiveness and the performances obtained on large real-world datasets Indexation multidimensionnelle Apprentissage non supervisé Hachage Recherche des plus proches voisins Apprentissage profond Multidimensionnal indexing Unsupervised learning Hashing Approximate nearest neighbor search Deep learning
174	Detekce fibrilace síní v krátkodobých EKG záznamech / Detection of atrial fibrillation in short-term ECG Ambrožová, Monika January 2019 (has links) Atrial fibrillation is diagnosed in 1-2% of the population, in next decades, it expects a significant increase in the number of patients with this arrhythmia in connection with the aging of the population and the higher incidence of some diseases that are considered as risk factors of atrial fibrillation. The aim of this work is to describe the problem of atrial fibrillation and the methods that allow its detection in the ECG record. In the first part of work there is a theory dealing with cardiac physiology and atrial fibrillation. There is also basic descreption of the detection of atrial fibrillation. In the practical part of work, there is described software for detection of atrial fibrillation, which is provided by BTL company. Furthermore, an atrial fibrillation detector is designed. Several parameters were selected to detect the variation of RR intervals. These are the parameters of the standard deviation, coefficient of skewness and kurtosis, coefficient of variation, root mean square of the successive differences, normalized absolute deviation, normalized absolute difference, median absolute deviation and entropy. Three different classification models were used: support vector machine (SVM), k-nearest neighbor (KNN) and discriminant analysis classification. The SVM classification model achieves the best results. Results of success indicators (sensitivity: 67.1%; specificity: 97.0%; F-measure: 66.8%; accuracy: 92.9%).
175	Adaptivní klient pro sociální síť Twitter / Adaptive Client for Twitter Social Network Guňka, Jiří January 2011 (has links) The goal of this term project is create user friendly client of Twitter. They may use methods of machine learning as naive bayes classifier to mentions new interests tweets. For visualissation this tweets will be use hyperbolic trees and some others methods.
176	A influência da topografia na identificação de centralidades urbanas : estudo de caso no município de Barra do Piraí, Rio de Janeiro / Fontoura Júnior, Caio Flávio Martinez January 2020 (has links) Orientador: Edmur Azevedo Pugliesi / Resumo: A expansão urbana vem formando aglomerados populacionais desordenados, o que causa problemas para a administração municipal. A fim de reduzir este tipo de problema, uma das maneiras de reorganizar o território é o policentrismo, conceito que vem sendo aplicado em grande parte da área urbana de diversos países como Estados Unidos, China e países da Europa. O policentrismo pode ser entendido como uma área urbana com pluralidade de centros urbanos. Há duas abordagens para identificar possíveis centralidades: a morfológica e a funcional. Além disso, não foi encontrado quaisquer resultados de trabalhos científicos que tenham utilizado a variável inclinação do relevo nas análises de identificação de centralidades urbanas. Dessa maneira, a variável declividade pode ser um fator impactante na determinação de uma centralidade ou núcleo urbano para localidades brasileiras e que tenha característica similares da área de estudo. O objetivo desse trabalho propõe um estudo para a identificação de centralidades ou a possibilidade de identificar novos núcleos urbanos, por meio da avaliação da morfologia do terreno na formação de subcentros no município de Barra do Piraí no Estado do Rio de Janeiro, Brasil. Foram adquiridos arquivos vetoriais da base cartográfica do IBGE 2018, que posteriormente foram tratados e analisados. A fim de compreender a identificação de centralidades foram realizadas análises com a utilização da abordagem morfológica: Componentes Principais (CP), o Índice Global de ... (Resumo completo, clicar acesso eletrônico abaixo) / Abstract: The urban expansion has been creating disorderly population agglomerations, which causes problems for the municipal administration. In order to reduce this type of problem, one of the ways to reorganize the territory is polycentrism, a concept that has been applied in a large part of the urban area of several countries such as the United States, China, and countries in Europe. Polycentrism can be understood as an urban area with a plurality of urban centers. There are two approaches to identify possible centralities: the morphological and the functional. In addition, no results were found from scientific studies that have used the variable slope of relief in the analysis of identification of urban centralities. Thus, the slope variable can be an impacting factor in determining a centrality or urban nucleus for Brazilian locations and which has similar characteristics of the study area. The aim of this work proposes a study for the identification of centralities or the possibility of identifying new urban centers, through the evaluation of the morphology of the land in the formation of sub-centers in the municipality of Barra do Piraí in the State of Rio de Janeiro, Brazil. Vector files were acquired from the IBGE 2018 cartographic base, which were later treated and analyzed. To understand the identification of centralities, analyzes were performed using the morphological approach: Principal Components (CP), the Moran Global Index, the Local Indicator of Spatial Association - ... (Complete abstract click electronic access below) / Mestre Policentrismo Planejamento urbano. Abordagem morfológica Componente principal Estimador de densidade Kernel Índice global de Moran Average Nearest Neighbor Polycentrism Planejamento urbano. Morphological approach Principal component kernel density estimator Moran global index
177	[en] APPROXIMATE NEAREST NEIGHBOR SEARCH FOR THE KULLBACK-LEIBLER DIVERGENCE / [pt] BUSCA APROXIMADA DE VIZINHOS MAIS PRÓXIMOS PARA DIVERGÊNCIA DE KULLBACK-LEIBLER 19 March 2018 (has links) [pt] Em uma série de aplicações, os pontos de dados podem ser representados como distribuições de probabilidade. Por exemplo, os documentos podem ser representados como modelos de tópicos, as imagens podem ser representadas como histogramas e também a música pode ser representada como uma distribuição de probabilidade. Neste trabalho, abordamos o problema do Vizinho Próximo Aproximado onde os pontos são distribuições de probabilidade e a função de distância é a divergência de Kullback-Leibler (KL). Mostramos como acelerar as estruturas de dados existentes, como a Bregman Ball Tree, em teoria, colocando a divergência KL como um produto interno. No lado prático, investigamos o uso de duas técnicas de indexação muito populares: Índice Invertido e Locality Sensitive Hashing. Os experimentos realizados em 6 conjuntos de dados do mundo real mostraram que o Índice Invertido é melhor do que LSH e Bregman Ball Tree, em termos de consultas por segundo e precisão. / [en] In a number of applications, data points can be represented as probability distributions. For instance, documents can be represented as topic models, images can be represented as histograms and also music can be represented as a probability distribution. In this work, we address the problem of the Approximate Nearest Neighbor where the points are probability distributions and the distance function is the Kullback-Leibler (KL) divergence. We show how to accelerate existing data structures such as the Bregman Ball Tree, by posing the KL divergence as an inner product embedding. On the practical side we investigated the use of two, very popular, indexing techniques: Inverted Index and Locality Sensitive Hashing. Experiments performed on 6 real world data-sets showed the Inverted Index performs better than LSH and Bregman Ball Tree, in terms of queries per second and precision. [pt] DIVERGENCIA KULLBACK-LEIBER [pt] ARVORES DE BREGMAN [pt] HASH SENSIVEL A LOCALIDADE [pt] INDICES INVERTIDOS [pt] BUSCA DE VIZINHOS MAIS PROXIMOS [en] KULLBACK-LEIBLER DIVERGENCE [en] BREGMAN BALL TREE [en] LOCALITY SENSITIVE HASHING [en] INVERTED INDEX [en] NEAREST NEIGHBOR SEARCH
178	Spatially Resolved Hydration Statistical Mechanics at Biomolecular Surfaces from Atomistic Simulations Heinz, Leonard 13 December 2021 (has links) No description available. 571.4 entropy hydration protein folding protein stability membrane fusion membrane thickness nearest neighbor search hydrophobic effect molecular dynamics simulation MD simulation mutual information expansion hydration shell mutual information Biologie (PPN619462639)
179	[en] AUTOMATED SYNTHESIS OF OPTIMAL DECISION TREES FOR SMALL COMBINATORIAL OPTIMIZATION PROBLEMS / [pt] SÍNTESE AUTOMATIZADA DE ÁRVORES DE DECISÃO ÓTIMAS PARA PEQUENOS PROBLEMAS DE OTIMIZAÇÃO COMBINATÓRIA CLEBER OLIVEIRA DAMASCENO 24 August 2021 (has links) [pt] A análise de complexidade clássica para problemas NP-difíceis é geralmente orientada para cenários de pior caso, considerando apenas o comportamento assintótico. No entanto, existem algoritmos práticos com execução em um tempo razoável para muitos problemas clássicos. Além disso, há evidências que apontam para algoritmos polinomiais no modelo de árvore de decisão linear para resolver esses problemas, embora não muito explorados. Neste trabalho, exploramos esses resultados teóricos anteriores. Mostramos que a solução ótima para problemas combinatórios 0-1 pode ser encontrada reduzindo esses problemas para uma Busca por Vizinho Mais Próximo sobre o conjunto de vértices de Voronoi correspondentes. Utilizamos os hiperplanos que delimitam essas regiões para gerar sistematicamente uma árvore de decisão que repetidamente divide o espaço até que possa separar todas as soluções, garantindo uma resposta ótima. Fazemos experimentos para testar os limites de tamanho para os quais podemos construir essas árvores para os casos do 0-1 knapsack, weighted minimum cut e symmetric traveling salesman. Conseguimos encontrar as árvores desses problemas com tamanhos até 10, 5 e 6, respectivamente. Obtemos também as relações de adjacência completas para os esqueletos dos politopos do knapsack e do traveling salesman até os tamanhos 10 e 7. Nossa abordagem supera consistentemente o método de enumeração e os métodos baseline para o weighted minimum cut e symmetric traveling salesman, fornecendo soluções ótimas em microssegundos. / [en] Classical complexity analysis for NP-hard problems is usually oriented to worst-case scenarios, considering only the asymptotic behavior. However, there are practical algorithms running in a reasonable time for many classic problems. Furthermore, there is evidence pointing towards polynomial algorithms in the linear decision tree model to solve these problems, although not explored much. In this work, we explore previous theoretical results. We show that the optimal solution for 0-1 combinatorial problems can be found by reducing these problems into a Nearest Neighbor Search over the set of corresponding Voronoi vertices. We use the hyperplanes delimiting these regions to systematically generate a decision tree that repeatedly splits the space until it can separate all solutions, guaranteeing an optimal answer. We run experiments to test the size limits for which we can build these trees for the cases of the 0-1 knapsack, weighted minimum cut, and symmetric traveling salesman. We manage to find the trees of these problems with sizes up to 10, 5, and 6, respectively. We also obtain the complete adjacency relations for the skeletons of the knapsack and traveling salesman polytopes up to size 10 and 7. Our approach consistently outperforms the enumeration method and the baseline methods for the weighted minimum cut and symmetric traveling salesman, providing optimal solutions within microseconds. [pt] OTIMIZACAO COMBINATORIA [pt] BUSCA POR VIZINHO MAIS PROXIMO [pt] DIAGRAMAS DE VORONOI [pt] POLITOPOS [en] COMBINATORIAL OPTIMIZATION [en] LINEAR DECISION TREE MODEL [en] NEAREST NEIGHBOR SEARCH [en] VORONOI DIAGRAMS [en] POLYTOPES
180	Effect of Step Change in Growth Speed During Directional Solidification on Array Morphology of Al-7 wt% Si Alloy Pakiru, Swapna January 2011 (has links) No description available. Chemical Engineering Materials Science Metallurgy Directional Solidification Step change Growth Speed Morphology Nearest Neighbor Spacing Trunk diameter Side branch length Side branch orientation Al- 7 wt% Si alloy thermal gradient step increase step decrease steady state transition.

Search results