Global ETD Search

1	On Dimensionality Reduction of Data Vamulapalli, Harika Rao 05 August 2010 (has links) Random projection method is one of the important tools for the dimensionality reduction of data which can be made efficient with strong error guarantees. In this thesis, we focus on linear transforms of high dimensional data to the low dimensional space satisfying the Johnson-Lindenstrauss lemma. In addition, we also prove some theoretical results relating to the projections that are of interest when applying them in practical applications. We show how the technique can be applied to synthetic data with probabilistic guarantee on the pairwise distance. The connection between dimensionality reduction and compressed sensing is also discussed. principal component analysis dimensionality reduction random projection fast Johnson Lindenstrauss transform
2	Contribution to dimension reduction techniques : application to object tracking / Contribution aux techniques de la réduction de dimension : application au suivi d'objet Lu, Weizhi 16 July 2014 (has links) Cette thèse étudie et apporte des améliorations significatives sur trois techniques répandues en réduction de dimension : l'acquisition parcimonieuse (ou l'échantillonnage parcimonieux), la projection aléatoire et la représentation parcimonieuse. En acquisition parcimonieuse, la construction d’une matrice de réduction possédant à la fois de bonnes performances et une structure matérielle adéquate reste un défi de taille. Ici, nous proposons explicitement la matrice binaire optimale, avec éléments zéro-Un, en recherchant la meilleure propriété d’isométrie restreinte (RIP). Dans la pratique, un algorithme glouton efficace est successivement développé pour construire la matrice binaire optimale avec une taille arbitraire. Par ailleurs, nous étudions également un autre problème intéressant pour l'acquisition parcimonieuse, c'est celui de la performance des matrices d'acquisition parcimonieuse avec des taux de compression élevés. Pour la première fois, la limite inférieure de la performance des matrices aléatoires de Bernoulli pour des taux de compression croissants est observée et estimée. La projection aléatoire s'utilise principalement en classification mais la construction de la matrice de projection aléatoire s'avère également critique en termes de performance et de complexité. Cette thèse présente la matrice de projection aléatoire, de loin, la plus éparse. Celle-Ci est démontrée présenter la meilleure performance en sélection de caractéristiques, comparativement à d’autres matrices aléatoires plus denses. Ce résultat théorique est confirmé par de nombreuses expériences. Comme nouvelle technique pour la sélection de caractéristiques ou d’échantillons, la représentation parcimonieuse a récemment été largement appliquée dans le domaine du traitement d'image. Dans cette thèse, nous nous concentrons principalement sur ses applications de suivi d'objets dans une séquence d'images. Pour réduire la charge de calcul liée à la représentation parcimonieuse, un système simple mais efficace est proposé pour le suivi d'un objet unique. Par la suite, nous explorons le potentiel de cette représentation pour le suivi d'objets multiples. / This thesis studies three popular dimension reduction techniques: compressed sensing, random projection and sparse representation, and brings significant improvements on these techniques. In compressed sensing, the construction of sensing matrix with both good performance and hardware-Friendly structure has been a significant challenge. In this thesis, we explicitly propose the optimal zero-One binary matrix by searching the best Restricted Isometry Property. In practice, an efficient greedy algorithm is successively developed to construct the optimal binary matrix with arbitrary size. Moreover, we also study another interesting problem for compressed sensing, that is the performance of sensing matrices with high compression rates. For the first time, the performance floor of random Bernoulli matrices over increasing compression rates is observed and effectively estimated. Random projection is mainly used in the task of classification, for which the construction of random projection matrix is also critical in terms of both performance and complexity. This thesis presents so far the most sparse random projection matrix, which is proved holding better feature selection performance than other more dense random matrices. The theoretical result is confirmed with extensive experiments. As a novel technique for feature or sample selection, sparse representation has recently been widely applied in the area of image processing. In this thesis, we mainly focus our attention on its applications to visual object tracking. To reduce the computation load related to sparse representation, a simple but efficient scheme is proposed for the tracking of single object. Subsequently, the potential of sparse representation to multiobject tracking is investigated. Réduction de dimension Dimension reduction Compressed sensing Random projection Sparse representation 621.382
3	Spectral classification of high-dimensional time series Zhang, Fuli 01 August 2018 (has links) In this era of big data, multivariate time-series (MTS) data are prevalent in diverse domains and often high dimensional. However, there have been limited studies of building a capable classifier with MTS via classical machine learning methods that can deal with the double curse of dimensionality due to high variable dimension and long time series (large sample size). In this thesis, we propose two approaches to address this problem for multiclass classification with high dimensional MTS. Both approaches leverage the dynamics of an MTS captured by non-parametric modeling of its spectral density function. In the first approach, we introduce the reduced-rank spectral classifier (RRSC), which utilizes low-rank estimation and some new discrimination functions. We illustrate the efficacy of the RRSC with both simulations and real applications. For binary classification, we establish the consistency of the RRSC and provide an asymptotic formula for the misclassification error rates, under some regularity conditions. The second approach concerns the development of the random projection ensemble classifier for time series (RPECTS). This method first applies dimension reduction in the time domain via projecting the time-series variables into some low dimensional space, followed by measuring the disparity via some novel base classifier between the data and the candidate generating processes in the projected space. We assess the classification performance of our new approaches by simulations and compare them with some existing methods using real applications. Finally, we elaborate two R packages that implement the aforementioned methods. classification multivariate time series random projection reduced rank spectrum Statistics and Probability
4	Dimensionality Reduction of Hyperspectral Imagery Using Random Projections Menon, Vineetha 09 December 2016 (has links) Hyperspectral imagery is often associated with high storage and transmission costs. Dimensionality reduction aims to reduce the time and space complexity of hyperspectral imagery by projecting data into a low-dimensional space such that all the important information in the data is preserved. Dimensionality-reduction methods based on transforms are widely used and give a data-dependent representation that is unfortunately costly to compute. Recently, there has been a growing interest in data-independent representations for dimensionality reduction; of particular prominence are random projections which are attractive due to their computational efficiency and simplicity of implementation. This dissertation concentrates on exploring the realm of computationally fast and efficient random projections by considering projections based on a random Hadamard matrix. These Hadamard-based projections are offered as an alternative to more widely used random projections based on dense Gaussian matrices. Such Hadamard matrices are then coupled with a fast singular value decomposition in order to implement a two-stage dimensionality reduction that marries the computational benefits of the data-independent random projection to the structure-capturing capability of the data-dependent singular value transform. Finally, random projections are applied in conjunction with nonnegative least squares to provide a computationally lightweight methodology for the well-known spectral-unmixing problem. Overall, it is seen that random projections offer a computationally efficient framework for dimensionality reduction that permits hyperspectral-analysis tasks such as unmixing and classification to be conducted in a lower-dimensional space without sacrificing analysis performance while reducing computational costs significantly. Random projection supervised classification spectral unmixing dimensionality reduction Nonnegative least squares estimation Gaussian matrix Hadamard matrix
5	Similarity Search in Document Collections / Similarity Search in Document Collections Jordanov, Dimitar Dimitrov January 2009 (has links) Hlavním cílem této práce je odhadnout výkonnost volně šířeni balík Sémantický Vektory a třída MoreLikeThis z balíku Apache Lucene. Tato práce nabízí porovnání těchto dvou přístupů a zavádí metody, které mohou vést ke zlepšení kvality vyhledávání.
6	Approximate Clustering Algorithms for High Dimensional Streaming and Distributed Data Carraher, Lee A. 22 May 2018 (has links) No description available. Computer Engineering data clustering distributed data mining streaming data algorithms locality sensitive hashing count-min cut tree random projection
7	Fast hierarchical algorithms for the low-rank approximation of matrices, with applications to materials physics, geostatistics and data analysis / Algorithmes hiérarchiques rapides pour l’approximation de rang faible des matrices, applications à la physique des matériaux, la géostatistique et l’analyse de données Blanchard, Pierre 16 February 2017 (has links) Les techniques avancées pour l’approximation de rang faible des matrices sont des outils de réduction de dimension fondamentaux pour un grand nombre de domaines du calcul scientifique. Les approches hiérarchiques comme les matrices H2, en particulier la méthode multipôle rapide (FMM), bénéficient de la structure de rang faible par bloc de certaines matrices pour réduire le coût de calcul de problèmes d’interactions à n-corps en O(n) opérations au lieu de O(n2). Afin de mieux traiter des noyaux d’interaction complexes de plusieurs natures, des formulations FMM dites ”kernel-independent” ont récemment vu le jour, telles que les FMM basées sur l’interpolation polynomiale. Cependant elles deviennent très coûteuses pour les noyaux tensoriels à fortes dimensions, c’est pourquoi nous avons développé une nouvelle formulation FMM efficace basée sur l’interpolation polynomiale, appelée Uniform FMM. Cette méthode a été implémentée dans la bibliothèque parallèle ScalFMM et repose sur une grille d’interpolation régulière et la transformée de Fourier rapide (FFT). Ses performances et sa précision ont été comparées à celles de la FMM par interpolation de Chebyshev. Des simulations numériques sur des cas tests artificiels ont montré que la perte de précision induite par le schéma d’interpolation était largement compensées par le gain de performance apporté par la FFT. Dans un premier temps, nous avons étendu les FMM basées sur grille de Chebyshev et sur grille régulière au calcul des champs élastiques isotropes mis en jeu dans des simulations de Dynamique des Dislocations (DD). Dans un second temps, nous avons utilisé notre nouvelle FMM pour accélérer une factorisation SVD de rang r par projection aléatoire et ainsi permettre de générer efficacement des champs Gaussiens aléatoires sur de grandes grilles hétérogènes. Pour finir, nous avons développé un algorithme de réduction de dimension basé sur la projection aléatoire dense afin d’étudier de nouvelles façons de caractériser la biodiversité, à savoir d’un point de vue géométrique. / Advanced techniques for the low-rank approximation of matrices are crucial dimension reduction tools in many domains of modern scientific computing. Hierarchical approaches like H2-matrices, in particular the Fast Multipole Method (FMM), benefit from the block low-rank structure of certain matrices to reduce the cost of computing n-body problems to O(n) operations instead of O(n2). In order to better deal with kernels of various kinds, kernel independent FMM formulations have recently arisen such as polynomial interpolation based FMM. However, they are hardly tractable to high dimensional tensorial kernels, therefore we designed a new highly efficient interpolation based FMM, called the Uniform FMM, and implemented it in the parallel library ScalFMM. The method relies on an equispaced interpolation grid and the Fast Fourier Transform (FFT). Performance and accuracy were compared with the Chebyshev interpolation based FMM. Numerical experiments on artificial benchmarks showed that the loss of accuracy induced by the interpolation scheme was largely compensated by the FFT optimization. First of all, we extended both interpolation based FMM to the computation of the isotropic elastic fields involved in Dislocation Dynamics (DD) simulations. Second of all, we used our new FMM algorithm to accelerate a rank-r Randomized SVD and thus efficiently generate multivariate Gaussian random variables on large heterogeneous grids in O(n) operations. Finally, we designed a new efficient dimensionality reduction algorithm based on dense random projection in order to investigate new ways of characterizing the biodiversity, namely from a geometric point of view. Méthode multipole rapide Positionnement multidimensionnel Matrices de covariance Dynamique des dislocations Projection aléatoire Transformé de Fourier rapide Fast multipole method Multidimensional scaling Covariance matrix Dislocation dynamics Random projection Fast Fourier transform
8	Soluções aproximadas para algoritmos escaláveis de mineração de dados em domínios de dados complexos usando GPGPU / On approximate solutions to scalable data mining algorithms for complex data problems using GPGPU Mamani, Alexander Victor Ocsa 22 September 2011 (has links) A crescente disponibilidade de dados em diferentes domínios tem motivado o desenvolvimento de técnicas para descoberta de conhecimento em grandes volumes de dados complexos. Trabalhos recentes mostram que a busca em dados complexos é um campo de pesquisa importante, já que muitas tarefas de mineração de dados, como classificação, detecção de agrupamentos e descoberta de motifs, dependem de algoritmos de busca ao vizinho mais próximo. Para resolver o problema da busca dos vizinhos mais próximos em domínios complexos muitas abordagens determinísticas têm sido propostas com o objetivo de reduzir os efeitos da maldição da alta dimensionalidade. Por outro lado, algoritmos probabilísticos têm sido pouco explorados. Técnicas recentes relaxam a precisão dos resultados a fim de reduzir o custo computacional da busca. Além disso, em problemas de grande escala, uma solução aproximada com uma análise teórica sólida mostra-se mais adequada que uma solução exata com um modelo teórico fraco. Por outro lado, apesar de muitas soluções exatas e aproximadas de busca e mineração terem sido propostas, o modelo de programação em CPU impõe restrições de desempenho para esses tipos de solução. Uma abordagem para melhorar o tempo de execução de técnicas de recuperação e mineração de dados em várias ordens de magnitude é empregar arquiteturas emergentes de programação paralela, como a arquitetura CUDA. Neste contexto, este trabalho apresenta uma proposta para buscas kNN de alto desempenho baseada numa técnica de hashing e implementações paralelas em CUDA. A técnica proposta é baseada no esquema LSH, ou seja, usa-se projeções em subespac¸os. O LSH é uma solução aproximada e tem a vantagem de permitir consultas de custo sublinear para dados em altas dimensões. Usando implementações massivamente paralelas melhora-se tarefas de mineração de dados. Especificamente, foram desenvolvidos soluções de alto desempenho para algoritmos de descoberta de motifs baseados em implementações paralelas de consultas kNN. As implementações massivamente paralelas em CUDA permitem executar estudos experimentais sobre grandes conjuntos de dados reais e sintéticos. A avaliação de desempenho realizada neste trabalho usando GeForce GTX470 GPU resultou em um aumento de desempenho de até 7 vezes, em média sobre o estado da arte em buscas por similaridade e descoberta de motifs / The increasing availability of data in diverse domains has created a necessity to develop techniques and methods to discover knowledge from huge volumes of complex data, motivating many research works in databases, data mining and information retrieval communities. Recent studies have suggested that searching in complex data is an interesting research field because many data mining tasks such as classification, clustering and motif discovery depend on nearest neighbor search algorithms. Thus, many deterministic approaches have been proposed to solve the nearest neighbor search problem in complex domains, aiming to reduce the effects of the well-known curse of dimensionality. On the other hand, probabilistic algorithms have been slightly explored. Recently, new techniques aim to reduce the computational cost relaxing the quality of the query results. Moreover, in large-scale problems, an approximate solution with a solid theoretical analysis seems to be more appropriate than an exact solution with a weak theoretical model. On the other hand, even though several exact and approximate solutions have been proposed, single CPU architectures impose limits on performance to deliver these kinds of solution. An approach to improve the runtime of data mining and information retrieval techniques by an order-of-magnitude is to employ emerging many-core architectures such as CUDA-enabled GPUs. In this work we present a massively parallel kNN query algorithm based on hashing and CUDA implementation. Our method, based on the LSH scheme, is an approximate method which queries high-dimensional datasets with sub-linear computational time. By using the massively parallel implementation we improve data mining tasks, specifically we create solutions for (soft) realtime time series motif discovery. Experimental studies on large real and synthetic datasets were carried out thanks to the highly CUDA parallel implementation. Our performance evaluation on GeForce GTX 470 GPU resulted in average runtime speedups of up to 7x on the state-of-art of similarity search and motif discovery solutions Busca ao vizinho mais próximo Busca por similaridade aproximada Complex data CUDA CUDA Dados complexos Data mining Descoberta der motifs GPGPU GPGPU Mineração de dados Motif-discovery Nearest neighbor search Projeção aleatória Random projection Similarity serach
9	Soluções aproximadas para algoritmos escaláveis de mineração de dados em domínios de dados complexos usando GPGPU / On approximate solutions to scalable data mining algorithms for complex data problems using GPGPU Alexander Victor Ocsa Mamani 22 September 2011 (has links) A crescente disponibilidade de dados em diferentes domínios tem motivado o desenvolvimento de técnicas para descoberta de conhecimento em grandes volumes de dados complexos. Trabalhos recentes mostram que a busca em dados complexos é um campo de pesquisa importante, já que muitas tarefas de mineração de dados, como classificação, detecção de agrupamentos e descoberta de motifs, dependem de algoritmos de busca ao vizinho mais próximo. Para resolver o problema da busca dos vizinhos mais próximos em domínios complexos muitas abordagens determinísticas têm sido propostas com o objetivo de reduzir os efeitos da maldição da alta dimensionalidade. Por outro lado, algoritmos probabilísticos têm sido pouco explorados. Técnicas recentes relaxam a precisão dos resultados a fim de reduzir o custo computacional da busca. Além disso, em problemas de grande escala, uma solução aproximada com uma análise teórica sólida mostra-se mais adequada que uma solução exata com um modelo teórico fraco. Por outro lado, apesar de muitas soluções exatas e aproximadas de busca e mineração terem sido propostas, o modelo de programação em CPU impõe restrições de desempenho para esses tipos de solução. Uma abordagem para melhorar o tempo de execução de técnicas de recuperação e mineração de dados em várias ordens de magnitude é empregar arquiteturas emergentes de programação paralela, como a arquitetura CUDA. Neste contexto, este trabalho apresenta uma proposta para buscas kNN de alto desempenho baseada numa técnica de hashing e implementações paralelas em CUDA. A técnica proposta é baseada no esquema LSH, ou seja, usa-se projeções em subespac¸os. O LSH é uma solução aproximada e tem a vantagem de permitir consultas de custo sublinear para dados em altas dimensões. Usando implementações massivamente paralelas melhora-se tarefas de mineração de dados. Especificamente, foram desenvolvidos soluções de alto desempenho para algoritmos de descoberta de motifs baseados em implementações paralelas de consultas kNN. As implementações massivamente paralelas em CUDA permitem executar estudos experimentais sobre grandes conjuntos de dados reais e sintéticos. A avaliação de desempenho realizada neste trabalho usando GeForce GTX470 GPU resultou em um aumento de desempenho de até 7 vezes, em média sobre o estado da arte em buscas por similaridade e descoberta de motifs / The increasing availability of data in diverse domains has created a necessity to develop techniques and methods to discover knowledge from huge volumes of complex data, motivating many research works in databases, data mining and information retrieval communities. Recent studies have suggested that searching in complex data is an interesting research field because many data mining tasks such as classification, clustering and motif discovery depend on nearest neighbor search algorithms. Thus, many deterministic approaches have been proposed to solve the nearest neighbor search problem in complex domains, aiming to reduce the effects of the well-known curse of dimensionality. On the other hand, probabilistic algorithms have been slightly explored. Recently, new techniques aim to reduce the computational cost relaxing the quality of the query results. Moreover, in large-scale problems, an approximate solution with a solid theoretical analysis seems to be more appropriate than an exact solution with a weak theoretical model. On the other hand, even though several exact and approximate solutions have been proposed, single CPU architectures impose limits on performance to deliver these kinds of solution. An approach to improve the runtime of data mining and information retrieval techniques by an order-of-magnitude is to employ emerging many-core architectures such as CUDA-enabled GPUs. In this work we present a massively parallel kNN query algorithm based on hashing and CUDA implementation. Our method, based on the LSH scheme, is an approximate method which queries high-dimensional datasets with sub-linear computational time. By using the massively parallel implementation we improve data mining tasks, specifically we create solutions for (soft) realtime time series motif discovery. Experimental studies on large real and synthetic datasets were carried out thanks to the highly CUDA parallel implementation. Our performance evaluation on GeForce GTX 470 GPU resulted in average runtime speedups of up to 7x on the state-of-art of similarity search and motif discovery solutions Busca ao vizinho mais próximo Busca por similaridade aproximada CUDA Dados complexos Descoberta der motifs GPGPU Mineração de dados Projeção aleatória Complex data CUDA Data mining GPGPU Motif-discovery Nearest neighbor search Random projection Similarity serach
10	Advances on Dimension Reduction for Multivariate Linear Regression Guo, Wenxing January 2020 (has links) Multivariate linear regression methods are widely used statistical tools in data analysis, and were developed when some response variables are studied simultaneously, in which our aim is to study the relationship between predictor variables and response variables through the regression coefficient matrix. The rapid improvements of information technology have brought us a large number of large-scale data, but also brought us great challenges in data processing. When dealing with high dimensional data, the classical least squares estimation is not applicable in multivariate linear regression analysis. In recent years, some approaches have been developed to deal with high-dimensional data problems, among which dimension reduction is one of the main approaches. In some literature, random projection methods were used to reduce dimension in large datasets. In Chapter 2, a new random projection method, with low-rank matrix approximation, is proposed to reduce the dimension of the parameter space in high-dimensional multivariate linear regression model. Some statistical properties of the proposed method are studied and explicit expressions are then derived for the accuracy loss of the method with Gaussian random projection and orthogonal random projection. These expressions are precise rather than being bounds up to constants. In multivariate regression analysis, reduced rank regression is also a dimension reduction method, which has become an important tool for achieving dimension reduction goals due to its simplicity, computational efficiency and good predictive performance. In practical situations, however, the performance of the reduced rank estimator is not satisfactory when the predictor variables are highly correlated or the ratio of signal to noise is small. To overcome this problem, in Chapter 3, we incorporate matrix projections into reduced rank regression method, and then develop reduced rank regression estimators based on random projection and orthogonal projection in high-dimensional multivariate linear regression models. We also propose a consistent estimator of the rank of the coefficient matrix and achieve prediction performance bounds for the proposed estimators based on mean squared errors. Envelope technology is also a popular method in recent years to reduce estimative and predictive variations in multivariate regression, including a class of methods to improve the efficiency without changing the traditional objectives. Variable selection is the process of selecting a subset of relevant features variables for use in model construction. The purpose of using this technology is to avoid the curse of dimensionality, simplify models to make them easier to interpret, shorten training time and reduce overfitting. In Chapter 4, we combine envelope models and a group variable selection method to propose an envelope-based sparse reduced rank regression estimator in high-dimensional multivariate linear regression models, and then establish its consistency, asymptotic normality and oracle property. Tensor data are in frequent use today in a variety of fields in science and engineering. Processing tensor data is a practical but challenging problem. Recently, the prevalence of tensor data has resulted in several envelope tensor versions. In Chapter 5, we incorporate envelope technique into tensor regression analysis and propose a partial tensor envelope model, which leads to a parsimonious version for tensor response regression when some predictors are of special interest, and then consistency and asymptotic normality of the coefficient estimators are proved. The proposed method achieves significant gains in efficiency compared to the standard tensor response regression model in terms of the estimation of the coefficients for the selected predictors. Finally, in Chapter 6, we summarize the work carried out in the thesis, and then suggest some problems of further research interest. / Dissertation / Doctor of Philosophy (PhD)

Search results