Global ETD Search

1	PROBABILISTIC PREDICTION USING EMBEDDED RANDOM PROJECTIONS OF HIGH DIMENSIONAL DATA Kurwitz, Richard C. 2009 May 1900 (has links) The explosive growth of digital data collection and processing demands a new approach to the historical engineering methods of data correlation and model creation. A new prediction methodology based on high dimensional data has been developed. Since most high dimensional data resides on a low dimensional manifold, the new prediction methodology is one of dimensional reduction with embedding into a diffusion space that allows optimal distribution along the manifold. The resulting data manifold space is then used to produce a probability density function which uses spatial weighting to influence predictions i.e. data nearer the query have greater importance than data further away. The methodology also allows data of differing phenomenology e.g. color, shape, temperature, etc to be handled by regression or clustering classification. The new methodology is first developed, validated, then applied to common engineering situations, such as critical heat flux prediction and shuttle pitch angle determination. A number of illustrative examples are given with a significant focus placed on the objective identification of two-phase flow regimes. It is shown that the new methodology is robust through accurate predictions with even a small number of data points in the diffusion space as well as flexible in the ability to handle a wide range of engineering problems.
2	Asymptotic Performance Analysis of the Randomly-Projected RLDA Ensemble Classi er Niyazi, Lama 07 1900 (has links) Reliability and computational efficiency of classification error estimators are critical factors in classifier design. In a high-dimensional data setting where data is scarce, the conventional method of error estimation, cross-validation, can be very computationally expensive. In this thesis, we consider a particular discriminant analysis type classifier, the Randomly-Projected RLDA ensemble classifier, which operates under the assumption of such a ‘small sample’ regime. We conduct an asymptotic study of the generalization error of this classifier under this regime, which necessitates the use of tools from the field of random matrix theory. The main outcome of this study is a deterministic function of the true statistics of the data and the problem dimension that approximates the generalization error well for large enough dimensions. This is demonstrated by simulation on synthetic data. The main advantage of this approach is that it is computationally efficient. It also constitutes a major step towards the construction of a consistent estimator of the error that depends on the training data and not the true statistics, and so can be applied to real data. An analogous quantity for the Randomly-Projected LDA ensemble classifier, which appears in the literature and is a special case of the former, is also derived. We motivate its use for tuning the parameter of this classifier by simulation on synthetic data. classisication random matrix theory machine learning error estimation discriminant analysis random projections
3	Scalable Nonparametric Bayes Learning Banerjee, Anjishnu January 2013 (has links) <p>Capturing high dimensional complex ensembles of data is becoming commonplace in a variety of application areas. Some examples include</p><p>biological studies exploring relationships between genetic mutations and diseases, atmospheric and spatial data, and internet usage and online behavioral data. These large complex data present many challenges in their modeling and statistical analysis. Motivated by high dimensional data applications, in this thesis, we focus on building scalable Bayesian nonparametric regression algorithms and on developing models for joint distributions of complex object ensembles.</p><p>We begin with a scalable method for Gaussian process regression, a commonly used tool for nonparametric regression, prediction and spatial modeling. A very common bottleneck for large data sets is the need for repeated inversions of a big covariance matrix, which is required for likelihood evaluation and inference. Such inversion can be practically infeasible and even if implemented, highly numerically unstable. We propose an algorithm utilizing random projection ideas to construct flexible, computationally efficient and easy to implement approaches for generic scenarios. We then further improve the algorithm incorporating some structure and blocking ideas in our random projections and demonstrate their applicability in other contexts requiring inversion of large covariance matrices. We show theoretical guarantees for performance as well as substantial improvements over existing methods with simulated and real data. A by product of the work is that we discover hitherto unknown equivalences between approaches in machine learning, random linear algebra and Bayesian statistics. We finally connect random projection methods for large dimensional predictors and large sample size under a unifying theoretical framework.</p><p>The other focus of this thesis is joint modeling of complex ensembles of data from different domains. This goes beyond traditional relational modeling of ensembles of one type of data and relies on probability mixing measures over tensors. These models have added flexibility over some existing product mixture model approaches in letting each component of the ensemble have its own dependent cluster structure. We further investigate the question of measuring dependence between variables of different types and propose a very general novel scaled measure based on divergences between the joint and marginal distributions of the objects. Once again, we show excellent performance in both simulated and real data scenarios.</p> / Dissertation Statistics Bayes Gaussian process high-dimensional nonparametric random projections tensor factorization
4	Two-Sample Testing of High-Dimensional Covariance Matrices Sun, Nan, 0000-0003-0278-5254 January 2021 (has links) Testing the equality between two high-dimensional covariance matrices is challenging. As the most efficient way to measure evidential discrepancies in observed data, the likelihood ratio test is expected to be powerful when the null hypothesis is violated. However, when the data dimensionality becomes large and potentially exceeds the sample size by a substantial margin, likelihood ratio based approaches face practical and theoretical challenges. To solve this problem, this study proposes a method by which we first randomly project the original high-dimensional data into lower-dimensional space, and then apply the corrected likelihood ratio tests developed with random matrix theory. We show that testing with a single random projection is consistent under the null hypothesis. Through evaluating the power function, which is challenging in this context, we provide evidence that the test with a single random projection based on a random projection matrix with reasonable column sizes is more powerful when the two covariance matrices are unequal but component-wise discrepancy could be small -- a weak and dense signal setting. To more efficiently utilize this data information, we propose combined tests from multiple random projections from the class of meta-analyses. We establish the foundation of the combined tests from our theoretical analysis that the p-values from multiple random projections are asymptotically independent in the high-dimensional covariance matrices testing problem. Then, we show that combined tests from multiple random projections are consistent under the null hypothesis. In addition, our theory presents the merit of certain meta-analysis approaches over testing with a single random projection. Numerical evaluation of the power function of the combined tests from multiple random projections is also provided based on numerical evaluation of power function of testing with a single random projection. Extensive simulations and two real genetic data analyses confirm the merits and potential applications of our test. / Statistics Statistics Corrected likelihood ratio test Covariance matrix Hypothesis testing Meta analysis Random matrix theory Random projections
5	An Empirical Study of Novel Approaches to Dimensionality Reduction and Applications Nsang, Augustine S. 23 September 2011 (has links) No description available. Computer Science dimensionality reduction random projections clustering classification queries web data
6	Dimension reduction of streaming data via random projections Cosma, Ioana Ada January 2009 (has links) A data stream is a transiently observed sequence of data elements that arrive unordered, with repetitions, and at very high rate of transmission. Examples include Internet traffic data, networks of banking and credit transactions, and radar derived meteorological data. Computer science and engineering communities have developed randomised, probabilistic algorithms to estimate statistics of interest over streaming data on the fly, with small computational complexity and storage requirements, by constructing low dimensional representations of the stream known as data sketches. This thesis combines techniques of statistical inference with algorithmic approaches, such as hashing and random projections, to derive efficient estimators for cardinality, l_{alpha} distance and quasi-distance, and entropy over streaming data. I demonstrate an unexpected connection between two approaches to cardinality estimation that involve indirect record keeping: the first using pseudo-random variates and storing selected order statistics, and the second using random projections. I show that l_{alpha} distances and quasi-distances between data streams, and entropy, can be recovered from random projections that exploit properties of alpha-stable distributions with full statistical efficiency. This is achieved by the method of L-estimation in a single-pass algorithm with modest computational requirements. The proposed estimators have good small sample performance, improved by the methods of trimming and winsorising; in other words, the value of these summary statistics can be approximated with high accuracy from data sketches of low dimension. Finally, I consider the problem of convergence assessment of Markov Chain Monte Carlo methods for simulating from complex, high dimensional, discrete distributions. I argue that online, fast, and efficient computation of summary statistics such as cardinality, entropy, and l_{alpha} distances may be a useful qualitative tool for detecting lack of convergence, and illustrate this with simulations of the posterior distribution of a decomposable Gaussian graphical model via the Metropolis-Hastings algorithm. 519.2
7	Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease Duan, Haoyang 15 May 2014 (has links) From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset. SNPs GWAS Data Science Mass Transportation Distance Dimensionality Reduction Random Projections Supervised Learning Theory Coronary Artery Disease K-Nearest Neighbour Classifier Universal Consistency
8	Traitement du signal dans le domaine compressé et quantification sur un bit : deux outils pour les contextes sous contraintes de communication / Compressed-domain signal processing and one-bit quantization : two tools for contexts undercommunication constraints Zebadúa, Augusto 11 December 2017 (has links) La surveillance de phénomènes physiques à l’aide d’un réseau de capteurs (autonomes mais communicants) est fortement contrainte en consommation énergétique, principalement pour la transmission de données. Dans ce cadre, cette thèse propose des méthodes de traitement du signal permettant de réduire les communications sans compromettre la précision des calculs ultérieurs. La complexité de ces méthodes est réduite, de façon à ne consommer que peu d’énergie supplémentaire. Deux éléments servent à leur synthèse : la compression dès l’acquisition (Acquisition compressive) et la quantification grossière (sur 1 bit). D’abord, on étudie le corrélateur compressé, un estimateur qui permet d’évaluer les fonctions de corrélation, temps de retard et densités spectrales en exploitant directement des signaux compressés. Ses performances sont comparées au corrélateur usuel. Si le signal à traiter possède un support spectral étroit, l’estimateur proposé s’avère sensiblement meilleur que l’usuel. Ensuite, inspirés par les corrélateurs à forte quantification des années 50 et 60, deux nouveaux corrélateurs sont étudiés : le compressé sur 1 bit et le compressé hybride, qui peuvent également surpasser les performances de leurs contreparties non-compressées. Finalement, on montre la pertinence de ces méthodes pour les applications envisagées à travers l’exploitation de données réelles. / Monitoring physical phenomena by using a network of sensors (autonomous but interconnected) is highly constrained in energy consumption, mainly for data transmission. In this context, this thesis proposes signal processing tools to reduce communications without compromising computational accuracy in subsequent calculations. The complexity of these methods is reduced, so as to consume only little additional energy. Our two building blocks are compression during signal acquisition (Compressive Sensing) and CoarseQuantization (1 bit). We first study the Compressed Correlator, an estimator which allows for evaluating correlation functions, time-delay, and spectral densities directly from compressed signals. Its performance is compared with the usual correlator. As we show, if the signal of interest has limited frequency content, the proposed estimator significantly outperforms theconventional correlator. Then, inspired by the coarse quantization correlators from the 50s and 60s, two new correlators are studied: The 1-bit Compressed and the Hybrid Compressed, which can also outperform their uncompressed counterparts. Finally, we show the applicability of these methods in the context of interest through the exploitation of real data. Acquisition compresive Projections Aléatoires Quantification Estimation Correlation Analyse Spectrale Compressive sensing Random Projections One bit quantization Estimation Correlation Spectral Analysis 620
9	Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease Duan, Haoyang January 2014 (has links) From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset. SNPs GWAS Data Science Mass Transportation Distance Dimensionality Reduction Random Projections Supervised Learning Theory Coronary Artery Disease K-Nearest Neighbour Classifier Universal Consistency
10	Random projections in a distributed environment for privacy-preserved deep learning / Slumpmässiga projektioner i en distribuerad miljö för privatiserad djupinlärning Bagger Toräng, Malcolm January 2021 (has links) The field of Deep Learning (DL) only over the last decade has proven useful for increasingly more complex Machine Learning tasks and data, a notable milestone being generative models achieving facial synthesis indistinguishable from real faces. With the increased complexity in DL architecture and training data, follows a steep increase in time and hardware resources required for the training task. These resources are easily accessible via cloud-based platforms if the data owner is willing to share its training data. To allow for cloud-sharing of its training data, The Swedish Transport Administration (TRV) is interested in evaluating resource effective, infrastructure independent, privacy-preserving obfuscation methods to be used on real-time collected data on distributed Internet-of-Things (IoT) devices. A fundamental problem in this setting is to balance the trade-off between privacy and DL utility of the obfuscated training data. We identify statistically measurable relevant metrics of privacy achievable via obfuscation and compare two prominent alternatives from the literature, optimization-based methods (OBM) and random projections (RP). OBM achieve privacy via direct optimization towards a metric, preserving utility-crucial patterns in the data, and is typically in addition evaluated in terms of a DL-based adversary’s sensitive feature estimation error. RP project data via a random matrix to lower dimensions to preserve sample pair-wise distances while offering privacy in terms of difficulty in data recovery. The goals of the project centered around evaluating RP on privacy metric results previously attained for OBM, compare adversarial feature estimation error in OBM and RP, as well as to address the possibly infeasible learning task of using composite multi-device datasets generated using independent projection matrices. The last goal is relevant to TRV in that multiple devices are likely to contribute to the same composite dataset. Our results complement previous research in that they indicate that both privacy and utility guarantees in a distributed setting, vary depending on data type and learning task. These results favor OBM that theoretically should offer more robust guarantees. Our results and conclusions would encourage further experimentation with RP in a distributed setting to better understand the influence of data type and learning task on privacy-utility, target-distributed data sources being a promising starting point. / Forskningsområdet Deep Learning (DL) bara under det senaste decenniet har visat sig vara användbart för allt mer komplexa maskinginlärnings-uppgifter och data, en anmärkningsvärd milstolpe är generativa modeller som erhåller verklighetstrogna syntetiska ansiktsbilder. Med den ökade komplexiteten i DL -arkitektur och träningsdata följer ett kraftigt ökat behov av tid och hårdvaruresurser för träningsuppgiften. Dessa resurser är lättillgängliga via molnbaserade plattformar om dataägaren är villig att dela sin träningsdata. För att möjliggöra molndelning av träningsdata är Trafikverket (TRV) intresserat av att utvärdera resurseffektiva, infrastrukturoberoende, privatiserade obfuskeringsmetoder som ska användas på data hämtad i realtid via distribuerade Internet-of-Things ( IoT) -enheter; det grundläggande problemet är avvägningen mellan privatisering och användbarhet av datan i DL-syfte. Vi identifierar statistiskt mätbara relevanta mått av privatisering som kan uppnås via obfuskering och jämför två framstående alternativ från litteraturen, optimeringsbaserade metoder (OBM) och slumpmässiga projektioner (RP). OBM uppnår privatisering via matematisk optimering av ett mått av data-privatisering, vilket bevarar övriga nödvändiga mönster i data för DL-uppgiften. OBM-metoder utvärderas vanligtvis i termer av en DL-baserad motståndares uppskattningsfel av känsliga attribut i datan. RP obfuskerar data via en slumpmässig projektion till lägre dimensioner för att bevara avstånd mellan datapunkter samtidigt som de erbjuder privatisering genom teoretisk svårighet i dataåterställning. Målen för examensarbetet centrerades kring utvärdering av RP på privatiserings-mått som tidigare uppnåtts för OBM, att jämföra DL-baserade motståndares uppskattningsfel på data från OBM och RP, samt att ta itu med den befarat omöjliga inlärningsuppgiften att använda sammansatta dataset från flera IoT-enheter som använder oberoende projektionsmatriser. Sistnämnda målet är relevant i en miljö sådan som TRVs, där flera IoT-enheter oberoende bidrar till ett och samma dataset och DL-uppgift. Våra resultat kompletterar tidigare forskning genom att de indikerar att både privatisering och användbarhetsgarantier i en distribuerad miljö varierar beroende på datatyp och inlärningsuppgift. Dessa resultat gynnar OBM som teoretiskt sett bör erbjuda mer robusta garantier vad gäller användbarhet. Våra resultat och slutsatser uppmuntrar framtida experiment med RP i en distribuerad miljö för att bättre förstå inverkan av datatyp och inlärningsuppgift på graden av privatisering, datakällor distribuerade baserat på klassificerings-target är en lovande utgångspunkt. Random projections Generative adversarial networks Privacy metrics Deep learning Obfuscation. Slumpmässiga projektioner Generativa kontroversiella nätverk Privatiserings-mått Djupinlärning Obfuskering. Computer Sciences Datavetenskap (datalogi)

Search results