Global ETD Search

121	Intelligent Autonomous Data Categorization Finegan, Edward Graham 01 January 2005 (has links) The goal of this research was to determine if the results of a simple comparison algorithm (SCA) could be improved by adding a hyperspace analogue to language model of memory (HAL) layer to form NCA. The HAL layer provides contextual data that otherwise would be unavailable for consideration. It was found that NCA did improve the results when compared to SCA alone. However, NCA added complexity problems that limit its practicality. The complexity of this algorithm is On3 where n is equal to the number of unique symbols in the data. While there is a relativity reasonable soft upper bound for the number of unique symbols used in a language, the complexity still limits the uses of the NCA combined algorithm. The conclusion from this research is that NCA can improve results. This research also suggested that the quality of results might increase as more data is processed by NCA. data organization internet search high-dimensional memory model hyperspace analogue to language search engine HAL Computer Sciences Physical Sciences and Mathematics
122	Etude des projections de données comme support interactif de l’analyse visuelle de la structure de données de grande dimension / Study of multidimensional scaling as an interactive visualization to help the visual analysis of high dimensional data Heulot, Nicolas 04 July 2014 (has links) Acquérir et traiter des données est de moins en moins coûteux, à la fois en matériel et en temps, mais encore faut-il pouvoir les analyser et les interpréter malgré leur complexité. La dimensionnalité est un des aspects de cette complexité intrinsèque. Pour aider à interpréter et à appréhender ces données le recours à la visualisation est indispensable au cours du processus d’analyse. La projection représente les données sous forme d’un nuage de points 2D, indépendamment du nombre de dimensions. Cependant cette technique de visualisation souffre de distorsions dues à la réduction de dimension, ce qui pose des problèmes d’interprétation et de confiance. Peu d’études ont été consacrées à la considération de l’impact de ces artefacts, ainsi qu’à la façon dont des utilisateurs non-familiers de ces techniques peuvent analyser visuellement une projection. L’approche soutenue dans cette thèse repose sur la prise en compte interactive des artefacts, afin de permettre à des analystes de données ou des non-experts de réaliser de manière fiable les tâches d’analyse visuelle des projections. La visualisation interactive des proximités colore la projection en fonction des proximités d’origine par rapport à une donnée de référence dans l’espace des données. Cette technique permet interactivement de révéler les artefacts de projection pour aider à appréhender les détails de la structure sous-jacente aux données. Dans cette thèse, nous revisitons la conception de cette technique et présentons ses apports au travers de deux expérimentations contrôlées qui étudient l’impact des artefacts sur l’analyse visuelle des projections. Nous présentons également une étude de l’espace de conception d’une technique basée sur la métaphore de lentille et visant à s’affranchir localement des problématiques d’artefacts de projection. / The cost of data acquisition and processing has radically decreased in both material and time. But we also need to analyze and interpret the large amounts of complex data that are stored. Dimensionality is one aspect of their intrinsic complexity. Visualization is essential during the analysis process to help interpreting and understanding these data. Projection represents data as a 2D scatterplot, regardless the amount of dimensions. However, this visualization technique suffers from artifacts due to the dimensionality reduction. Its lack of reliability implies issues of interpretation and trust. Few studies have been devoted to the consideration of the impact of these artifacts, and especially to give feedbacks on how non-expert users can visually analyze projections. The main approach of this thesis relies on an taking these artifacts into account using interactive techniques, in order to allow data scientists or non-expert users to perform a trustworthy visual analysis of projections. The interactive visualization of the proximities applies a coloring of the original proximities relatives to a reference in the data-space. This interactive technique allows revealing projection artifacts in order to help grasping details of the underlying data-structure. In this thesis, we redesign this technique and we demonstrate its potential by presenting two controlled experiments studying the impact of artifacts on the visual analysis of projections. We also present a design-space based on the lens metaphor, in order to improve this technique and to locally visualize a projection free of artifacts issues. Visualisation d’information Fouille visuelle de données Données de grande dimension Projection de données Information Visualization Visual Analytics High-Dimensional Data Multidimensional Scaling
123	Genetic association of high-dimensional traits Meyer, Hannah Verena January 2018 (has links) Over the past ten years, more than 4,000 genome-wide association studies (GWAS) have helped to shed light on the genetic architecture of complex traits and diseases. In recent years, phenotyping of the samples has often gone beyond single traits and it has become common to record multi- to high-dimensional phenotypes for individu- als. Whilst these rich datasets offer the potential to analyse complex trait structures and pleiotropic effects at a genome-wide level, novel analytic challenges arise. This thesis summarises my research into genetic associations for high-dimensional phen- otype data. First, I developed a novel and computationally efficient approach for multivari- ate analysis of high-dimensional phenotypes based on linear mixed models, com- bined with bootstrapping (LiMMBo). Both in simulation studies and on real data, I demonstrate the statistical validity of LiMMBo and that it can scale to hundreds of phenotypes. I show the gain in power of multivariate analyses for high-dimensional phenotypes compared to univariate approaches, and illustrate that LiMMBo allows for detecting pleiotropy in a large number of phenotypic traits. Aside from their computational challenges in GWAS, the true dimensionality of very high-dimensional phenotypes is often unknown and lies hidden in high-dimen- sional space. Retaining maximum power for association studies of such phenotype data relies on using an appropriate phenotype representation. I systematically ana- lysed twelve unsupervised dimensionality reduction methods based on their per- formance in finding a robust phenotype representation in simulated data of different structure and size. I propose a stability criteria for choosing low-dimensional phen- otype representations and demonstrate that stable phenotypes can recover genetic associations. Finally, I analysed genetic variants for associations to high-dimensional cardiac phenotypes based on MRI data from 1,500 healthy individuals. I used an unsuper- vised approach to extract a low-dimensional representation of cardiac wall thickness and conducted a GWAS on this representation. In addition, I investigated genetic associations to a trabeculation phenotype generated from a supervised feature ex- traction approach on the cardiac MRI data. In summary, this thesis highlights and overcomes some of the challenges in per- forming genetic association studies on high-dimensional phenotypes. It describes new approaches for phenotype processing, and genotype to phenotype mapping for high-dimensional datasets, as well as providing new insights in the genetic structure of cardiac morphology in humans.
124	THE FAMILY OF CONDITIONAL PENALIZED METHODS WITH THEIR APPLICATION IN SUFFICIENT VARIABLE SELECTION Xie, Jin 01 January 2018 (has links) When scientists know in advance that some features (variables) are important in modeling a data, then these important features should be kept in the model. How can we utilize this prior information to effectively find other important features? This dissertation is to provide a solution, using such prior information. We propose the Conditional Adaptive Lasso (CAL) estimates to exploit this knowledge. By choosing a meaningful conditioning set, namely the prior information, CAL shows better performance in both variable selection and model estimation. We also propose Sufficient Conditional Adaptive Lasso Variable Screening (SCAL-VS) and Conditioning Set Sufficient Conditional Adaptive Lasso Variable Screening (CS-SCAL-VS) algorithms based on CAL. The asymptotic and oracle properties are proved. Simulations, especially for the large p small n problems, are performed with comparisons with other existing methods. We further extend to the linear model setup to the generalized linear models (GLM). Instead of least squares, we consider the likelihood function with L1 penalty, that is the penalized likelihood methods. We proposed for Generalized Conditional Adaptive Lasso (GCAL) for the generalized linear models. We then further extend the method for any penalty terms that satisfy certain regularity conditions, namely Conditionally Penalized Estimate (CPE). Asymptotic and oracle properties are showed. Four corresponding sufficient variable screening algorithms are proposed. Simulation examples are evaluated for our method with comparisons with existing methods. GCAL is also evaluated with a read data set on leukemia. Generalized Conditional Adaptive Lasso High-dimensional Data Variable Screening Variable Selection Applied Statistics Statistical Methodology Statistical Models Statistical Theory
125	False Discovery Rates, Higher Criticism and Related Methods in High-Dimensional Multiple Testing Klaus, Bernd 16 January 2013 (has links) (PDF) The technical advancements in genomics, functional magnetic-resonance and other areas of scientific research seen in the last two decades have led to a burst of interest in multiple testing procedures. A driving factor for innovations in the field of multiple testing has been the problem of large scale simultaneous testing. There, the goal is to uncover lower--dimensional signals from high--dimensional data. Mathematically speaking, this means that the dimension d is usually in the thousands while the sample size n is relatively small (max. 100 in general, often due to cost constraints) --- a characteristic commonly abbreviated as d >> n. In my thesis I look at several multiple testing problems and corresponding procedures from a false discovery rate (FDR) perspective, a methodology originally introduced in a seminal paper by Benjamini and Hochberg (2005). FDR analysis starts by fitting a two--component mixture model to the observed test statistics. This mixture consists of a null model density and an alternative component density from which the interesting cases are assumed to be drawn. In the thesis I proposed a new approach called log--FDR to the estimation of false discovery rates. Specifically, my new approach to truncated maximum likelihood estimation yields accurate null model estimates. This is complemented by constrained maximum likelihood estimation for the alternative density using log--concave density estimation. A recent competitor to the FDR is the method of \"Higher Criticism\". It has been strongly advocated in the context of variable selection in classification which is deeply linked to multiple comparisons. Hence, I also looked at variable selection in class prediction which can be viewed as a special signal identification problem. Both FDR methods and Higher Criticism can be highly useful for signal identification. This is discussed in the context of variable selection in linear discriminant analysis (LDA), a popular classification method. FDR methods are not only useful for multiple testing situations in the strict sense, they are also applicable to related problems. I looked at several kinds of applications of FDR in linear classification. I present and extend statistical techniques related to effect size estimation using false discovery rates and showed how to use these for variable selection. The resulting fdr--effect method proposed for effect size estimation is shown to work as well as competing approaches while being conceptually simple and computationally inexpensive. Additionally, I applied the fdr--effect method to variable selection by minimizing the misclassification rate and showed that it works very well and leads to compact and interpretable feature sets. Multiples Testen Hochdimensionale Daten FDR Klassifikation Higher Criticism Multiple Testing High-dimensional Data FDR Classification Higher Criticism ddc:500
126	Nonlinear Dimensionality Reduction with Side Information Ghodsi Boushehri, Ali January 2006 (has links) In this thesis, I look at three problems with important applications in data processing. Incorporating side information, provided by the user or derived from data, is a main theme of each of these problems. <br /><br /> This thesis makes a number of contributions. The first is a technique for combining different embedding objectives, which is then exploited to incorporate side information expressed in terms of transformation invariants known to hold in the data. It also introduces two different ways of incorporating transformation invariants in order to make new similarity measures. Two algorithms are proposed which learn metrics based on different types of side information. These learned metrics can then be used in subsequent embedding methods. Finally, it introduces a manifold learning algorithm that is useful when applied to sequential decision problems. In this case we are given action labels in addition to data points. Actions in the manifold learned by this algorithm have meaningful representations in that they are represented as simple transformations. Mathematics Statistics Computer Science Artificial intelligence Machine learning Dimensionality reduction Manifold learning Unsupervised learning High dimensional data
127	Nonlinear Dimensionality Reduction with Side Information Ghodsi Boushehri, Ali January 2006 (has links) In this thesis, I look at three problems with important applications in data processing. Incorporating side information, provided by the user or derived from data, is a main theme of each of these problems. <br /><br /> This thesis makes a number of contributions. The first is a technique for combining different embedding objectives, which is then exploited to incorporate side information expressed in terms of transformation invariants known to hold in the data. It also introduces two different ways of incorporating transformation invariants in order to make new similarity measures. Two algorithms are proposed which learn metrics based on different types of side information. These learned metrics can then be used in subsequent embedding methods. Finally, it introduces a manifold learning algorithm that is useful when applied to sequential decision problems. In this case we are given action labels in addition to data points. Actions in the manifold learned by this algorithm have meaningful representations in that they are represented as simple transformations. Mathematics Statistics Computer Science Artificial intelligence Machine learning Dimensionality reduction Manifold learning Unsupervised learning High dimensional data
128	Bayesian and Information-Theoretic Learning of High Dimensional Data Chen, Minhua January 2012 (has links) <p>The concept of sparseness is harnessed to learn a low dimensional representation of high dimensional data. This sparseness assumption is exploited in multiple ways. In the Bayesian Elastic Net, a small number of correlated features are identified for the response variable. In the sparse Factor Analysis for biomarker trajectories, the high dimensional gene expression data is reduced to a small number of latent factors, each with a prototypical dynamic trajectory. In the Bayesian Graphical LASSO, the inverse covariance matrix of the data distribution is assumed to be sparse, inducing a sparsely connected Gaussian graph. In the nonparametric Mixture of Factor Analyzers, the covariance matrices in the Gaussian Mixture Model are forced to be low-rank, which is closely related to the concept of block sparsity. </p><p>Finally in the information-theoretic projection design, a linear projection matrix is explicitly sought for information-preserving dimensionality reduction. All the methods mentioned above prove to be effective in learning both simulated and real high dimensional datasets.</p> / Dissertation Electrical engineering Statistics Computer science Bayesian Statistics High Dimensional Data Analysis Information-Theoretic Learning Machine Learning Signal Processing Sparseness
129	Hessian-based response surface approximations for uncertainty quantification in large-scale statistical inverse problems, with applications to groundwater flow Flath, Hannah Pearl 11 September 2013 (has links) Subsurface flow phenomena characterize many important societal issues in energy and the environment. A key feature of these problems is that subsurface properties are uncertain, due to the sparsity of direct observations of the subsurface. The Bayesian formulation of this inverse problem provides a systematic framework for inferring uncertainty in the properties given uncertainties in the data, the forward model, and prior knowledge of the properties. We address the problem: given noisy measurements of the head, the pdf describing the noise, prior information in the form of a pdf of the hydraulic conductivity, and a groundwater flow model relating the head to the hydraulic conductivity, find the posterior probability density function (pdf) of the parameters describing the hydraulic conductivity field. Unfortunately, conventional sampling of this pdf to compute statistical moments is intractable for problems governed by large-scale forward models and high-dimensional parameter spaces. We construct a Gaussian process surrogate of the posterior pdf based on Bayesian interpolation between a set of "training" points. We employ a greedy algorithm to find the training points by solving a sequence of optimization problems where each new training point is placed at the maximizer of the error in the approximation. Scalable Newton optimization methods solve this "optimal" training point problem. We tailor the Gaussian process surrogate to the curvature of the underlying posterior pdf according to the Hessian of the log posterior at a subset of training points, made computationally tractable by a low-rank approximation of the data misfit Hessian. A Gaussian mixture approximation of the posterior is extracted from the Gaussian process surrogate, and used as a proposal in a Markov chain Monte Carlo method for sampling both the surrogate as well as the true posterior. The Gaussian process surrogate is used as a first stage approximation in a two-stage delayed acceptance MCMC method. We provide evidence for the viability of the low-rank approximation of the Hessian through numerical experiments on a large scale atmospheric contaminant transport problem and analysis of an infinite dimensional model problem. We provide similar results for our groundwater problem. We then present results from the proposed MCMC algorithms. / text Hessian Uncertainty quantification Bayesian statistics Inverse problems Gaussian processes High-dimensional parameter spaces Response surface MCMC Subsurface flow
130	Clustering, Classification, and Factor Analysis in High Dimensional Data Analysis Wang, Yanhong 17 December 2013 (has links) Clustering, classification, and factor analysis are three popular data mining techniques. In this dissertation, we investigate these methods in high dimensional data analysis. Since there are much more features than the sample sizes and most of the features are non-informative in high dimensional data, dimension reduction is necessary before clustering or classification can be made. In the first part of this dissertation, we reinvestigate an existing clustering procedure, optimal discriminant clustering (ODC; Zhang and Dai, 2009), and propose to use cross-validation to select the tuning parameter. Then we develop a variation of ODC, sparse optimal discriminant clustering (SODC) for high dimensional data, by adding a group-lasso type of penalty to ODC. We also demonstrate that both ODC and SDOC can be used as a dimension reduction tool for data visualization in cluster analysis. In the second part, three existing sparse principal component analysis (SPCA) methods, Lasso-PCA (L-PCA), Alternative Lasso PCA (AL-PCA), and sparse principal component analysis by choice of norm (SPCABP) are applied to a real data set the International HapMap Project for AIM selection to genome-wide SNP data, the classification accuracy is compared for them and it is demonstrated that SPCABP outperforms the other two SPCA methods. Third, we propose a novel method called sparse factor analysis by projection (SFABP) based on SPCABP, and propose to use cross-validation method for the selection of the tuning parameter and the number of factors. Our simulation studies show that SFABP has better performance than the unpenalyzed factor analysis when they are applied to classification problems. Cluster analysis Classification Cross-validation High-dimensional data Optimal score Principal components analysis Tuning parameter Variable selection Factor Analysis

Search results