1 |
Subspace Clustering with the Multivariate-t DistributionPesevski, Angelina January 2017 (has links)
Clustering procedures suitable for the analysis of very high-dimensional data are needed for many modern data sets. One model-based clustering approach called high-dimensional data clustering (HDDC) uses a family of Gaussian mixture models to model the sub-populations of the observed data, i.e., to perform cluster analysis. The HDDC approach is based on the idea that high-dimensional data usually exists in lower-dimensional subspaces; as such, the dimension of each subspace, called the intrinsic dimension, can be estimated for each sub-population of the observed data. As a result, each of these Gaussian mixture models can be fitted using only a fraction of the total number of model parameters. This family of models has gained attention due to its superior classification performance compared to other families of mixture models; however, it still suffers from the usual limitations of Gaussian mixture model-based approaches. Herein, a robust analogue of the HDDC approach is proposed. This approach, which extends the HDDC procedure to include the mulitvariate-t distribution, encompasses 28 models that rectify one of the major shortcomings of the HDDC procedure. Our tHDDC procedure is fitted to both simulated and real data sets and is compared to the HDDC procedure using an image reconstruction problem that arose from satellite imagery of Mars' surface. / Thesis / Master of Science (MSc)
|
2 |
Feed forward neural network entitiesHadjiprocopis, Andreas January 2000 (has links)
No description available.
|
3 |
Model selection and estimation in high dimensional settingsNgueyep Tzoumpe, Rodrigue 08 June 2015 (has links)
Several statistical problems can be described as estimation problem, where the goal is to learn a set of parameters, from some data, by maximizing a criterion. These type of problems are typically encountered in a supervised learning setting, where we want to relate an output (or many outputs) to multiple inputs. The relationship between these outputs and these inputs can be complex, and this complexity can be attributed to the high dimensionality of the space containing the inputs and the outputs; the existence of a structural prior knowledge within the inputs or the outputs that if ignored may lead to inefficient estimates of the parameters; and the presence of a non-trivial noise structure in the data. In this thesis we propose new statistical methods to achieve model selection and estimation when there are more predictors than observations. We also design a new set of algorithms to efficiently solve the proposed statistical models. We apply the implemented methods to genetic data sets of cancer patients and to some economics data.
|
4 |
Least absolute shrinkage and selection operator type methods for the identification of serum biomarkers of overweight and obesity: simulation and applicationVasquez, Monica M., Hu, Chengcheng, Roe, Denise J., Chen, Zhao, Halonen, Marilyn, Guerra, Stefano 14 November 2016 (has links)
Background: The study of circulating biomarkers and their association with disease outcomes has become progressively complex due to advances in the measurement of these biomarkers through multiplex technologies. The Least Absolute Shrinkage and Selection Operator (LASSO) is a data analysis method that may be utilized for biomarker selection in these high dimensional data. However, it is unclear which LASSO-type method is preferable when considering data scenarios that may be present in serum biomarker research, such as high correlation between biomarkers, weak associations with the outcome, and sparse number of true signals. The goal of this study was to compare the LASSO to five LASSO-type methods given these scenarios. Methods: A simulation study was performed to compare the LASSO, Adaptive LASSO, Elastic Net, Iterated LASSO, Bootstrap-Enhanced LASSO, and Weighted Fusion for the binary logistic regression model. The simulation study was designed to reflect the data structure of the population-based Tucson Epidemiological Study of Airway Obstructive Disease (TESAOD), specifically the sample size (N = 1000 for total population, 500 for sub-analyses), correlation of biomarkers (0.20, 0.50, 0.80), prevalence of overweight (40%) and obese (12%) outcomes, and the association of outcomes with standardized serum biomarker concentrations (log-odds ratio = 0.05-1.75). Each LASSO-type method was then applied to the TESAOD data of 306 overweight, 66 obese, and 463 normal-weight subjects with a panel of 86 serum biomarkers. Results: Based on the simulation study, no method had an overall superior performance. The Weighted Fusion correctly identified more true signals, but incorrectly included more noise variables. The LASSO and Elastic Net correctly identified many true signals and excluded more noise variables. In the application study, biomarkers of overweight and obesity selected by all methods were Adiponectin, Apolipoprotein H, Calcitonin, CD14, Complement 3, C-reactive protein, Ferritin, Growth Hormone, Immunoglobulin M, Interleukin-18, Leptin, Monocyte Chemotactic Protein-1, Myoglobin, Sex Hormone Binding Globulin, Surfactant Protein D, and YKL-40. Conclusions: For the data scenarios examined, choice of optimal LASSO-type method was data structure dependent and should be guided by the research objective. The LASSO-type methods identified biomarkers that have known associations with obesity and obesity related conditions.
|
5 |
Methods for Predicting an Ordinal Response with High-Throughput Genomic DataFerber, Kyle L 01 January 2016 (has links)
Multigenic diagnostic and prognostic tools can be derived for ordinal clinical outcomes using data from high-throughput genomic experiments. A challenge in this setting is that the number of predictors is much greater than the sample size, so traditional ordinal response modeling techniques must be exchanged for more specialized approaches. Existing methods perform well on some datasets, but there is room for improvement in terms of variable selection and predictive accuracy. Therefore, we extended an impressive binary response modeling technique, Feature Augmentation via Nonparametrics and Selection, to the ordinal response setting. Through simulation studies and analyses of high-throughput genomic datasets, we showed that our Ordinal FANS method is sensitive and specific when discriminating between important and unimportant features from the high-dimensional feature space and is highly competitive in terms of predictive accuracy.
Discrete survival time is another example of an ordinal response. For many illnesses and chronic conditions, it is impossible to record the precise date and time of disease onset or relapse. Further, the HIPPA Privacy Rule prevents recording of protected health information which includes all elements of dates (except year), so in the absence of a “limited dataset,” date of diagnosis or date of death are not available for calculating overall survival. Thus, we developed a method that is suitable for modeling high-dimensional discrete survival time data and assessed its performance by conducting a simulation study and by predicting the discrete survival times of acute myeloid leukemia patients using a high-dimensional dataset.
|
6 |
Bootstrapping in a high dimensional but very low sample size problemSong, Juhee 16 August 2006 (has links)
High Dimension, Low Sample Size (HDLSS) problems have received much attention
recently in many areas of science. Analysis of microarray experiments is one
such area. Numerous studies are on-going to investigate the behavior of genes by
measuring the abundance of mRNA (messenger RiboNucleic Acid), gene expression.
HDLSS data investigated in this dissertation consist of a large number of data sets
each of which has only a few observations.
We assume a statistical model in which measurements from the same subject
have the same expected value and variance. All subjects have the same distribution
up to location and scale. Information from all subjects is shared in estimating this
common distribution.
Our interest is in testing the hypothesis that the mean of measurements from a
given subject is 0. Commonly used tests of this hypothesis, the t-test, sign test and
traditional bootstrapping, do not necessarily provide reliable results since there are
only a few observations for each data set.
We motivate a mixture model having C clusters and 3C parameters to overcome
the small sample size problem. Standardized data are pooled after assigning each
data set to one of the mixture components. To get reasonable initial parameter estimates
when density estimation methods are applied, we apply clustering methods
including agglomerative and K-means.
Bayes Information Criterion (BIC) and a new criterion, WMCV (Weighted Mean
of within Cluster Variance estimates), are used to choose an optimal number of clusters.
Density estimation methods including a maximum likelihood unimodal density
estimator and kernel density estimation are used to estimate the unknown density.
Once the density is estimated, a bootstrapping algorithm that selects samples from
the estimated density is used to approximate the distribution of test statistics. The
t-statistic and an empirical likelihood ratio statistic are used, since their distributions
are completely determined by the distribution common to all subject. A method to
control the false discovery rate is used to perform simultaneous tests on all small data
sets.
Simulated data sets and a set of cDNA (complimentary DeoxyriboNucleic Acid)
microarray experiment data are analyzed by the proposed methods.
|
7 |
Learning with high-dimensional noisy dataChen, Yudong 25 September 2013 (has links)
Learning an unknown parameter from data is a problem of fundamental importance across many fields of engineering and science. Rapid development in information technology allows a large amount of data to be collected. The data is often highly non-uniform and noisy, sometimes subject to gross errors and even direct manipulations. Data explosion also highlights the importance of the so-called high-dimensional regime, where the number of variables might exceed the number of samples. Extracting useful information from the data requires high-dimensional learning algorithms that are robust to noise. However, standard algorithms for the high-dimensional regime are often brittle to noise, and the suite of techniques developed in Robust Statistics are often inapplicable to large and high-dimensional data. In this thesis, we study the problem of robust statistical learning in high-dimensions from noisy data. Our goal is to better understand the behaviors and effect of noise in high-dimensional problems, and to develop algorithms that are statistically efficient, computationally tractable, and robust to various types of noise. We forge into this territory by considering three important sub-problems. We first look at the problem of recovering a sparse vector from a few linear measurements, where both the response vector and the covariate matrix are subject to noise. Both stochastic and arbitrary noise are considered. We show that standard approaches are inadequate in these settings. We then develop robust efficient algorithms that provably recover the support and values of the sparse vector under different noise models and require minimum knowledge of the nature of the noise. Next, we study the problem of recovering a low-rank matrix from partially observed entries, with some of the observations arbitrarily corrupted. We consider the entry-wise corruption setting where no row or column has too many entries corrupted, and provide performance guarantees for a natural convex relaxation approach. Our unified guarantees cover both randomly and deterministically located corruptions, and improve upon existing results. We then turn to the column-wise corruption case where all observations from some columns are arbitrarily contaminated. We propose a new convex optimization approach and show that it simultaneously identify the corrupted columns and recover unobserved entries in the uncorrupted columns. Lastly, we consider the graph clustering problem, i.e., arranging the nodes of a graph into clusters such that there are relatively dense connections inside the clusters and sparse connections across different clusters. We propose a semi-random Generalized Stochastic Blockmodel for clustered graphs and develop a new algorithm based on convexified maximum likelihood estimators. We provide theoretical performance guarantees which recover, and sometimes improve on, all exiting results for the classical stochastic blockmodel, the planted k-clique model and the planted coloring models. We extend our algorithm to the case where the clusters are allowed to overlap with each other, and provide theoretical characterization of the performance of the algorithm. A further extension is studied when the graph may change over time. We develop new approaches to incorporate the time dynamics and show that it can identify stable overlapping communities in real-world time-evolving graphs. / text
|
8 |
High-dimensional problems in stochastic modelling of biological processesLiao, Shuohao January 2017 (has links)
Stochastic modelling of gene regulatory networks provides an indispensable tool for understanding how random events at the molecular level influence cellular functions. A common challenge of stochastic models is to calibrate a large number of model parameters against the experimental data. Another difficulty is to study how the behaviour of a stochastic model depends on its parameters, i.e. whether a change in model parameters can lead to a significant qualitative change in model behaviour (bifurcation). This thesis addresses such computational challenges by a tensor-structured computational framework. After a background introduction in Chapter 1, Chapter 2 derives the order of convergence in volume size between the stationary distributions of the exact chemical master equation (CME) and its continuous Fokker-Planck approximation (CFPE). It also proposes the multi-scale approaches to address the failure of the CFPE in capturing the noise-induced multi-stability of the CME distribution. Chapter 3 studies the numerical solution of the high-dimensional CFPE using the tensor train and the quantized-TT data formats. In Chapter 4, the tensor solutions are applied to study the parameter estimation, robustness, sensitivity and bifurcation structures of stochastic reaction networks. A Matlab implementation of the proposed methods/algorithms is available at http://www.stobifan.org.
|
9 |
A Bidirectional Pipeline for Semantic Interaction in Visual AnalyticsBinford, Adam Quarles 21 September 2016 (has links)
Semantic interaction in visual data analytics allows users to indirectly adjust model parameters by directly manipulating the output of the models. This is accomplished using an underlying bidirectional pipeline that first uses statistical models to visualize the raw data. When a user interacts with the visualization, the interaction is interpreted into updates in the model parameters automatically, giving the users immediate feedback on each interaction. These interpreted interactions eliminate the need for a deep understanding of the underlying statistical models. However, the development of such tools is necessarily complex due to their interactive nature. Furthermore, each tool defines its own unique pipeline to suit its needs, which leads to difficulty experimenting with different types of data, models, interaction techniques, and visual encodings. To address this issue, we present a flexible multi-model bidirectional pipeline for prototyping visual analytics tools that rely on semantic interaction. The pipeline has plug-and-play functionality, enabling quick alterations to the type of data being visualized, how models transform the data, and interaction methods. In so doing, the pipeline enforces a separation between the data pipeline and the visualization, preventing the two from becoming codependent. To show the flexibility of the pipeline, we demonstrate a new visual analytics tool and several distinct variations, each of which were quickly and easily implemented with slight changes to the pipeline or client. / Master of Science
|
10 |
Biclustering and Visualization of High Dimensional Data using VIsual Statistical Data AnalyzerBlake, Patrick Michael 31 January 2019 (has links)
Many data sets have too many features for conventional pattern recognition techniques to work properly. This thesis investigates techniques that alleviate these difficulties. One such technique, biclustering, clusters data in both dimensions and is inherently resistant to the challenges posed by having too many features. However, the algorithms that implement biclustering have limitations in that the user must know at least the structure of the data and how many biclusters to expect. This is where the VIsual Statistical Data Analyzer, or VISDA, can help. It is a visualization tool that successively and progressively explores the structure of the data, identifying clusters along the way. This thesis proposes coupling VISDA with biclustering to overcome some of the challenges of data sets with too many features. Further, to increase the performance, usability, and maintainability as well as reduce costs, VISDA was translated from Matlab to a Python version called VISDApy. Both VISDApy and the overall process were demonstrated with real and synthetic data sets. The results of this work have the potential to improve analysts' understanding of the relationships within complex data sets and their ability to make informed decisions from such data. / Master of Science / Many data sets have too many features for conventional pattern recognition techniques to work properly. This thesis investigates techniques that alleviate these difficulties. One such technique, biclustering, clusters data in both dimensions and is inherently resistant to the challenges posed by having too many features. However, the algorithms that implement biclustering have limitations in that the user must know at least the structure of the data and how many biclusters to expect. This is where the VIsual Statistical Data Analyzer, or VISDA, can help. It is a visualization tool that successively and progressively explores the structure of the data, identifying clusters along the way. This thesis proposes coupling VISDA with biclustering to overcome some of the challenges of data sets with too many features. Further, to increase the performance, usability, and maintainability as well as reduce costs, VISDA was translated from Matlab to a Python version called VISDApy. Both VISDApy and the overall process were demonstrated with real and synthetic data sets. The results of this work have the potential to improve analysts’ understanding of the relationships within complex data sets and their ability to make informed decisions from such data.
|
Page generated in 0.4409 seconds