Spelling suggestions: "subject:"highdimensional data"" "subject:"higherdimensional data""
1 |
Feed forward neural network entitiesHadjiprocopis, Andreas January 2000 (has links)
No description available.
|
2 |
Bootstrapping in a high dimensional but very low sample size problemSong, Juhee 16 August 2006 (has links)
High Dimension, Low Sample Size (HDLSS) problems have received much attention
recently in many areas of science. Analysis of microarray experiments is one
such area. Numerous studies are on-going to investigate the behavior of genes by
measuring the abundance of mRNA (messenger RiboNucleic Acid), gene expression.
HDLSS data investigated in this dissertation consist of a large number of data sets
each of which has only a few observations.
We assume a statistical model in which measurements from the same subject
have the same expected value and variance. All subjects have the same distribution
up to location and scale. Information from all subjects is shared in estimating this
common distribution.
Our interest is in testing the hypothesis that the mean of measurements from a
given subject is 0. Commonly used tests of this hypothesis, the t-test, sign test and
traditional bootstrapping, do not necessarily provide reliable results since there are
only a few observations for each data set.
We motivate a mixture model having C clusters and 3C parameters to overcome
the small sample size problem. Standardized data are pooled after assigning each
data set to one of the mixture components. To get reasonable initial parameter estimates
when density estimation methods are applied, we apply clustering methods
including agglomerative and K-means.
Bayes Information Criterion (BIC) and a new criterion, WMCV (Weighted Mean
of within Cluster Variance estimates), are used to choose an optimal number of clusters.
Density estimation methods including a maximum likelihood unimodal density
estimator and kernel density estimation are used to estimate the unknown density.
Once the density is estimated, a bootstrapping algorithm that selects samples from
the estimated density is used to approximate the distribution of test statistics. The
t-statistic and an empirical likelihood ratio statistic are used, since their distributions
are completely determined by the distribution common to all subject. A method to
control the false discovery rate is used to perform simultaneous tests on all small data
sets.
Simulated data sets and a set of cDNA (complimentary DeoxyriboNucleic Acid)
microarray experiment data are analyzed by the proposed methods.
|
3 |
A Bidirectional Pipeline for Semantic Interaction in Visual AnalyticsBinford, Adam Quarles 21 September 2016 (has links)
Semantic interaction in visual data analytics allows users to indirectly adjust model parameters by directly manipulating the output of the models. This is accomplished using an underlying bidirectional pipeline that first uses statistical models to visualize the raw data. When a user interacts with the visualization, the interaction is interpreted into updates in the model parameters automatically, giving the users immediate feedback on each interaction. These interpreted interactions eliminate the need for a deep understanding of the underlying statistical models. However, the development of such tools is necessarily complex due to their interactive nature. Furthermore, each tool defines its own unique pipeline to suit its needs, which leads to difficulty experimenting with different types of data, models, interaction techniques, and visual encodings. To address this issue, we present a flexible multi-model bidirectional pipeline for prototyping visual analytics tools that rely on semantic interaction. The pipeline has plug-and-play functionality, enabling quick alterations to the type of data being visualized, how models transform the data, and interaction methods. In so doing, the pipeline enforces a separation between the data pipeline and the visualization, preventing the two from becoming codependent. To show the flexibility of the pipeline, we demonstrate a new visual analytics tool and several distinct variations, each of which were quickly and easily implemented with slight changes to the pipeline or client. / Master of Science
|
4 |
Biclustering and Visualization of High Dimensional Data using VIsual Statistical Data AnalyzerBlake, Patrick Michael 31 January 2019 (has links)
Many data sets have too many features for conventional pattern recognition techniques to work properly. This thesis investigates techniques that alleviate these difficulties. One such technique, biclustering, clusters data in both dimensions and is inherently resistant to the challenges posed by having too many features. However, the algorithms that implement biclustering have limitations in that the user must know at least the structure of the data and how many biclusters to expect. This is where the VIsual Statistical Data Analyzer, or VISDA, can help. It is a visualization tool that successively and progressively explores the structure of the data, identifying clusters along the way. This thesis proposes coupling VISDA with biclustering to overcome some of the challenges of data sets with too many features. Further, to increase the performance, usability, and maintainability as well as reduce costs, VISDA was translated from Matlab to a Python version called VISDApy. Both VISDApy and the overall process were demonstrated with real and synthetic data sets. The results of this work have the potential to improve analysts' understanding of the relationships within complex data sets and their ability to make informed decisions from such data. / Master of Science / Many data sets have too many features for conventional pattern recognition techniques to work properly. This thesis investigates techniques that alleviate these difficulties. One such technique, biclustering, clusters data in both dimensions and is inherently resistant to the challenges posed by having too many features. However, the algorithms that implement biclustering have limitations in that the user must know at least the structure of the data and how many biclusters to expect. This is where the VIsual Statistical Data Analyzer, or VISDA, can help. It is a visualization tool that successively and progressively explores the structure of the data, identifying clusters along the way. This thesis proposes coupling VISDA with biclustering to overcome some of the challenges of data sets with too many features. Further, to increase the performance, usability, and maintainability as well as reduce costs, VISDA was translated from Matlab to a Python version called VISDApy. Both VISDApy and the overall process were demonstrated with real and synthetic data sets. The results of this work have the potential to improve analysts’ understanding of the relationships within complex data sets and their ability to make informed decisions from such data.
|
5 |
Independence Screening in High-Dimensional DataWauters, John, Wauters, John January 2016 (has links)
High-dimensional data, data in which the number of dimensions exceeds the number of observations, is increasingly common in statistics. The term "ultra-high dimensional" is defined by Fan and Lv (2008) as describing the situation where log(p) is of order O(na) for some a in the interval (0, ½). It arises in many contexts such as gene expression data, proteomic data, imaging data, tomography, and finance, as well as others. High-dimensional data present a challenge to traditional statistical techniques. In traditional statistical settings, models have a small number of features, chosen based on an assumption of what features may be relevant to the response of interest. In the high-dimensional setting, many of the techniques of traditional feature selection become computationally intractable, or does not yield unique solutions. Current research in modeling high-dimensional data is heavily focused on methods that screen the features before modeling; that is, methods that eliminate noise-features as a pre-modeling dimension reduction. Typically noise feature are identified by exploiting properties of independent random variables, thus the term "independence screening." There are methods for modeling high-dimensional data without feature screening first (e.g. LASSO or SCAD), but simulation studies show screen-first methods perform better as dimensionality increases. Many proposals for independence screening exist, but in my literature review certain themes recurred: A) The assumption of sparsity: that all the useful information in the data is actually contained in a small fraction of the features (the "active features"), the rest being essentially random noise (the "inactive" features). B) In many newer methods, initial dimension reduction by feature screening reduces the problem from the high-dimensional case to a classical case; feature selection then proceeds by a classical method. C) In the initial screening, removal of features independent of the response is highly desirable, as such features literally give no information about the response. D) For the initial screening, some statistic is applied pairwise to each feature in combination with the response; the specific statistic chosen so that in the case that the two random variables are independent, a specific known value is expected for the statistic. E) Features are ranked by the absolute difference between the calculated statistic and the expected value of that statistic in the independent case, i.e. features that are most different from the independent case are most preferred. F) Proof is typically offered that, asymptotically, the method retains the true active features with probability approaching one. G) Where possible, an iterative version of the process is explored, as iterative versions do much better at identifying features that are active in their interactions, but not active individually.
|
6 |
Randomization test and correlation effects in high dimensional dataWang, Xiaofei January 1900 (has links)
Master of Science / Department of Statistics / Gary Gadbury / High-dimensional data (HDD) have been encountered in many fields and are characterized by a “large p, small n” paradigm that arises in genomic, lipidomic, and proteomic studies. This report used a simulation study that employed basic block diagonal covariance matrices to generate correlated HDD. Quantities of interests in such data are, among others, the number of ‘significant’ discoveries. This number can be highly variable when data are correlated. This project compared randomization tests versus usual t-tests for testing of significant effects across two treatment conditions. Of interest was whether the variance of the number of discoveries is better controlled in a randomization setting versus a t-test. The results showed that the randomization tests produced results similar to that of t-tests.
|
7 |
Penalised regression for high-dimensional data : an empirical investigation and improvements via ensemble learningWang, Fan January 2019 (has links)
In a wide range of applications, datasets are generated for which the number of variables p exceeds the sample size n. Penalised likelihood methods are widely used to tackle regression problems in these high-dimensional settings. In this thesis, we carry out an extensive empirical comparison of the performance of popular penalised regression methods in high-dimensional settings and propose new methodology that uses ensemble learning to enhance the performance of these methods. The relative efficacy of different penalised regression methods in finite-sample settings remains incompletely understood. Through a large-scale simulation study, consisting of more than 1,800 data-generating scenarios, we systematically consider the influence of various factors (for example, sample size and sparsity) on method performance. We focus on three related goals --- prediction, variable selection and variable ranking --- and consider six widely used methods. The results are supported by a semi-synthetic data example. Our empirical results complement existing theory and provide a resource to compare performance across a range of settings and metrics. We then propose a new ensemble learning approach for improving the performance of penalised regression methods, called STructural RANDomised Selection (STRANDS). The approach, that builds and improves upon the Random Lasso method, consists of two steps. In both steps, we reduce dimensionality by repeated subsampling of variables. We apply a penalised regression method to each subsampled dataset and average the results. In the first step, subsampling is informed by variable correlation structure, and in the second step, by variable importance measures from the first step. STRANDS can be used with any sparse penalised regression approach as the ``base learner''. In simulations, we show that STRANDS typically improves upon its base learner, and demonstrate that taking account of the correlation structure in the first step can help to improve the efficiency with which the model space may be explored. We propose another ensemble learning method to improve the prediction performance of Ridge Regression in sparse settings. Specifically, we combine Bayesian Ridge Regression with a probabilistic forward selection procedure, where inclusion of a variable at each stage is probabilistically determined by a Bayes factor. We compare the prediction performance of the proposed method to penalised regression methods using simulated data.
|
8 |
Statistical Dependence in Imputed High-Dimensional Data for a Colorectal Cancer StudySuyundikov, Anvar 01 May 2015 (has links)
The main purpose of this dissertation was to examine the statistical dependence of imputed microRNA (miRNA) data in a colorectal cancer study. The dissertation addressed three related statistical issues that were raised by this study. the first statistical issue was motivated by the fact that miRNA expression was measured in paired tumor-normal samples of hundreds of patients, but data for many normal samples were missing due to lack of tissue availability. We compared the precision and power performance of several imputation methods, and drew attention to the statistical dependence induced by K-Nearest Neighbors (KNN) imputation. The second statistical issue was raised by the necessity to address the bimodality of distributions of miRNA data along with the imputation-induced dependency among subjects. We proposed and compared the performance of three nonparametric methods to identify the dierentially expressed miRNAs in the paired tumor-normal data while accounting for the imputation-induced dependence. The third statistical issue was related to the development of a normalization method for miRNA data that would reduce not only technical variation but also the variation caused by the characteristics of subjects, while maintaining the true biological dierences between arrays.
|
9 |
A clustering scheme for large high-dimensional document datasetsChen, Jing-wen 09 August 2007 (has links)
Peoples pay more and more attention on document clustering methods. Because of the high dimension and the large number of data, clustering methods usually need a lot of time to calculate. We propose a scheme to make the clustering algorithm much faster then original. We partition the whole dataset to several parts. First, use one of these parts for clustering. Then according to the label after clustering, we reduce the number of features by a certain ratio. Add another part of data, convert these data to lower dimension and cluster them again. Repeat this until all partitions are used. According to the experimental result, this scheme may run twice faster then the original clustering method.
|
10 |
Kernel Machine Methods for Risk Prediction with High Dimensional DataSinnott, Jennifer Anne 22 October 2012 (has links)
Understanding the relationship between genomic markers and complex disease could have a profound impact on medicine, but the large number of potential markers can make it hard to differentiate true biological signal from noise and false positive associations. A standard approach for relating genetic markers to complex disease is to test each marker for its association with disease outcome by comparing disease cases to healthy controls. It would be cost-effective to use control groups across studies of many different diseases; however, this can be problematic when the controls are genotyped on a platform different from the one used for cases. Since different platforms genotype different SNPs, imputation is needed to provide full genomic coverage, but introduces differential measurement error. In Chapter 1, we consider the effects of this differential error on association tests. We quantify the inflation in Type I Error by comparing two healthy control groups drawn from the same cohort study but genotyped on different platforms, and assess several methods for mitigating this error. Analyzing genomic data one marker at a time can effectively identify associations, but the resulting lists of significant SNPs or differentially expressed genes can be hard to interpret. Integrating prior biological knowledge into risk prediction with such data by grouping genomic features into pathways reduces the dimensionality of the problem and could improve models by making them more biologically grounded and interpretable. The kernel machine framework has been proposed to model pathway effects because it allows nonlinear associations between the genes in a pathway and disease risk. In Chapter 2, we propose kernel machine regression under the accelerated failure time model. We derive a pseudo-score statistic for testing and a risk score for prediction using genes in a single pathway. We propose omnibus procedures that alleviate the need to prespecify the kernel and allow the data to drive the complexity of the resulting model. In Chapter 3, we extend methods for risk prediction using a single pathway to methods for risk prediction model using multiple pathways using a multiple kernel learning approach to select important pathways and efficiently combine information across pathways.
|
Page generated in 0.1173 seconds