Spelling suggestions: "subject:"[een] HIGH DIMENSIONAL DATA"" "subject:"[enn] HIGH DIMENSIONAL DATA""
11 |
Simultaneous Inference for High Dimensional and Correlated DataPolin, Afroza 22 August 2019 (has links)
No description available.
|
12 |
Hierarchické shlukování s Mahalanobis-average metrikou akcelerované na GPU / GPU-accelerated Mahalanobis-average hierarchical clusteringŠmelko, Adam January 2020 (has links)
Hierarchical clustering algorithms are common tools for simplifying, exploring and analyzing datasets in many areas of research. For flow cytometry, a specific variant of agglomerative clustering has been proposed, that uses cluster linkage based on Mahalanobis distance to produce results better suited for the domain. Applicability of this clustering algorithm is currently limited by its relatively high computational complexity, which does not allow it to scale to common cytometry datasets. This thesis describes a specialized, GPU-accelerated version of the Mahalanobis-average linked hierarchical clustering, which improves the algorithm performance by several orders of magnitude, thus allowing it to scale to much larger datasets. The thesis provides an overview of current hierarchical clustering algorithms, and details the construction of the variant used on GPU. The result is benchmarked on publicly available high-dimensional data from mass cytometry.
|
13 |
Improving the Accuracy of Variable Selection Using the Whole Solution PathLiu, Yang 23 July 2015 (has links)
No description available.
|
14 |
Consistent bi-level variable selection via composite group bridge penalized regressionSeetharaman, Indu January 1900 (has links)
Master of Science / Department of Statistics / Kun Chen / We study the composite group bridge penalized regression methods for conducting bilevel variable selection in high dimensional linear regression models with a diverging number of predictors. The proposed method combines the ideas of bridge regression (Huang et al., 2008a) and group bridge regression (Huang et al., 2009), to achieve variable selection consistency
in both individual and group levels simultaneously, i.e., the important groups and
the important individual variables within each group can both be correctly identi ed with
probability approaching to one as the sample size increases to in nity. The method takes full advantage of the prior grouping information, and the established bi-level oracle properties ensure that the method is immune to possible group misidenti cation. A related adaptive group bridge estimator, which uses adaptive penalization for improving bi-level selection, is also investigated. Simulation studies show that the proposed methods have superior performance in comparison to many existing methods.
|
15 |
Bayesian classification of DNA barcodesAnderson, Michael P. January 1900 (has links)
Doctor of Philosophy / Department of Statistics / Suzanne Dubnicka / DNA barcodes are short strands of nucleotide bases taken from the cytochrome c oxidase
subunit 1 (COI) of the mitochondrial DNA (mtDNA). A single barcode may have the form C
C G G C A T A G T A G G C A C T G . . . and typically ranges in length from 255 to around
700 nucleotide bases. Unlike nuclear DNA (nDNA), mtDNA remains largely unchanged as
it is passed from mother to offspring. It has been proposed that these barcodes may be
used as a method of differentiating between biological species (Hebert, Ratnasingham, and
deWaard 2003). While this proposal is sharply debated among some taxonomists (Will
and Rubinoff 2004), it has gained momentum and attention from biologists. One issue
at the heart of the controversy is the use of genetic distance measures as a tool for species differentiation. Current methods of species classification utilize these distance measures that are heavily dependent on both evolutionary model assumptions as well as a clearly defined "gap" between intra- and interspecies variation (Meyer and Paulay 2005). We point out the limitations of such distance measures and propose a character-based method of species classification which utilizes an application of Bayes' rule to overcome these deficiencies. The proposed method is shown to provide accurate species-level classification. The proposed methods also provide answers to important questions not addressable with current methods.
|
16 |
A-OPTIMAL SUBSAMPLING FOR BIG DATA GENERAL ESTIMATING EQUATIONSChung Ching Cheung (7027808) 13 August 2019 (has links)
<p>A significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.</p>
|
17 |
A General Framework for Multi-Resolution VisualizationYang, Jing 05 May 2005 (has links)
Multi-resolution visualization (MRV) systems are widely used for handling large amounts of information. These systems look different but they share many common features. The visualization research community lacks a general framework that summarizes the common features among the wide variety of MRV systems in order to help in MRV system design, analysis, and enhancement. This dissertation proposes such a general framework. This framework is based on the definition that a MRV system is a visualization system that visually represents perceptions in different levels of detail and allows users to interactively navigate among the representations. The visual representations of a perception are called a view. The framework is composed of two essential components: view simulation and interactive visualization. View simulation means that an MRV system simulates views of non-existing perceptions through simplification on the data structure or the graphics generation process. This is needed when the perceptions provided to the MRV system are not at the user's desired level of detail. The framework identifies classes of view simulation approaches and describes them in terms of simplification operators and operands (spaces). The simplification operators are further divided into four categories, namely sampling operators, aggregation operators, approximation operators, and generalization operators. Techniques in these categories are listed and illustrated via examples. The simplification operands (spaces) are also further divided into categories, namely data space and visualization space. How different simplification operators are applied to these spaces is also illustrated using examples. Interactive visualization means that an MRV system visually presents the views to users and allows users to interactively navigate among different views or within one view. Three types of MRV interface, namely the zoomable interface, the overview + context interface, and the focus + detail interface, are presented with examples. Common interaction tools used in MRV systems, such as zooming and panning, selection, distortion, overlap reduction, previewing, and dynamic simplification are also presented. A large amount of existing MRV systems are used as examples in this dissertation, including several MRV systems developed by the author based on the general framework. In addition, a case study that analyzes and suggests possible improvements for an existing MRV system is described. These examples and the case study reveal that the framework covers the common features of a wide variety of existing MRV systems, and helps users analyze and improve existing MRV systems as well as design new MRV systems.
|
18 |
Visual Hierarchical Dimension ReductionYang, Jing 09 January 2002 (has links)
Traditional visualization techniques for multidimensional data sets, such as parallel coordinates, star glyphs, and scatterplot matrices, do not scale well to high dimensional data sets. A common approach to solve this problem is dimensionality reduction. Existing dimensionality reduction techniques, such as Principal Component Analysis, Multidimensional Scaling, and Self Organizing Maps, have serious drawbacks in that the generated low dimensional subspace has no intuitive meaning to users. In addition, little user interaction is allowed in those highly automatic processes. In this thesis, we propose a new methodology to dimensionality reduction that combines automation and user interaction for the generation of meaningful subspaces, called the visual hierarchical dimension reduction (VHDR) framework. Firstly, VHDR groups all dimensions of a data set into a dimension hierarchy. This hierarchy is then visualized using a radial space-filling hierarchy visualization tool called Sunburst. Thus users are allowed to interactively explore and modify the dimension hierarchy, and select clusters at different levels of detail for the data display. VHDR then assigns a representative dimension to each dimension cluster selected by the users. Finally, VHDR maps the high-dimensional data set into the subspace composed of these representative dimensions and displays the projected subspace. To accomplish the latter, we have designed several extensions to existing popular multidimensional display techniques, such as parallel coordinates, star glyphs, and scatterplot matrices. These displays have been enhanced to express semantics of the selected subspace, such as the context of the dimensions and dissimilarity among the individual dimensions in a cluster. We have implemented all these features and incorporated them into the XmdvTool software package, which will be released as XmdvTool Version 6.0. Lastly, we developed two case studies to show how we apply VHDR to visualize and interactively explore a high dimensional data set.
|
19 |
Marginal false discovery rate approaches to inference on penalized regression modelsMiller, Ryan 01 August 2018 (has links)
Data containing large number of variables is becoming increasingly more common and sparsity inducing penalized regression methods, such the lasso, have become a popular analysis tool for these datasets due to their ability to naturally perform variable selection. However, quantifying the importance of the variables selected by these models is a difficult task. These difficulties are compounded by the tendency for the most predictive models, for example those which were chosen using procedures like cross-validation, to include substantial amounts of noise variables with no real relationship with the outcome. To address the task of performing inference on penalized regression models, this thesis proposes false discovery rate approaches for a broad class of penalized regression models. This work includes the development of an upper bound for the number of noise variables in a model, as well as local false discovery rate approaches that quantify the likelihood of each individual selection being a false discovery. These methods are applicable to a wide range of penalties, such as the lasso, elastic net, SCAD, and MCP; a wide range of models, including linear regression, generalized linear models, and Cox proportional hazards models; and are also extended to the group regression setting under the group lasso penalty. In addition to studying these methods using numerous simulation studies, the practical utility of these methods is demonstrated using real data from several high-dimensional genome wide association studies.
|
20 |
Bayesian Sparse Learning for High Dimensional DataShi, Minghui January 2011 (has links)
<p>In this thesis, we develop some Bayesian sparse learning methods for high dimensional data analysis. There are two important topics that are related to the idea of sparse learning -- variable selection and factor analysis. We start with Bayesian variable selection problem in regression models. One challenge in Bayesian variable selection is to search the huge model space adequately, while identifying high posterior probability regions. In the past decades, the main focus has been on the use of Markov chain Monte Carlo (MCMC) algorithms for these purposes. In the first part of this thesis, instead of using MCMC, we propose a new computational approach based on sequential Monte Carlo (SMC), which we refer to as particle stochastic search (PSS). We illustrate PSS through applications to linear regression and probit models.</p><p>Besides the Bayesian stochastic search algorithms, there is a rich literature on shrinkage and variable selection methods for high dimensional regression and classification with vector-valued parameters, such as lasso (Tibshirani, 1996) and the relevance vector machine (Tipping, 2001). Comparing with the Bayesian stochastic search algorithms, these methods does not account for model uncertainty but are more computationally efficient. In the second part of this thesis, we generalize this type of ideas to matrix valued parameters and focus on developing efficient variable selection method for multivariate regression. We propose a Bayesian shrinkage model (BSM) and an efficient algorithm for learning the associated parameters .</p><p>In the third part of this thesis, we focus on the topic of factor analysis which has been widely used in unsupervised learnings. One central problem in factor analysis is related to the determination of the number of latent factors. We propose some Bayesian model selection criteria for selecting the number of latent factors based on a graphical factor model. As it is illustrated in Chapter 4, our proposed method achieves good performance in correctly selecting the number of factors in several different settings. As for application, we implement the graphical factor model for several different purposes, such as covariance matrix estimation, latent factor regression and classification.</p> / Dissertation
|
Page generated in 0.0385 seconds