Global ETD Search

111	STATISTICAL METHODS FOR VARIABLE SELECTION IN THE CONTEXT OF HIGH-DIMENSIONAL DATA: LASSO AND EXTENSIONS Yang, Xiao Di 10 1900 (has links) <p>With the advance of technology, the collection and storage of data has become routine. Huge amount of data are increasingly produced from biological experiments. the advent of DNA microarray technologies has enabled scientists to measure expressions of tens of thousands of genes simultaneously. Single nucleotide polymorphism (SNP) are being used in genetic association with a wide range of phenotypes, for example, complex diseases. These high-dimensional problems are becoming more and more common. The "large p, small n" problem, in which there are more variables than samples, currently a challenge that many statisticians face. The penalized variable selection method is an effective method to deal with "large p, small n" problem. In particular, The Lasso (least absolute selection and shrinkage operator) proposed by Tibshirani has become an effective method to deal with this type of problem. the Lasso works well for the covariates which can be treated individually. When the covariates are grouped, it does not work well. Elastic net, group lasso, group MCP and group bridge are extensions of the Lasso. Group lasso enforces sparsity at the group level, rather than at the level of the individual covariates. Group bridge, group MCP produces sparse solutions both at the group level and at the level of the individual covariates within a group. Our simulation study shows that the group lasso forces complete grouping, group MCP encourages grouping to a rather slight extent, and group bridge is somewhere in between. If one expects that the proportion of nonzero group members to be greater than one-half, group lasso maybe a good choice; otherwise group MCP would be preferred. If one expects this proportion to be close to one-half, one may wish to use group bridge. A real data analysis example is also conducted for genetic variation (SNPs) data to find out the associations between SNPs and West Nile disease.</p> / Master of Science (MSc) Lasso High-Dimensional Penalized Variable Selection Methods Applied Statistics Biostatistics Statistical Models Applied Statistics
112	LEARNING FROM INCOMPLETE HIGH-DIMENSIONAL DATA Lou, Qiang January 2013 (has links) Data sets with irrelevant and redundant features and large fraction of missing values are common in the real life application. Learning such data usually requires some preprocess such as selecting informative features and imputing missing values based on observed data. These processes can provide more accurate and more efficient prediction as well as better understanding of the data distribution. In my dissertation I will describe my work in both of these aspects and also my following up work on feature selection in incomplete dataset without imputing missing values. In the last part of my dissertation, I will present my current work on more challenging situation where high-dimensional data is time-involving. The first two parts of my dissertation consist of my methods that focus on handling such data in a straightforward way: imputing missing values first, and then applying traditional feature selection method to select informative features. We proposed two novel methods, one for imputing missing values and the other one for selecting informative features. We proposed a new method that imputes the missing attributes by exploiting temporal correlation of attributes, correlations among multiple attributes collected at the same time and space, and spatial correlations among attributes from multiple sources. The proposed feature selection method aims to find a minimum subset of the most informative variables for classification/regression by efficiently approximating the Markov Blanket which is a set of variables that can shield a certain variable from the target. I present, in the third part, how to perform feature selection in incomplete high-dimensional data without imputation, since imputation methods only work well when data is missing completely at random, when fraction of missing values is small, or when there is prior knowledge about the data distribution. We define the objective function of the uncertainty margin-based feature selection method to maximize each instance's uncertainty margin in its own relevant subspace. In optimization, we take into account the uncertainty of each instance due to the missing values. The experimental results on synthetic and 6 benchmark data sets with few missing values (less than 25%) provide evidence that our method can select the same accurate features as the alternative methods which apply an imputation method first. However, when there is a large fraction of missing values (more than 25%) in data, our feature selection method outperforms the alternatives, which impute missing values first. In the fourth part, I introduce my method handling more challenging situation where the high-dimensional data varies in time. Existing way to handle such data is to flatten temporal data into single static data matrix, and then applying traditional feature selection method. In order to keep the dynamics in the time series data, our method avoid flattening the data in advance. We propose a way to measure the distance between multivariate temporal data from two instances. Based on this distance, we define the new objective function based on the temporal margin of each data instance. A fixed-point gradient descent method is proposed to solve the formulated objective function to learn the optimal feature weights. The experimental results on real temporal microarray data provide evidence that the proposed method can identify more informative features than the alternatives that flatten the temporal data in advance. / Computer and Information Science Computer Science Data Mining Feature Selection High-dimensional Data Incomplete Data Machine Learning
113	Variable Selection and Supervised Dimension Reduction for Large-Scale Genomic Data with Censored Survival Outcomes Spirko, Lauren Nicole January 2017 (has links) One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes, providing insight into the disease's process. With the rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of thousands of genes and proteins resulting in enormous data sets where the number of genomic variables (covariates) is far greater than the number of subjects. It is also typical for such data sets to have a high proportion of censored observations. Methods based on univariate Cox regression are often used to select genes related to survival outcome. However, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each gene. When applied to genes exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. In this thesis, we develop methods that will directly address t / Statistics Statistics Biostatistics Dimension Reduction Gene Expression High-dimensional Data Non-proportional Hazards Survival Variable Selection
114	Corporate Default Predictions and Methods for Uncertainty Quantifications Yuan, Miao 01 August 2016 (has links) Regarding quantifying uncertainties in prediction, two projects with different perspectives and application backgrounds are presented in this dissertation. The goal of the first project is to predict the corporate default risks based on large-scale time-to-event and covariate data in the context of controlling credit risks. Specifically, we propose a competing risks model to incorporate exits of companies due to default and other reasons. Because of the stochastic and dynamic nature of the corporate risks, we incorporate both company-level and market-level covariate processes into the event intensities. We propose a parsimonious Markovian time series model and a dynamic factor model (DFM) to efficiently capture the mean and correlation structure of the high-dimensional covariate dynamics. For estimating parameters in the DFM, we derive an expectation maximization (EM) algorithm in explicit forms under necessary constraints. For multi-period default risks, we consider both the corporate-level and the market-level predictions. We also develop prediction interval (PI) procedures that synthetically take uncertainties in the future observation, parameter estimation, and the future covariate processes into account. In the second project, to quantify the uncertainties in the maximum likelihood (ML) estimators and compute the exact tolerance interval (TI) factors regarding the nominal confidence level, we propose algorithms for two-sided control-the-center and control-both-tails TI for complete or Type II censored data following the (log)-location-scale family of distributions. Our approaches are based on pivotal properties of ML estimators of parameters for the (log)-location-scale family and utilize the Monte-Carlo simulations. While for Type I censored data, only approximate pivotal quantities exist. An adjusted procedure is developed to compute the approximate factors. The observed CP is shown to be asymptotically accurate by our simulation study. Our proposed methods are illustrated using real-data examples. / Ph. D. Default Risk Dynamic Factor Model High-Dimensional Time Series (Log)-Location-Scale Distributions Tolerance Interval.
115	Analysis of high-dimensional compositional microbiome data using PERMANOVA and machine learning classifiers Lindström, Felix, Oleandersson, Robin January 2024 (has links) Microbiome research has become a ubiquitous component of contemporary clinical research, with potential to uncover associations between microbiome composition and disease. With microbiome data becoming more prevalent, the need to understand how to analyse such data is increasingly important. One complicating property of microbiome data is that it is inherently compositional and thus constrained to simplex-space; because of this, it cannot be analysed directly using conventional statistical methods. In this paper, we transform the compositional data in order to lift the simplex-constraint, and then investigate the viability of applying conventional statistical methods to the data. Using a high-dimensional data set containing gut-microbiome samples from Parkinson's- and control patients, we first transform the raw data to centred log-ratio scale, and then use permutational multivariate analysis of variance (PERMANOVA) to test if there are differences between the two groups with respect to bacterial abundances. We then employ three machine learning classifiers -- Logistic regression, XGBoost, and Random Forest -- and evaluate their performance on the transformed data. The results from PERMANOVA indicate that gut-microbiome composition in the patients with Parkinson's disease indeed differ from that in the control individuals. The Random Forest method achieves the highest classification accuracy, followed by XGBoost, while logistic regression performs poorly, questioning its viability in analysis of high-dimensional compositional microbiome data. We find four bacterial species of high importance for the classification: Prevotella copri, Prevotella sp. CAG 520, Akkermansia muciniphila, and Butyricimonas virosa, where the first three have been previously mentioned in the Parkinson's literature. Compositional data Microbiome High-dimensional PERMANOVA Machine learning Classifiers Probability Theory and Statistics Sannolikhetsteori och statistik
116	Topics in Modern Bayesian Computation Qamar, Shaan January 2015 (has links) <p>Collections of large volumes of rich and complex data has become ubiquitous in recent years, posing new challenges in methodological and theoretical statistics alike. Today, statisticians are tasked with developing flexible methods capable of adapting to the degree of complexity and noise in increasingly rich data gathered across a variety of disciplines and settings. This has spurred the need for novel multivariate regression techniques that can efficiently capture a wide range of naturally occurring predictor-response relations, identify important predictors and their interactions and do so even when the number of predictors is large but the sample size remains limited. </p><p>Meanwhile, efficient model fitting tools must evolve quickly to keep pace with the rapidly growing dimension and complexity of data they are applied to. Aided by the tremendous success of modern computing, Bayesian methods have gained tremendous popularity in recent years. These methods provide a natural probabilistic characterization of uncertainty in the parameters and in predictions. In addition, they provide a practical way of encoding model structure that can lead to large gains in statistical estimation and more interpretable results. However, this flexibility is often hindered in applications to modern data which are increasingly high dimensional, both in the number of observations $n$ and the number of predictors $p$. Here, computational complexity and the curse of dimensionality typically render posterior computation inefficient. In particular, Markov chain Monte Carlo (MCMC) methods which remain the workhorse for Bayesian computation (owing to their generality and asymptotic accuracy guarantee), typically suffer data processing and computational bottlenecks as a consequence of (i) the need to hold the entire dataset (or available sufficient statistics) in memory at once; and (ii) having to evaluate of the (often expensive to compute) data likelihood at each sampling iteration. </p><p>This thesis divides into two parts. The first part concerns itself with developing efficient MCMC methods for posterior computation in the high dimensional {\em large-n large-p} setting. In particular, we develop an efficient and widely applicable approximate inference algorithm that extends MCMC to the online data setting, and separately propose a novel stochastic search sampling scheme for variable selection in high dimensional predictor settings. The second part of this thesis develops novel methods for structured sparsity in the high-dimensional {\em large-p small-n} regression setting. Here, statistical methods should scale well with the predictor dimension and be able to efficiently identify low dimensional structure so as to facilitate optimal statistical estimation in the presence of limited data. Importantly, these methods must be flexible to accommodate potentially complex relationships between the response and its associated explanatory variables. The first work proposes a nonparametric additive Gaussian process model to learn predictor-response relations that may be highly nonlinear and include numerous lower order interaction effects, possibly in different parts of the predictor space. A second work proposes a novel class of Bayesian shrinkage priors for multivariate regression with a tensor valued predictor. Dimension reduction is achieved using a low-rank additive decomposition for the latter, enabling a highly flexible and rich structure within which excellent cell-estimation and region selection may be obtained through state-of-the-art shrinkage methods. In addition, the methods developed in these works come with strong theoretical guarantees.</p> / Dissertation Statistics Approximate Bayesian computation High dimensional regression Nonparametric regression Scalable Markov chain Monte Carlo Structured additive models Variable selection
117	Inference of nonparametric hypothesis testing on high dimensional longitudinal data and its application in DNA copy number variation and micro array data analysis Zhang, Ke January 1900 (has links) Doctor of Philosophy / Department of Statistics / Haiyan Wang / High throughput screening technologies have generated a huge amount of biological data in the last ten years. With the easy availability of array technology, researchers started to investigate biological mechanisms using experiments with more sophisticated designs that pose novel challenges to statistical analysis. We provide theory for robust statistical tests in three flexible models. In the first model, we consider the hypothesis testing problems when there are a large number of variables observed repeatedly over time. A potential application is in tumor genomics where an array comparative genome hybridization (aCGH) study will be used to detect progressive DNA copy number changes in tumor development. In the second model, we consider hypothesis testing theory in a longitudinal microarray study when there are multiple treatments or experimental conditions. The tests developed can be used to detect treatment effects for a large group of genes and discover genes that respond to treatment over time. In the third model, we address a hypothesis testing problem that could arise when array data from different sources are to be integrated. We perform statistical tests by assuming a nested design. In all models, robust test statistics were constructed based on moment methods allowing unbalanced design and arbitrary heteroscedasticity. The limiting distributions were derived under the nonclassical setting when the number of probes is large. The test statistics are not targeted at a single probe. Instead, we are interested in testing for a selected set of probes simultaneously. Simulation studies were carried out to compare the proposed methods with some traditional tests using linear mixed-effects models and generalized estimating equations. Interesting results obtained with the proposed theory in two cancer genomic studies suggest that the new methods are promising for a wide range of biological applications with longitudinal arrays. high dimensional data longitudinal analysis nonparametric inference hypothesis testing DNA copy number variation Biology, Biostatistics (0308) Statistics (0463)
118	Applications of stochastic control and statistical inference in macroeconomics and high-dimensional data Han, Zhi 07 January 2016 (has links) This dissertation is dedicated to study the modeling of drift control in foreign exchange reserves management and design the fast algorithm of statistical inference with its application in high dimensional data analysis. The thesis has two parts. The first topic involves the modeling of foreign exchange reserve management as an drift control problem. We show that, under certain conditions, the control band policies are optimal for the discounted cost drift control problem and develop an algorithm to calculate the optimal thresholds of the optimal control band policy. The second topic involves the fast computing algorithm of partial distance covariance statistics with its application in feature screening in high dimensional data. We show that an O(n log n) algorithm for a version of the partial distance covariance exists, compared with the O(n^2) algorithm implemented directly accordingly to its definition. We further propose an iterative feature screening procedure in high dimensional data based on the partial distance covariance. This procedure enjoys two advantages over the correlation learning. First, an important predictor that is marginally uncorrelated but jointly correlated with the response can be picked by our procedure and thus entering the estimation model. Second, our procedure is robust to model mis- specification. Stochastic control Foreign exchange reserve Drift control Verification theorem Partial distance covariance High dimensional data Feature screening
119	On fillability of contact manifolds Niederkrüger, Klaus 11 December 2013 (has links) (PDF) The aim of this text is to give an accessible overview to some recent results concerning contact manifolds and their symplectic fillings. In particular, we work out the weakest compatibility conditions between a symplectic manifold and a contact structure on its boundary to still be able to obtain a sensible theory (Chapter II), furthermore we prove two results (Theorem A and B in Section I.4) that show how certain submanifolds inside a contact manifold obstruct the existence of a symplectic filling or influence its topology. We conclude by giving several constructions of contact manifolds that for different reasons do not admit a symplectic filling. high dimensional contact topology fillability holomorphic disks Legendrian foliations
120	Regularization Methods for Predicting an Ordinal Response using Longitudinal High-dimensional Genomic Data Hou, Jiayi 25 November 2013 (has links) Ordinal scales are commonly used to measure health status and disease related outcomes in hospital settings as well as in translational medical research. Notable examples include cancer staging, which is a five-category ordinal scale indicating tumor size, node involvement, and likelihood of metastasizing. Glasgow Coma Scale (GCS), which gives a reliable and objective assessment of conscious status of a patient, is an ordinal scaled measure. In addition, repeated measurements are common in clinical practice for tracking and monitoring the progression of complex diseases. Classical ordinal modeling methods based on the likelihood approach have contributed to the analysis of data in which the response categories are ordered and the number of covariates (p) is smaller than the sample size (n). With the emergence of genomic technologies being increasingly applied for obtaining a more accurate diagnosis and prognosis, a novel type of data, known as high-dimensional data where the number of covariates (p) is much larger than the number of samples (n), are generated. However, corresponding statistical methodologies as well as computational software are lacking for analyzing high-dimensional data with an ordinal or a longitudinal ordinal response. In this thesis, we develop a regularization algorithm to build a parsimonious model for predicting an ordinal response. In addition, we utilize the classical ordinal model with longitudinal measurements to incorporate the cutting-edge data mining tool for a comprehensive understanding of the causes of complex disease on both the molecular level and environmental level. Moreover, we develop the corresponding R package for general utilization. The algorithm was applied to several real datasets as well as to simulated data to demonstrate the efficiency in variable selection and precision in prediction and classification. The four real datasets are from: 1) the National Institute of Mental Health Schizophrenia Collaborative Study; 2) the San Diego Health Services Research Example; 3) A gene expression experiment to understand `Decreased Expression of Intelectin 1 in The Human Airway Epithelium of Smokers Compared to Nonsmokers' by Weill Cornell Medical College; and 4) the National Institute of General Medical Sciences Inflammation and the Host Response to Burn Injury Collaborative Study. classification high-dimensional data longitudinal data ordinal response prediction regularization methods Biostatistics Physical Sciences and Mathematics Statistics and Probability

Search results