51 |
Penalized methods and algorithms for high-dimensional regression in the presence of heterogeneityYi, Congrui 01 December 2016 (has links)
In fields such as statistics, economics and biology, heterogeneity is an important topic concerning validity of data inference and discovery of hidden patterns. This thesis focuses on penalized methods for regression analysis with the presence of heterogeneity in a potentially high-dimensional setting. Two possible strategies to deal with heterogeneity are: robust regression methods that provide heterogeneity-resistant coefficient estimation, and direct detection of heterogeneity while estimating coefficients accurately in the meantime.
We consider the first strategy for two robust regression methods, Huber loss regression and quantile regression with Lasso or Elastic-Net penalties, which have been studied theoretically but lack efficient algorithms. We propose a new algorithm Semismooth Newton Coordinate Descent to solve them. The algorithm is a novel combination of Semismooth Newton Algorithm and Coordinate Descent that applies to penalized optimization problems with both nonsmooth loss and nonsmooth penalty. We prove its convergence properties, and show its computational efficiency through numerical studies.
We also propose a nonconvex penalized regression method, Heterogeneity Discovery Regression (HDR) , as a realization of the second idea. We establish theoretical results that guarantees statistical precision for any local optimum of the objective function with high probability. We also compare the numerical performances of HDR with competitors including Huber loss regression, quantile regression and least squares through simulation studies and a real data example. In these experiments, HDR methods are able to detect heterogeneity accurately, and also largely outperform the competitors in terms of coefficient estimation and variable selection.
|
52 |
Grouped variable selection in high dimensional partially linear additive Cox modelLiu, Li 01 December 2010 (has links)
In the analysis of survival outcome supplemented with both clinical information and high-dimensional gene expression data, traditional Cox proportional hazard model fails to meet some emerging needs in biological research. First, the number of covariates is generally much larger the sample size. Secondly, predicting an outcome with individual gene expressions is inadequate because a gene's expression is regulated by multiple biological processes and functional units. There is a need to understand the impact of changes at a higher level such as molecular function, cellular component, biological process, or pathway. The change at a higher level is usually measured with a set of gene expressions related to the biological process. That is, we need to model the outcome with gene sets as variable groups and the gene sets could be partially overlapped also.
In this thesis work, we investigate the impact of a penalized Cox regression procedure on regularization, parameter estimation, variable group selection, and nonparametric modeling of nonlinear eects with a time-to-event outcome.
We formulate the problem as a partially linear additive Cox model with high-dimensional data. We group genes into gene sets and approximate the nonparametric components by truncated series expansions with B-spline bases. After grouping and approximation, the problem of variable selection becomes that of selecting groups of coecients in a gene set or in an approximation. We apply the group Lasso to obtain an initial solution path and reduce the dimension of the problem and then update the whole solution path with the adaptive group Lasso. We also propose a generalized group lasso method to provide more freedom in specifying the penalty and excluding covariates from being penalized.
A modied Newton-Raphson method is designed for stable and rapid computation. The core programs are written in the C language. An user-friendly R interface is implemented to perform all the calculations by calling the core programs.
We demonstrate the asymptotic properties of the proposed methods. Simulation studies are carried out to evaluate the finite sample performance of the proposed procedure using several tuning parameter selection methods for choosing the point on the solution path as the nal estimator. We also apply the proposed approach on two real data examples.
|
53 |
Marginal false discovery rate approaches to inference on penalized regression modelsMiller, Ryan 01 August 2018 (has links)
Data containing large number of variables is becoming increasingly more common and sparsity inducing penalized regression methods, such the lasso, have become a popular analysis tool for these datasets due to their ability to naturally perform variable selection. However, quantifying the importance of the variables selected by these models is a difficult task. These difficulties are compounded by the tendency for the most predictive models, for example those which were chosen using procedures like cross-validation, to include substantial amounts of noise variables with no real relationship with the outcome. To address the task of performing inference on penalized regression models, this thesis proposes false discovery rate approaches for a broad class of penalized regression models. This work includes the development of an upper bound for the number of noise variables in a model, as well as local false discovery rate approaches that quantify the likelihood of each individual selection being a false discovery. These methods are applicable to a wide range of penalties, such as the lasso, elastic net, SCAD, and MCP; a wide range of models, including linear regression, generalized linear models, and Cox proportional hazards models; and are also extended to the group regression setting under the group lasso penalty. In addition to studying these methods using numerous simulation studies, the practical utility of these methods is demonstrated using real data from several high-dimensional genome wide association studies.
|
54 |
High-dimensional data analysis : optimal metrics and feature selectionFrançois, Damien 10 January 2007 (has links)
High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumptions made while developing the classical data analysis tools are not transposable to high-dimensional data. For instance, many tools rely on the Euclidean distance, to compare data elements. But the Euclidean distance concentrates in high-dimensional spaces: all distances between data elements seem identical. The Euclidean distance is furthermore incapable of identifying important attributes from irrelevant ones. This thesis therefore focuses the choice of a relevant distance function to compare high-dimensional data and the selection of the relevant attributes. In Part One of the thesis, the phenomenon of the concentration of the distances is considered, and its consequences on data analysis tools are studied. It is shown that for nearest neighbours search, the Euclidean distance and the Gaussian kernel, both heavily used, may not be appropriate; it is thus proposed to use Fractional metrics and Generalised Gaussian kernels. Part Two of this thesis focuses on the problem of feature selection in the case of a large number of initial features. Two methods are proposed to (1) reduce the computational burden of feature selection process and (2) cope with the instability induced by high correlation between features that often appear with high-dimensional data. Most of the concepts studied and presented in this thesis are illustrated on chemometric data, and more particularly on spectral data, with the objective of inferring a physical or chemical property of a material by analysis the spectrum of the light it reflects.
|
55 |
Exploring Bit-Difference for Approximate KNN Search in High-dimensional DatabasesCui, Bin, Shen, Heng Tao, Shen, Jialie, Tan, Kian Lee 01 1900 (has links)
In this paper, we develop a novel index structure to support efficient approximate k-nearest neighbor (KNN) query in high-dimensional databases. In high-dimensional spaces, the computational cost of the distance (e.g., Euclidean distance) between two points contributes a dominant portion of the overall query response time for memory processing. To reduce the distance computation, we first propose a structure (BID) using BIt-Difference to answer approximate KNN query. The BID employs one bit to represent each feature vector of point and the number of bit-difference is used to prune the further points. To facilitate real dataset which is typically skewed, we enhance the BID mechanism with clustering, cluster adapted bitcoder and dimensional weight, named the BID⁺. Extensive experiments are conducted to show that our proposed method yields significant performance advantages over the existing index structures on both real life and synthetic high-dimensional datasets. / Singapore-MIT Alliance (SMA)
|
56 |
High-dimensional data analysis : optimal metrics and feature selectionFrançois, Damien 10 January 2007 (has links)
High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumptions made while developing the classical data analysis tools are not transposable to high-dimensional data. For instance, many tools rely on the Euclidean distance, to compare data elements. But the Euclidean distance concentrates in high-dimensional spaces: all distances between data elements seem identical. The Euclidean distance is furthermore incapable of identifying important attributes from irrelevant ones. This thesis therefore focuses the choice of a relevant distance function to compare high-dimensional data and the selection of the relevant attributes. In Part One of the thesis, the phenomenon of the concentration of the distances is considered, and its consequences on data analysis tools are studied. It is shown that for nearest neighbours search, the Euclidean distance and the Gaussian kernel, both heavily used, may not be appropriate; it is thus proposed to use Fractional metrics and Generalised Gaussian kernels. Part Two of this thesis focuses on the problem of feature selection in the case of a large number of initial features. Two methods are proposed to (1) reduce the computational burden of feature selection process and (2) cope with the instability induced by high correlation between features that often appear with high-dimensional data. Most of the concepts studied and presented in this thesis are illustrated on chemometric data, and more particularly on spectral data, with the objective of inferring a physical or chemical property of a material by analysis the spectrum of the light it reflects.
|
57 |
Bayesian Modeling and Computation for Mixed DataCui, Kai January 2012 (has links)
<p>Multivariate or high-dimensional data with mixed types are ubiquitous in many fields of studies, including science, engineering, social science, finance, health and medicine, and joint analysis of such data entails both statistical models flexible enough to accommodate them and novel methodologies for computationally efficient inference. Such joint analysis is potentially advantageous in many statistical and practical aspects, including shared information, dimensional reduction, efficiency gains, increased power and better control of error rates.</p><p>This thesis mainly focuses on two types of mixed data: (i) mixed discrete and continuous outcomes, especially in a dynamic setting; and (ii) multivariate or high dimensional continuous data with potential non-normality, where each dimension may have different degrees of skewness and tail-behaviors. Flexible Bayesian models are developed to jointly model these types of data, with a particular interest in exploring and utilizing the factor models framework. Much emphasis has also been placed on the ability to scale the statistical approaches and computation efficiently up to problems with long mixed time series or increasingly high-dimensional heavy-tailed and skewed data.</p><p>To this end, in Chapter 1, we start with reviewing the mixed data challenges. We start developing generalized dynamic factor models for mixed-measurement time series in Chapter 2. The framework allows mixed scale measurements in different time series, with the different measurements having distributions in the exponential family conditional on time-specific dynamic latent factors. Efficient computational algorithms for Bayesian inference are developed that can be easily extended to long time series. Chapter 3 focuses on the problem of jointly modeling of high-dimensional data with potential non-normality, where the mixed skewness and/or tail-behaviors in different dimensions are accurately captured via the proposed heavy-tailed and skewed factor models. Chapter 4 further explores the properties and efficient Bayesian inference for the generalized semiparametric Gaussian variance-mean mixtures family, and introduce it as a potentially useful family for modeling multivariate heavy-tailed and skewed data.</p> / Dissertation
|
58 |
Bayesian Sparse Learning for High Dimensional DataShi, Minghui January 2011 (has links)
<p>In this thesis, we develop some Bayesian sparse learning methods for high dimensional data analysis. There are two important topics that are related to the idea of sparse learning -- variable selection and factor analysis. We start with Bayesian variable selection problem in regression models. One challenge in Bayesian variable selection is to search the huge model space adequately, while identifying high posterior probability regions. In the past decades, the main focus has been on the use of Markov chain Monte Carlo (MCMC) algorithms for these purposes. In the first part of this thesis, instead of using MCMC, we propose a new computational approach based on sequential Monte Carlo (SMC), which we refer to as particle stochastic search (PSS). We illustrate PSS through applications to linear regression and probit models.</p><p>Besides the Bayesian stochastic search algorithms, there is a rich literature on shrinkage and variable selection methods for high dimensional regression and classification with vector-valued parameters, such as lasso (Tibshirani, 1996) and the relevance vector machine (Tipping, 2001). Comparing with the Bayesian stochastic search algorithms, these methods does not account for model uncertainty but are more computationally efficient. In the second part of this thesis, we generalize this type of ideas to matrix valued parameters and focus on developing efficient variable selection method for multivariate regression. We propose a Bayesian shrinkage model (BSM) and an efficient algorithm for learning the associated parameters .</p><p>In the third part of this thesis, we focus on the topic of factor analysis which has been widely used in unsupervised learnings. One central problem in factor analysis is related to the determination of the number of latent factors. We propose some Bayesian model selection criteria for selecting the number of latent factors based on a graphical factor model. As it is illustrated in Chapter 4, our proposed method achieves good performance in correctly selecting the number of factors in several different settings. As for application, we implement the graphical factor model for several different purposes, such as covariance matrix estimation, latent factor regression and classification.</p> / Dissertation
|
59 |
Bayesian Semi-parametric Factor ModelsBhattacharya, Anirban January 2012 (has links)
<p>Identifying a lower-dimensional latent space for representation of high-dimensional observations is of significant importance in numerous biomedical and machine learning applications. In many such applications, it is now routine to collect data where the dimensionality of the outcomes is comparable or even larger than the number of available observations. Motivated in particular by the problem of predicting the risk of impending diseases from massive gene expression and single nucleotide polymorphism profiles, this dissertation focuses on building parsimonious models and computational schemes for high-dimensional continuous and unordered categorical data, while also studying theoretical properties of the proposed methods. Sparse factor modeling is fast becoming a standard tool for parsimonious modeling of such massive dimensional data and the content of this thesis is specifically directed towards methodological and theoretical developments in Bayesian sparse factor models.</p><p>The first three chapters of the thesis studies sparse factor models for high-dimensional continuous data. A class of shrinkage priors on factor loadings are introduced with attractive computational properties, with operating characteristics explored through a number of simulated and real data examples. In spite of the methodological advances over the past decade, theoretical justifications in high-dimensional factor models are scarce in the Bayesian literature. Part of the dissertation focuses on exploring estimation of high-dimensional covariance matrices using a factor model and studying the rate of posterior contraction as both the sample size & dimensionality increases. </p><p>To relax the usual assumption of a linear relationship among the latent and observed variables in a standard factor model, extensions to a non-linear latent factor model are also considered.</p><p>Although Gaussian latent factor models are routinely used for modeling of dependence in continuous, binary and ordered categorical data, it leads to challenging computation and complex modeling structures for unordered categorical variables. As an alternative, a novel class of simplex factor models for massive-dimensional and enormously sparse contingency table data is proposed in the second part of the thesis. An efficient MCMC scheme is developed for posterior computation and the methods are applied to modeling dependence in nucleotide sequences and prediction from high-dimensional categorical features. Building on a connection between the proposed model & sparse tensor decompositions, we propose new classes of nonparametric Bayesian models for testing associations between a massive dimensional vector of genetic markers and a phenotypical outcome.</p> / Dissertation
|
60 |
Causality and aggregation in economics: the use of high dimensional panel data in micro-econometrics and macro-econometricsKwon, Dae-Heum 15 May 2009 (has links)
This study proposes one plausible procedure to address two methodological issues,
which are common in micro- and macro- econometric analyses, for the full realization of
research potential brought by recently available high dimensional data. To address the issue of
how to infer the causal structure from empirical regularities, graphical causal models are
proposed to inductively infer causal structure from non-temporal and non-experimental data.
However, the (probabilistic) stability condition for the graphical causal models can be violated
for high dimensional data, given that close co-movements and thus near deterministic relations
are oftentimes observed among variables in high dimensional data. Aggregation methods are
proposed as one possible way to address this matter, allowing one to infer causal relationships
among disaggregated variables based on aggregated variables. Aggregation methods also are
helpful to address the issue of how to incorporate a large information set into an empirical model,
given that econometric considerations, such as degrees-of-freedom and multicollinearity, require
an economy of parameters in empirical models. However, actual aggregation requires legitimate
classifications for interpretable and consistent aggregation.
Based on the generalized condition for the consistent and interpretable aggregation
derived from aggregation theory and statistical dimensional methods, we propose plausible
methodological procedure to consistently address the two related issues of causal inference and
actual aggregation procedures. Additional issues for empirical studies of micro-economics and
macro-economics are also discussed. The proposed procedure provides an inductive guidance for
the specification issues among the direct, inverse, and mixed demand systems and an inverse
demand system, which is statistically supported, is identified for the consumer behavior of soft
drink consumption. The proposed procedure also provides ways to incorporate large information
set into an empirical model with allowing structural understanding of U.S. macro-economy, which was difficult to obtain based on the previously used factor augmented vector
autoregressive (FAVAR) framework. The empirical results suggest the plausibility of the
proposed method to incorporate large information sets into empirical studies by inductively
addressing multicollinearity problem in high dimensional data.
|
Page generated in 0.1207 seconds