• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 2
  • 1
  • Tagged with
  • 3
  • 3
  • 3
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Principal Components Analysis for Binary Data

Lee, Seokho 2009 May 1900 (has links)
Principal components analysis (PCA) has been widely used as a statistical tool for the dimension reduction of multivariate data in various application areas and extensively studied in the long history of statistics. One of the limitations of PCA machinery is that PCA can be applied only to the continuous type variables. Recent advances of information technology in various applied areas have created numerous large diverse data sets with a high dimensional feature space, including high dimensional binary data. In spite of such great demands, only a few methodologies tailored to such binary dataset have been suggested. The methodologies we developed are the model-based approach for generalization to binary data. We developed a statistical model for binary PCA and proposed two stable estimation procedures using MM algorithm and variational method. By considering the regularization technique, the selection of important variables is automatically achieved. We also proposed an efficient algorithm for model selection including the choice of the number of principal components and regularization parameter in this study.
2

Dimension reduction methods for nonlinear association analysis with applications to omics data

Wu, Peitao 06 November 2021 (has links)
With advances in high-throughput techniques, the availability of large-scale omics data has revolutionized the fields of medicine and biology, and has offered a better understanding of the underlying biological mechanisms. However, the high-dimensionality and the unknown association structure between different data types make statistical integration analyses challenging. In this dissertation, we develop three dimensionality reduction methods to detect nonlinear association structure using omics data. First, we propose a method for variable selection in a nonparametric additive quantile regression framework. We enforce a network regularization to incorporate information encoded by known networks. To account for nonlinear associations, we approximate the additive functional effect of each predictor with the expansion of a B-spline basis. We implement the group Lasso penalty to achieve sparsity. We define the network-constrained penalty by regulating the difference between the effect functions of any two linked genes (predictors) in the network. Simulation studies show that our proposed method performs well in identifying truly associated genes with fewer falsely associated genes than alternative approaches. Second, we develop a canonical correlation analysis (CCA)-based method, canonical distance correlation analysis (CDCA), and leverage the distance correlation to capture the overall association between two sets of variables. The CDCA allows untangling linear and nonlinear dependence structures. Third, we develop the sparse CDCA (sCDCA) method to achieve sparsity and improve result interpretability by adding penalties on the loadings from the CDCA. The sCDCA method can be applied to data with large dimensionality and small sample size. We develop iterative majorization-minimization-based coordinate descent algorithms to compute the loadings in the CDCA and sCDCA methods. Simulation studies show that the proposed CDCA and sCDCA approaches have better performance than classical CCA and sparse CCA (sCCA) in nonlinear settings and have similar performance in linear association settings. We apply the proposed methods to the Framingham Heart Study (FHS) to identify body mass index associated genes, the association structure between metabolic disorders and metabolite profiles, and a subset of metabolites and their associated type 2 diabetes (T2D)-related genes. / 2023-11-05T00:00:00Z
3

High-dimensional inference of ordinal data with medical applications

Jiao, Feiran 01 May 2016 (has links)
Ordinal response variables abound in scientific and quantitative analyses, whose outcomes comprise a few categorical values that admit a natural ordering, so that their values are often represented by non-negative integers, for instance, pain score (0-10) or disease severity (0-4) in medical research. Ordinal variables differ from rational variables in that its values delineate qualitative rather than quantitative differences. In this thesis, we develop new statistical methods for variable selection in a high-dimensional cumulative link regression model with an ordinal response. Our study is partly motivated by the needs for exploring the association structure between disease phenotype and high-dimensional medical covariates. The cumulative link regression model specifies that the ordinal response of interest results from an order-preserving quantization of some latent continuous variable that bears a linear regression relationship with a set of covariates. Commonly used error distributions in the latent regression include the normal distribution, the logistic distribution, the Cauchy distribution and the standard Gumbel distribution (minimum). The cumulative link model with normal (logit, Gumbel) errors is also known as the ordered probit (logit, complementary log-log) model. While the likelihood function has a closed-form solution for the aforementioned error distributions, its strong nonlinearity renders direct optimization of the likelihood to sometimes fail. To mitigate this problem and to facilitate extension to penalized likelihood estimation, we proposed specific minorization-maximization (MM) algorithms for maximum likelihood estimation of a cumulative link model for each of the preceding 4 error distributions. Penalized ordinal regression models play a role when variable selection needs to be performed. In some applications, covariates may often be grouped according to some meaningful way but some groups may be mixed in that they contain both relevant and irrelevant variables, i.e., whose coefficients are non-zero and zero, respectively. Thus, it is pertinent to develop a consistent method for simultaneously selecting relevant groups and the relevant variables within each selected group, which constitutes the so-called bi-level selection problem. We have proposed to use a penalized maximum likelihood approach with a composite bridge penalty to solve the bi-level selection problem in a cumulative link model. An MM algorithm was developed for implementing the proposed method, which is specific to each of the 4 error distributions. The proposed approach is shown to enjoy a number of desirable theoretical properties including bi-level selection consistency and oracle properties, under suitable regularity conditions. Simulations demonstrate that the proposed method enjoys good empirical performance. We illustrated the proposed methods with several real medical applications.

Page generated in 0.0434 seconds