Global ETD Search

1	Variable selection for generalized linear mixed models and non-Gaussian Genome-wide associated study data Xu, Shuangshuang 11 June 2024 (has links) Genome-wide associated study (GWAS) aims to identify associated single nucleotide polymorphisms (SNP) for phenotypes. SNP has the characteristic that the number of SNPs is from hundred of thousands to millions. If p is the number of SNPs and n is the sample size, it is a p>>n variable selection problem. To solve this p>>n problem, the common method for GWAS is single marker analysis (SMA). However, since SNPs are highly correlated, SMA identifies true causal SNPs with high false discovery rate. In addition, SMA does not consider interaction between SNPs. In this dissertation, we propose novel Bayesian variable selection methods BG2 and IBG3 for non-Gaussian GWAS data. To solve ultra-high dimension problem and highly correlated SNPs problem, BG2 and IBG3 have two steps: screening step and fine-mapping step. In the screening step, BG2 and IBG3, like SMA method, only have one SNP in one model and screen to obtain a subset of most associated SNPs. In the fine-mapping step, BG2 and IBG3 consider all possible combinations of screened candidate SNPs to find the best model. Fine-mapping step helps to reduce false positives. In addition, IBG3 iterates these two steps to detect more SNPs with small effect size. In simulation studies, we compare our methods with SMA methods and fine-mapping methods. We also compare our methods with different priors for variables, including nonlocal prior, unit information prior, Zellner-g prior, and Zellner-Siow prior. Our methods are applied to substance use disorder (alcohol comsumption and cocaine dependence), human health (breast cancer), and plant science (the number of root-like structure). / Doctor of Philosophy / Genome-wide associated study (GWAS) aims to identify genomics variants for targeted phenotype, such as disease and trait. The genomics variants which we are interested in are single nucleotide polymorphisms (SNP). SNP is a substitution mutation in the DNA sequence. GWAS solves the problem that which SNP is associated with the phenotype. However, the number of possible SNPs is from hundred of thousands to millions. The common method for GWAS is called single marker analysis (SMA). SMA only considers one SNP's association with the phenotype each time. In this way, SMA does not have the problem which comes from the large number of SNPs and small sample size. However, SMA does not consider the interaction between SNPs. In addition, SNPs that are close to each other in the DNA sequance may highly correlated SNPs causing SMA to have high false discovery rate. To solve these problems, this dissertation proposes two variable selection methods (BG2 and IBG3) for non-Gaussian GWAS data. Compared with SMA methods, BG2 and IBG3 methods detect true causal SNPs with low false discovery rate. In addition, IBG3 can detect SNPs with small effect sizes. Our methods are applied to substance use disorder (alcohol comsumption and cocaine dependence), human health (breast cancer), and plant science (the number of root-like structure). GLMM GWAS Non Gaussian data Bayesian variable selection
2	Exact Markov chain Monte Carlo and Bayesian linear regression Bentley, Jason Phillip January 2009 (has links) In this work we investigate the use of perfect sampling methods within the context of Bayesian linear regression. We focus on inference problems related to the marginal posterior model probabilities. Model averaged inference for the response and Bayesian variable selection are considered. Perfect sampling is an alternate form of Markov chain Monte Carlo that generates exact sample points from the posterior of interest. This approach removes the need for burn-in assessment faced by traditional MCMC methods. For model averaged inference, we find the monotone Gibbs coupling from the past (CFTP) algorithm is the preferred choice. This requires the predictor matrix be orthogonal, preventing variable selection, but allowing model averaging for prediction of the response. Exploring choices of priors for the parameters in the Bayesian linear model, we investigate sufficiency for monotonicity assuming Gaussian errors. We discover that a number of other sufficient conditions exist, besides an orthogonal predictor matrix, for the construction of a monotone Gibbs Markov chain. Requiring an orthogonal predictor matrix, we investigate new methods of orthogonalizing the original predictor matrix. We find that a new method using the modified Gram-Schmidt orthogonalization procedure performs comparably with existing transformation methods, such as generalized principal components. Accounting for the effect of using an orthogonal predictor matrix, we discover that inference using model averaging for in-sample prediction of the response is comparable between the original and orthogonal predictor matrix. The Gibbs sampler is then investigated for sampling when using the original predictor matrix and the orthogonal predictor matrix. We find that a hybrid method, using a standard Gibbs sampler on the orthogonal space in conjunction with the monotone CFTP Gibbs sampler, provides the fastest computation and convergence to the posterior distribution. We conclude the hybrid approach should be used when the monotone Gibbs CFTP sampler becomes impractical, due to large backwards coupling times. We demonstrate large backwards coupling times occur when the sample size is close to the number of predictors, or when hyper-parameter choices increase model competition. The monotone Gibbs CFTP sampler should be taken advantage of when the backwards coupling time is small. For the problem of variable selection we turn to the exact version of the independent Metropolis-Hastings (IMH) algorithm. We reiterate the notion that the exact IMH sampler is redundant, being a needlessly complicated rejection sampler. We then determine a rejection sampler is feasible for variable selection when the sample size is close to the number of predictors and using Zellner’s prior with a small value for the hyper-parameter c. Finally, we use the example of simulating from the posterior of c conditional on a model to demonstrate how the use of an exact IMH view-point clarifies how the rejection sampler can be adapted to improve efficiency. perfect sampling Bayesian variable selection linear regression exact markov chain monte carlo
3	Distributed Feature Selection in Large n and Large p Regression Problems Wang, Xiangyu January 2016 (has links) <p>Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.</p><p>While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.</p><p>For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.</p> / Dissertation Statistics Computer science Bayesian variable selection Big data Distributed Embarrassingly Parallel Feature selection Partition
4	Exact Markov chain Monte Carlo and Bayesian linear regression Bentley, Jason Phillip January 2009 (has links) In this work we investigate the use of perfect sampling methods within the context of Bayesian linear regression. We focus on inference problems related to the marginal posterior model probabilities. Model averaged inference for the response and Bayesian variable selection are considered. Perfect sampling is an alternate form of Markov chain Monte Carlo that generates exact sample points from the posterior of interest. This approach removes the need for burn-in assessment faced by traditional MCMC methods. For model averaged inference, we find the monotone Gibbs coupling from the past (CFTP) algorithm is the preferred choice. This requires the predictor matrix be orthogonal, preventing variable selection, but allowing model averaging for prediction of the response. Exploring choices of priors for the parameters in the Bayesian linear model, we investigate sufficiency for monotonicity assuming Gaussian errors. We discover that a number of other sufficient conditions exist, besides an orthogonal predictor matrix, for the construction of a monotone Gibbs Markov chain. Requiring an orthogonal predictor matrix, we investigate new methods of orthogonalizing the original predictor matrix. We find that a new method using the modified Gram-Schmidt orthogonalization procedure performs comparably with existing transformation methods, such as generalized principal components. Accounting for the effect of using an orthogonal predictor matrix, we discover that inference using model averaging for in-sample prediction of the response is comparable between the original and orthogonal predictor matrix. The Gibbs sampler is then investigated for sampling when using the original predictor matrix and the orthogonal predictor matrix. We find that a hybrid method, using a standard Gibbs sampler on the orthogonal space in conjunction with the monotone CFTP Gibbs sampler, provides the fastest computation and convergence to the posterior distribution. We conclude the hybrid approach should be used when the monotone Gibbs CFTP sampler becomes impractical, due to large backwards coupling times. We demonstrate large backwards coupling times occur when the sample size is close to the number of predictors, or when hyper-parameter choices increase model competition. The monotone Gibbs CFTP sampler should be taken advantage of when the backwards coupling time is small. For the problem of variable selection we turn to the exact version of the independent Metropolis-Hastings (IMH) algorithm. We reiterate the notion that the exact IMH sampler is redundant, being a needlessly complicated rejection sampler. We then determine a rejection sampler is feasible for variable selection when the sample size is close to the number of predictors and using Zellner’s prior with a small value for the hyper-parameter c. Finally, we use the example of simulating from the posterior of c conditional on a model to demonstrate how the use of an exact IMH view-point clarifies how the rejection sampler can be adapted to improve efficiency. perfect sampling Bayesian variable selection linear regression exact markov chain monte carlo
5	Monitoring and diagnosis of process faults and sensor faults in manufacturing processes Li, Shan 01 January 2008 (has links) The substantial growth in the use of automated in-process sensing technologies creates great opportunities for manufacturers to detect abnormal manufacturing processes and identify the root causes quickly. It is critical to locate and distinguish two types of faults - process faults and sensor faults. The procedures to monitor and diagnose process and sensor mean shift faults are presented with the assumption that the manufacturing processes can be modeled by a linear fault-quality model. A W control chart is developed to monitor the manufacturing process and quickly detect the occurrence of the sensor faults. Since the W chart is insensitive to process faults, when it is combined with U chart, both process faults and sensor faults can be detected and distinguished. A unit-free index referred to as the sensitivity ratio (SR) is defined to measure the sensitivity of the W chart. It shows that the sensitivity of the W chart is affected by the potential influence of the sensor measurement. A Bayesian variable selection based fault diagnosis approach is presented to locate the root causes of the abnormal processes. A Minimal Coupled Pattern (MCP) and its degree are defined to denote the coupled structure of a system. When less than half of the faults within an MCP occur, which is defined as sparse faults, the proposed fault diagnosis procedure can identify the correct root causes with high probability. Guidelines are provided for the hyperparameters selection in the Bayesian hierarchical model. An alternative CML method for hyperparameters selection is also discussed. With the large number of potential process faults and sensor faults, an MCMC method, e.g. Metropolis-Hastings algorithm can be applied to approximate the posterior probabilities of candidate models. The monitor and diagnosis procedures are demonstrated and evaluate through an autobody assembly example. Bayesian variable selection control chart fault diagnosis sensor faults Industrial Engineering
6	Bayesian modeling of neuropsychological test scores Du, Mengtian 06 October 2021 (has links) In this dissertation we propose novel Bayesian methods of analysis of patterns of neuropsychological testing. We first focus attention to situations in which the goal of the analysis is to discover risk factors of cognitive decline using longitudinal assessment of tests scores. Variable selection in the Bayesian setting is still challenging, particularly for analysis of longitudinal data. We propose a novel approach to selection of the fixed effects in mixed effect models that combines a backward selection algorithm and a metrics based on the posterior credible intervals of the model parameters. The heuristic of this approach is based on searching for those parameters that are most likely to be different from zero based on their posterior credible intervals, without requiring ad hoc approximations of model parameters or informative prior distributions. We show via a simulation study that this approach produces more parsimonious models than other popular criteria such as the Bayesian deviance information criterion. We then apply this approach to test the hypothesis that genotypes of the APOE gene have different effects on the rate of cognitive decline of participants in the Long Life Family Study. In the second part of the dissertation we shift focus on analysis of neuropsychological tests administered using emerging digital technologies. The challenge of analyzing these data is that for each study participant the test is a data stream that records time and spatial coordinates of the digitally executed test and the goal is to extract some useful and informative summary univariate variables that can be used for analysis. Toward this goal, we propose a novel application of Bayesian Hidden Markov Models to analyze digitally recorded Trail Making Tests. Applying the Hidden Markov Model enables us to perform automatic segmentation of the digital data stream and allows us to extract meaningful metrics that correlate the Trail Making Tests performance to other cognitive and physical function test scores. We show that the extracted metrics provide information in addition to the traditionally used scores. / 2023-10-06T00:00:00Z Biostatistics Bayesian hierarchical models Bayesian variable selection Credible intervals Mixed effects models
7	Topics in Sparse Inverse Problems and Electron Paramagnetic Resonance Imaging Som, Subhojit 27 October 2010 (has links) No description available. sparse reconstruction Bayesian variable selection electron paramagnetic resonance imaging EPR oximetry multi-site oximetry parametric EPR
8	Bayesian Modeling of Complex High-Dimensional Data Huo, Shuning 07 December 2020 (has links) With the rapid development of modern high-throughput technologies, scientists can now collect high-dimensional complex data in different forms, such as medical images, genomics measurements. However, acquisition of more data does not automatically lead to better knowledge discovery. One needs efficient and reliable analytical tools to extract useful information from complex datasets. The main objective of this dissertation is to develop innovative Bayesian methodologies to enable effective and efficient knowledge discovery from complex high-dimensional data. It contains two parts—the development of computationally efficient functional mixed models and the modeling of data heterogeneity via Dirichlet Diffusion Tree. The first part focuses on tackling the computational bottleneck in Bayesian functional mixed models. We propose a computational framework called variational functional mixed model (VFMM). This new method facilitates efficient data compression and high-performance computing in basis space. We also propose a new multiple testing procedure in basis space, which can be used to detect significant local regions. The effectiveness of the proposed model is demonstrated through two datasets, a mass spectrometry dataset in a cancer study and a neuroimaging dataset in an Alzheimer's disease study. The second part is about modeling data heterogeneity by using Dirichlet Diffusion Trees. We propose a Bayesian latent tree model that incorporates covariates of subjects to characterize the heterogeneity and uncover the latent tree structure underlying data. This innovative model may reveal the hierarchical evolution process through branch structures and estimate systematic differences between groups of samples. We demonstrate the effectiveness of the model through the simulation study and a brain tumor real data. / Doctor of Philosophy / With the rapid development of modern high-throughput technologies, scientists can now collect high-dimensional data in different forms, such as engineering signals, medical images, and genomics measurements. However, acquisition of such data does not automatically lead to efficient knowledge discovery. The main objective of this dissertation is to develop novel Bayesian methods to extract useful knowledge from complex high-dimensional data. It has two parts—the development of an ultra-fast functional mixed model and the modeling of data heterogeneity via Dirichlet Diffusion Trees. The first part focuses on developing approximate Bayesian methods in functional mixed models to estimate parameters and detect significant regions. Two datasets demonstrate the effectiveness of proposed method—a mass spectrometry dataset in a cancer study and a neuroimaging dataset in an Alzheimer's disease study. The second part focuses on modeling data heterogeneity via Dirichlet Diffusion Trees. The method helps uncover the underlying hierarchical tree structures and estimate systematic differences between the group of samples. We demonstrate the effectiveness of the method through the brain tumor imaging data. Variational Inference Bayesian Variable Selection Functional Mixed Model Parallel Computing Bayesian Hierarchical Clustering Dirichlet Diffusion Tree
9	Wavelet methods and statistical applications: network security and bioinformatics Kwon, Deukwoo 01 November 2005 (has links) Wavelet methods possess versatile properties for statistical applications. We would like to explore the advantages of using wavelets in the analyses in two different research areas. First of all, we develop an integrated tool for online detection of network anomalies. We consider statistical change point detection algorithms, for both local changes in the variance and for jumps detection, and propose modified versions of these algorithms based on moving window techniques. We investigate performances on simulated data and on network traffic data with several superimposed attacks. All detection methods are based on wavelet packets transformations. We also propose a Bayesian model for the analysis of high-throughput data where the outcome of interest has a natural ordering. The method provides a unified approach for identifying relevant markers and predicting class memberships. This is accomplished by building a stochastic search variable selection method into an ordinal model. We apply the methodology to the analysis of proteomic studies in prostate cancer. We explore wavelet-based techniques to remove noise from the protein mass spectra. The goal is to identify protein markers associated with prostate-specific antigen (PSA) level, an ordinal diagnostic measure currently used to stratify patients into different risk groups. Bayesian ordinal probit model wavelet methods change point detection network security bioinformatics proteomics SELDI-TOF MS Bayesian variable selection biomarker
10	Monte Carlo methods for sampling high-dimensional binary vectors Schäfer, Christian 14 November 2012 (has links) (PDF) This thesis is concerned with Monte Carlo methods for sampling high-dimensional binary vectors from complex distributions of interest. If the state space is too large for exhaustive enumeration, these methods provide a mean of estimating the expected value with respect to some function of interest. Standard approaches are mostly based on random walk type Markov chain Monte Carlo, where the equilibrium distribution of the chain is the distribution of interest and its ergodic mean converges to the expected value. We propose a novel sampling algorithm based on sequential Monte Carlo methodology which copes well with multi-modal problems by virtue of an annealing schedule. The performance of the proposed sequential Monte Carlo sampler depends on the ability to sample proposals from auxiliary distributions which are, in a certain sense, close to the current distribution of interest. The core work of this thesis discusses strategies to construct parametric families for sampling binary vectors with dependencies. The usefulness of this approach is demonstrated in the context of Bayesian variable selection and combinatorial optimization of pseudo-Boolean objective functions. Sequential Monte Carlo Bayesian variable selection Binary parametric families Binary optimization

Search results