Global ETD Search

11	Variable selection for generalized linear mixed models and non-Gaussian Genome-wide associated study data Xu, Shuangshuang 11 June 2024 (has links) Genome-wide associated study (GWAS) aims to identify associated single nucleotide polymorphisms (SNP) for phenotypes. SNP has the characteristic that the number of SNPs is from hundred of thousands to millions. If p is the number of SNPs and n is the sample size, it is a p>>n variable selection problem. To solve this p>>n problem, the common method for GWAS is single marker analysis (SMA). However, since SNPs are highly correlated, SMA identifies true causal SNPs with high false discovery rate. In addition, SMA does not consider interaction between SNPs. In this dissertation, we propose novel Bayesian variable selection methods BG2 and IBG3 for non-Gaussian GWAS data. To solve ultra-high dimension problem and highly correlated SNPs problem, BG2 and IBG3 have two steps: screening step and fine-mapping step. In the screening step, BG2 and IBG3, like SMA method, only have one SNP in one model and screen to obtain a subset of most associated SNPs. In the fine-mapping step, BG2 and IBG3 consider all possible combinations of screened candidate SNPs to find the best model. Fine-mapping step helps to reduce false positives. In addition, IBG3 iterates these two steps to detect more SNPs with small effect size. In simulation studies, we compare our methods with SMA methods and fine-mapping methods. We also compare our methods with different priors for variables, including nonlocal prior, unit information prior, Zellner-g prior, and Zellner-Siow prior. Our methods are applied to substance use disorder (alcohol comsumption and cocaine dependence), human health (breast cancer), and plant science (the number of root-like structure). / Doctor of Philosophy / Genome-wide associated study (GWAS) aims to identify genomics variants for targeted phenotype, such as disease and trait. The genomics variants which we are interested in are single nucleotide polymorphisms (SNP). SNP is a substitution mutation in the DNA sequence. GWAS solves the problem that which SNP is associated with the phenotype. However, the number of possible SNPs is from hundred of thousands to millions. The common method for GWAS is called single marker analysis (SMA). SMA only considers one SNP's association with the phenotype each time. In this way, SMA does not have the problem which comes from the large number of SNPs and small sample size. However, SMA does not consider the interaction between SNPs. In addition, SNPs that are close to each other in the DNA sequance may highly correlated SNPs causing SMA to have high false discovery rate. To solve these problems, this dissertation proposes two variable selection methods (BG2 and IBG3) for non-Gaussian GWAS data. Compared with SMA methods, BG2 and IBG3 methods detect true causal SNPs with low false discovery rate. In addition, IBG3 can detect SNPs with small effect sizes. Our methods are applied to substance use disorder (alcohol comsumption and cocaine dependence), human health (breast cancer), and plant science (the number of root-like structure). GLMM GWAS Non Gaussian data Bayesian variable selection
12	Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution Strobl, Carolin, Boulesteix, Anne-Laure, Zeileis, Achim, Hothorn, Torsten January 2006 (has links) (PDF) Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale level or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types. Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale level or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analysing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research. (author's abstract) / Series: Research Report Series / Department of Statistics and Mathematics
13	Ion implant virtual metrology for process monitoring Fowler, Courtney Marie 07 September 2010 (has links) This thesis presents the modeling of tool data produced during ion implantation for the prediction of wafer sheet resistance. In this work, we will use various statistical techniques to address challenges due to the nature of equipment data: high dimensionality, colinearity, parameter interactions, and non-linearities. The emphasis will be data integrity, variable selection, and model building methods. Different variable selection and modeling techniques will be evaluated using an industrial data set. Ion implant processes are fast and depending on the monitoring frequency of the equipment, late detection of a process shift could lead to the loss of a significant amount of product. The main objective of the research presented in this thesis is to identify any ion implant parameters that can be used to formulate a virtual metrology model. The virtual metrology model would then be used for process monitoring to ensure stable processing conditions and consequent yield guarantees. / text Virtual metrology Ion implant BPNN RBFN Variable selection Semiconductor manufacturing
14	Some statistical methods for dimension reduction Al-Kenani, Ali J. Kadhim January 2013 (has links) The aim of the work in this thesis is to carry out dimension reduction (DR) for high dimensional (HD) data by using statistical methods for variable selection, feature extraction and a combination of the two. In Chapter 2, the DR is carried out through robust feature extraction. Robust canonical correlation (RCCA) methods have been proposed. In the correlation matrix of canonical correlation analysis (CCA), we suggest that the Pearson correlation should be substituted by robust correlation measures in order to obtain robust correlation matrices. These matrices have been employed for producing RCCA. Moreover, the classical covariance matrix has been substituted by robust estimators for multivariate location and dispersion in order to get RCCA. In Chapter 3 and 4, the DR is carried out by combining the ideas of variable selection using regularisation methods with feature extraction, through the minimum average variance estimator (MAVE) and single index quantile regression (SIQ) methods, respectively. In particular, we extend the sparse MAVE (SMAVE) reported in (Wang and Yin, 2008) by combining the MAVE loss function with different regularisation penalties in Chapter 3. An extension of the SIQ of Wu et al. (2010) by considering different regularisation penalties is proposed in Chapter 4. In Chapter 5, the DR is done through variable selection under Bayesian framework. A flexible Bayesian framework for regularisation in quantile regression (QR) model has been proposed. This work is different from Bayesian Lasso quantile regression (BLQR), employing the asymmetric Laplace error distribution (ALD). The error distribution is assumed to be an infinite mixture of Gaussian (IMG) densities. 519.5
15	Some problems in model specification and inference for generalized additive models Marra, Giampiero January 2010 (has links) Regression models describingthe dependence between a univariate response and a set of covariates play a fundamental role in statistics. In the last two decades, a tremendous effort has been made in developing flexible regression techniques such as generalized additive models(GAMs) with the aim of modelling the expected value of a response variable as a sum of smooth unspecified functions of predictors. Many nonparametric regression methodologies exist includinglocal-weighted regressionand smoothing splines. Here the focus is on penalized regression spline methods which can be viewed as a generalization of smoothing splines with a more flexible choice of bases and penalties. This thesis addresses three issues. First, the problem of model misspecification is treated by extending the instrumental variable approach to the GAM context. Second, we study the theoretical and empirical properties of the confidence intervals for the smooth component functions of a GAM. Third, we consider the problem of variable selection within this flexible class of models. All results are supported by theoretical arguments and extensive simulation experiments which shed light on the practical performance of the methods discussed in this thesis. 518
16	Penalised regression for high-dimensional data : an empirical investigation and improvements via ensemble learning Wang, Fan January 2019 (has links) In a wide range of applications, datasets are generated for which the number of variables p exceeds the sample size n. Penalised likelihood methods are widely used to tackle regression problems in these high-dimensional settings. In this thesis, we carry out an extensive empirical comparison of the performance of popular penalised regression methods in high-dimensional settings and propose new methodology that uses ensemble learning to enhance the performance of these methods. The relative efficacy of different penalised regression methods in finite-sample settings remains incompletely understood. Through a large-scale simulation study, consisting of more than 1,800 data-generating scenarios, we systematically consider the influence of various factors (for example, sample size and sparsity) on method performance. We focus on three related goals --- prediction, variable selection and variable ranking --- and consider six widely used methods. The results are supported by a semi-synthetic data example. Our empirical results complement existing theory and provide a resource to compare performance across a range of settings and metrics. We then propose a new ensemble learning approach for improving the performance of penalised regression methods, called STructural RANDomised Selection (STRANDS). The approach, that builds and improves upon the Random Lasso method, consists of two steps. In both steps, we reduce dimensionality by repeated subsampling of variables. We apply a penalised regression method to each subsampled dataset and average the results. In the first step, subsampling is informed by variable correlation structure, and in the second step, by variable importance measures from the first step. STRANDS can be used with any sparse penalised regression approach as the ``base learner''. In simulations, we show that STRANDS typically improves upon its base learner, and demonstrate that taking account of the correlation structure in the first step can help to improve the efficiency with which the model space may be explored. We propose another ensemble learning method to improve the prediction performance of Ridge Regression in sparse settings. Specifically, we combine Bayesian Ridge Regression with a probabilistic forward selection procedure, where inclusion of a variable at each stage is probabilistically determined by a Bayes factor. We compare the prediction performance of the proposed method to penalised regression methods using simulated data.
17	Information-Theoretic Variable Selection and Network Inference from Microarray Data Meyer, Patrick E 16 December 2008 (has links) Statisticians are used to model interactions between variables on the basis of observed data. In a lot of emerging fields, like bioinformatics, they are confronted with datasets having thousands of variables, a lot of noise, non-linear dependencies and, only, tens of samples. The detection of functional relationships, when such uncertainty is contained in data, constitutes a major challenge. Our work focuses on variable selection and network inference from datasets having many variables and few samples (high variable-to-sample ratio), such as microarray data. Variable selection is the topic of machine learning whose objective is to select, among a set of input variables, those that lead to the best predictive model. The application of variable selection methods to gene expression data allows, for example, to improve cancer diagnosis and prognosis by identifying a new molecular signature of the disease. Network inference consists in representing the dependencies between the variables of a dataset by a graph. Hence, when applied to microarray data, network inference can reverse-engineer the transcriptional regulatory network of cell in view of discovering new drug targets to cure diseases. In this work, two original tools are proposed MASSIVE (Matrix of Average Sub-Subset Information for Variable Elimination) a new method of feature selection and MRNET (Minimum Redundancy NETwork), a new algorithm of network inference. Both tools rely on the computation of mutual information, an information-theoretic measure of dependency. More precisely, MASSIVE and MRNET use approximations of the mutual information between a subset of variables and a target variable based on combinations of mutual informations between sub-subsets of variables and the target. The used approximations allow to estimate a series of low variate densities instead of one large multivariate density. Low variate densities are well-suited for dealing with high variable-to-sample ratio datasets, since they are rather cheap in terms of computational cost and they do not require a large amount of samples in order to be estimated accurately. Numerous experimental results show the competitiveness of these new approaches. Finally, our thesis has led to a freely available source code of MASSIVE and an open-source R and Bioconductor package of network inference. microarray analysis information theory variable selection network inference
18	Comparisons of statistical modeling for constructing gene regulatory networks Chen, Xiaohui 11 1900 (has links) Genetic regulatory networks are of great importance in terms of scientific interests and practical medical importance. Since a number of high-throughput measurement devices are available, such as microarrays and sequencing techniques, regulatory networks have been intensively studied over the last decade. Based on these high-throughput data sets, statistical interpretations of these billions of bits are crucial for biologist to extract meaningful results. In this thesis, we compare a variety of existing regression models and apply them to construct regulatory networks which span trancription factors and microRNAs. We also propose an extended algorithm to address the local optimum issue in finding the Maximum A Posterjorj estimator. An E. coli mRNA expression microarray data set with known bona fide interactions is used to evaluate our models and we show that our regression networks with a properly chosen prior can perform comparably to the state-of-the-art regulatory network construction algorithm. Finally, we apply our models on a p53-related data set, NCI-60 data. By further incorporating available prior structural information from sequencing data, we identify several significantly enriched interactions with cell proliferation function. In both of the two data sets, we select specific examples to show that many regulatory interactions can be confirmed by previous studies or functional enrichment analysis. Through comparing statistical models, we conclude from the project that combining different models with over-representation analysis and prior structural information can improve the quality of prediction and facilitate biological interpretation. Keywords: regulatory network, variable selection, penalized maximum likelihood estimation, optimization, functional enrichment analysis. Regulatory network Variable selection Optimization Functional enrichment analysis
19	Contributions to the Analysis of Experiments Using Empirical Bayes Techniques Delaney, James Dillon 10 July 2006 (has links) Specifying a prior distribution for the large number of parameters in the linear statistical model is a difficult step in the Bayesian approach to the design and analysis of experiments. Here we address this difficulty by proposing the use of functional priors and then by working out important details for three and higher level experiments. One of the challenges presented by higher level experiments is that a factor can be either qualitative or quantitative. We propose appropriate correlation functions and coding schemes so that the prior distribution is simple and the results easily interpretable. The prior incorporates well known experimental design principles such as effect hierarchy and effect heredity, which helps to automatically resolve the aliasing problems experienced in fractional designs. The second part of the thesis focuses on the analysis of optimization experiments. Not uncommon are designed experiments with their primary purpose being to determine optimal settings for all of the factors in some predetermined set. Here we distinguish between the two concepts of statistical significance and practical significance. We perform estimation via an empirical Bayes data analysis methodology that has been detailed in the recent literature. But then propose an alternative to the usual next step in determining optimal factor level settings. Instead of implementing variable or model selection techniques, we propose an objective function that assists in our goal of finding the ideal settings for all factors over which we experimented. The usefulness of the new approach is illustrated through the analysis of some real experiments as well as simulation. Heredity principle Hierarchy principle Shrinkage estimation Practical significance Variable selection
20	Bayesian model-based approaches with MCMC computation to some bioinformatics problems Bae, Kyounghwa 29 August 2005 (has links) Bioinformatics applications can address the transfer of information at several stages of the central dogma of molecular biology, including transcription and translation. This dissertation focuses on using Bayesian models to interpret biological data in bioinformatics, using Markov chain Monte Carlo (MCMC) for the inference method. First, we use our approach to interpret data at the transcription level. We propose a two-level hierarchical Bayesian model for variable selection on cDNA Microarray data. cDNA Microarray quantifies mRNA levels of a gene simultaneously so has thousands of genes in one sample. By observing the expression patterns of genes under various treatment conditions, important clues about gene function can be obtained. We consider a multivariate Bayesian regression model and assign priors that favor sparseness in terms of number of variables (genes) used. We introduce the use of different priors to promote different degrees of sparseness using a unified two-level hierarchical Bayesian model. Second, we apply our method to a problem related to the translation level. We develop hidden Markov models to model linker/non-linker sequence regions in a protein sequence. We use a linker index to exploit differences in amino acid composition between regions from sequence information alone. A goal of protein structure prediction is to take an amino acid sequence (represented as a sequence of letters) and predict its tertiary structure. The identification of linker regions in a protein sequence is valuable in predicting the three-dimensional structure. Because of the complexities of both models encountered in practice, we employ the Markov chain Monte Carlo method (MCMC), particularly Gibbs sampling (Gelfand and Smith, 1990) for the inference of the parameter estimation.

Search results