Spelling suggestions: "subject:"ariable selection"" "subject:"aariable selection""
11 |
Variable selection for generalized linear mixed models and non-Gaussian Genome-wide associated study dataXu, Shuangshuang 11 June 2024 (has links)
Genome-wide associated study (GWAS) aims to identify associated single nucleotide polymorphisms (SNP) for phenotypes. SNP has the characteristic that the number of SNPs is from hundred of thousands to millions. If p is the number of SNPs and n is the sample size, it is a p>>n variable selection problem. To solve this p>>n problem, the common method for GWAS is single marker analysis (SMA). However, since SNPs are highly correlated, SMA identifies true causal SNPs with high false discovery rate. In addition, SMA does not consider interaction between SNPs. In this dissertation, we propose novel Bayesian variable selection methods BG2 and IBG3 for non-Gaussian GWAS data. To solve ultra-high dimension problem and highly correlated SNPs problem, BG2 and IBG3 have two steps: screening step and fine-mapping step. In the screening step, BG2 and IBG3, like SMA method, only have one SNP in one model and screen to obtain a subset of most associated SNPs. In the fine-mapping step, BG2 and IBG3 consider all possible combinations of screened candidate SNPs to find the best model. Fine-mapping step helps to reduce false positives. In addition, IBG3 iterates these two steps to detect more SNPs with small effect size. In simulation studies, we compare our methods with SMA methods and fine-mapping methods. We also compare our methods with different priors for variables, including nonlocal prior, unit information prior, Zellner-g prior, and Zellner-Siow prior. Our methods are applied to substance use disorder (alcohol comsumption and cocaine dependence), human health (breast cancer), and plant science (the number of root-like structure). / Doctor of Philosophy / Genome-wide associated study (GWAS) aims to identify genomics variants for targeted phenotype, such as disease and trait. The genomics variants which we are interested in are single nucleotide polymorphisms (SNP). SNP is a substitution mutation in the DNA sequence. GWAS solves the problem that which SNP is associated with the phenotype. However, the number of possible SNPs is from hundred of thousands to millions. The common method for GWAS is called single marker analysis (SMA). SMA only considers one SNP's association with the phenotype each time. In this way, SMA does not have the problem which comes from the large number of SNPs and small sample size. However, SMA does not consider the interaction between SNPs. In addition, SNPs that are close to each other in the DNA sequance may highly correlated SNPs causing SMA to have high false discovery rate. To solve these problems, this dissertation proposes two variable selection methods (BG2 and IBG3) for non-Gaussian GWAS data. Compared with SMA methods, BG2 and IBG3 methods detect true causal SNPs with low false discovery rate. In addition, IBG3 can detect SNPs with small effect sizes. Our methods are applied to substance use disorder (alcohol comsumption and cocaine dependence), human health (breast cancer), and plant science (the number of root-like structure).
|
12 |
Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a SolutionStrobl, Carolin, Boulesteix, Anne-Laure, Zeileis, Achim, Hothorn, Torsten January 2006 (has links) (PDF)
Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale level or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types. Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale level or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analysing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research. (author's abstract) / Series: Research Report Series / Department of Statistics and Mathematics
|
13 |
Ion implant virtual metrology for process monitoringFowler, Courtney Marie 07 September 2010 (has links)
This thesis presents the modeling of tool data produced during ion implantation for the prediction of wafer sheet resistance. In this work, we will use various statistical techniques to address challenges due to the nature of equipment data: high dimensionality, colinearity, parameter interactions, and non-linearities. The emphasis will be data integrity, variable selection, and model building methods. Different variable selection and modeling techniques will be evaluated using an industrial data set. Ion implant processes are fast and depending on the monitoring frequency of the equipment, late detection of a process shift could lead to the loss of a significant amount of product. The main objective of the research presented in this thesis is to identify any ion implant parameters that can be used to formulate a virtual metrology model. The virtual metrology model would then be used for process monitoring to ensure stable processing conditions and consequent yield guarantees. / text
|
14 |
Some statistical methods for dimension reductionAl-Kenani, Ali J. Kadhim January 2013 (has links)
The aim of the work in this thesis is to carry out dimension reduction (DR) for high dimensional (HD) data by using statistical methods for variable selection, feature extraction and a combination of the two. In Chapter 2, the DR is carried out through robust feature extraction. Robust canonical correlation (RCCA) methods have been proposed. In the correlation matrix of canonical correlation analysis (CCA), we suggest that the Pearson correlation should be substituted by robust correlation measures in order to obtain robust correlation matrices. These matrices have been employed for producing RCCA. Moreover, the classical covariance matrix has been substituted by robust estimators for multivariate location and dispersion in order to get RCCA. In Chapter 3 and 4, the DR is carried out by combining the ideas of variable selection using regularisation methods with feature extraction, through the minimum average variance estimator (MAVE) and single index quantile regression (SIQ) methods, respectively. In particular, we extend the sparse MAVE (SMAVE) reported in (Wang and Yin, 2008) by combining the MAVE loss function with different regularisation penalties in Chapter 3. An extension of the SIQ of Wu et al. (2010) by considering different regularisation penalties is proposed in Chapter 4. In Chapter 5, the DR is done through variable selection under Bayesian framework. A flexible Bayesian framework for regularisation in quantile regression (QR) model has been proposed. This work is different from Bayesian Lasso quantile regression (BLQR), employing the asymmetric Laplace error distribution (ALD). The error distribution is assumed to be an infinite mixture of Gaussian (IMG) densities.
|
15 |
Some problems in model specification and inference for generalized additive modelsMarra, Giampiero January 2010 (has links)
Regression models describingthe dependence between a univariate response and a set of covariates play a fundamental role in statistics. In the last two decades, a tremendous effort has been made in developing flexible regression techniques such as generalized additive models(GAMs) with the aim of modelling the expected value of a response variable as a sum of smooth unspecified functions of predictors. Many nonparametric regression methodologies exist includinglocal-weighted regressionand smoothing splines. Here the focus is on penalized regression spline methods which can be viewed as a generalization of smoothing splines with a more flexible choice of bases and penalties. This thesis addresses three issues. First, the problem of model misspecification is treated by extending the instrumental variable approach to the GAM context. Second, we study the theoretical and empirical properties of the confidence intervals for the smooth component functions of a GAM. Third, we consider the problem of variable selection within this flexible class of models. All results are supported by theoretical arguments and extensive simulation experiments which shed light on the practical performance of the methods discussed in this thesis.
|
16 |
Penalised regression for high-dimensional data : an empirical investigation and improvements via ensemble learningWang, Fan January 2019 (has links)
In a wide range of applications, datasets are generated for which the number of variables p exceeds the sample size n. Penalised likelihood methods are widely used to tackle regression problems in these high-dimensional settings. In this thesis, we carry out an extensive empirical comparison of the performance of popular penalised regression methods in high-dimensional settings and propose new methodology that uses ensemble learning to enhance the performance of these methods. The relative efficacy of different penalised regression methods in finite-sample settings remains incompletely understood. Through a large-scale simulation study, consisting of more than 1,800 data-generating scenarios, we systematically consider the influence of various factors (for example, sample size and sparsity) on method performance. We focus on three related goals --- prediction, variable selection and variable ranking --- and consider six widely used methods. The results are supported by a semi-synthetic data example. Our empirical results complement existing theory and provide a resource to compare performance across a range of settings and metrics. We then propose a new ensemble learning approach for improving the performance of penalised regression methods, called STructural RANDomised Selection (STRANDS). The approach, that builds and improves upon the Random Lasso method, consists of two steps. In both steps, we reduce dimensionality by repeated subsampling of variables. We apply a penalised regression method to each subsampled dataset and average the results. In the first step, subsampling is informed by variable correlation structure, and in the second step, by variable importance measures from the first step. STRANDS can be used with any sparse penalised regression approach as the ``base learner''. In simulations, we show that STRANDS typically improves upon its base learner, and demonstrate that taking account of the correlation structure in the first step can help to improve the efficiency with which the model space may be explored. We propose another ensemble learning method to improve the prediction performance of Ridge Regression in sparse settings. Specifically, we combine Bayesian Ridge Regression with a probabilistic forward selection procedure, where inclusion of a variable at each stage is probabilistically determined by a Bayes factor. We compare the prediction performance of the proposed method to penalised regression methods using simulated data.
|
17 |
Information-Theoretic Variable Selection and Network Inference from Microarray DataMeyer, Patrick E 16 December 2008 (has links)
Statisticians are used to model interactions between variables on the basis of observed
data. In a lot of emerging fields, like bioinformatics, they are confronted with datasets
having thousands of variables, a lot of noise, non-linear dependencies and, only, tens of
samples. The detection of functional relationships, when such uncertainty is contained in
data, constitutes a major challenge.
Our work focuses on variable selection and network inference from datasets having
many variables and few samples (high variable-to-sample ratio), such as microarray data.
Variable selection is the topic of machine learning whose objective is to select, among a
set of input variables, those that lead to the best predictive model. The application of
variable selection methods to gene expression data allows, for example, to improve cancer
diagnosis and prognosis by identifying a new molecular signature of the disease. Network
inference consists in representing the dependencies between the variables of a dataset by
a graph. Hence, when applied to microarray data, network inference can reverse-engineer
the transcriptional regulatory network of cell in view of discovering new drug targets to
cure diseases.
In this work, two original tools are proposed MASSIVE (Matrix of Average Sub-Subset
Information for Variable Elimination) a new method of feature selection and MRNET (Minimum
Redundancy NETwork), a new algorithm of network inference. Both tools rely on
the computation of mutual information, an information-theoretic measure of dependency.
More precisely, MASSIVE and MRNET use approximations of the mutual information
between a subset of variables and a target variable based on combinations of mutual informations
between sub-subsets of variables and the target. The used approximations allow
to estimate a series of low variate densities instead of one large multivariate density. Low
variate densities are well-suited for dealing with high variable-to-sample ratio datasets,
since they are rather cheap in terms of computational cost and they do not require a large
amount of samples in order to be estimated accurately. Numerous experimental results
show the competitiveness of these new approaches. Finally, our thesis has led to a freely
available source code of MASSIVE and an open-source R and Bioconductor package of
network inference.
|
18 |
Comparisons of statistical modeling for constructing gene regulatory networksChen, Xiaohui 11 1900 (has links)
Genetic regulatory networks are of great importance in terms of scientific interests and practical medical importance. Since a number of high-throughput
measurement devices are available, such as microarrays and
sequencing techniques, regulatory networks have been intensively studied
over the last decade. Based on these high-throughput data sets, statistical interpretations of these billions of bits are crucial for biologist to extract meaningful results. In this thesis, we compare a variety of existing
regression models and apply them to construct regulatory networks which
span trancription factors and microRNAs. We also propose an extended
algorithm to address the local optimum issue in finding the Maximum A
Posterjorj estimator. An E. coli mRNA expression microarray data set with
known bona fide interactions is used to evaluate our models and we show
that our regression networks with a properly chosen prior can perform comparably
to the state-of-the-art regulatory network construction algorithm.
Finally, we apply our models on a p53-related data set, NCI-60 data. By
further incorporating available prior structural information from sequencing
data, we identify several significantly enriched interactions with cell proliferation
function. In both of the two data sets, we select specific examples
to show that many regulatory interactions can be confirmed by previous
studies or functional enrichment analysis. Through comparing statistical
models, we conclude from the project that combining different models with
over-representation analysis and prior structural information can improve
the quality of prediction and facilitate biological interpretation.
Keywords: regulatory network, variable selection, penalized maximum
likelihood estimation, optimization, functional enrichment analysis.
|
19 |
Contributions to the Analysis of Experiments Using Empirical Bayes TechniquesDelaney, James Dillon 10 July 2006 (has links)
Specifying a prior distribution for the large number of parameters in the linear statistical model is a difficult step in the Bayesian approach to the design and analysis of experiments. Here we address this difficulty by proposing the use of functional priors and then by working out important details for three and higher level experiments. One of the challenges presented by higher level experiments is that a factor can be either qualitative or quantitative. We propose appropriate correlation functions and coding schemes so that the prior distribution is simple and the results easily interpretable. The prior incorporates well known experimental design principles such as effect hierarchy and effect heredity, which helps to automatically resolve the aliasing problems experienced in fractional designs.
The second part of the thesis focuses on the analysis of optimization experiments. Not uncommon are designed experiments with their primary purpose being to determine optimal settings for all of the factors in some predetermined set. Here we distinguish between the two concepts of statistical significance and practical significance. We perform estimation via an empirical Bayes data analysis methodology that has been detailed in the recent literature. But then propose an alternative to the usual next step in determining optimal factor level settings. Instead of implementing variable or model selection techniques, we propose an objective function that assists in our goal of finding the ideal settings for all factors over which we experimented. The usefulness of the new approach is illustrated through the analysis of some real experiments as well as simulation.
|
20 |
Bayesian model-based approaches with MCMC computation to some bioinformatics problemsBae, Kyounghwa 29 August 2005 (has links)
Bioinformatics applications can address the transfer of information at several stages
of the central dogma of molecular biology, including transcription and translation.
This dissertation focuses on using Bayesian models to interpret biological data in
bioinformatics, using Markov chain Monte Carlo (MCMC) for the inference method.
First, we use our approach to interpret data at the transcription level. We propose
a two-level hierarchical Bayesian model for variable selection on cDNA Microarray
data. cDNA Microarray quantifies mRNA levels of a gene simultaneously so has
thousands of genes in one sample. By observing the expression patterns of genes under
various treatment conditions, important clues about gene function can be obtained.
We consider a multivariate Bayesian regression model and assign priors that favor
sparseness in terms of number of variables (genes) used. We introduce the use of
different priors to promote different degrees of sparseness using a unified two-level
hierarchical Bayesian model. Second, we apply our method to a problem related to
the translation level. We develop hidden Markov models to model linker/non-linker
sequence regions in a protein sequence. We use a linker index to exploit differences
in amino acid composition between regions from sequence information alone. A goal
of protein structure prediction is to take an amino acid sequence (represented as
a sequence of letters) and predict its tertiary structure. The identification of linker
regions in a protein sequence is valuable in predicting the three-dimensional structure.
Because of the complexities of both models encountered in practice, we employ the
Markov chain Monte Carlo method (MCMC), particularly Gibbs sampling (Gelfand
and Smith, 1990) for the inference of the parameter estimation.
|
Page generated in 0.1046 seconds