Global ETD Search

51	Exact Markov chain Monte Carlo and Bayesian linear regression Bentley, Jason Phillip January 2009 (has links) In this work we investigate the use of perfect sampling methods within the context of Bayesian linear regression. We focus on inference problems related to the marginal posterior model probabilities. Model averaged inference for the response and Bayesian variable selection are considered. Perfect sampling is an alternate form of Markov chain Monte Carlo that generates exact sample points from the posterior of interest. This approach removes the need for burn-in assessment faced by traditional MCMC methods. For model averaged inference, we find the monotone Gibbs coupling from the past (CFTP) algorithm is the preferred choice. This requires the predictor matrix be orthogonal, preventing variable selection, but allowing model averaging for prediction of the response. Exploring choices of priors for the parameters in the Bayesian linear model, we investigate sufficiency for monotonicity assuming Gaussian errors. We discover that a number of other sufficient conditions exist, besides an orthogonal predictor matrix, for the construction of a monotone Gibbs Markov chain. Requiring an orthogonal predictor matrix, we investigate new methods of orthogonalizing the original predictor matrix. We find that a new method using the modified Gram-Schmidt orthogonalization procedure performs comparably with existing transformation methods, such as generalized principal components. Accounting for the effect of using an orthogonal predictor matrix, we discover that inference using model averaging for in-sample prediction of the response is comparable between the original and orthogonal predictor matrix. The Gibbs sampler is then investigated for sampling when using the original predictor matrix and the orthogonal predictor matrix. We find that a hybrid method, using a standard Gibbs sampler on the orthogonal space in conjunction with the monotone CFTP Gibbs sampler, provides the fastest computation and convergence to the posterior distribution. We conclude the hybrid approach should be used when the monotone Gibbs CFTP sampler becomes impractical, due to large backwards coupling times. We demonstrate large backwards coupling times occur when the sample size is close to the number of predictors, or when hyper-parameter choices increase model competition. The monotone Gibbs CFTP sampler should be taken advantage of when the backwards coupling time is small. For the problem of variable selection we turn to the exact version of the independent Metropolis-Hastings (IMH) algorithm. We reiterate the notion that the exact IMH sampler is redundant, being a needlessly complicated rejection sampler. We then determine a rejection sampler is feasible for variable selection when the sample size is close to the number of predictors and using Zellner’s prior with a small value for the hyper-parameter c. Finally, we use the example of simulating from the posterior of c conditional on a model to demonstrate how the use of an exact IMH view-point clarifies how the rejection sampler can be adapted to improve efficiency. perfect sampling Bayesian variable selection linear regression exact markov chain monte carlo
52	Unbiased Recursive Partitioning: A Conditional Inference Framework Hothorn, Torsten, Hornik, Kurt, Zeileis, Achim January 2004 (has links) (PDF) Recursive binary partitioning is a popular tool for regression analysis. Two fundamental problems of exhaustive search procedures usually applied to fit such models have been known for a long time: Overfitting and a selection bias towards covariates with many possible splits or missing values. While pruning procedures are able to solve the overfitting problem, the variable selection bias still seriously effects the interpretability of tree-structured regression models. For some special cases unbiased procedures have been suggested, however lacking a common theoretical foundation. We propose a unified framework for recursive partitioning which embeds tree-structured regression models into a well defined theory of conditional inference procedures. Stopping criteria based on multiple test procedures are implemented and it is shown that the predictive performance of the resulting trees is as good as the performance of established exhaustive search procedures. It turns out that the partitions and therefore the models induced by both approaches are structurally different, indicating the need for an unbiased variable selection. The methodology presented here is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates. Data from studies on animal abundance, glaucoma classification, node positive breast cancer and mammography experience are re-analyzed. / Series: Research Report Series / Department of Statistics and Mathematics
53	Tuning Parameter Selection in L1 Regularized Logistic Regression Shi, Shujing 05 December 2012 (has links) Variable selection is an important topic in regression analysis and is intended to select the best subset of predictors. Least absolute shrinkage and selection operator (Lasso) was introduced by Tibshirani in 1996. This method can serve as a tool for variable selection because it shrinks some coefficients to exact zero by a constraint on the sum of absolute values of regression coefficients. For logistic regression, Lasso modifies the traditional parameter estimation method, maximum log likelihood, by adding the L1 norm of the parameters to the negative log likelihood function, so it turns a maximization problem into a minimization one. To solve this problem, we first need to give the value for the parameter of the L1 norm, called tuning parameter. Since the tuning parameter affects the coefficients estimation and variable selection, we want to find the optimal value for the tuning parameter to get the most accurate coefficient estimation and best subset of predictors in the L1 regularized regression model. There are two popular methods to select the optimal value of the tuning parameter that results in a best subset of predictors, Bayesian information criterion (BIC) and cross validation (CV). The objective of this paper is to evaluate and compare these two methods for selecting the optimal value of tuning parameter in terms of coefficients estimation accuracy and variable selection through simulation studies. Variable Selection Logistic Regression Lasso Tuning Parameter BIC Cross Validation Physical Sciences and Mathematics
54	Variable Selection in Competing Risks Using the L1-Penalized Cox Model Kong, XiangRong 22 September 2008 (has links) One situation in survival analysis is that the failure of an individual can happen because of one of multiple distinct causes. Survival data generated in this scenario are commonly referred to as competing risks data. One of the major tasks, when examining survival data, is to assess the dependence of survival time on explanatory variables. In competing risks, as with ordinary univariate survival data, there may be explanatory variables associated with the risks raised from the different causes being studied. The same variable might have different degrees of influence on the risks due to different causes. Given a set of explanatory variables, it is of interest to identify the subset of variables that are significantly associated with the risk corresponding to each failure cause. In this project, we develop a statistical methodology to achieve this purpose, that is, to perform variable selection in the presence of competing risks survival data. Asymptotic properties of the model and empirical simulation results for evaluation of the model performance are provided. One important feature of our method, which is based on the idea of the L1 penalized Cox model, is the ability to perform variable selection in situations where we have high-dimensional explanatory variables, i.e. the number of explanatory variables is larger than the number of observations. The method was applied on a real dataset originated from the National Institutes of Health funded project "Genes related to hepatocellular carcinoma progression in living donor and deceased donor liver transplant'' to identify genes that might be relevant to tumor progression in hepatitis C virus (HCV) infected patients diagnosed with hepatocellular carcinoma (HCC). The gene expression was measured on Affymetrix GeneChip microarrays. Based on the current available 46 samples, 42 genes show very strong association with tumor progression and deserve to be further investigated for their clinical implications in prognosis of progression on patients diagnosed with HCV and HCC. Cox model Variable selection Penalized model Competing risks Biostatistics Physical Sciences and Mathematics Statistics and Probability
55	Ridle for sparse regression with mandatory covariates with application to the genetic assessment of histologic grades of breast cancer Zhai, Jing, Hsu, Chiu-Hsieh, Daye, Z. John 25 January 2017 (has links) Background: Many questions in statistical genomics can be formulated in terms of variable selection of candidate biological factors for modeling a trait or quantity of interest. Often, in these applications, additional covariates describing clinical, demographical or experimental effects must be included a priori as mandatory covariates while allowing the selection of a large number of candidate or optional variables. As genomic studies routinely require mandatory covariates, it is of interest to propose principled methods of variable selection that can incorporate mandatory covariates. Methods: In this article, we propose the ridge-lasso hybrid estimator (ridle), a new penalized regression method that simultaneously estimates coefficients of mandatory covariates while allowing selection for others. The ridle provides a principled approach to mitigate effects of multicollinearity among the mandatory covariates and possible dependency between mandatory and optional variables. We provide detailed empirical and theoretical studies to evaluate our method. In addition, we develop an efficient algorithm for the ridle. Software, based on efficient Fortran code with R-language wrappers, is publicly and freely available at https://sites.google.com/site/zhongyindaye/software. Results: The ridle is useful when mandatory predictors are known to be significant due to prior knowledge or must be kept for additional analysis. Both theoretical and comprehensive simulation studies have shown that the ridle to be advantageous when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves. A microarray gene expression analysis of the histologic grades of breast cancer has identified 24 genes, in which 2 genes are selected only by the ridle among current methods and found to be associated with tumor grade. Conclusions: In this article, we proposed the ridle as a principled sparse regression method for the selection of optional variables while incorporating mandatory ones. Results suggest that the ridle is advantageous when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves. Gene expression analysis Lasso Linear models Penalized regression Ridge Variable selection
56	Distributed Feature Selection in Large n and Large p Regression Problems Wang, Xiangyu January 2016 (has links) <p>Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.</p><p>While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.</p><p>For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.</p> / Dissertation Statistics Computer science Bayesian variable selection Big data Distributed Embarrassingly Parallel Feature selection Partition
57	Seleção de covariáveis para modelos de sobrevivência via verossimilhança penalizada / Variable selection for survival models based on penalized likelihood Pinto Junior, Jony Arrais 18 February 2009 (has links) A seleção de variáveis é uma importante fase para a construção de um modelo parcimonioso. Entretanto, as técnicas mais populares de seleção de variáveis, como, por exemplo, a seleção do melhor subconjunto de variáveis e o método stepwise, ignoram erros estocásticos inerentes à fase de seleção das variáveis. Neste trabalho, foram estudados procedimentos alternativos aos métodos mais populares para o modelo de riscos proporcionais de Cox e o modelo de Cox com fragilidade gama. Os métodos alternativos são baseados em verossimilhançaa penalizada e diferem dos métodos usuais de seleção de variáveis, pois têm como objetivo excluir do modelo variáveis não significantes estimando seus coeficientes como zero. O estimador resultante possui propriedades desejáveis com escolhas apropriadas de funções de penalidade e do parâmetro de suavização. A avaliação desses métodos foi realizada por meio de simulação e uma aplicação a um conjunto de dados reais foi considerada. / Variable selection is an important step when setting a parsimonious model. However, the most popular variable selection techniques, such as the best subset variable selection and the stepwise method, do not take into account inherent stochastic errors in the variable selection step. This work presents a study of alternative procedures to more popular methods for the Cox proportional hazards model and the frailty model. The alternative methods are based on penalized likelihood and differ from the usual variable selection methods, since their objective is to exclude from the model non significant variables, estimating their coefficient as zero. The resulting estimator has nice properties with appropriate choices of penalty functions and the tuning parameter. The assessment of these methods was studied through simulations, and an application to a real data set was considered. funções de penalidade penalized likelihood penalty functions Seleção de variáveis variable selection verossimilhança penalizada
58	Sistemáticas de agrupamento de países com base em indicadores de desempenho / Countries clustering systematics based on performance indexes Mello, Paula Lunardi de January 2017 (has links) A economia mundial passou por grandes transformações no último século, as quais incluiram períodos de crescimento sustentado seguidos por outros de estagnação, governos alternando estratégias de liberalização de mercado com políticas de protecionismo comercial e instabilidade nos mercados, dentre outros. Figurando como auxiliar na compreensão de problemas econômicos e sociais de forma sistêmica, a análise de indicadores de desempenho é capaz de gerar informações relevantes a respeito de padrões de comportamento e tendências, além de orientar políticas e estratégias para incremento de resultados econômicos e sociais. Indicadores que descrevem as principais dimensões econômicas de um país podem ser utilizados como norteadores na elaboração e monitoramento de políticas de desenvolvimento e crescimento desses países. Neste sentido, esta dissertação utiliza dados do Banco Mundial para aplicar e avaliar sistemáticas de agrupamento de países com características similares em termos dos indicadores que os descrevem. Para tanto, integra técnicas de clusterização (hierárquicas e não-hierárquicas), seleção de variáveis (por meio da técnica “leave one variable out at a time”) e redução dimensional (através da Análise de Componentes Principais) com vistas à formação de agrupamentos consistentes de países. A qualidade dos clusters gerados é avaliada pelos índices Silhouette, Calinski-Harabasz e Davies-Bouldin. Os resultados se mostraram satisfatórios quanto à representatividade dos indicadores destacados e qualidade da clusterização gerada. / The world economy faced transformations in the last century. Periods of sustained growth followed by others of stagnation, governments alternating strategies of market liberalization with policies of commercial protectionism, and instability in markets, among others. As an aid to understand economic and social problems in a systemic way, the analysis of performance indicators generates relevant information about patterns, behavior and trends, as well as guiding policies and strategies to increase results in economy and social issues. Indicators describing main economic dimensions of a country can be used guiding principles in the development and monitoring of development and growth policies of these countries. In this way, this dissertation uses data from World Bank to elaborate a system of grouping countries with similar characteristics in terms of the indicators that describe them. To do so, it integrates clustering techniques (hierarchical and non-hierarchical), selection of variables (through the "leave one variable out at a time" technique) and dimensional reduction (appling Principal Component Analysis). The generated clusters quality is evaluated by the Silhouette Index, Calinski-Harabasz and Davies-Bouldin indexes. The results were satisfactory regarding the representativity of the highlighted indicators and the generated a good clustering quality. Análise de clusters Indicadores econômicos Clustering Variable selection Principal component analysis Clustering validation measures
59	Variable selection of fixed effects and frailties for Cox Proportional Hazard frailty models and competing risks frailty models Pelagia, Ioanna January 2016 (has links) This thesis focuses on two fundamental topics, specifically in medical statistics: the modelling of correlated survival datasets and the variable selection of the significant covariates and random effects. In particular, two types of survival data are considered: the classical survival datasets, where subjects are likely to experience only one type of event and the competing risks datasets, where subjects are likely to experience one of several types of event. In Chapter 2, among other topics, we highlight the importance of adding frailty terms on the proposed models in order to account for the association between the survival time and characteristics of subjects/groups. The main novelty of this thesis is to simultaneously select fixed effects and frailty terms through the proposed statistical models for each survival dataset. Chapter 3 covers the analysis of the classical survival dataset through the proposed Cox Proportional Hazard (PH) model. Utilizing a Cox PH frailty model, may increase the dimension of variable components and estimation of the unknown coefficients becomes very challenging. The method proposed for the analysis of classical survival datasets involves simultaneous variable selection on both fixed effects and frailty terms through penalty functions. The benefit of penalty functions is that they identify the non-significant parameters and set them to have a zero effect in the model. Hence, the idea is to 'doubly-penalize' the partial likelihood of the Cox PH frailty model; one penalty for each term. Estimation and selection implemented through Newton-Raphson algorithms, whereas closed iterative forms for the estimation and selection of fixed effects and prediction of frailty terms were obtained. For the selection of frailty terms, penalties imposed on their variances since frailties are random effects. Based on the same idea, we further extend the simultaneous variable selection in the competing risks datasets in Chapter 4, using extended cause-specific frailty models. Two different scenarios are considered for frailty terms; in the first case we consider that frailty terms vary among different types of events (similar to the fixed effects) whereas in the second case we consider shared frailties over all the types of events. Moreover, our 'individual penalization' approach allows for one covariate to be significant for some types of events, in contrast to the frequently used 'group-penalization' where a covariate is entirely removed when it is not significant over all the events. For both proposed methods, simulation studies were conduced and showed that the proposed procedure followed for each analysis works well in simultaneously selecting and estimating significant fixed effects and frailty terms. The proposed methods are also applied to real datasets analysis; Kidney catheter infections, Diabetes Type 2 and Breast Cancer datasets. Association of the survival times and unmeasured characteristics of the subjects was studied as well as a variable selection for fixed effects and frailties implemented successfully. 510
60	Seleção de covariáveis para modelos de sobrevivência via verossimilhança penalizada / Variable selection for survival models based on penalized likelihood Jony Arrais Pinto Junior 18 February 2009 (has links) A seleção de variáveis é uma importante fase para a construção de um modelo parcimonioso. Entretanto, as técnicas mais populares de seleção de variáveis, como, por exemplo, a seleção do melhor subconjunto de variáveis e o método stepwise, ignoram erros estocásticos inerentes à fase de seleção das variáveis. Neste trabalho, foram estudados procedimentos alternativos aos métodos mais populares para o modelo de riscos proporcionais de Cox e o modelo de Cox com fragilidade gama. Os métodos alternativos são baseados em verossimilhançaa penalizada e diferem dos métodos usuais de seleção de variáveis, pois têm como objetivo excluir do modelo variáveis não significantes estimando seus coeficientes como zero. O estimador resultante possui propriedades desejáveis com escolhas apropriadas de funções de penalidade e do parâmetro de suavização. A avaliação desses métodos foi realizada por meio de simulação e uma aplicação a um conjunto de dados reais foi considerada. / Variable selection is an important step when setting a parsimonious model. However, the most popular variable selection techniques, such as the best subset variable selection and the stepwise method, do not take into account inherent stochastic errors in the variable selection step. This work presents a study of alternative procedures to more popular methods for the Cox proportional hazards model and the frailty model. The alternative methods are based on penalized likelihood and differ from the usual variable selection methods, since their objective is to exclude from the model non significant variables, estimating their coefficient as zero. The resulting estimator has nice properties with appropriate choices of penalty functions and the tuning parameter. The assessment of these methods was studied through simulations, and an application to a real data set was considered. funções de penalidade Seleção de variáveis verossimilhança penalizada penalized likelihood penalty functions variable selection

Search results