• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 115
  • 61
  • 21
  • 20
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 263
  • 263
  • 68
  • 67
  • 59
  • 55
  • 51
  • 39
  • 34
  • 32
  • 31
  • 30
  • 30
  • 29
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
51

Unbiased Recursive Partitioning: A Conditional Inference Framework

Hothorn, Torsten, Hornik, Kurt, Zeileis, Achim January 2004 (has links) (PDF)
Recursive binary partitioning is a popular tool for regression analysis. Two fundamental problems of exhaustive search procedures usually applied to fit such models have been known for a long time: Overfitting and a selection bias towards covariates with many possible splits or missing values. While pruning procedures are able to solve the overfitting problem, the variable selection bias still seriously effects the interpretability of tree-structured regression models. For some special cases unbiased procedures have been suggested, however lacking a common theoretical foundation. We propose a unified framework for recursive partitioning which embeds tree-structured regression models into a well defined theory of conditional inference procedures. Stopping criteria based on multiple test procedures are implemented and it is shown that the predictive performance of the resulting trees is as good as the performance of established exhaustive search procedures. It turns out that the partitions and therefore the models induced by both approaches are structurally different, indicating the need for an unbiased variable selection. The methodology presented here is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates. Data from studies on animal abundance, glaucoma classification, node positive breast cancer and mammography experience are re-analyzed. / Series: Research Report Series / Department of Statistics and Mathematics
52

Tuning Parameter Selection in L1 Regularized Logistic Regression

Shi, Shujing 05 December 2012 (has links)
Variable selection is an important topic in regression analysis and is intended to select the best subset of predictors. Least absolute shrinkage and selection operator (Lasso) was introduced by Tibshirani in 1996. This method can serve as a tool for variable selection because it shrinks some coefficients to exact zero by a constraint on the sum of absolute values of regression coefficients. For logistic regression, Lasso modifies the traditional parameter estimation method, maximum log likelihood, by adding the L1 norm of the parameters to the negative log likelihood function, so it turns a maximization problem into a minimization one. To solve this problem, we first need to give the value for the parameter of the L1 norm, called tuning parameter. Since the tuning parameter affects the coefficients estimation and variable selection, we want to find the optimal value for the tuning parameter to get the most accurate coefficient estimation and best subset of predictors in the L1 regularized regression model. There are two popular methods to select the optimal value of the tuning parameter that results in a best subset of predictors, Bayesian information criterion (BIC) and cross validation (CV). The objective of this paper is to evaluate and compare these two methods for selecting the optimal value of tuning parameter in terms of coefficients estimation accuracy and variable selection through simulation studies.
53

Variable Selection in Competing Risks Using the L1-Penalized Cox Model

Kong, XiangRong 22 September 2008 (has links)
One situation in survival analysis is that the failure of an individual can happen because of one of multiple distinct causes. Survival data generated in this scenario are commonly referred to as competing risks data. One of the major tasks, when examining survival data, is to assess the dependence of survival time on explanatory variables. In competing risks, as with ordinary univariate survival data, there may be explanatory variables associated with the risks raised from the different causes being studied. The same variable might have different degrees of influence on the risks due to different causes. Given a set of explanatory variables, it is of interest to identify the subset of variables that are significantly associated with the risk corresponding to each failure cause. In this project, we develop a statistical methodology to achieve this purpose, that is, to perform variable selection in the presence of competing risks survival data. Asymptotic properties of the model and empirical simulation results for evaluation of the model performance are provided. One important feature of our method, which is based on the idea of the L1 penalized Cox model, is the ability to perform variable selection in situations where we have high-dimensional explanatory variables, i.e. the number of explanatory variables is larger than the number of observations. The method was applied on a real dataset originated from the National Institutes of Health funded project "Genes related to hepatocellular carcinoma progression in living donor and deceased donor liver transplant'' to identify genes that might be relevant to tumor progression in hepatitis C virus (HCV) infected patients diagnosed with hepatocellular carcinoma (HCC). The gene expression was measured on Affymetrix GeneChip microarrays. Based on the current available 46 samples, 42 genes show very strong association with tumor progression and deserve to be further investigated for their clinical implications in prognosis of progression on patients diagnosed with HCV and HCC.
54

Ridle for sparse regression with mandatory covariates with application to the genetic assessment of histologic grades of breast cancer

Zhai, Jing, Hsu, Chiu-Hsieh, Daye, Z. John 25 January 2017 (has links)
Background: Many questions in statistical genomics can be formulated in terms of variable selection of candidate biological factors for modeling a trait or quantity of interest. Often, in these applications, additional covariates describing clinical, demographical or experimental effects must be included a priori as mandatory covariates while allowing the selection of a large number of candidate or optional variables. As genomic studies routinely require mandatory covariates, it is of interest to propose principled methods of variable selection that can incorporate mandatory covariates. Methods: In this article, we propose the ridge-lasso hybrid estimator (ridle), a new penalized regression method that simultaneously estimates coefficients of mandatory covariates while allowing selection for others. The ridle provides a principled approach to mitigate effects of multicollinearity among the mandatory covariates and possible dependency between mandatory and optional variables. We provide detailed empirical and theoretical studies to evaluate our method. In addition, we develop an efficient algorithm for the ridle. Software, based on efficient Fortran code with R-language wrappers, is publicly and freely available at https://sites.google.com/site/zhongyindaye/software. Results: The ridle is useful when mandatory predictors are known to be significant due to prior knowledge or must be kept for additional analysis. Both theoretical and comprehensive simulation studies have shown that the ridle to be advantageous when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves. A microarray gene expression analysis of the histologic grades of breast cancer has identified 24 genes, in which 2 genes are selected only by the ridle among current methods and found to be associated with tumor grade. Conclusions: In this article, we proposed the ridle as a principled sparse regression method for the selection of optional variables while incorporating mandatory ones. Results suggest that the ridle is advantageous when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves.
55

Distributed Feature Selection in Large n and Large p Regression Problems

Wang, Xiangyu January 2016 (has links)
<p>Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.</p><p>While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.</p><p>For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.</p> / Dissertation
56

Seleção de covariáveis para modelos de sobrevivência via verossimilhança penalizada / Variable selection for survival models based on penalized likelihood

Pinto Junior, Jony Arrais 18 February 2009 (has links)
A seleção de variáveis é uma importante fase para a construção de um modelo parcimonioso. Entretanto, as técnicas mais populares de seleção de variáveis, como, por exemplo, a seleção do melhor subconjunto de variáveis e o método stepwise, ignoram erros estocásticos inerentes à fase de seleção das variáveis. Neste trabalho, foram estudados procedimentos alternativos aos métodos mais populares para o modelo de riscos proporcionais de Cox e o modelo de Cox com fragilidade gama. Os métodos alternativos são baseados em verossimilhançaa penalizada e diferem dos métodos usuais de seleção de variáveis, pois têm como objetivo excluir do modelo variáveis não significantes estimando seus coeficientes como zero. O estimador resultante possui propriedades desejáveis com escolhas apropriadas de funções de penalidade e do parâmetro de suavização. A avaliação desses métodos foi realizada por meio de simulação e uma aplicação a um conjunto de dados reais foi considerada. / Variable selection is an important step when setting a parsimonious model. However, the most popular variable selection techniques, such as the best subset variable selection and the stepwise method, do not take into account inherent stochastic errors in the variable selection step. This work presents a study of alternative procedures to more popular methods for the Cox proportional hazards model and the frailty model. The alternative methods are based on penalized likelihood and differ from the usual variable selection methods, since their objective is to exclude from the model non significant variables, estimating their coefficient as zero. The resulting estimator has nice properties with appropriate choices of penalty functions and the tuning parameter. The assessment of these methods was studied through simulations, and an application to a real data set was considered.
57

Sistemáticas de agrupamento de países com base em indicadores de desempenho / Countries clustering systematics based on performance indexes

Mello, Paula Lunardi de January 2017 (has links)
A economia mundial passou por grandes transformações no último século, as quais incluiram períodos de crescimento sustentado seguidos por outros de estagnação, governos alternando estratégias de liberalização de mercado com políticas de protecionismo comercial e instabilidade nos mercados, dentre outros. Figurando como auxiliar na compreensão de problemas econômicos e sociais de forma sistêmica, a análise de indicadores de desempenho é capaz de gerar informações relevantes a respeito de padrões de comportamento e tendências, além de orientar políticas e estratégias para incremento de resultados econômicos e sociais. Indicadores que descrevem as principais dimensões econômicas de um país podem ser utilizados como norteadores na elaboração e monitoramento de políticas de desenvolvimento e crescimento desses países. Neste sentido, esta dissertação utiliza dados do Banco Mundial para aplicar e avaliar sistemáticas de agrupamento de países com características similares em termos dos indicadores que os descrevem. Para tanto, integra técnicas de clusterização (hierárquicas e não-hierárquicas), seleção de variáveis (por meio da técnica “leave one variable out at a time”) e redução dimensional (através da Análise de Componentes Principais) com vistas à formação de agrupamentos consistentes de países. A qualidade dos clusters gerados é avaliada pelos índices Silhouette, Calinski-Harabasz e Davies-Bouldin. Os resultados se mostraram satisfatórios quanto à representatividade dos indicadores destacados e qualidade da clusterização gerada. / The world economy faced transformations in the last century. Periods of sustained growth followed by others of stagnation, governments alternating strategies of market liberalization with policies of commercial protectionism, and instability in markets, among others. As an aid to understand economic and social problems in a systemic way, the analysis of performance indicators generates relevant information about patterns, behavior and trends, as well as guiding policies and strategies to increase results in economy and social issues. Indicators describing main economic dimensions of a country can be used guiding principles in the development and monitoring of development and growth policies of these countries. In this way, this dissertation uses data from World Bank to elaborate a system of grouping countries with similar characteristics in terms of the indicators that describe them. To do so, it integrates clustering techniques (hierarchical and non-hierarchical), selection of variables (through the "leave one variable out at a time" technique) and dimensional reduction (appling Principal Component Analysis). The generated clusters quality is evaluated by the Silhouette Index, Calinski-Harabasz and Davies-Bouldin indexes. The results were satisfactory regarding the representativity of the highlighted indicators and the generated a good clustering quality.
58

Variable selection of fixed effects and frailties for Cox Proportional Hazard frailty models and competing risks frailty models

Pelagia, Ioanna January 2016 (has links)
This thesis focuses on two fundamental topics, specifically in medical statistics: the modelling of correlated survival datasets and the variable selection of the significant covariates and random effects. In particular, two types of survival data are considered: the classical survival datasets, where subjects are likely to experience only one type of event and the competing risks datasets, where subjects are likely to experience one of several types of event. In Chapter 2, among other topics, we highlight the importance of adding frailty terms on the proposed models in order to account for the association between the survival time and characteristics of subjects/groups. The main novelty of this thesis is to simultaneously select fixed effects and frailty terms through the proposed statistical models for each survival dataset. Chapter 3 covers the analysis of the classical survival dataset through the proposed Cox Proportional Hazard (PH) model. Utilizing a Cox PH frailty model, may increase the dimension of variable components and estimation of the unknown coefficients becomes very challenging. The method proposed for the analysis of classical survival datasets involves simultaneous variable selection on both fixed effects and frailty terms through penalty functions. The benefit of penalty functions is that they identify the non-significant parameters and set them to have a zero effect in the model. Hence, the idea is to 'doubly-penalize' the partial likelihood of the Cox PH frailty model; one penalty for each term. Estimation and selection implemented through Newton-Raphson algorithms, whereas closed iterative forms for the estimation and selection of fixed effects and prediction of frailty terms were obtained. For the selection of frailty terms, penalties imposed on their variances since frailties are random effects. Based on the same idea, we further extend the simultaneous variable selection in the competing risks datasets in Chapter 4, using extended cause-specific frailty models. Two different scenarios are considered for frailty terms; in the first case we consider that frailty terms vary among different types of events (similar to the fixed effects) whereas in the second case we consider shared frailties over all the types of events. Moreover, our 'individual penalization' approach allows for one covariate to be significant for some types of events, in contrast to the frequently used 'group-penalization' where a covariate is entirely removed when it is not significant over all the events. For both proposed methods, simulation studies were conduced and showed that the proposed procedure followed for each analysis works well in simultaneously selecting and estimating significant fixed effects and frailty terms. The proposed methods are also applied to real datasets analysis; Kidney catheter infections, Diabetes Type 2 and Breast Cancer datasets. Association of the survival times and unmeasured characteristics of the subjects was studied as well as a variable selection for fixed effects and frailties implemented successfully.
59

Seleção de covariáveis para modelos de sobrevivência via verossimilhança penalizada / Variable selection for survival models based on penalized likelihood

Jony Arrais Pinto Junior 18 February 2009 (has links)
A seleção de variáveis é uma importante fase para a construção de um modelo parcimonioso. Entretanto, as técnicas mais populares de seleção de variáveis, como, por exemplo, a seleção do melhor subconjunto de variáveis e o método stepwise, ignoram erros estocásticos inerentes à fase de seleção das variáveis. Neste trabalho, foram estudados procedimentos alternativos aos métodos mais populares para o modelo de riscos proporcionais de Cox e o modelo de Cox com fragilidade gama. Os métodos alternativos são baseados em verossimilhançaa penalizada e diferem dos métodos usuais de seleção de variáveis, pois têm como objetivo excluir do modelo variáveis não significantes estimando seus coeficientes como zero. O estimador resultante possui propriedades desejáveis com escolhas apropriadas de funções de penalidade e do parâmetro de suavização. A avaliação desses métodos foi realizada por meio de simulação e uma aplicação a um conjunto de dados reais foi considerada. / Variable selection is an important step when setting a parsimonious model. However, the most popular variable selection techniques, such as the best subset variable selection and the stepwise method, do not take into account inherent stochastic errors in the variable selection step. This work presents a study of alternative procedures to more popular methods for the Cox proportional hazards model and the frailty model. The alternative methods are based on penalized likelihood and differ from the usual variable selection methods, since their objective is to exclude from the model non significant variables, estimating their coefficient as zero. The resulting estimator has nice properties with appropriate choices of penalty functions and the tuning parameter. The assessment of these methods was studied through simulations, and an application to a real data set was considered.
60

Agrupamento de trabalhadores com perfis semelhantes de aprendizado utilizando técnicas multivariadas

Azevedo, Bárbara Brzezinski January 2013 (has links)
A manufatura de produtos customizados resulta em variedade de modelos, redução no tamanho de lotes e alternância frequente de tarefas executadas por trabalhadores. Neste contexto, tarefas manuais são especialmente afetadas por conta do processo de adaptação do trabalhador a novos modelos de produtos. Este processo de aprendizado pode ocorrer de maneira distinta dentro de um grupo de trabalhadores. Assim, busca-se o agrupamento dos trabalhadores com perfis similares de aprendizado, monitorando a formação de gargalos em linhas de produção constituídas por dissimilaridades de aprendizado em processos manuais. A presente dissertação apresenta abordagens para clusterização de trabalhadores baseadas nos parâmetros oriundos da modelagem de Curvas de Aprendizado. Tais parâmetros, os quais caracterizam o processo de adaptação de trabalhadores a tarefas, são transformados através da Análise de Componentes Principais e então utilizados como variáveis de clusterização. Na sequência, testam-se outras transformações nos parâmetros utilizando funções Kernel. Os trabalhadores são clusterizados através do método K-Means e Fuzzy C-Means e a qualidade dos agrupamentos formados é medida através do Silhouette Index. Por fim, sugere-se um índice de importância de variável baseado em parâmetros obtidos na Análise Componentes Principais com o objetivo de selecionar as variáveis mais relevantes para clusterização. As abordagens propostas são aplicadas em um processo da indústria calçadista, gerando resultados satisfatórios quando comparados a clusterizações realizadas sem a transformação prévia dos dados ou sem seleção das variáveis. / Manufacturing of customized products relies on a large menu choice, reduced batch sizes and frequent alternation of tasks performed by workers. In this context, manual tasks are especially affected by workers’ adaptation to new product models. This learning process takes place in different paces within a group of workers. This thesis aims at grouping workers with similar learning process tailored to avoid bottlenecks in production lines due to learning dissimilarities among workers. For that matter, we present a method for clustering workers based on parameters derived from Learning Curve (LC) modeling. Such parameters are processed through Principal Component Analysis (PCA), and the PCA scores are used as clustering variables. Next, Kernel transformations are also used to improve clustering quality. The data is clustered using K-Means and Fuzzy C-Means techniques, and the quality of resulting clusters is measured by the Silhouette Index. Finally, we suggest a variable importance index based on parameters derived from PCA to select the most relevant variables for clustering. The proposed approaches are applied in a footwear process, yielding satisfactory results when compared to clustering on original data or without variable selection.

Page generated in 0.1154 seconds