• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 117
  • 61
  • 21
  • 20
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 266
  • 266
  • 69
  • 67
  • 59
  • 57
  • 52
  • 39
  • 36
  • 32
  • 31
  • 30
  • 30
  • 29
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
241

Sélection de variables pour la classification non supervisée en grande dimension / Variable selection in model-based clustering for high-dimensional data

Meynet, Caroline 09 November 2012 (has links)
Il existe des situations de modélisation statistique pour lesquelles le problème classique de classification non supervisée (c'est-à-dire sans information a priori sur la nature ou le nombre de classes à constituer) se double d'un problème d'identification des variables réellement pertinentes pour déterminer la classification. Cette problématique est d'autant plus essentielle que les données dites de grande dimension, comportant bien plus de variables que d'observations, se multiplient ces dernières années : données d'expression de gènes, classification de courbes... Nous proposons une procédure de sélection de variables pour la classification non supervisée adaptée aux problèmes de grande dimension. Nous envisageons une approche par modèles de mélange gaussien, ce qui nous permet de reformuler le problème de sélection des variables et du choix du nombre de classes en un problème global de sélection de modèle. Nous exploitons les propriétés de sélection de variables de la régularisation l1 pour construire efficacement, à partir des données, une collection de modèles qui reste de taille raisonnable même en grande dimension. Nous nous démarquons des procédures classiques de sélection de variables par régularisation l1 en ce qui concerne l'estimation des paramètres : dans chaque modèle, au lieu de considérer l'estimateur Lasso, nous calculons l'estimateur du maximum de vraisemblance. Ensuite, nous sélectionnons l'un des ces estimateurs du maximum de vraisemblance par un critère pénalisé non asymptotique basé sur l'heuristique de pente introduite par Birgé et Massart. D'un point de vue théorique, nous établissons un théorème de sélection de modèle pour l'estimation d'une densité par maximum de vraisemblance pour une collection aléatoire de modèles. Nous l'appliquons dans notre contexte pour trouver une forme de pénalité minimale pour notre critère pénalisé. D'un point de vue pratique, des simulations sont effectuées pour valider notre procédure, en particulier dans le cadre de la classification non supervisée de courbes. L'idée clé de notre procédure est de n'utiliser la régularisation l1 que pour constituer une collection restreinte de modèles et non pas aussi pour estimer les paramètres des modèles. Cette étape d'estimation est réalisée par maximum de vraisemblance. Cette procédure hybride nous est inspirée par une étude théorique menée dans une première partie dans laquelle nous établissons des inégalités oracle l1 pour le Lasso dans les cadres de régression gaussienne et de mélange de régressions gaussiennes, qui se démarquent des inégalités oracle l0 traditionnellement établies par leur absence totale d'hypothèse. / This thesis deals with variable selection for clustering. This problem has become all the more challenging since the recent increase in high-dimensional data where the number of variables can largely exceeds the number of observations (DNA analysis, functional data clustering...). We propose a variable selection procedure for clustering suited to high-dimensional contexts. We consider clustering based on finite Gaussian mixture models in order to recast both the variable selection and the choice of the number of clusters into a global model selection problem. We use the variable selection property of l1-regularization to build a data-driven model collection in a efficient way. Our procedure differs from classical procedures using l1-regularization as regards the estimation of the mixture parameters: in each model of the collection, rather than considering the Lasso estimator, we calculate the maximum likelihood estimator. Then, we select one of these maximum likelihood estimators by a non-asymptotic penalized criterion. From a theoretical viewpoint, we establish a model selection theorem for maximum likelihood estimators in a density estimation framework with a random model collection. We apply it in our context to determine a convenient penalty shape for our criterion. From a practical viewpoint, we carry out simulations to validate our procedure, for instance in the functional data clustering framework. The basic idea of our procedure, which consists in variable selection by l1-regularization but estimation by maximum likelihood estimators, comes from theoretical results we establish in the first part of this thesis: we provide l1-oracle inequalities for the Lasso in the regression framework, which are valid with no assumption at all contrary to the usual l0-oracle inequalities in the literature, thus suggesting a gap between l1-regularization and l0-regularization.
242

Modèles de mélange pour la régression en grande dimension, application aux données fonctionnelles / High-dimensional mixture regression models, application to functional data

Devijver, Emilie 02 July 2015 (has links)
Les modèles de mélange pour la régression sont utilisés pour modéliser la relation entre la réponse et les prédicteurs, pour des données issues de différentes sous-populations. Dans cette thèse, on étudie des prédicteurs de grande dimension et une réponse de grande dimension. Tout d’abord, on obtient une inégalité oracle ℓ1 satisfaite par l’estimateur du Lasso. On s’intéresse à cet estimateur pour ses propriétés de régularisation ℓ1. On propose aussi deux procédures pour pallier ce problème de classification en grande dimension. La première procédure utilise l’estimateur du maximum de vraisemblance pour estimer la densité conditionnelle inconnue, en se restreignant aux variables actives sélectionnées par un estimateur de type Lasso. La seconde procédure considère la sélection de variables et la réduction de rang pour diminuer la dimension. Pour chaque procédure, on obtient une inégalité oracle, qui explicite la pénalité nécessaire pour sélectionner un modèle proche de l’oracle. On étend ces procédures au cas des données fonctionnelles, où les prédicteurs et la réponse peuvent être des fonctions. Dans ce but, on utilise une approche par ondelettes. Pour chaque procédure, on fournit des algorithmes, et on applique et évalue nos méthodes sur des simulations et des données réelles. En particulier, on illustre la première méthode par des données de consommation électrique. / Finite mixture regression models are useful for modeling the relationship between a response and predictors, arising from different subpopulations. In this thesis, we focus on high-dimensional predictors and a high-dimensional response. First of all, we provide an ℓ1-oracle inequality satisfied by the Lasso estimator. We focus on this estimator for its ℓ1-regularization properties rather than for the variable selection procedure. We also propose two procedures to deal with this issue. The first procedure leads to estimate the unknown conditional mixture density by a maximum likelihood estimator, restricted to the relevant variables selected by an ℓ1-penalized maximum likelihood estimator. The second procedure considers jointly predictor selection and rank reduction for obtaining lower-dimensional approximations of parameters matrices. For each procedure, we get an oracle inequality, which derives the penalty shape of the criterion, depending on the complexity of the random model collection. We extend these procedures to the functional case, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms, apply and evaluate our methods both on simulations and real datasets. In particular, we illustrate the first procedure on an electricity load consumption dataset.
243

Emprego de técnicas de análise exploratória de dados utilizados em Química Medicinal / Use of different techniques for exploratory data analysis in Medicinal Chemistry

Gertrudes, Jadson Castro 10 September 2013 (has links)
Pesquisas na área de Química Medicinal têm direcionado esforços na busca por métodos que acelerem o processo de descoberta de novos medicamentos. Dentre as diversas etapas relacionadas ao longo do processo de descoberta de substâncias bioativas está a análise das relações entre a estrutura química e a atividade biológica de compostos. Neste processo, os pesquisadores da área de Química Medicinal analisam conjuntos de dados que são caracterizados pela alta dimensionalidade e baixo número de observações. Dentro desse contexto, o presente trabalho apresenta uma abordagem computacional que visa contribuir para a análise de dados químicos e, consequentemente, a descoberta de novos medicamentos para o tratamento de doenças crônicas. As abordagens de análise exploratória de dados, utilizadas neste trabalho, combinam técnicas de redução de dimensionalidade e de agrupamento para detecção de estruturas naturais que reflitam a atividade biológica dos compostos analisados. Dentre as diversas técnicas existentes para a redução de dimensionalidade, são discutidas o escore de Fisher, a análise de componentes principais e a análise de componentes principais esparsas. Quanto aos algoritmos de aprendizado, são avaliados o k-médias, fuzzy c-médias e modelo de misturas ICA aperfeiçoado. No desenvolvimento deste trabalho foram utilizados quatro conjuntos de dados, contendo informações de substâncias bioativas, sendo que dois conjuntos foram relacionados ao tratamento da diabetes mellitus e da síndrome metabólica, o terceiro conjunto relacionado a doenças cardiovasculares e o último conjunto apresenta substâncias que podem ser utilizadas no tratamento do câncer. Nos experimentos realizados, os resultados alcançados sugerem a utilização das técnicas de redução de dimensionalidade juntamente com os algoritmos não supervisionados para a tarefa de agrupamento dos dados químicos, uma vez que nesses experimentos foi possível descrever níveis de atividade biológica dos compostos estudados. Portanto, é possível concluir que as técnicas de redução de dimensionalidade e de agrupamento podem possivelmente ser utilizadas como guias no processo de descoberta e desenvolvimento de novos compostos na área de Química Medicinal. / Researches in Medicinal Chemistry\'s area have focused on the search of methods that accelerate the process of drug discovery. Among several steps related to the process of discovery of bioactive substances there is the analysis of the relationships between chemical structure and biological activity of compounds. In this process, researchers of medicinal chemistry analyze data sets that are characterized by high dimensionality and small number of observations. Within this context, this work presents a computational approach that aims to contribute to the analysis of chemical data and, consequently, the discovery of new drugs for the treatment of chronic diseases. Approaches used in exploratory data analysis, employed in this work, combine techniques of dimensionality reduction and clustering for detecting natural structures that reflect the biological activity of the analyzed compounds. Among several existing techniques for dimensionality reduction, we have focused the Fisher\'s score, principal component analysis and sparse principal component analysis. For the clustering procedure, this study evaluated k-means, fuzzy c-means and enhanced ICA mixture model. In order to perform experiments, we used four data sets, containing information of bioactive substances. Two sets are related to the treatment of diabetes mellitus and metabolic syndrome, the third set is related to cardiovascular disease and the latter set has substances that can be used in cancer treatment. In the experiments, the obtained results suggest the use of dimensionality reduction techniques along with clustering algorithms for the task of clustering chemical data, since from these experiments, it was possible to describe different levels of biological activity of the studied compounds. Therefore, we conclude that the techniques of dimensionality reduction and clustering can be used as guides in the process of discovery and development of new compounds in the field of Medicinal Chemistry
244

Análise e comparação de alguns métodos alternativos de seleção de variáveis preditoras no modelo de regressão linear / Analysis and comparison of some alternative methods of selection of predictor variables in linear regression models.

Marques, Matheus Augustus Pumputis 04 June 2018 (has links)
Neste trabalho estudam-se alguns novos métodos de seleção de variáveis no contexto da regressão linear que surgiram nos últimos 15 anos, especificamente o LARS - Least Angle Regression, o NAMS - Noise Addition Model Selection, a Razão de Falsa Seleção - RFS (FSR em inglês), o LASSO Bayesiano e o Spike-and-Slab LASSO. A metodologia foi a análise e comparação dos métodos estudados e aplicações. Após esse estudo, realizam-se aplicações em bases de dados reais e um estudo de simulação, em que todos os métodos se mostraram promissores, com os métodos Bayesianos apresentando os melhores resultados. / In this work, some new variable selection methods that have appeared in the last 15 years in the context of linear regression are studied, specifically the LARS - Least Angle Regression, the NAMS - Noise Addition Model Selection, the False Selection Rate - FSR, the Bayesian LASSO and the Spike-and-Slab LASSO. The methodology was the analysis and comparison of the studied methods. After this study, applications to real data bases are made, as well as a simulation study, in which all methods are shown to be promising, with the Bayesian methods showing the best results.
245

Développement de méthodes spatio-temporelles pour la prévision à court terme de la production photovoltaïque / Development of spatio-temporal methods for short term forecasting of photovoltaïc production

Agoua, Xwégnon 20 December 2017 (has links)
L’évolution du contexte énergétique mondial et la lutte contre le changement climatique ont conduit à l’accroissement des capacités de production d’énergie renouvelable. Les énergies renouvelables sont caractérisées par une forte variabilité due à leur dépendance aux conditions météorologiques. La maîtrise de cette variabilité constitue un enjeu important pour les opérateurs du système électrique, mais aussi pour l’atteinte des objectifs européens de réduction des émissions de gaz à effet de serre, d’amélioration de l’efficacité énergétique et de l’augmentation de la part des énergies renouvelables. Dans le cas du photovoltaïque(PV), la maîtrise de la variabilité de la production passe par la mise en place d’outils qui permettent de prévoir la production future des centrales. Ces prévisions contribuent entre autres à l’augmentation du niveau de pénétration du PV,à l’intégration optimale dans le réseau électrique, à l’amélioration de la gestion des centrales PV et à la participation aux marchés de l’électricité. L’objectif de cette thèse est de contribuer à l’amélioration de la prédictibilité à court-terme (moins de 6 heures) de la production PV. Dans un premier temps, nous analysons la variabilité spatio-temporelle de la production PV et proposons une méthode de réduction de la non-stationnarité des séries de production. Nous proposons ensuite un modèle spatio-temporel de prévision déterministe qui exploite les corrélations spatio-temporelles entre les centrales réparties sur une région. Les centrales sont utilisées comme un réseau de capteurs qui permettent d’anticiper les sources de variabilité. Nous proposons aussi une méthode automatique de sélection des variables qui permet de résoudre les problèmes de dimension et de parcimonie du modèle spatio-temporel. Un modèle spatio-temporel probabiliste a aussi été développé aux fins de produire des prévisions performantes non seulement du niveau moyen de la production future mais de toute sa distribution. Enfin nous proposons, un modèle qui exploite les observations d’images satellites pour améliorer la prévision court-terme de la production et une comparaison de l’apport de différentes sources de données sur les performances de prévision. / The evolution of the global energy context and the challenges of climate change have led to anincrease in the production capacity of renewable energy. Renewable energies are characterized byhigh variability due to their dependence on meteorological conditions. Controlling this variabilityis an important challenge for the operators of the electricity systems, but also for achieving the Europeanobjectives of reducing greenhouse gas emissions, improving energy efficiency and increasing the share of renewable energies in EU energy consumption. In the case of photovoltaics (PV), the control of the variability of the production requires to predict with minimum errors the future production of the power stations. These forecasts contribute to increasing the level of PV penetration and optimal integration in the power grid, improving PV plant management and participating in electricity markets. The objective of this thesis is to contribute to the improvement of the short-term predictability (less than 6 hours) of PV production. First, we analyze the spatio-temporal variability of PV production and propose a method to reduce the nonstationarity of the production series. We then propose a deterministic prediction model that exploits the spatio-temporal correlations between the power plants of a spatial grid. The power stationsare used as a network of sensors to anticipate sources of variability. We also propose an automaticmethod for selecting variables to solve the dimensionality and sparsity problems of the space-time model. A probabilistic spatio-temporal model has also been developed to produce efficient forecasts not only of the average level of future production but of its entire distribution. Finally, we propose a model that exploits observations of satellite images to improve short-term forecasting of PV production.
246

Contrôle des fausses découvertes lors de la sélection de variables en grande dimension / Control of false discoveries in high-dimensional variable selection

Bécu, Jean-Michel 10 March 2016 (has links)
Dans le cadre de la régression, de nombreuses études s’intéressent au problème dit de la grande dimension, où le nombre de variables explicatives mesurées sur chaque échantillon est beaucoup plus grand que le nombre d’échantillons. Si la sélection de variables est une question classique, les méthodes usuelles ne s’appliquent pas dans le cadre de la grande dimension. Ainsi, dans ce manuscrit, nous présentons la transposition de tests statistiques classiques à la grande dimension. Ces tests sont construits sur des estimateurs des coefficients de régression produits par des approches de régressions linéaires pénalisées, applicables dans le cadre de la grande dimension. L’objectif principal des tests que nous proposons consiste à contrôler le taux de fausses découvertes. La première contribution de ce manuscrit répond à un problème de quantification de l’incertitude sur les coefficients de régression réalisée sur la base de la régression Ridge, qui pénalise les coefficients de régression par leur norme l2, dans le cadre de la grande dimension. Nous y proposons un test statistique basé sur le rééchantillonage. La seconde contribution porte sur une approche de sélection en deux étapes : une première étape de criblage des variables, basée sur la régression parcimonieuse Lasso précède l’étape de sélection proprement dite, où la pertinence des variables pré-sélectionnées est testée. Les tests sont construits sur l’estimateur de la régression Ridge adaptive, dont la pénalité est construite à partir des coefficients de régression du Lasso. Une dernière contribution consiste à transposer cette approche à la sélection de groupes de variables. / In the regression framework, many studies are focused on the high-dimensional problem where the number of measured explanatory variables is very large compared to the sample size. If variable selection is a classical question, usual methods are not applicable in the high-dimensional case. So, in this manuscript, we develop the transposition of statistical tests to the high dimension. These tests operate on estimates of regression coefficients obtained by penalized linear regression, which is applicable in high-dimension. The main objective of these tests is the false discovery control. The first contribution of this manuscript provides a quantification of the uncertainty for regression coefficients estimated by ridge regression in high dimension. The Ridge regression penalizes the coefficients on their l2 norm. To do this, we devise a statistical test based on permutations. The second contribution is based on a two-step selection approach. A first step is dedicated to the screening of variables, based on parsimonious regression Lasso. The second step consists in cleaning the resulting set by testing the relevance of pre-selected variables. These tests are made on adaptive-ridge estimates, where the penalty is constructed on Lasso estimates learned during the screening step. A last contribution consists to the transposition of this approach to group-variables selection.
247

Redes Neurais Aplicadas à InferÃncia dos Sinais de Controle de Dosagem de Coagulantes em uma ETA por FiltraÃÃo RÃpida / Artificial Neural Networks applied to the inference of dosage control signals of coagulants in a water treatment plant by direct filtrationâ,

Leonaldo da Silva Gomes 28 February 2012 (has links)
Considerando a importÃncia do controle da coagulaÃÃo quÃmica para o processo de tratamento de Ãgua por filtraÃÃo rÃpida, esta dissertaÃÃo propÃe a aplicaÃÃo de redes neurais artificiais para inferÃncia dos sinais de controle de dosagem de coagulantes principal e auxiliar, no processo de coagulaÃÃo quÃmica em uma estaÃÃo de tratamento de Ãgua por filtraÃÃo rÃpida. Para tanto, foi feito uma anÃlise comparativa da aplicaÃÃo de modelos baseados em redes neurais do tipo: alimentada adiante focada atrasada no tempo (FTLFN); alimentada adiante atrasada no tempo distribuÃda (DTLFN); recorrente de Elman (ERN) e auto-regressiva nÃo-linear com entradas exÃgenas (NARX). Da anÃlise comparativa, o modelo baseado em redes NARX apresentou melhores resultados, evidenciando o potencial do modelo para uso em casos reais, o que contribuirà para a viabilizaÃÃo de projetos desta natureza em estaÃÃes de tratamento de Ãgua de pequeno porte. / Considering the importance of the chemical coagulation control for the water treatment by direct filtration, this work proposes the application of artificial neural networks for inference of dosage control signals of principal and auxiliary coagulant, in the chemical coagulation process in a water treatment plant by direct filtration. To that end, was made a comparative analysis of the application of models based on neural networks, such as: Focused Time Lagged Feedforward Network (FTLFN); Distributed Time Lagged Feedforward Network (DTLFN); Elman Recurrent Network (ERN) and Non-linear Autoregressive with exogenous inputs (NARX). From the comparative analysis, the model based on NARX networks showed better results, demonstrating the potential of the model for use in real cases, which will contribute to the viability of projects of this nature in small size water treatment plants.
248

Análise e comparação de alguns métodos alternativos de seleção de variáveis preditoras no modelo de regressão linear / Analysis and comparison of some alternative methods of selection of predictor variables in linear regression models.

Matheus Augustus Pumputis Marques 04 June 2018 (has links)
Neste trabalho estudam-se alguns novos métodos de seleção de variáveis no contexto da regressão linear que surgiram nos últimos 15 anos, especificamente o LARS - Least Angle Regression, o NAMS - Noise Addition Model Selection, a Razão de Falsa Seleção - RFS (FSR em inglês), o LASSO Bayesiano e o Spike-and-Slab LASSO. A metodologia foi a análise e comparação dos métodos estudados e aplicações. Após esse estudo, realizam-se aplicações em bases de dados reais e um estudo de simulação, em que todos os métodos se mostraram promissores, com os métodos Bayesianos apresentando os melhores resultados. / In this work, some new variable selection methods that have appeared in the last 15 years in the context of linear regression are studied, specifically the LARS - Least Angle Regression, the NAMS - Noise Addition Model Selection, the False Selection Rate - FSR, the Bayesian LASSO and the Spike-and-Slab LASSO. The methodology was the analysis and comparison of the studied methods. After this study, applications to real data bases are made, as well as a simulation study, in which all methods are shown to be promising, with the Bayesian methods showing the best results.
249

Emprego de técnicas de análise exploratória de dados utilizados em Química Medicinal / Use of different techniques for exploratory data analysis in Medicinal Chemistry

Jadson Castro Gertrudes 10 September 2013 (has links)
Pesquisas na área de Química Medicinal têm direcionado esforços na busca por métodos que acelerem o processo de descoberta de novos medicamentos. Dentre as diversas etapas relacionadas ao longo do processo de descoberta de substâncias bioativas está a análise das relações entre a estrutura química e a atividade biológica de compostos. Neste processo, os pesquisadores da área de Química Medicinal analisam conjuntos de dados que são caracterizados pela alta dimensionalidade e baixo número de observações. Dentro desse contexto, o presente trabalho apresenta uma abordagem computacional que visa contribuir para a análise de dados químicos e, consequentemente, a descoberta de novos medicamentos para o tratamento de doenças crônicas. As abordagens de análise exploratória de dados, utilizadas neste trabalho, combinam técnicas de redução de dimensionalidade e de agrupamento para detecção de estruturas naturais que reflitam a atividade biológica dos compostos analisados. Dentre as diversas técnicas existentes para a redução de dimensionalidade, são discutidas o escore de Fisher, a análise de componentes principais e a análise de componentes principais esparsas. Quanto aos algoritmos de aprendizado, são avaliados o k-médias, fuzzy c-médias e modelo de misturas ICA aperfeiçoado. No desenvolvimento deste trabalho foram utilizados quatro conjuntos de dados, contendo informações de substâncias bioativas, sendo que dois conjuntos foram relacionados ao tratamento da diabetes mellitus e da síndrome metabólica, o terceiro conjunto relacionado a doenças cardiovasculares e o último conjunto apresenta substâncias que podem ser utilizadas no tratamento do câncer. Nos experimentos realizados, os resultados alcançados sugerem a utilização das técnicas de redução de dimensionalidade juntamente com os algoritmos não supervisionados para a tarefa de agrupamento dos dados químicos, uma vez que nesses experimentos foi possível descrever níveis de atividade biológica dos compostos estudados. Portanto, é possível concluir que as técnicas de redução de dimensionalidade e de agrupamento podem possivelmente ser utilizadas como guias no processo de descoberta e desenvolvimento de novos compostos na área de Química Medicinal. / Researches in Medicinal Chemistry\'s area have focused on the search of methods that accelerate the process of drug discovery. Among several steps related to the process of discovery of bioactive substances there is the analysis of the relationships between chemical structure and biological activity of compounds. In this process, researchers of medicinal chemistry analyze data sets that are characterized by high dimensionality and small number of observations. Within this context, this work presents a computational approach that aims to contribute to the analysis of chemical data and, consequently, the discovery of new drugs for the treatment of chronic diseases. Approaches used in exploratory data analysis, employed in this work, combine techniques of dimensionality reduction and clustering for detecting natural structures that reflect the biological activity of the analyzed compounds. Among several existing techniques for dimensionality reduction, we have focused the Fisher\'s score, principal component analysis and sparse principal component analysis. For the clustering procedure, this study evaluated k-means, fuzzy c-means and enhanced ICA mixture model. In order to perform experiments, we used four data sets, containing information of bioactive substances. Two sets are related to the treatment of diabetes mellitus and metabolic syndrome, the third set is related to cardiovascular disease and the latter set has substances that can be used in cancer treatment. In the experiments, the obtained results suggest the use of dimensionality reduction techniques along with clustering algorithms for the task of clustering chemical data, since from these experiments, it was possible to describe different levels of biological activity of the studied compounds. Therefore, we conclude that the techniques of dimensionality reduction and clustering can be used as guides in the process of discovery and development of new compounds in the field of Medicinal Chemistry
250

Lasso顯著性檢定與向前逐步迴歸變數選取方法之比較 / A Comparison between Lasso Significance Test and Forward Stepwise Selection Method

鄒昀庭, Tsou, Yun Ting Unknown Date (has links)
迴歸模式的變數選取是很重要的課題,Tibshirani於1996年提出最小絕對壓縮挑選機制(Least Absolute Shrinkage and Selection Operator;簡稱Lasso),主要特色是能在估計的過程中自動完成變數選取。但因為Lasso本身並沒有牽扯到統計推論的層面,因此2014年時Lockhart et al.所提出的Lasso顯著性檢定是重要的突破。由於Lasso顯著性檢定的建構過程與傳統向前逐步迴歸相近,本研究接續Lockhart et al.(2014)對兩種變數選取方法的比較,提出以Bootstrap來改良傳統向前逐步迴歸;最後並比較Lasso、Lasso顯著性檢定、傳統向前逐步迴歸、以AIC決定變數組合的向前逐步迴歸,以及以Bootstrap改良的向前逐步迴歸等五種方法變數選取之效果。最後發現Lasso顯著性檢定雖然不容易犯型一錯誤,選取變數時卻過於保守;而以Bootstrap改良的向前逐步迴歸跟Lasso顯著性檢定一樣不容易犯型一錯誤,而選取變數上又比起Lasso顯著性檢定更大膽,因此可算是理想的方法改良結果。 / Variable selection of a regression model is an essential topic. In 1996, Tibshirani proposed a method called Lasso (Least Absolute Shrinkage and Selection Operator), which completes the matter of selecting variable set while estimating the parameters. However, the original version of Lasso does not provide a way for making inference. Therefore, the significance test for lasso proposed by Lockhart et al. in 2014 is an important breakthrough. Based on the similarity of construction of statistics between Lasso significance test and forward selection method, continuing the comparisons between the two methods from Lockhart et al. (2014), we propose an improved version of forward selection method by bootstrap. And at the second half of our research, we compare the variable selection results of Lasso, Lasso significance test, forward selection, forward selection by AIC, and forward selection by bootstrap. We find that although the Type I error probability for Lasso Significance Test is small, the testing method is too conservative for including new variables. On the other hand, the Type I error probability for forward selection by bootstrap is also small, yet it is more aggressive in including new variables. Therefore, based on our simulation results, the bootstrap improving forward selection is rather an ideal variable selecting method.

Page generated in 0.0824 seconds