Spelling suggestions: "subject:"group passo"" "subject:"croup passo""
1 |
Sélection de groupes de variables corrélées en grande dimension / Selection of groups of correlated variables in a high dimensionnal settingGrimonprez, Quentin 14 December 2016 (has links)
Le contexte de cette thèse est la sélection de variables en grande dimension à l'aide de procédures de régression régularisée en présence de redondance entre variables explicatives. Parmi les variables candidates, on suppose que seul un petit nombre est réellement pertinent pour expliquer la réponse. Dans ce cadre de grande dimension, les approches classiques de type Lasso voient leurs performances se dégrader lorsque la redondance croît, puisqu'elles ne tiennent pas compte de cette dernière. Regrouper au préalable ces variables peut pallier ce défaut, mais nécessite usuellement la calibration de paramètres supplémentaires. L'approche proposée combine regroupement et sélection de variables dans un souci d'interprétabilité et d'amélioration des performances. D'abord une Classification Ascendante Hiérarchique (CAH) fournit à chaque niveau une partition des variables en groupes. Puis le Group-lasso est utilisé à partir de l'ensemble des groupes de variables des différents niveaux de la CAH à paramètre de régularisation fixé. Choisir ce dernier fournit alors une liste de groupe candidats issus potentiellement de différents niveaux. Le choix final des groupes est obtenu via une procédure de tests multiples. La procédure proposée exploite la structure hiérarchique de la CAH et des pondérations dans le Group-lasso. Cela permet de réduire considérablement la complexité algorithmique induite par la flexibilité. / This thesis takes place in the context of variable selection in the high dimensional setting using penalizedregression in presence of redundancy between explanatory variables. Among all variables, we supposethat only a few number is relevant for predicting the response variable. In this high dimensional setting,performance of classical lasso-based approaches decreases when redundancy increases as they do not takeit into account. Firstly aggregating variables can overcome this problem but generally requires calibrationof additional parameters. The proposed approach combines variables aggregation and selection in order to improve interpretabilityand performance. First, a hierarchical clustering procedure provides at each level a partition of the variablesinto groups. Then the Group-lasso is used with the set of groups of variables from the different levels ofthe hierarchical clustering and a fixed regularization parameter. Choosing this parameter provides a list ofcandidates groups potentially coming from different levels. The final choice of groups is done by a multipletesting procedure. The proposed procedure exploits the hierarchical structure from hierarchical clustering and some weightsin Group-lasso. This allows to greatly reduce the algorithm complexity induced by the possibility to choosegroups coming from different levels of the hierarchical clustering.
|
2 |
Grouped variable selection in high dimensional partially linear additive Cox modelLiu, Li 01 December 2010 (has links)
In the analysis of survival outcome supplemented with both clinical information and high-dimensional gene expression data, traditional Cox proportional hazard model fails to meet some emerging needs in biological research. First, the number of covariates is generally much larger the sample size. Secondly, predicting an outcome with individual gene expressions is inadequate because a gene's expression is regulated by multiple biological processes and functional units. There is a need to understand the impact of changes at a higher level such as molecular function, cellular component, biological process, or pathway. The change at a higher level is usually measured with a set of gene expressions related to the biological process. That is, we need to model the outcome with gene sets as variable groups and the gene sets could be partially overlapped also.
In this thesis work, we investigate the impact of a penalized Cox regression procedure on regularization, parameter estimation, variable group selection, and nonparametric modeling of nonlinear eects with a time-to-event outcome.
We formulate the problem as a partially linear additive Cox model with high-dimensional data. We group genes into gene sets and approximate the nonparametric components by truncated series expansions with B-spline bases. After grouping and approximation, the problem of variable selection becomes that of selecting groups of coecients in a gene set or in an approximation. We apply the group Lasso to obtain an initial solution path and reduce the dimension of the problem and then update the whole solution path with the adaptive group Lasso. We also propose a generalized group lasso method to provide more freedom in specifying the penalty and excluding covariates from being penalized.
A modied Newton-Raphson method is designed for stable and rapid computation. The core programs are written in the C language. An user-friendly R interface is implemented to perform all the calculations by calling the core programs.
We demonstrate the asymptotic properties of the proposed methods. Simulation studies are carried out to evaluate the finite sample performance of the proposed procedure using several tuning parameter selection methods for choosing the point on the solution path as the nal estimator. We also apply the proposed approach on two real data examples.
|
3 |
分類蛋白質質譜資料變數選取的探討 / On Variable Selection of Classifying Proteomic Spectra Data林婷婷 Unknown Date (has links)
本研究所利用的資料是來自美國東維吉尼亞醫學院所提供的攝護腺癌蛋白質質譜資料,其資料有原始資料和另一筆經過事前處理過的資料,而本研究是利用事前處理過的資料來作實証分析。由於此種資料通常都是屬於高維度資料,故變數間具有高度相關的現象也很常見,因此從大量的特徵變數中選取到重要的特徵變數來準確的判斷攝護腺的病變程度成為一個非常普遍且重要的課題。那麼本研究的目的是欲探討各(具有懲罰項)迴歸模型對於分類蛋白質質譜資料之變數選取結果,藉由LARS、Stagewise、LASSO、Group LASSO和Elastic Net各(具有懲罰項)迴歸模型將變數選入的先後順序當作其排序所產生的判別結果與利用「統計量排序」(t檢定、ANOVA F檢定以及Kruskal-Wallis檢定)以及SVM「分錯率排序」的判別結果相比較。而分析的結果顯示,Group LASSO對於六種兩兩分類的分錯率,其分錯率趨勢的表現都較其他方法穩定,並不會有大起大落的現象發生,且最小分錯率也幾乎較其他方法理想。此外Group LASSO在四分類的判別結果在與其他方法相較下也顯出此法可得出最低的分錯率,亦表示若須同時判別四種類別時,相較於其他方法之下Group LASSO的判別準確度最優。 / Our research uses the prostate proteomic spectra data which is offered by Eastern Virginia Medical School. The materials have raw data and preprocessed data. Our research uses the preprocessed data to do the analysis of real example. Because this kind of materials usually have high dimension, so it maybe has highly correlation between variables very common, therefore choose from a large number of characteristic variables to accurately determine the pathological change degree of the Prostate is become a very general and important subject. Then the purpose of our research wants to discuss every (penalized) regression model in variable selection results for classifying the proteomic spectra data. With LARS, Stagewise, LASSO, Group LASSO and Elastic Net, each variable is chosen successively by each (penalized) regression model, and it is regarded as each variable’s order then produce discrimination results. After that, we use their results to compare with using statistic order (t-test, ANOVA F-test and Kruskal-Wallis test) and SVM fault rate order. And the result of analyzing reveals Group LASSO to two by two of six kinds of rate by mistake that classify, the mistake rate behavior of trend is more stable than other ways, it doesn’t appear big rise or big fall phenomenon. Furthermore, this way’s mistake rate is almostly more ideal than other ways. Moreover, using Group LASSO to get the discrimination result of four classifications has the lowest mistake rate under comparing with other methods. In other words, when must distinguish four classifications in the same time, Group LASSO’s discrimination accuracy is optimum.
|
4 |
Regularized and robust regression methods for high dimensional dataHashem, Hussein Abdulahman January 2014 (has links)
Recently, variable selection in high-dimensional data has attracted much research interest. Classical stepwise subset selection methods are widely used in practice, but when the number of predictors is large these methods are difficult to implement. In these cases, modern regularization methods have become a popular choice as they perform variable selection and parameter estimation simultaneously. However, the estimation procedure becomes more difficult and challenging when the data suffer from outliers or when the assumption of normality is violated such as in the case of heavy-tailed errors. In these cases, quantile regression is the most appropriate method to use. In this thesis we combine these two classical approaches together to produce regularized quantile regression methods. Chapter 2 shows a comparative simulation study of regularized and robust regression methods when the response variable is continuous. In chapter 3, we develop a quantile regression model with a group lasso penalty for binary response data when the predictors have a grouped structure and when the data suffer from outliers. In chapter 4, we extend this method to the case of censored response variables. Numerical examples on simulated and real data are used to evaluate the performance of the proposed methods in comparisons with other existing methods.
|
5 |
Regularized methods for high-dimensional and bi-level variable selectionBreheny, Patrick John 01 July 2009 (has links)
Many traditional approaches cease to be useful when the number of variables is large in comparison with the sample size. Penalized regression methods have proved to be an attractive approach, both theoretically and empirically, for dealing with these problems. This thesis focuses on the development of penalized regression methods for high-dimensional variable selection. The first part of this thesis deals with problems in which the covariates possess a grouping structure that can be incorporated into the analysis to select important groups as well as important members of those groups. I introduce a framework for grouped penalization that encompasses the previously proposed group lasso and group bridge methods, sheds light on the behavior of grouped penalties, and motivates the proposal of a new method, group MCP.
The second part of this thesis develops fast algorithms for fitting models with complicated penalty functions such as grouped penalization methods. These algorithms combine the idea of local approximation of penalty functions with recent research into coordinate descent algorithms to produce highly efficient numerical methods for fitting models with complicated penalties. Importantly, I show these algorithms to be both stable and linear in the dimension of the feature space, allowing them to be efficiently scaled up to very large problems.
In the third part of this thesis, I extend the idea of false discovery rates to penalized regression. The Karush-Kuhn-Tucker conditions describing penalized regression estimates provide testable hypotheses involving partial residuals. I use these hypotheses to connect the previously disparate elds of multiple comparisons and penalized regression, develop estimators for the false discovery rates of methods such as the lasso and elastic net, and establish theoretical results.
Finally, the methods from all three sections are studied in a number of simulations and applied to real data from gene expression and genetic association studies.
|
6 |
Quelques questions de sélection de variables autour de l'estimateur LASSOHebiri, Mohamed 30 June 2009 (has links) (PDF)
Le problème général étudié dans cette thèse est celui de la régression linéaire en grande dimension. On s'intéresse particulièrement aux méthodes d'estimation qui capturent la sparsité du paramètre cible, même dans le cas où la dimension est supérieure au nombre d'observations. Une méthode populaire pour estimer le paramètre inconnu de la régression dans ce contexte est l'estimateur des moindres carrés pénalisés par la norme ℓ1 des coefficients, connu sous le nom de LASSO. Les contributions de la thèse portent sur l'étude de variantes de l'estimateur LASSO pour prendre en compte soit des informations supplémentaires sur les variables d'entrée, soit des modes semi-supervisés d'acquisition des données. Plus précisément, les questions abordées dans ce travail sont : i) l'estimation du paramètre inconnu lorsque l'espace des variables explicatives a une structure bien déterminée (présence de corrélations, structure d'ordre sur les variables ou regroupements entre variables) ; ii) la construction d'estimateurs adaptés au cadre transductif, pour lequel les nouvelles observations non étiquetées sont prises en considération. Ces adaptations sont en partie déduites par une modification de la pénalité dans la définition de l'estimateur LASSO. Les procédures introduites sont essentiellement analysées d'un point de vue non-asymptotique ; nous prouvons notamment que les estimateurs construits vérifient des Inégalités de Sparsité Oracles. Ces inégalités ont pour particularité de dépendre du nombre de composantes non-nulles du paramètre cible. Un contrôle sur la probabilité d'erreur d'estimation du support du paramètre de régression est également établi. Les performances pratiques des méthodes étudiées sont par ailleurs illustrées à travers des résultats de simulation.
|
7 |
Recovering Data with Group Sparsity by Alternating Direction MethodsDeng, Wei 06 September 2012 (has links)
Group sparsity reveals underlying sparsity patterns and contains rich structural information in data. Hence, exploiting group sparsity will facilitate more efficient techniques for recovering large and complicated data in applications such as compressive sensing, statistics, signal and image processing, machine learning and computer vision. This thesis develops efficient algorithms for solving a class of optimization problems with group sparse solutions, where arbitrary group configurations are allowed and the mixed L21-regularization is used to promote group sparsity. Such optimization problems can be quite challenging to solve due to the mixed-norm structure and possible grouping irregularities. We derive algorithms based on a variable splitting strategy and the alternating direction methodology. Extensive numerical results are presented to demonstrate the efficiency, stability and robustness of these algorithms, in comparison with the previously known state-of-the-art algorithms. We also extend the existing global convergence theory to allow more generality.
|
8 |
Statistical analysis of high dimensional dataRuan, Lingyan 05 November 2010 (has links)
This century is surely the century of data (Donoho, 2000). Data analysis has been an emerging activity over the last few decades. High dimensional data is in particular more and more pervasive with the advance of massive data collection system, such as microarrays, satellite imagery, and financial data. However, analysis of high dimensional data is of challenge with the so called curse of dimensionality (Bellman 1961). This research dissertation presents several methodologies in the application of high dimensional data analysis.
The first part discusses a joint analysis of multiple microarray gene expressions. Microarray analysis dates back to Golub et al. (1999). It draws much attention after that. One common goal of microarray analysis is to determine which genes are differentially expressed. These genes behave significantly differently between groups of individuals. However, in microarray analysis, there are thousands of genes but few arrays (samples, individuals) and thus relatively low reproducibility remains. It is natural to consider joint analyses that could combine microarrays from different experiments effectively in order to achieve improved accuracy. In particular, we present a model-based approach for better identification of differentially expressed genes by incorporating data from different studies. The model can accommodate in a seamless fashion a wide range of studies including those performed at different platforms, and/or under different but overlapping biological conditions. Model-based inferences can be done in an empirical Bayes fashion. Because of the information sharing among studies, the joint analysis dramatically improves inferences based on individual analysis. Simulation studies and real data examples are presented to demonstrate the effectiveness of the proposed approach under a variety of complications that often arise in practice.
The second part is about covariance matrix estimation in high dimensional data. First, we propose a penalised likelihood estimator for high dimensional t-distribution. The student t-distribution is of increasing interest in mathematical finance, education and many other applications. However, the application in t-distribution is limited by the difficulty in the parameter estimation of the covariance matrix for high dimensional data. We show that by imposing LASSO penalty on the Cholesky factors of the covariance matrix, EM algorithm can efficiently compute the estimator and it performs much better than other popular estimators.
Secondly, we propose an estimator for high dimensional Gaussian mixture models. Finite Gaussian mixture models are widely used in statistics thanks to its great flexibility. However, parameter estimation for Gaussian mixture models with high dimensionality can be rather challenging because of the huge number of parameters that need to be estimated. For such purposes, we propose a penalized likelihood estimator to specifically address such difficulties. The LASSO penalty we impose on the inverse covariance matrices encourages sparsity on its entries and therefore helps reducing the dimensionality of the problem. We show that the proposed estimator can be efficiently computed via an Expectation-Maximization algorithm. To illustrate the practical merits of the proposed method, we consider its application in model-based clustering and mixture discriminant analysis. Numerical experiments with both simulated and real data show that the new method is a valuable tool in handling high dimensional data.
Finally, we present structured estimators for high dimensional Gaussian mixture models. The graphical representation of every cluster in Gaussian mixture models may have the same or similar structure, which is an important feature in many applications, such as image processing, speech recognition and gene network analysis. Failure to consider the sharing structure would deteriorate the estimation accuracy. To address such issues, we propose two structured estimators, hierarchical Lasso estimator and group Lasso estimator. An EM algorithm can be applied to conveniently solve the estimation problem. We show that when clusters share similar structures, the proposed estimator perform much better than the separate Lasso estimator.
|
9 |
Temporal signals classification / Classification de signaux temporelsRida, Imad 03 February 2017 (has links)
De nos jours, il existe de nombreuses applications liées à la vision et à l’audition visant à reproduire par des machines les capacités humaines. Notre intérêt pour ce sujet vient du fait que ces problèmes sont principalement modélisés par la classification de signaux temporels. En fait, nous nous sommes intéressés à deux cas distincts, la reconnaissance de la démarche humaine et la reconnaissance de signaux audio, (notamment environnementaux et musicaux). Dans le cadre de la reconnaissance de la démarche, nous avons proposé une nouvelle méthode qui apprend et sélectionne automatiquement les parties dynamiques du corps humain. Ceci permet de résoudre le problème des variations intra-classe de façon dynamique; les méthodes à l’état de l’art se basant au contraire sur des connaissances a priori. Dans le cadre de la reconnaissance audio, aucune représentation de caractéristiques conventionnelle n’a montré sa capacité à s’attaquer indifféremment à des problèmes de reconnaissance d’environnement ou de musique : diverses caractéristiques ont été introduites pour résoudre chaque tâche spécifiquement. Nous proposons ici un cadre général qui effectue la classification des signaux audio grâce à un problème d’apprentissage de dictionnaire supervisé visant à minimiser et maximiser les variations intra-classe et inter-classe respectivement. / Nowadays, there are a lot of applications related to machine vision and hearing which tried to reproduce human capabilities on machines. These problems are mainly amenable to a temporal signals classification problem, due our interest to this subject. In fact, we were interested to two distinct problems, humain gait recognition and audio signal recognition including both environmental and music ones. In the former, we have proposed a novel method to automatically learn and select the dynamic human body-parts to tackle the problem intra-class variations contrary to state-of-art methods which relied on predefined knowledge. To achieve it a group fused lasso algorithm is applied to segment the human body into parts with coherent motion value across the subjects. In the latter, while no conventional feature representation showed its ability to tackle both environmental and music problems, we propose to model audio classification as a supervised dictionary learning problem. This is done by learning a dictionary per class and encouraging the dissimilarity between the dictionaries by penalizing their pair- wise similarities. In addition the coefficients of a signal representation over these dictionaries is sought as sparse as possible. The experimental evaluations provide performing and encouraging results.
|
10 |
Povýběrová Inference: Lasso & Skupinové Lasso / Post-selection Inference: Lasso & Group LassoBouř, Vojtěch January 2017 (has links)
The lasso is a popular tool that can be used for variable selection and esti- mation, however, classical statistical inference cannot be applied for its estimates. In this thesis the classical and the group lasso is described together with effici- ent algorithms for the solution. The key part is dedicated to the post-selection inference for the lasso estimates where we explain why the classical inference is not suitable. Three post-selection tests for the lasso are described and one test is proposed also for the group lasso. The tests are compared in simulations where finite sample properties are examined. The tests are further applied on a practical example. 1
|
Page generated in 0.0942 seconds