• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 14
  • 3
  • 3
  • Tagged with
  • 25
  • 25
  • 7
  • 7
  • 6
  • 6
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • 4
  • 4
  • 4
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Marginal false discovery rate approaches to inference on penalized regression models

Miller, Ryan 01 August 2018 (has links)
Data containing large number of variables is becoming increasingly more common and sparsity inducing penalized regression methods, such the lasso, have become a popular analysis tool for these datasets due to their ability to naturally perform variable selection. However, quantifying the importance of the variables selected by these models is a difficult task. These difficulties are compounded by the tendency for the most predictive models, for example those which were chosen using procedures like cross-validation, to include substantial amounts of noise variables with no real relationship with the outcome. To address the task of performing inference on penalized regression models, this thesis proposes false discovery rate approaches for a broad class of penalized regression models. This work includes the development of an upper bound for the number of noise variables in a model, as well as local false discovery rate approaches that quantify the likelihood of each individual selection being a false discovery. These methods are applicable to a wide range of penalties, such as the lasso, elastic net, SCAD, and MCP; a wide range of models, including linear regression, generalized linear models, and Cox proportional hazards models; and are also extended to the group regression setting under the group lasso penalty. In addition to studying these methods using numerous simulation studies, the practical utility of these methods is demonstrated using real data from several high-dimensional genome wide association studies.
12

New Results in ell_1 Penalized Regression

Roualdes, Edward A. 01 January 2015 (has links)
Here we consider penalized regression methods, and extend on the results surrounding the l1 norm penalty. We address a more recent development that generalizes previous methods by penalizing a linear transformation of the coefficients of interest instead of penalizing just the coefficients themselves. We introduce an approximate algorithm to fit this generalization and a fully Bayesian hierarchical model that is a direct analogue of the frequentist version. A number of benefits are derived from the Bayesian persepective; most notably choice of the tuning parameter and natural means to estimate the variation of estimates – a notoriously difficult task for the frequentist formulation. We then introduce Bayesian trend filtering which exemplifies the benefits of our Bayesian version. Bayesian trend filtering is shown to be an empirically strong technique for fitting univariate, nonparametric regression. Through a simulation study, we show that Bayesian trend filtering reduces prediction error and attains more accurate coverage probabilities over the frequentist method. We then apply Bayesian trend filtering to real data sets, where our method is quite competitive against a number of other popular nonparametric methods.
13

New approaches to identify gene-by-gene interactions in genome wide association studies

Lu, Chen 22 January 2016 (has links)
Genetic variants identified to date by genome-wide association studies only explain a small fraction of total heritability. Gene-by-gene interaction is one important potential source of unexplained heritability. In the first part of this dissertation, a novel approach to detect such interactions is proposed. This approach utilizes penalized regression and sparse estimation principles, and incorporates outside biological knowledge through a network-based penalty. The method is tested on simulated data under various scenarios. Simulations show that with reasonable outside biological knowledge, the new method performs noticeably better than current stage-wise strategies in finding true interactions, especially when the marginal strength of main effects is weak. The proposed method is designed for single-cohort analyses. However, it is generally acknowledged that only multi-cohort analyses have sufficient power to uncover genes and gene-by-gene interactions with moderate effects on traits, such as likely underlie complex diseases. Multi-cohort, meta-analysis approaches for penalized regressions are developed and investigated in the second part of this dissertation. Specifically, I propose two different ways of utilizing data-splitting principles in multi-cohort settings and develop three procedures to conduct meta-analysis. Using the method developed in the first part of this dissertation as an example of penalized regressions, three proposed meta-analysis procedures are compared to mega-analysis using a simulation study. The results suggest that the best approach is to split the participating cohorts into two groups, to perform variable selection for each cohort in the first group, to fit regular regression model on the union of selected variables for each cohort in the second group, and lastly to conduct a meta-analysis across cohorts in the second group. In the last part of this dissertation, the novel method developed in the first part is applied to the Framingham Heart Study measures on total plasma Immunoglobulin E (IgE) concentrations, C-reactive protein levels, and Fasting Glucose. The effect of incorporating various sources of biological information on the ability to detect gene-gene interaction is explored. For IgE, for example, a number of potentially interesting interactions are identified. Some of these interactions involve pairs in human leukocyte antigen genes, which encode proteins that are the key regulators of the immune response. The remaining interactions are among genes previously found to be associated with IgE as main effects. Identification of these interactions may provide new insights into the genetic basis and mechanisms of atopic diseases.
14

Sequential Change-point Detection in Linear Regression and Linear Quantile Regression Models Under High Dimensionality

Ratnasingam, Suthakaran 06 August 2020 (has links)
No description available.
15

Robust Approaches for Matrix-Valued Parameters

Jing, Naimin January 2021 (has links)
Modern large data sets inevitably contain outliers that deviate from the model assumptions. However, many widely used estimators, such as maximum likelihood estimators and least squared estimators, perform weakly with the existence of outliers. Alternatively, many statistical modeling approaches have matrices as the parameters. We consider penalized estimators for matrix-valued parameters with a focus on their robustness properties in the presence of outliers. We propose a general framework for robust modeling with matrix-valued parameters by minimizing robust loss functions with penalization. However, there are challenges to this approach in both computation and theoretical analysis. To tackle the computational challenges from the large size of the data, non-smoothness of robust loss functions, and the slow speed of matrix operations, we propose to apply the Frank-Wolfe algorithm, a first-order algorithm for optimization on a restricted region with low computation burden per iteration. Theoretically, we establish finite-sample error bounds under high-dimensional settings. We show that the estimation errors are bounded by small terms and converge in probability to zero under mild conditions in a neighborhood of the true model. Our method accommodates a broad classes of modeling problems using robust loss functions with penalization. Concretely, we study three cases: matrix completion, multivariate regression, and network estimation. For all cases, we illustrate the robustness of the proposed method both theoretically and numerically. / Statistics
16

Assessment of Penalized Regression for Genome-wide Association  Studies

Yi, Hui 27 August 2014 (has links)
The data from genome-wide association studies (GWAS) in humans are still predominantly analyzed using single marker association methods. As an alternative to Single Marker Analysis (SMA), all or subsets of markers can be tested simultaneously. This approach requires a form of Penalized Regression (PR) as the number of SNPs is much larger than the sample size. Here we review PR methods in the context of GWAS, extend them to perform penalty parameter and SNP selection by False Discovery Rate (FDR) control, and assess their performance (including penalties incorporating linkage disequilibrium) in comparison with SMA. PR methods were compared with SMA on realistically simulated GWAS data consisting of genotype data from single and multiple chromosomes and a continuous phenotype and on real data. Based on our comparisons our analytic FDR criterion may currently be the best approach to SNP selection using PR for GWAS. We found that PR with FDR control provides substantially more power than SMA with genome-wide type-I error control but somewhat less power than SMA with Benjamini-Hochberg FDR control. PR controlled the FDR conservatively while SMA-BH may not achieve FDR control in all situations. Differences among PR methods seem quite small when the focus is on variable selection with FDR control. Incorporating LD into PR by adapting penalties developed for covariates measured on graphs can improve power but also generate morel false positives or wider regions for follow-up. We recommend using the Elastic Net with a mixing weight for the Lasso penalty near 0.5 as the best method. / Ph. D.
17

Développement d'outils statistiques pour l'analyse de données transcriptomiques par les réseaux de co-expression de gènes / A systemic approach to statistical analysis to transcriptomic data through co-expression network analysis

Brunet, Anne-Claire 17 June 2016 (has links)
Les nouvelles biotechnologies offrent aujourd'hui la possibilité de récolter une très grande variété et quantité de données biologiques (génomique, protéomique, métagénomique...), ouvrant ainsi de nouvelles perspectives de recherche pour la compréhension des processus biologiques. Dans cette thèse, nous nous sommes plus spécifiquement intéressés aux données transcriptomiques, celles-ci caractérisant l'activité ou le niveau d'expression de plusieurs dizaines de milliers de gènes dans une cellule donnée. L'objectif était alors de proposer des outils statistiques adaptés pour analyser ce type de données qui pose des problèmes de "grande dimension" (n<<p), car collectées sur des échantillons de tailles très limitées au regard du très grand nombre de variables (ici l'expression des gènes).La première partie de la thèse est consacrée à la présentation de méthodes d'apprentissage supervisé, telles que les forêts aléatoires de Breiman et les modèles de régressions pénalisées, utilisées dans le contexte de la grande dimension pour sélectionner les gènes (variables d'expression) qui sont les plus pertinents pour l'étude de la pathologie d'intérêt. Nous évoquons les limites de ces méthodes pour la sélection de gènes qui soient pertinents, non pas uniquement pour des considérations d'ordre statistique, mais qui le soient également sur le plan biologique, et notamment pour les sélections au sein des groupes de variables fortement corrélées, c'est à dire au sein des groupes de gènes co-exprimés. Les méthodes d'apprentissage classiques considèrent que chaque gène peut avoir une action isolée dans le modèle, ce qui est en pratique peu réaliste. Un caractère biologique observable est la résultante d'un ensemble de réactions au sein d'un système complexe faisant interagir les gènes les uns avec les autres, et les gènes impliqués dans une même fonction biologique ont tendance à être co-exprimés (expression corrélée). Ainsi, dans une deuxième partie, nous nous intéressons aux réseaux de co-expression de gènes sur lesquels deux gènes sont reliés si ils sont co-exprimés. Plus précisément, nous cherchons à mettre en évidence des communautés de gènes sur ces réseaux, c'est à dire des groupes de gènes co-exprimés, puis à sélectionner les communautés les plus pertinentes pour l'étude de la pathologie, ainsi que les "gènes clés" de ces communautés. Cela favorise les interprétations biologiques, car il est souvent possible d'associer une fonction biologique à une communauté de gènes. Nous proposons une approche originale et efficace permettant de traiter simultanément la problématique de la modélisation du réseau de co-expression de gènes et celle de la détection des communautés de gènes sur le réseau. Nous mettons en avant les performances de notre approche en la comparant à des méthodes existantes et populaires pour l'analyse des réseaux de co-expression de gènes (WGCNA et méthodes spectrales). Enfin, par l'analyse d'un jeu de données réelles, nous montrons dans la dernière partie de la thèse que l'approche que nous proposons permet d'obtenir des résultats convaincants sur le plan biologique, plus propices aux interprétations et plus robustes que ceux obtenus avec les méthodes d'apprentissage supervisé classiques. / Today's, new biotechnologies offer the opportunity to collect a large variety and volume of biological data (genomic, proteomic, metagenomic...), thus opening up new avenues for research into biological processes. In this thesis, what we are specifically interested is the transcriptomic data indicative of the activity or expression level of several thousands of genes in a given cell. The aim of this thesis was to propose proper statistical tools to analyse these high dimensional data (n<<p) collected from small samples with regard to the very large number of variables (gene expression variables). The first part of the thesis is devoted to a description of some supervised learning methods, such as random forest and penalized regression models. The following methods can be used for selecting the most relevant disease-related genes. However, the statistical relevance of the selections doesn't determine the biological relevance, and particularly when genes are selected within a group of highly correlated variables or co-expressed genes. Common supervised learning methods consider that every gene can have an isolated action in the model which is not so much realistic. An observable biological phenomenum is the result of a set of reactions inside a complex system which makes genes interact with each other, and genes that have a common biological function tend to be co-expressed (correlation between expression variables). Then, in a second part, we are interested in gene co-expression networks, where genes are linked if they are co-expressed. More precisely, we aim to identify communities of co-expressed genes, and then to select the most relevant disease-related communities as well as the "key-genes" of these communities. It leads to a variety of biological interpretations, because a community of co-expressed genes is often associated with a specific biological function. We propose an original and efficient approach that permits to treat simultaneously the problem of modeling the gene co-expression network and the problem of detecting the communities in network. We put forward the performances of our approach by comparing it to the existing methods that are popular for analysing gene co-expression networks (WGCNA and spectral approaches). The last part presents the results produced by applying our proposed approach on a real-world data set. We obtain convincing and robust results that help us make more diverse biological interpretations than with results produced by common supervised learning methods.
18

Testing new genetic and genomic approaches for trait mapping and prediction in wheat (Triticum aestivum) and rice (Oryza spp)

Ladejobi, Olufunmilayo Olubukola January 2018 (has links)
Advances in molecular marker technologies have led to the development of high throughput genotyping techniques such as Genotyping by Sequencing (GBS), driving the application of genomics in crop research and breeding. They have also supported the use of novel mapping approaches, including Multi-parent Advanced Generation Inter-Cross (MAGIC) populations which have increased precision in identifying markers to inform plant breeding practices. In the first part of this thesis, a high density physical map derived from GBS was used to identify QTLs controlling key agronomic traits of wheat in a genome-wide association study (GWAS) and to demonstrate the practicability of genomic selection for predicting the trait values. The results from GBS were compared to a previous study conducted on the same association mapping panel using a less dense physical map derived from diversity arrays technology (DArT) markers. GBS detected more QTLs than DArT markers although some of the QTLs were detected by DArT markers alone. Prediction accuracies from the two marker platforms were mostly similar and largely dependent on trait genetic architecture. The second part of this thesis focused on MAGIC populations, which incorporate diversity and novel allelic combinations from several generations of recombination. Pedigrees representing a wild rice MAGIC population were used to model MAGIC populations by simulation to assess the level of recombination and creation of novel haplotypes. The wild rice species are an important reservoir of beneficial genes that have been variously introgressed into rice varieties using bi-parental population approaches. The level of recombination was found to be highly dependent on the number of crosses made and on the resulting population size. Creation of MAGIC populations require adequate planning in order to make sufficient number of crosses that capture optimal haplotype diversity. The third part of the thesis considers models that have been proposed for genomic prediction. The ridge regression best linear unbiased prediction (RR-BLUP) is based on the assumption that all genotyped molecular markers make equal contributions to the variations of a phenotype. Information from underlying candidate molecular markers are however of greater significance and can be used to improve the accuracy of prediction. Here, an existing Differentially Penalized Regression (DiPR) model which uses modifications to a standard RR-BLUP package and allows two or more marker sets from different platforms to be independently weighted was used. The DiPR model performed better than single or combined marker sets for predicting most of the traits both in a MAGIC population and an association mapping panel. Overall the work presented in this thesis shows that while these techniques have great promise, they should be carefully evaluated before introduction into breeding programmes.
19

Prediction with Penalized Logistic Regression : An Application on COVID-19 Patient Gender based on Case Series Data

Schwarz, Patrick January 2021 (has links)
The aim of the study was to evaluate dierent types of logistic regression to find the optimal model to predict the gender of hospitalized COVID-19 patients. The models were based on COVID-19 case series data from Pakistan using a set of 18 explanatory variables out of which patient age and BMI were numerical and the rest were categorical variables, expressing symptoms and previous health issues.  Compared were a logistic regression using all variables, a logistic regression that used stepwise variable selection with 4 explanatory variables, a logistic Ridge regression model, a logistic Lasso regression model and a logistic Elastic Net regression model.  Based on several metrics assessing the goodness of fit of the models and the evaluation of predictive power using the area under the ROC curve the Elastic Net that was only using the Lasso penalty had the best result and was able to predict 82.5% of the test cases correctly.
20

Identification de biomarqueurs prédictifs de la survie et de l'effet du traitement dans un contexte de données de grande dimension / Identification of biomarkers predicting the outcome and the treatment effect in presence of high-dimensional data

Ternes, Nils 05 October 2016 (has links)
Avec la révolution récente de la génomique et la médecine stratifiée, le développement de signatures moléculaires devient de plus en plus important pour prédire le pronostic (biomarqueurs pronostiques) ou l’effet d’un traitement (biomarqueurs prédictifs) de chaque patient. Cependant, la grande quantité d’information disponible rend la découverte de faux positifs de plus en plus fréquente dans la recherche biomédicale. La présence de données de grande dimension (nombre de biomarqueurs ≫ taille d’échantillon) soulève de nombreux défis statistiques tels que la non-identifiabilité des modèles, l’instabilité des biomarqueurs sélectionnés ou encore la multiplicité des tests.L’objectif de cette thèse a été de proposer et d’évaluer des méthodes statistiques pour l’identification de ces biomarqueurs et l’élaboration d’une prédiction individuelle des probabilités de survie pour des nouveaux patients à partir d’un modèle de régression de Cox. Pour l’identification de biomarqueurs en présence de données de grande dimension, la régression pénalisée lasso est très largement utilisée. Dans le cas de biomarqueurs pronostiques, une extension empirique de cette pénalisation a été proposée permettant d’être plus restrictif sur le choix du paramètre λ dans le but de sélectionner moins de faux positifs. Pour les biomarqueurs prédictifs, l’intérêt s’est porté sur les interactions entre le traitement et les biomarqueurs dans le contexte d’un essai clinique randomisé. Douze approches permettant de les identifier ont été évaluées telles que le lasso (standard, adaptatif, groupé ou encore ridge+lasso), le boosting, la réduction de dimension des effets propres et un modèle implémentant les effets pronostiques par bras. Enfin, à partir d’un modèle de prédiction pénalisé, différentes stratégies ont été évaluées pour obtenir une prédiction individuelle pour un nouveau patient accompagnée d’un intervalle de confiance, tout en évitant un éventuel surapprentissage du modèle. La performance des approches ont été évaluées au travers d’études de simulation proposant des scénarios nuls et alternatifs. Ces méthodes ont également été illustrées sur différents jeux de données, contenant des données d’expression de gènes dans le cancer du sein. / With the recent revolution in genomics and in stratified medicine, the development of molecular signatures is becoming more and more important for predicting the prognosis (prognostic biomarkers) and the treatment effect (predictive biomarkers) of each patient. However, the large quantity of information has rendered false positives more and more frequent in biomedical research. The high-dimensional space (i.e. number of biomarkers ≫ sample size) leads to several statistical challenges such as the identifiability of the models, the instability of the selected coefficients or the multiple testing issue.The aim of this thesis was to propose and evaluate statistical methods for the identification of these biomarkers and the individual predicted survival probability for new patients, in the context of the Cox regression model. For variable selection in a high-dimensional setting, the lasso penalty is commonly used. In the prognostic setting, an empirical extension of the lasso penalty has been proposed to be more stringent on the estimation of the tuning parameter λ in order to select less false positives. In the predictive setting, focus has been given to the biomarker-by-treatment interactions in the setting of a randomized clinical trial. Twelve approaches have been proposed for selecting these interactions such as lasso (standard, adaptive, grouped or ridge+lasso), boosting, dimension reduction of the main effects and a model incorporating arm-specific biomarker effects. Finally, several strategies were studied to obtain an individual survival prediction with a corresponding confidence interval for a future patient from a penalized regression model, while limiting the potential overfit.The performance of the approaches was evaluated through simulation studies combining null and alternative scenarios. The methods were also illustrated in several data sets containing gene expression data in breast cancer.

Page generated in 0.0917 seconds