1 |
Bio-statistical approaches to evaluate the link between specific nutrients and methylation patterns in a breast cancer case-control study nested within the European Prospective Investigation into Cancer and Nutrition (EPIC) study / Approches bio-statistiques pour évaluer le lien entre nutriments et profils de méthylation du cancer du sein dans l’étude prospective Européenne sur le Cancer et la Nutrition (EPIC)Perrier, Flavie 13 September 2018 (has links)
De par les centaines de milliers de données qui les caractérisent, les bases de données épigénétiques représentent actuellement un défi majeur. L’objectif principal de cette thèse est d’évaluer la performance d’outils statistiques développés pour les données de grande dimension, en explorant l’association entre facteurs alimentaires reliés au cancer du sein (CS) et méthylation de l’ADN dans la cohorte EPIC.Afin d’étudier les caractéristiques des données de méthylation, l’identification des sources systématiques de variabilité des mesures de méthylation a été effectuée par la méthode de la PC-PR2. Ainsi la performance de trois techniques de normalisation, très répandues pour corriger la part de variabilité non désirée, a été évaluée en quantifiant l’entendu de variabilité attribuée aux facteurs de laboratoire avant et après chaque méthode de correction.Une fois la méthode de normalisation la plus appropriée identifiée, la relation entre le folate, l’alcool et la méthylation de l’ADN a été analysée par le biais de trois approches : une analyse individuelle des sites CpG, une analyse de DMR et la régression fused lasso. Les deux dernières méthodes visent à identifier des régions spécifiques de l’épigénome grâce aux corrélations possibles entre les sites proches. La méthylation globale a aussi été utilisée pour étudier la relation entre méthylation et risque de CS.Grâce à une évaluation exhaustive d’outils statistiques révélant la complexité des données de méthylation de l’ADN, cette thèse offre un aperçu instructif de connaissances pour les études épigénétiques, avec une possibilité d’application de méthodologie similaire aux analyses d’autres types de données -omiques / Epigenetics data are challenging sets characterized by hundreds of thousands of features. The main objective of this thesis was to evaluate the performance of some of the existing statistical methods to handle sets of large dimension data, exploring the association between dietary factors related to breast cancer (BC) and DNA methylation within the EPIC study.In order to investigate the characteristics of epigenetics data, the identification of random and systematic sources of variability of methylation measurements was attempted, via the principal component partial R-square (PC-PR2) method. Using this technique, the performance of three popular normalization techniques to correct for unwanted sources of variability was evaluated by quantifying epigenetics variability attributed to laboratory factors before and after the application of each correction method.Once a suitable normalization procedure was identified, the association between alcohol intake, dietary folate and methylation levels was examined by means of three approaches: an analysis of individual CpG sites, of differentially methylated regions (DMRs) and using fused lasso regression. The last two methods aim at the identification of specific regions of the epigenome using the potential correlation between neighboring CpG sites. Global methylation levels were used to investigate the relationship between methylation and BC risk.By performing an exhaustive evaluation of the statistical tools used to disclose complexity of DNA methylation data, this thesis provides informative insights for studies focusing on epigenetics, with promising potentials to apply similar methodology to the analysis of other -omics data
|
2 |
The Linkage Disequilibrium LASSO for SNP Selection in Genetic Association StudiesYounkin, Samuel G. January 2011 (has links)
No description available.
|
3 |
Semiparametric and Nonparametric Methods for Complex DataKim, Byung-Jun 26 June 2020 (has links)
A variety of complex data has broadened in many research fields such as epidemiology, genomics, and analytical chemistry with the development of science, technologies, and design scheme over the past few decades. For example, in epidemiology, the matched case-crossover study design is used to investigate the association between the clustered binary outcomes of disease and a measurement error in covariate within a certain period by stratifying subjects' conditions. In genomics, high-correlated and high-dimensional(HCHD) data are required to identify important genes and their interaction effect over diseases. In analytical chemistry, multiple time series data are generated to recognize the complex patterns among multiple classes. Due to the great diversity, we encounter three problems in analyzing those complex data in this dissertation. We have then provided several contributions to semiparametric and nonparametric methods for dealing with the following problems: the first is to propose a method for testing the significance of a functional association under the matched study; the second is to develop a method to simultaneously identify important variables and build a network in HDHC data; the third is to propose a multi-class dynamic model for recognizing a pattern in the time-trend analysis.
For the first topic, we propose a semiparametric omnibus test for testing the significance of a functional association between the clustered binary outcomes and covariates with measurement error by taking into account the effect modification of matching covariates. We develop a flexible omnibus test for testing purposes without a specific alternative form of a hypothesis. The advantages of our omnibus test are demonstrated through simulation studies and 1-4 bidirectional matched data analyses from an epidemiology study.
For the second topic, we propose a joint semiparametric kernel machine network approach to provide a connection between variable selection and network estimation. Our approach is a unified and integrated method that can simultaneously identify important variables and build a network among them. We develop our approach under a semiparametric kernel machine regression framework, which can allow for the possibility that each variable might be nonlinear and is likely to interact with each other in a complicated way. We demonstrate our approach using simulation studies and real application on genetic pathway analysis.
Lastly, for the third project, we propose a Bayesian focal-area detection method for a multi-class dynamic model under a Bayesian hierarchical framework. Two-step Bayesian sequential procedures are developed to estimate patterns and detect focal intervals, which can be used for gas chromatography. We demonstrate the performance of our proposed method using a simulation study and real application on gas chromatography on Fast Odor Chromatographic Sniffer (FOX) system. / Doctor of Philosophy / A variety of complex data has broadened in many research fields such as epidemiology, genomics, and analytical chemistry with the development of science, technologies, and design scheme over the past few decades. For example, in epidemiology, the matched case-crossover study design is used to investigate the association between the clustered binary outcomes of disease and a measurement error in covariate within a certain period by stratifying subjects' conditions. In genomics, high-correlated and high-dimensional(HCHD) data are required to identify important genes and their interaction effect over diseases. In analytical chemistry, multiple time series data are generated to recognize the complex patterns among multiple classes. Due to the great diversity, we encounter three problems in analyzing the following three types of data: (1) matched case-crossover data, (2) HCHD data, and (3) Time-series data. We contribute to the development of statistical methods to deal with such complex data.
First, under the matched study, we discuss an idea about hypothesis testing to effectively determine the association between observed factors and risk of interested disease. Because, in practice, we do not know the specific form of the association, it might be challenging to set a specific alternative hypothesis. By reflecting the reality, we consider the possibility that some observations are measured with errors. By considering these measurement errors, we develop a testing procedure under the matched case-crossover framework. This testing procedure has the flexibility to make inferences on various hypothesis settings.
Second, we consider the data where the number of variables is very large compared to the sample size, and the variables are correlated to each other. In this case, our goal is to identify important variables for outcome among a large amount of the variables and build their network. For example, identifying few genes among whole genomics associated with diabetes can be used to develop biomarkers. By our proposed approach in the second project, we can identify differentially expressed and important genes and their network structure with consideration for the outcome.
Lastly, we consider the scenario of changing patterns of interest over time with application to gas chromatography. We propose an efficient detection method to effectively distinguish the patterns of multi-level subjects in time-trend analysis. We suggest that our proposed method can give precious information on efficient search for the distinguishable patterns so as to reduce the burden of examining all observations in the data.
|
4 |
Contributions to Structured Variable Selection Towards Enhancing Model Interpretation and Computation EfficiencyShen, Sumin 07 February 2020 (has links)
The advances in data-collecting technologies provides great opportunities to access large sample-size data sets with high dimensionality. Variable selection is an important procedure to extract useful knowledge from such complex data. While in many real-data applications, appropriate selection of variables should facilitate the model interpretation and computation efficiency. It is thus important to incorporate domain knowledge of underlying data generation mechanism to select key variables for improving the model performance. However, general variable selection techniques, such as the best subset selection and the Lasso, often do not take the underlying data generation mechanism into considerations. This thesis proposal aims to develop statistical modeling methodologies with a focus on the structured variable selection towards better model interpretation and computation efficiency. Specifically, this thesis proposal consists of three parts: an additive heredity model with coefficients incorporating the multi-level data, a regularized dynamic generalized linear model with piecewise constant functional coefficients, and a structured variable selection method within the best subset selection framework.
In Chapter 2, an additive heredity model is proposed for analyzing mixture-of-mixtures (MoM) experiments. The MoM experiment is different from the classical mixture experiment in that the mixture component in MoM experiments, known as the major component, is made up of sub-components, known as the minor components. The proposed model considers an additive structure to inherently connect the major components with the minor components. To enable a meaningful interpretation for the estimated model, we apply the hierarchical and heredity principles by using the nonnegative garrote technique for model selection. The performance of the additive heredity model was compared to several conventional methods in both unconstrained and constrained MoM experiments. The additive heredity model was then successfully applied in a real problem of optimizing the Pringlestextsuperscript{textregistered} potato crisp studied previously in the literature.
In Chapter 3, we consider the dynamic effects of variables in the generalized linear model such as logistic regression. This work is motivated from the engineering problem with varying effects of process variables to product quality caused by equipment degradation. To address such challenge, we propose a penalized dynamic regression model which is flexible to estimate the dynamic coefficient structure. The proposed method considers modeling the functional coefficient parameter as piecewise constant functions. Specifically, under the penalized regression framework, the fused lasso penalty is adopted for detecting the changes in the dynamic coefficients. The group lasso penalty is applied to enable a sparse selection of variables. Moreover, an efficient parameter estimation algorithm is also developed based on alternating direction method of multipliers. The performance of the dynamic coefficient model is evaluated in numerical studies and three real-data examples.
In Chapter 4, we develop a structured variable selection method within the best subset selection framework. In the literature, many techniques within the LASSO framework have been developed to address structured variable selection issues. However, less attention has been spent on structured best subset selection problems. In this work, we propose a sparse Ridge regression method to address structured variable selection issues. The key idea of the proposed method is to re-construct the regression matrix in the angle of experimental designs. We employ the estimation-maximization algorithm to formulate the best subset selection problem as an iterative linear integer optimization (LIO) problem. the mixed integer optimization algorithm as the selection step. We demonstrate the power of the proposed method in various structured variable selection problems. Moverover, the proposed method can be extended to the ridge penalized best subset selection problems. The performance of the proposed method is evaluated in numerical studies. / Doctor of Philosophy / The advances in data-collecting technologies provides great opportunities to access large sample-size data sets with high dimensionality. Variable selection is an important procedure to extract useful knowledge from such complex data. While in many real-data applications, appropriate selection of variables should facilitate the model interpretation and computation efficiency. It is thus important to incorporate domain knowledge of underlying data generation mechanism to select key variables for improving the model performance.
However, general variable selection techniques often do not take the underlying data generation mechanism into considerations. This thesis proposal aims to develop statistical modeling methodologies with a focus on the structured variable selection towards better model interpretation and computation efficiency. The proposed approaches have been applied to real-world problems to demonstrate their model performance.
|
Page generated in 0.0423 seconds