11 |
MonsterLM: A method to estimate the variance explained by genome-wide interactions with environmental factorsKhan, Mohammad January 2020 (has links)
Estimations of heritability and variance explained due to environmental exposures and interaction effects help in understanding complex diseases. Current methods to detect such interactions rely on variance component methods. These methods have been neces- sary due to the m » n problem, where the number of predictors (m) vastly outnumbers the number of observations (n). These methods are all computationally intensive, which is further exacerbated when considering gene-environment interactions, as the number of predictors increases from m to 2m+1 in the case of a single environmental exposure. Novel methods are thus needed to enable fast and unbiased calculations of the variance explained (R2) for gene-environment interactions in very large samples on multiple traits. Taking advantage of the large number of participants in contemporary genetic studies, we herein propose a novel method for continuous trait R2 estimates that are up to 20 times faster than current methods. We have devised a novel method, monsterlm, that enables multiple linear regression on large regions encompassing tens of thousands of variants in hundreds of thousands of participants. We tested monsterlm with simulations using real genotypes from the UK Biobank. During simulations we verified the properties of monsterlm to estimate the variance explained by interaction terms. Our preliminary results showcase potential interactions between blood biochemistry biomarkers such as HbA1c, Triglycerides and ApoB with an environmental factor relating to obesity-related lifestyle factor: Waist-hip Ratio (WHR). We further investigate these results to reveal that more than 50% of the interaction variance calculated can be attributed to ∼5% of the single-nucleotide polymorphisms (SNPs) interacting with the environmental trait. Lastly, we showcase the impact of interactions on improving polygenic risk scores. / Thesis / Master of Science (MSc)
|
12 |
Development of highly recombinant inbred populations for quantitative-trait locus mappingBoddhireddy, Prashanth January 1900 (has links)
Doctor of Philosophy / Genetics Interdepartmental Program-Plant Pathology / James Nelson / The goal of quantitative-trait locus (QTL) mapping is to understand the genetic architecture of an organism by identifying the genes underlying quantitative traits. It targets gene numbers and locations, interaction with other genes and environments, and the sizes of gene effects on the traits. QTL mapping in plants is often done on a population of progeny derived from one or more designed, or controlled, crosses. These crosses are designed to exploit correlation among marker genotypes for the purposes of mapping QTL. Reducing correlations between markers can improve the precision of location and effect estimates by reducing multicollinearity. The purpose of this thesis is to propose an approach for developing experimental populations to reduce correlation by increasing recombination between markers in QTL mapping populations especially in selfing species.
QTL mapping resolution of recombinant inbred lines (RILs) is limited by the amount of recombination RILs experience during development. Intercrossing during line development can be used to counter this disadvantage, but requires additional generations and is difficult in self-pollinated species. In this thesis I propose a way of improving mapping resolution through recombination enrichment. This method is based on genotyping at each generation and advancing lines selected for high recombination and/or low heterozygosity. These lines developed are called SA-RILs (selectively advanced recombinant inbred lines). In simulations, the method yields lines that represent up to twice as many recombination events as RILs developed conventionally by selfing without selection, or the same amount but in three generations, without reduction in homozygosity. Compared to methods that require maintaining a large population for several generations and selecting lines only from the finished population, the method proposed here achieves up to 25% more recombination.
Although SA-RILs accumulate more recombination than conventional RILs and can be used as fine-mapping populations for selfing species, the effectiveness of the SA-RIL approach decreases with genome size and is most valuable only when applied either to small genomes or to defined regions of large genomes. Here I propose the development of QTL-focused SA-RILs (QSA-RILs), which are SA-RILs enriched for recombination in regions of a large genome selected for evidence for the presence of a QTL. This evidence can be derived from QTL analysis in a subset of the population at the F2 generation and/or from previous studies. In simulations QSA-RILs afford up to threefold increase in recombination and twofold increase in accuracy of QTL position estimate in comparison with RILs. The regional-selection method also shows potential for resolving QTL linked in repulsion.
One of the recent Bayesian methods for QTL mapping, the shrinkage Bayesian method (BayesA (Xu)), has been successfully used for estimating marker effects in the QTL mapping populations. Although the implementation of the BayesA (Xu) method for estimating main effects was described by the author, the equations for the posterior mean and variance, used in estimation of the effects, were not elaborated. Here I derive the equations used for the estimation of main effects for doubled-haploid and F2 populations. I then extend these equations to estimate interaction effects in doubled-haploid populations. These derivations are helpful for an understanding of the intermediate steps leading to the equations described in the original paper introducing the shrinkage Bayesian method.
|
13 |
Association statistics under the PPL frameworkHuang, Yungui 01 May 2011 (has links)
In this dissertation, the posterior probability of linkage (PPL) framework is extended to the analysis of case-control (CC) data and three new linkage disequilibrium (LD) statistics are introduced. These statistics measure the evidence for or against LD, rather than testing the null hypothesis of no LD, and they therefore avoid the need for multiple testing corrections. They are suitable not only for CC designs but also can be used in application to family data, ranging from trios to complex pedigrees, all under the same statistical framework, allowing for the unified analysis of these disparate data structures. They also provide the other core advantages of the PPL framework, including the use of sequential updating to accumulate LD evidence across potentially heterogeneous sets of subsets of data; parameterization in terms of a very general trait likelihood, which simultaneously considers dominant, recessive, and additive models; and a straightforward mechanism for modeling two-locus epistasis. Finally, being implemented within the PPL framework, the new statistics readily allow linkage information obtained from distinct data, to be incorporated into LD analyses in the form of a prior probability distribution. Performance of the proposed LD statistics is examined using simulated data. In addition, the effects of key modeling violations on performance are assessed. These statistics are also applied to a previously published type 1 diabetes (T1D) family dataset with a few candidate genes with previously reported weak associations, and another T1D CC dataset also previously published as a genome-wide association (GWA) study with some strong associations reported. The new LD statistics under the PPLD framework confirm most of the findings in the published work and also find some new SNPs suspected of being associated with T1D. Sequential updating between the family dataset and the CC dataset dramatically increased the association signal strength for a CTLA4 SNP genotyped in both studies. Linkage information gleaned from the family dataset is also combined into the LD analysis of the CC dataset to demonstrate the utility of this unique feature of the PPL framework, and specifically for the new LD statistics.
|
14 |
Inference and Prediction for High Dimensional Data via Penalized Regression and Kernel Machine MethodsMinnier, Jessica 06 August 2012 (has links)
Analysis of high dimensional data often seeks to identify a subset of important features and assess their effects on the outcome. Furthermore, the ultimate goal is often to build a prediction model with these features that accurately assesses risk for future subjects. Such statistical challenges arise in the study of genetic associations with health outcomes. However, accurate inference and prediction with genetic information remains challenging, in part due to the complexity in the genetic architecture of human health and disease. A valuable approach for improving prediction models with a large number of potential predictors is to build a parsimonious model that includes only important variables. Regularized regression methods are useful, though often pose challenges for inference due to nonstandard limiting distributions or finite sample distributions that are difficult to approximate. In Chapter 1 we propose and theoretically justify a perturbation-resampling method to derive confidence regions and covariance estimates for marker effects estimated from regularized procedures with a general class of objective functions and concave penalties. Our methods outperform their asymptotic-based counterparts, even when effects are estimated as zero. In Chapters 2 and 3 we focus on genetic risk prediction. The difficulty in accurate risk assessment with genetic studies can in part be attributed to several potential obstacles: sparsity in marker effects, a large number of weak signals, and non-linear effects. Single marker analyses often lack power to select informative markers and typically do not account for non-linearity. One approach to gain predictive power and efficiency is to group markers based on biological knowledge such genetic pathways or gene structure. In Chapter 2 we propose and theoretically justify a multi-stage method for risk assessment that imposes a naive bayes kernel machine (KM) model to estimate gene-set specific risk models, and then aggregates information across all gene-sets by adaptively estimating gene-set weights via a regularization procedure. In Chapter 3 we extend these methods to meta-analyses by introducing sampling-based weights in the KM model. This permits building risk prediction models with multiple studies that have heterogeneous sampling schemes
|
15 |
Statistical Approaches for Next-Generation Sequencing DataQiao, Dandi 06 February 2015 (has links)
During the last two decades, genotyping technology has advanced rapidly, which enabled the tremendous success of genome-wide association studies (GWAS) in the search of disease susceptibility loci (DSLs). However, only a small fraction of the overall predicted heritability can be explained by the DSLs discovered. One possible explanation for this ”missing heritability” phenomenon is that many causal variants are rare. The recent development of high-throughput next-generation sequencing (NGS) technology provides the instrument to look closely at these rare variants with precision and efficiency. However, new approaches for both the storage and analysis of sequencing data are in imminent needs. In this thesis, we introduce three methods that could be utilized in the management and analysis of sequencing data. In Chapter 1, we propose a novel and simple algorithm for compressing sequencing data that leverages on the scarcity of rare variant data, which enables the storage and analysis of sequencing data efficiently in current hardware environment. We also provide a C++ implementation that supports direct and parallel loading of the compressed format without requiring extra time for decompression. Chapter 2 and 3 focus on the association analysis of sequencing data in population-based design. In Chapter 2, we present a statistical methodology that allows the identification of genetic outliers to obtain a genetically homogeneous subpopulation, which reduces the false positives due to population substructure. Our approach is computationally efficient that can be applied to all the genetic loci in the data and does not require pruning of variants in linkage disequilibrium (LD). In Chapter 3, we propose a general analysis framework in which thousands of genetic loci can be tested simultaneously for association with complex phenotypes. The approach is built on spatial-clustering methodology, assuming that genetic loci that are associated with the target phenotype cluster in certain genomic regions. In contrast to standard methodology for multi-loci analysis, which has focused on the dimension reduction of data, the proposed approach profits from the availability of large numbers of genetic loci. Thus it will be especially relevant for whole-genome sequencing studies which commonly record several thousand loci per gene.
|
16 |
Incorporating Interactions and Gene Annotation Data in Genomic PredictionMartini, Johannes Wolfgang Robert 03 November 2017 (has links)
No description available.
|
17 |
Beyond high mutation highrecombination limit in statisticalgenetics / Statistisk genetik utan begränsningarna hög mutationstakt och hög rekombineringstaktZorrilla, Luc January 2021 (has links)
One considers the bi-allelic model in population genetics which describes a population of genomes evolving under the processes of selection, mutation, recombination and drift. A focus is made on the Quasi-Linkage Equilibrium (QLE) phase with recent derivations from Neher and Shraiman, which exists for fast recombinations compared to selection strength and whose dynamics is greatly simplified compared to the general case. Using results in the QLE regime along with Direct Coupling Analysis (DCA) one can infer fitness landscape in a population, in particular epistasis coefficients. Following these ideas, we investigate here in detail a relation between population size, recombination rate and epistasis variance describing where the QLE regime breaks down, from a DCA-inference point of view. In particular, we find that there is no clear variation of the critical recombination rate with population size, but that as expected there is a linear dependence between the standard deviation of total epistasis and that critical recombination rate. / I examensarbetet behandlas modeller i populationsgenetik med två alleler per lokus. Modellerna beskiver hur en mängd genom ändras över tid under inflytande av naturligt urval, mutationer, rekombination och genetisk drift. Fokus ligger på en fas icke-egentlig linkagejämvikt (Quasi Linkage Equilibrium, QLE) med härledningar av Neher och Shraiman. Denna fas finns när rekombination är en snabb process relativt det naturliga urvalet, och förenklar dynamiken avsevärt jämfört med det allmäna fallet. Med användning av resultat som gäller i QLE samt direktkopplingsanalys (Direct Coupling Analysis, DCA) kan man härleda urvalslandskapet i vilket en population befinner sig i, särsklit epistaskoefficienter. Med användning av dessa ideer undersöker vi här i detalj ett samband mellan populationsstorlek, rekombantionshastighet och epistasspridning som beskriver var QLE slutar gälla, från ett DCA- inferens-perspektiv. Vi finner att det inte finns något klart samband mellan den kritiska rekombinationshastigheten och populationsstorleken, men som väntat ett linjärt förhållande mellan epistasvariationen och den kritiska rekombinationshastighete.
|
18 |
Multivariate Statistical Methods for Testing a Set of Variables Between Groups with Application to GenomicsAlsulami, Abdulhadi Huda 10 1900 (has links)
<p>The use of traditional univariate analyses for comparing groups in high-dimensional genomic studies, such as the ordinary t-test that is typically used to compare two independent groups, might be suboptimal because of methodological challenges including multiple testing problem and failure to incorporate correlation among genes. Hence, multivariate methods are preferred for the joint analysis of a group or set of variables. These methods aim to test for differences in average values of a set of variables across groups. The variables that make the set could be determined statistically (using exploratory methods such as cluster analysis) or biologically (based on membership to known pathways). In this thesis, the traditional One-Way Multivariate Analysis of Variance (MANOVA) method and a robustifed version of MANOVA (Robustifed MANOVA) are compared with respect to Type I error rates and power through a simulation study. We generated data from multivariate normal as well as multivariate gamma distributions with different parameter settings. The methods are illustrated using a real gene expression data. In addition, we investigated a popular method known as Gene Set Enrichment Analysis (GSEA), where sets of genes (variables) that belong to known biological pathways are considered jointly and assessed whether or not they are "enriched" with respect to their association with a disease or phenotype of interest. We applied this method to a real genotype data.</p> / Master of Science (MSc)
|
19 |
Multilocus approaches to the detection of disease susceptibility regions : methods and applicationsCiampa, Julia Grant January 2012 (has links)
This thesis focuses on multilocus methods designed to detect single nucleotide polymorphisms (SNPs) that are associated with disease using case-control data. I study multilocus methods that allow for interaction in the regression model because epistasis is thought to be pervasive in the etiology of common human diseases. In contrast, the single-SNP models widely used in genome wide association studies (GWAS) are thought to oversimplify the underlying biology. I consider both pairwise interactions between individual SNPs and modular interactions between sets of biologically similar SNPs. Modular epistasis may be more representative of disease processes and its incorporation into regression analyses yields more parsimonious models. My methodological work focuses on strategies to increase power to detect susceptibility SNPs in the presence of genetic interaction. I emphasize the effect of gene-gene independence constraints and explore methods to relax them. I review several existing methods for interaction analyses and present their first empirical evaluation in a GWAS setting. I introduce the innovative retrospective Tukey score test (RTS) that investigates modular epistasis. Simulation studies suggest it offers a more powerful alternative to existing methods. I present diverse applications of these methods, using data from a multi-stage GWAS on prostate cancer (PRCA). My applied work is designed to generate hypotheses about the functionality of established susceptibility regions for PRCA by identifying SNPs that affect disease risk through interactions with them. Comparison of results across methods illustrates the impact of incorporating different forms of epistasis on inference about disease association. The top findings from these analyses are well supported by molecular studies. The results unite several susceptibility regions through overlapping biological pathways known to be disrupted in PRCA, motivating replication study.
|
20 |
Competição intergenotípica na análise de testes de progênie em essências florestais. / Intergenotypic competition in the analysis of forest tree progeny trials.Leonardecz Neto, Eduardo 26 August 2002 (has links)
No presente trabalho buscou-se introduzir o efeito da competição entre plantas nas análises dos testes de progênie/procedências em essências florestais, com o fim de identificar os seus efeitos e as distorções devidas à sua não observância. Para tanto, foram utilizados ensaios com níveis de precisão e mortalidades diferentes, de cinco espécies, a saber: Gallesia gorarema Vell. Moq., Eucaliptus grandis Hill ex Maider, Eucaliptus citridora Hook, Pinus elliottii Engl. var. elliottii e Araucaria angustifolia (Bert.) O. Ktze. Obtiveram-se as esperanças dos quadrados médios das fontes de variação da análise de variância nos delineamentos aqui utilizados. Com base nestas derivações, foi demonstrado explicitamente o viés nas estimativas de parâmetros genéticos quantitativos. Este viés está diretamente relacionado com a magnitude do coeficiente de regressão b e com a grandeza relativa das somas de quadrados de diferentes efeitos contidos na análise de variância da variável competição. Caso ignorado o efeito de competição, quando este influencia a variável resposta Y, os ponderadores b, que compõem o índice de seleção terão estimativas viesadas, gerando erro na seleção dos indivíduos superiores. Na análise de dados observou-se que a inclusão da competição, de maneira geral, reduziu as estimativas das componentes de variância, e por conseqüência, outras estimativas de parâmetros que são função destes, quando comparado com as estimativas feitas por via das análises sem o ajuste para a competição. A análise com a variável competição não mostrou diferenças significativas para o efeito de progênies. Isto demostra que a competição comportou-se de forma aleatória, o que corrobora para que seja colocada na análise como uma covariável; caso contrário esta teria que ser considerada uma componente da performance e introduzida numa análise multicaracterística. Utilizando as análises com e sem ajuste para a competição, para estimar os valores genéticos e o ganho com a seleção, observou-se que os indivíduos selecionados não são concordantes. Isto indica que os equívocos na seleção podem ser comuns, haja vista que o fato de se ajustar os dados faz com que o posto dos indivíduos tidos por superiores seja alterado. É recomendável considerar os efeitos da competição na análise de dados em que os indivíduos estão sujeitos a competir uns com os outros, no seu desenvolvimento. / The aim of this work was to introduce competition effects in the model underlying the analysis of forest tree experiments. Results were compared with analyses in which effects were neglected. Progeny trails with different levels of precision and mortality were used, including the following species: Gallesia gorarema Vell. Moq., Eucaliptus grandis Hill ex Maider, Eucaliptus citridora Hook, Pinus elliottii Engl. var. elliottii and Araucaria angustifolia (Bert.) O. Ktze. Mathematical expectation of mean squares values were derived and the bias of estimates was explicitly shown. Competition effects were found significant in all experiments, but were primarily of random nature. Bias was shown to be directly proportional to the magnitude of the regression parameter b and to the relative magnitude of sums of squares of the competition variable. Including the variable in general lead to a reduction of estimates of variance components and to smaller expected progress from selection. The b coefficients of multi-effect selection index are also biased if competition is ignored. Results indicated that different sets of genotypes could be selected if the analyses of data were carried out with or without the competition effects. Including a competition variable in the analysis of trials in which plants are exposed to competing with each other is recommendable.
|
Page generated in 0.0862 seconds