Global ETD Search

1	Maternal Gene-Environment Effects: An Evaluation of Statistical Approaches to Detect Effects and an Investigation of the Effect of Violations of Model Assumptions Hudson, Julie 20 September 2019 (has links) Discovering the associations between genetic variables and disease status can help reduce the burden of disease on society. This thesis focuses on the methods required to detect maternal genetic effects (an effect where the genes of the mother affect the disease risk of the child) and interaction effects between these maternal genes and environmental variables in trio data consisting of parents and an affected child. A simulation study was conducted to determine the extent to which testing for these effects is affected by violations to the mating symmetry assumption required for two current methods when control parents are not available.. This study showed that methods for maternal effect estimation are not robust to these violations; however, the interaction test is robust to the violation. Finally, a candidate gene study on orofacial clefts was conducted to evaluate maternal gene-environment interactions in international consortium data. Significant effects were found but the large magnitude of the effect estimates raises concerns about the validity of the results. This thesis tries also discusses the lack of methods and software available to estimate maternal gene environment interactions. Statistical Genetics Genetic Epidemiology
2	Identification of Candidate Causal Variants and Estimation of Genetic Associations in GWAS and Post-GWAS Studies Faye, Laura 09 January 2014 (has links) Genome-wide association studies (GWAS) and next generation sequencing (NGS) studies are powerful high-throughput methods of scanning the human genome that have dramatically increased our ability to identify disease-causing genetic variants and estimate the magnitude of their effects. Leveraging the power of these technologies requires statistical methods tailored to the real world complexities of the data from these studies. Statistical methods developed during the era of small candidate gene studies fail to account for the extended scope of genome-wide studies, which encompasses: (1) discovery of disease-associated regions; (2) localization of associations to individual risk variants; and (3) estimation of effect size. In addition, high-throughput sequencing used for large samples differs from traditional Sanger sequencing in that genotyping error varies substantially over a region, which can distort evidence used to identify the disease-associated variant. In this thesis, I model these factors in order to increase accuracy of genetic effect estimation and accuracy of identification of disease-causing variants within disease-associated regions. I address these factors in three related settings: (1) GWAS study used alone to both discover and estimate the size of genetic effect at disease-associated variants; (2) GWAS study followed with sequencing to both discover an associated region via GWAS SNPs and estimate the size of genetic effect using the sequencing data; and (3) GWAS study with sequencing or imputation used jointly to identify candidate causal variants and estimate the corresponding effect sizes within an associated region. I develop novel statistical methods to address the specific localization and estimation problems encountered in each setting. Extensive simulation studies are used to explore the nature of these problems and to compare the performance of the new methods with the standard methods. Application to the Welcome Trust Case Control Consortium Type 1 Diabetes dataset and National Cancer Institute BPC3 aggressive prostate cancer study demonstrates the difference the methods make in the interpretation of evidence in these high-throughput studies. statistical genetics 0308
3	Identification of Candidate Causal Variants and Estimation of Genetic Associations in GWAS and Post-GWAS Studies Faye, Laura 09 January 2014 (has links) Genome-wide association studies (GWAS) and next generation sequencing (NGS) studies are powerful high-throughput methods of scanning the human genome that have dramatically increased our ability to identify disease-causing genetic variants and estimate the magnitude of their effects. Leveraging the power of these technologies requires statistical methods tailored to the real world complexities of the data from these studies. Statistical methods developed during the era of small candidate gene studies fail to account for the extended scope of genome-wide studies, which encompasses: (1) discovery of disease-associated regions; (2) localization of associations to individual risk variants; and (3) estimation of effect size. In addition, high-throughput sequencing used for large samples differs from traditional Sanger sequencing in that genotyping error varies substantially over a region, which can distort evidence used to identify the disease-associated variant. In this thesis, I model these factors in order to increase accuracy of genetic effect estimation and accuracy of identification of disease-causing variants within disease-associated regions. I address these factors in three related settings: (1) GWAS study used alone to both discover and estimate the size of genetic effect at disease-associated variants; (2) GWAS study followed with sequencing to both discover an associated region via GWAS SNPs and estimate the size of genetic effect using the sequencing data; and (3) GWAS study with sequencing or imputation used jointly to identify candidate causal variants and estimate the corresponding effect sizes within an associated region. I develop novel statistical methods to address the specific localization and estimation problems encountered in each setting. Extensive simulation studies are used to explore the nature of these problems and to compare the performance of the new methods with the standard methods. Application to the Welcome Trust Case Control Consortium Type 1 Diabetes dataset and National Cancer Institute BPC3 aggressive prostate cancer study demonstrates the difference the methods make in the interpretation of evidence in these high-throughput studies. statistical genetics 0308
4	Pathogenicity and selective constraint in the non-coding genome Short, Patrick January 2019 (has links) Gene regulation plays a central role in evolution, organismal development, and disease. Despite the critical importance of gene regulation throughout development, there have been few genetic variants in regulatory elements with large effects that have been robustly associated to disease. In this work, my overarching aim was to gain a better understanding of the contribution of genetic variation in regulatory elements to Mendelian disorders and attempted to approach this problem from three different perspectives. I first sought to assess the contribution of regulatory variation to severe developmental disorders using sequence data from 8,000 affected individuals and their parents and to identify individual elements with a high probability of harbouring pathogenic regulatory elements. Next, I used population genetic models and data from more than 28,000 whole genome sequenced individuals to examine the forces of selection operating on non-coding elements genome-wide. Finally, I conducted a pilot experiment to assay >50,000 different non-coding variants across more than 700 different non-coding elements, including variants observed in patients with developmental disorders in a massively parallel reporter assay (MPRA) and collaborated on an assessment of the impact of patient mutations in eleven different enhancers using mouse transgenesis assays. A few key results from the work are summarised below: - I provide evidence that de novo SNVs in non-coding elements contribute to severe developmental disorders, and estimate that they contribute in 1-3% of cases not harbouring a likely diagnostic coding variant. - These de novo SNVs reside primarily in highly evolutionarily conserved regulatory elements and I estimate that a large fraction of conserved non-coding elements (50-70%) are acting as enhancers and a smaller subset (10-15%) have a function related to alternative splicing. - Statistical modelling of the distribution of variants in developmental disorder patients suggests that a small fraction of bases (maximum likelihood estimate of 3%) within a disease-associated non-coding element are likely pathogenic with high penetrance when mutated. - I develop a new genome-wide mutation rate model that accounts for a variety of germline features including recombination rate, replication timing, sequence context, and histone marks which greatly outperforms models based on sequence-context alone. - I find evidence for widespread purifying selection in the non-coding genome that is correlated with nucleotide-level evolutionary conservation, even when the conserved nucleotides lie within otherwise poorly conserved sequence. - I show that the selective constraint on small insertions and deletions is likely greater than the selective constraint on SNVs. - I present data from a pilot experiment assessing more than 50,000 different non-coding variants in a massively parallel reporter assay conducted in both HeLa and Neuroblastoma cells.
5	Genes in space : selection, association and variation in spatially structured populations Mathieson, Iain January 2013 (has links) Spatial structure in a population creates distinctive patterns in genetic data. There are two reasons to model this process. First, since the genetic structure of a population is induced by its historical spatial structure, it can be used to make inference about history and demography. Second, these models provide corrections to other analyses that are confounded by spatial structure. Since is it is now common to collect genome-wide data on many thousands of samples, a major challenge is to develop fast, scalable, approximate algorithms that can analyse these datasets. A practical approach is to focus on subsets of the data that are most informative, for example rare variants. First we look at the problem of estimating selection coefficients in spatially structured populations. We demonstrate this approach using classical datasets of moth colour morph frequencies, and then use it in a model incorporating both ancient and modern DNA to estimate the selective advantage of one of the best known examples of local adaptation in humans, lactase persistence in Europeans. Next, we turn to the problem of association studies in spatially structured populations. We demonstrate that rare variants are more confounded by non-genetic risk than common variants. Excess confounding is a consequence of the fact that rare variants are highly in- formative about recent ancestry and therefore, in a spatially explicit model, about location. Finally, we use this insight into rare variants to develop methods for inference about population history using rare variant and haplotype sharing as simple summary statistics. These approaches are extremely fast and can be applied to genome-wide data on thousands of samples, yet they provide an accurate description of the history of a population, both identifying recent ancestry and estimating migration rates between subpopulations. 576.58
6	Leveraging Distribution Quantiles to Detect Gene Interactions in the Pursuit of Personalized Medicine Alyass, Akram January 2018 (has links) Anticipations of personalized medicine are primarily attributed to the recent advances in computational science and high-throughput technologies that enable the ever-more realistic modeling of complex diseases. These diseases result from the interplay between genes and environment that have limited our ability to predict, prevent, or treat them. While many envision the utility of integrated high-dimensional patient-specific information, basic research towards developing accurate and reliable frameworks for personalized medicine is relatively slow in progress. This thesis provides a state-of-the-art review of current challenges towards personalized medicine. There is a need for global investment in basic research that includes 1) cost-effective generation of high-quality high-throughput data, 2) hybrid education and multidisciplinary teams, 3) data storage and processing, 4) data integration and interpretation, and 5) individual and global economic relevance; to be followed by global investments into public health to adopt routine personalized medicine. This review also highlights that unknown or unadjusted interactions result in true heterogeneity in the effect and relevance of patient data. This limits our ability to integrate and reliably utilize high-dimensional patient-specific data. This thesis further investigates the true heterogeneity in marginal effects of known BMI genetic variants. This involved the development of the novel statistical method, meta-quantile regression (MQR), to identify variants with potential gene-gene / gene-environment interactions. Applying MQR on public and local data (75,230 European adults) showed that FTO, PCSK1, TCF7L2, MC4R, FANCL, GIPR, MAP2K5, and NT5C2 have potential interactions on BMI. In addition, a gene score of 37 BMI variants shows that the genetic architecture of BMI is shaped by gene-gene and gene-environment interactions. The computational cost of fitting MQR models was greatly reduced using unconditional quantile regression. The utility of MQR was further compared to variance heterogeneity tests in identifying variants with potential interactions. MQR tests were found to have a higher power of detecting synergetic and antagonistic interactions for skewed quantitative traits while maintaining nominal Type I error rates compared to variance heterogeneity tests. Overall, MQR is a valuable tool to detect potential interactions without imposing assumptions on the nature of interactions. / Thesis / Doctor of Philosophy (PhD) / The anticipations of personalized medicine are largely due to the recent advances in computational science and our capabilities to rapidly measure and generate biological data. These developments have enhanced our understanding of complex diseases, and should theoretically enable us to predict, prevent and treat such cases in a proactive personalized context. This thesis provides a state-of-the-art review of the challenges and opportunities that explain the relatively slow progress towards personalized medicine. It identifies data integration and interpretation as the main bottleneck and proposes a novel method, termed Meta-Quantile Regression (MQR), to identify genetic variations with potential interactions. Analyzes were conducted on a total of 75,230 individuals with European ancestry, and the genetic architecture of obesity was shown to be shaped by genetic interactions. Lastly, the computational cost of MQR was substantially reduced using linear approximations, and MQR was further shown to have better performance in identifying potential interactions compared to classic variance tests. personalized medicine Statistical Genetics Obesity BMI
7	Multivariate linear mixed models for statistical genetics Casale, Francesco Paolo January 2016 (has links) In the last decade, genome-wide association studies have helped to advance our understanding of the genetic architecture of many important traits, including diseases. However, the statistical analysis of genotype-phenotype associations remains challenging due to multiple factors. First, many traits have polygenic architectures, which means that they are controlled by a large number of variants with small individual effects. Second, as increasingly deep phenotype data are being generated there is a need for multivariate analysis approaches to leverage multiple related phenotypes while retaining computational efficiency. Additionally, genetic analyses are confronted by strong confounding factors that can create spurious associations when not properly accounted for in the statistical model. We here derive more flexible methods that allow integrating genetic effects across variants and multiple quantitative traits. To do so, we build on the classical linear mixed model (LMM), a widely adopted framework for genetic studies. The first contribution of this thesis is mtSet, an efficient mixed-model approach that enables genome-wide association testing between sets of genetic variants and multiple traits while accounting for confounding factors. In both simulations and real-data applications we demonstrate that mtSet effectively combines the advantages of variant-set and multi-trait analyses. Next, we present a new model for gene-context interactions that builds on mtSet. The proposed interaction set test (iSet) yields increased statistical power for detecting polygenic interactions. Additionally, iSet enables the identification of genetic loci that are associated with different configurations of causal variants across contexts. After benchmarking the proposed method using simulated data, we consider two applications to real datasets, where we investigate genetic effects on gene expression across different cellular contexts and sex-specific genetic effects on lipid levels. Finally, we describe LIMIX, a software framework for the flexible implementation of different LMMs. Most of the models considered in this thesis, including mtSet and iSet, are implemented and available in LIMIX. A unique aspect of the software is an inference framework that allows a large class of genetic models to be defined and, in many cases, to be efficiently fitted by exploiting specific algebraic properties. We demonstrate the utility of this software suite in two applied collaboration projects. Taken together, this thesis demonstrates the value of flexible and integrative modelling in genetics and contributes new statistical methods for genetic analysis. These approaches generalise previous models, yet retain the computational efficiency that is needed to tackle large genetic datasets. 519.5
8	Search for Complex Disease Genes: Achievements and Failures AXENOVICH, Tatiana I., BORODIN, Pavel M. 12 1900 (has links) 国立情報学研究所で電子化したコンテンツを使用している。 complex diseases linkage analysis QTL mapping allelic association statistical genetics
9	Exploring nonlinear regression methods, with application to association studies Speed, Douglas Christopher January 2011 (has links) The field of nonlinear regression is a long way from reaching a consensus. Once a method decides to explore nonlinear combinations of predictors, a number of questions are raised, such as what nonlinear combinations to permit and how best to search the resulting model space. Genetic Association Studies comprise an area that stands to gain greatly from the development of more sophisticated regression methods. While these studies' ability to interrogate the genome has advanced rapidly over recent years, it is thought that a lack of suitable regression tools prevents them from achieving their full potential. I have tried to investigate the area of regression in a methodical manner. In Chapter 1, I explain the regression problem and outline existing methods. I observe that both linear and nonlinear methods can be categorised according to the restrictions enforced by their underlying model assumptions and speculate that a method with as few restrictions as possible might prove more powerful. In order to design such a method, I begin by assuming each predictor is tertiary (takes no more than three distinct values). In Chapters 2 and 3, I propose the method Sparse Partitioning. Its name derives from the way it searches for high scoring partitions of the predictor set, where each partition defines groups of predictors that jointly contribute towards the response. A sparsity assumption supposes most predictors belong in the 'null group' indicating they have no effect on the outcome. In Chapter 4, I compare the performance of Sparse Partitioning to existing methods using simulated and real data. The results highlight how greatly a method's power depends on the validity of its model assumptions. For this reason, Sparse Partitioning appears to offer a robust alternative to current methods, as its lack of restrictions allows it to maintain power in scenarios where other methods will fail. Sparse Partitioning relies on Markov chain Monte Carlo estimation, which limits the size of problem on which it can be used. Therefore, in Chapter 5, I propose a deterministic version ofthe method which, although less powerful, is not affected by convergence issues. In Chapter 6, I describe Bayesian Projection Pursuit, which adds spline fitting into the method to cope withnon-tertiary predictors. 519.5
10	Multiple testing & optimization-based approaches with applications to genome-wide association studies Posner, Daniel Charles 07 December 2019 (has links) Many phenotypic traits are heritable, but the exact genetic causes are difficult to determine. A common approach for disentangling the different genetic factors is to conduct a "genome-wide association study" (GWAS), where each single nucleotide variant (SNV) is tested for association with a trait of interest. Many SNVs for complex traits have been found by GWAS, but to date they explain only a fraction of heritability of complex traits. In this dissertation, we propose novel optimization-based and multiple testing procedures for variant set tests. In the second chapter, we propose a novel variant set test, convex-optimized SKAT (cSKAT), that leverages multiple SNV annotations. The test generalizes SKAT to convex combinations of SKAT statistics constructed from functional genomic annotations. We differ from previous approaches by optimizing kernel weights with a multiple kernel learning algorithm. In cSKAT, the contribution of each variant to the overall statistic is a product of annotation values and kernel weights for annotation classes. We demonstrate the utility of our biologically-informed SNV weights in a rare-variant analysis of fasting glucose in the FHS. In the third chapter, we propose a sequential testing procedure for GWAS that joins tests of single SNVs and groups of SNVs (SNV-sets) with common biological function. The proposed procedure differs from previous procedures by testing genes and sliding 4kb intergenic windows rather than chromosomes or the whole genome. We also sharpen an existing tree-based multiple testing correction by incorporating correlation between SNVs, which is present in any SNV-set containing contiguous regions (such as genes). In the fourth chapter, we present a sequential testing procedure for SNV-sets that incorporates correlation between test statistics of the SNV-sets. At each step of the procedure, the multiplicity correction is the number of remaining independent tests, making no assumption about the null distribution of tests. We provide an estimator for the number of remaining independent tests based on previous work in single-SNV GWAS and demonstrate the estimator is valid for sequential procedures. We implement the proposed method for GWAS by sequentially testing chromosomes, genes, 4kb windows, and SNVs. Biostatistics Multiple testing Rare variant tests SKAT Statistical genetics

Search results