Predicting the effect of missense variants is critically important in population and medical genetics. It is essential to interpret genetic variation in population screening and clinical diagnostic sequencing, to reach optimal statistical power of risk gene discovery in genetic studies of diseases and traits. A quantitative analysis of the fitness effect of all possible missense variants can provide a foundation for understanding how proteins evolve in humans and other species. In this thesis, I describe new methods to infer the effect of missense variants using various machine learning techniques.
First, I worked on a ResNet-based supervised model to predict pathogenicity trained on curated databases. The curated clinical databases have uneven quality and uncertain bias across genes. To address this issue, I developed a new method, MisFit, to separately model the molecular effect and population fitness effect of missense variants, and to estimate them jointly using a probabilistic graphical model. The architecture of MisFit follows the biological causality of the variant effect, that is, for a missense variant, the protein sequence and structure context determine its molecular effect, which in turn determines its fitness effect given how the protein is involved in various conditions and traits.
The latter is a latent factor encapsulated in a sigmoid-shaped function with gene-specific parameters. The fitness effect determines the expected allele counts in human populations. This model can be trained using large-scale population genome data without known pathogenicity labels. I investigated how informative allele counts are for inferring fitness effect using simulations with realistic demographic parameters.
To take advantage of the latest deep learning techniques and large population genome data sets, I use a Poisson-Inverse-Gaussian distribution, which is differentiable, to approximate the probability of allele counts given fitness effect and sample size. We show that MisFit estimated heterozygous selection coefficient of missense variants is consistent with ratio of de novo mutations among observed variants in a population with child-parents trio data.
Furthermore, de novo missense variants with selection coefficient >0.01 are significantly enriched in neurodevelopmental disorders cases, achieving the best performance in prioritization of de novos for new risk gene discovery compared to previous methods. We also show that the estimated molecular effect reached the state-of-the-art performance in the classification of damaging variants in deep mutational scanning assays, with improved consistency of the score scale across genes.
Finally, I analyzed the transmission disequilibrium of inherited variants in autism using a new empirical Bayesian method to identify risk genes, which models relative risk as a continuous function of variant effect in each gene.
Identifer | oai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/rky8-3h79 |
Date | January 2024 |
Creators | Zhao, Yige |
Source Sets | Columbia University |
Language | English |
Detected Language | English |
Type | Theses |
Page generated in 0.002 seconds