Understanding the genetic basis of complex disease is a critical research goal due to the immense, worldwide burden of these diseases. Observational data, such as electronic health records (EHR), offer numerous advantages in the study of complex disease genetics. These include their large scale, cost-effectiveness, information on many different conditions, and future scalability with the widespread adoption of EHRs. Observational data, however, are challenging for research due to noise and confounding. EHR data reflect factors including the healthcare process and access to care, as well as broader societal effects like systemic biases. Billing codes for complex diseases may be recorded when no diagnosis is intended, and they may be missing when a diagnosis would be correct. Overall, systematic errors distort the genetic signal available for study and motivate taking a closer look at the ways that phenotypes can be defined using observational data.
In Chapter 3, we introduce MaxGCP, a novel phenotyping method designed to purify the genetic signal in observational data. Our approach optimizes a phenotype definition to maximize its coheritability with the complex trait of interest. We first validated this method in simulations of 5000 different phenotypes across a wide range of simulation parameters, demonstrating that the method improves genome-wide association study (GWAS) power compared to conventional methods. Having evaluated it in simulation, we next applied the method in real data analyses of stroke and Alzheimer’s disease. By comparing GWAS associations to high-quality, independent test data, we were able to compare both the sensitivity and specificity of our method. This analysis similarly found that MaxGCP boosts GWAS power compared to previous methods.
In Chapter 4, we extend this work to increase the speed and re-usability of pan-biobank GWAS with another new method, Indirect GWAS. Large scale, pan-biobank studies provide a powerful resource in complex disease genetics, generating shareable summary statistics on thousands of phenotypes. Biobank-scale GWAS have two notable limitations: they are resource-intensive to compute and do not inform about hand-crafted phenotype definitions, which are often more relevant to study. Our method uses summary statistics to addresses these limitations. It computes GWAS summary statistics for any phenotype defined as a linear combination of other phenotypes. We demonstrate a number of useful applications, including an order of magnitude improvement in runtime for large-pan-biobank GWAS and ultra-rapid (less than one minute) GWAS on hand-crafted phenotype definitions using only summary statistics.
Through the development of new computational and statistical methods, this work demonstrates the importance and power of the phenotype side of genetic association studies, and it provides two new approaches that can improve future genetic studies of complex disease.
Identifer | oai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/f1wc-tm92 |
Date | January 2024 |
Creators | Zietz, Michael Norman |
Source Sets | Columbia University |
Language | English |
Detected Language | English |
Type | Theses |
Page generated in 0.002 seconds