Return to search

Variable selection for generalized linear mixed models and non-Gaussian Genome-wide associated study data

Genome-wide associated study (GWAS) aims to identify associated single nucleotide polymorphisms (SNP) for phenotypes. SNP has the characteristic that the number of SNPs is from hundred of thousands to millions. If p is the number of SNPs and n is the sample size, it is a p>>n variable selection problem. To solve this p>>n problem, the common method for GWAS is single marker analysis (SMA). However, since SNPs are highly correlated, SMA identifies true causal SNPs with high false discovery rate. In addition, SMA does not consider interaction between SNPs. In this dissertation, we propose novel Bayesian variable selection methods BG2 and IBG3 for non-Gaussian GWAS data. To solve ultra-high dimension problem and highly correlated SNPs problem, BG2 and IBG3 have two steps: screening step and fine-mapping step. In the screening step, BG2 and IBG3, like SMA method, only have one SNP in one model and screen to obtain a subset of most associated SNPs. In the fine-mapping step, BG2 and IBG3 consider all possible combinations of screened candidate SNPs to find the best model. Fine-mapping step helps to reduce false positives. In addition, IBG3 iterates these two steps to detect more SNPs with small effect size. In simulation studies, we compare our methods with SMA methods and fine-mapping methods. We also compare our methods with different priors for variables, including nonlocal prior, unit information prior, Zellner-g prior, and Zellner-Siow prior. Our methods are applied to substance use disorder (alcohol comsumption and cocaine dependence), human health (breast cancer), and plant science (the number of root-like structure). / Doctor of Philosophy / Genome-wide associated study (GWAS) aims to identify genomics variants for targeted phenotype, such as disease and trait. The genomics variants which we are interested in are single nucleotide polymorphisms (SNP). SNP is a substitution mutation in the DNA sequence. GWAS solves the problem that which SNP is associated with the phenotype. However, the number of possible SNPs is from hundred of thousands to millions. The common method for GWAS is called single marker analysis (SMA). SMA only considers one SNP's association with the phenotype each time. In this way, SMA does not have the problem which comes from the large number of SNPs and small sample size. However, SMA does not consider the interaction between SNPs. In addition, SNPs that are close to each other in the DNA sequance may highly correlated SNPs causing SMA to have high false discovery rate. To solve these problems, this dissertation proposes two variable selection methods (BG2 and IBG3) for non-Gaussian GWAS data. Compared with SMA methods, BG2 and IBG3 methods detect true causal SNPs with low false discovery rate. In addition, IBG3 can detect SNPs with small effect sizes. Our methods are applied to substance use disorder (alcohol comsumption and cocaine dependence), human health (breast cancer), and plant science (the number of root-like structure).

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/119398
Date11 June 2024
CreatorsXu, Shuangshuang
ContributorsStatistics, Ferreira, Marco Antonio Rosa, Tegge, Allison, Franck, Christopher Thomas, Kim, Inyoung
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
LanguageEnglish
Detected LanguageEnglish
TypeDissertation
FormatETD, application/pdf
RightsIn Copyright, http://rightsstatements.org/vocab/InC/1.0/

Page generated in 0.0019 seconds