Return to search

MCMC sampling methods for binary variables with application to haplotype phasing and allele specific expression

The purpose of this thesis is to explore methodology concerning Markov Chain Monte Carlo (MCMC), a powerful technique in the Bayesian framework, on binary variables. The primary application of interest in this thesis is applying this methodology to phase haplotypes, a type of categorical variable. Haplotypes are the combination of variants present in an individual’s genome. Phasing refers to estimating the true haplotype. By considering only biallelic and heterozygous variants, the haplotype can be expressed as a vector of binary variables. Accounting for differences in haplotypes is essential for the study of associations between genotype and disease.
MCMC is an extremely popular class of statistical methods for simulating autocorrelated draws from target distributions, including posterior distributions in Bayesian analysis. Techniques for sampling categorical variables in MCMC have been developed in a variety of disparate settings. Samplers include Gibbs, Metropolis-Hastings, and exact Hamiltonian based samplers. A review of these techniques is presented and their relevance to the genetic model discussed.
An important consideration in using simulated MCMC draws for inference is that they have converged to the distribution of interest. Since the distribution is typically of a non-standard form, convergence cannot generally be proven and, instead, is assessed with convergence diagnostics. The convergence diagnostics developed so far focus on continuous variables and may be inappropriate for binary variables or categorical variables in general. Two convergence diagnostics are proposed that are tailor-made for categorical variables by modeling the data using categorical time series models. Performance of the convergence diagnostics is evaluated under various simulations.
The methodology developed in the thesis is applied to estimate haplotypes. There are two main challenges involved in accounting for haplotype differences. One is estimating the true combination of genetic variants on a single chromosome, known as haplotype phasing. The other is the phenomenon of allele-specific expression (ASE) in which haplotypes can be expressed non-equally. No existing method addresses these two intrinsically linked challenges together. Rather, current strategies rely on known haplotypes or family trio data, i.e. having data on subject of interest and their parents. A novel method is presented, named IDP-ASE, which is capable of phasing haplotypes and quantifying ASE using only RNA-seq data. This model leverages the strengths of both Second Generation Sequencing (SGS) data and Third Generation Sequencing (TGS) data. The long read length of TGS data facilitates phasing, while the accuracy and depth of SGS data facilitates estimation of ASE. Moreover, IDP-ASE is capable of estimating ASE at both the gene and isoform level.

Identiferoai:union.ndltd.org:uiowa.edu/oai:ir.uiowa.edu:etd-6936
Date01 May 2017
CreatorsDeonovic, Benjamin Enver
ContributorsSmith, Brian J. (Brian Joseph), 1982-, Au, Kin Fai
PublisherUniversity of Iowa
Source SetsUniversity of Iowa
LanguageEnglish
Detected LanguageEnglish
Typedissertation
Formatapplication/pdf
SourceTheses and Dissertations
RightsCopyright © 2017 Benjamin Enver Deonovic

Page generated in 0.0031 seconds