Global ETD Search

Return to search

Statistical Models for Next Generation Sequencing Data

Three statistical models are developed to address problems in Next-Generation Sequencing data. The first two models are designed for RNA-Seq data and the third is designed for ChIP-Seq data. The first of the RNA-Seq models uses a Bayesian non- parametric model to detect genes that are differentially expressed across treatments. A negative binomial sampling distribution is used for each gene’s read count such that each gene may have its own parameters. Despite the consequent large number of parameters, parsimony is imposed by a clustering inherent in the Bayesian nonparametric framework. A Bayesian discovery procedure is adopted to calculate the probability that each gene is differentially expressed. A simulation study and real data analysis show this method will perform at least as well as existing leading methods in some cases. The second RNA-Seq model shares the framework of the first model, but replaces the usual random partition prior from the Dirichlet process by a random partition prior indexed by distances from Gene Ontology (GO). The use of the external biological information yields improvements in statistical power over the original Bayesian discovery procedure. The third model addresses the problem of identifying protein binding sites for ChIP-Seq data. An exact test via a stochastic approximation is used to test the hypothesis that the treatment effect is independent of the sequence count intensity effect. The sliding window procedure for ChIP-Seq data is followed. The p-value and the adjusted false discovery rate are calculated for each window. For the sites identified as peak regions, three candidate models are proposed for characterizing the bimodality of the ChIP-Seq data, and the stochastic approximation in Monte Carlo (SAMC) method is used for selecting the best of the three. Real data analysis shows that this method produces comparable results as other existing methods and is advantageous in identifying bimodality of the data.

http://hdl.handle.net/1969.1/149412

next generation sequencing

Bayesian nonparametrics

Gene Ontology

MCMC

Identifer	oai:union.ndltd.org:tamu.edu/oai:repository.tamu.edu:1969.1/149412
Date	03 October 2013
Creators	Wang, Yiyi
Contributors	Dahl, David B., Liang, Faming, Spiegelman, Clifford H., Hart, Jeffrey D., Klein, Patricia E.
Source Sets	Texas A and M University
Language	English
Detected Language	English
Type	Thesis, text
Format	application/pdf

Page generated in 0.0016 seconds

Statistical Models for Next Generation Sequencing Data

Description

Links & Downloads

Tags

Additional Fields