Spelling suggestions: "subject:"genetics|estatistics"" "subject:"genetics|cstatistics""
1 |
Generalized Adaptive Shrinkage Methods and Applications in Genomics StudiesLu, Mengyin 01 January 2019 (has links)
<p> Shrinkage procedures have played an important role in helping improve estimation accuracy for a variety of applications. In genomics studies, the gene-specific sample statistics are usually noisy, especially when sample size is limited. Hence some shrinkage methods (e.g. <i>limma</i>) have been proposed to increase statistical power in identifying differentially expressed genes. Motivated by the success of shrinkage methods, Stephens (2016) proposed a novel approach, Adaptive Shrinkage (<i>ash</i>) for large-scale hypothesis testing including false discovery rate and effect size estimation, based on the fundamental “unimodal assumption” (UA) that the distribution of the actual unobserved effects has a single mode. </p><p> Even though <i>ash</i> primarily dealt with normal or student-t distributed observations, the idea of UA can be widely applied to other types of data. In this dissertation, we propose a general flexible Bayesian shrinkage framework based on UA, which is easily applicable to a wide range of settings. This framework allows us to deal with data involving other noise distributions (gamma, F, Poisson, binomial, etc.). We illustrate its flexibility in a variety of genomics applications including: differential gene expression analysis on RNA-seq data; comparison between bulk RNA-seq and single cell RNA-seq data; gene expression distribution deconvolution for single cell RNA-seq data, etc. </p><p>
|
2 |
Poisson multiscale methods for high-throughput sequencing dataXing, Zhengrong 21 December 2016 (has links)
<p> In this dissertation, we focus on the problem of analyzing data from high-throughput sequencing experiments. With the emergence of more capable hardware and more efficient software, these sequencing data provide information at an unprecedented resolution. However, statistical methods developed for such data rarely tackle the data at such high resolutions, and often make approximations that only hold under certain conditions. </p><p> We propose a model-based approach to dealing with such data, starting from a single sample. By taking into account the inherent structure present in such data, our model can accurately capture important genomic regions. We also present the model in such a way that makes it easily extensible to more complicated and biologically interesting scenarios. </p><p> Building upon the single-sample model, we then turn to the statistical question of detecting differences between multiple samples. Such questions often arise in the context of expression data, where much emphasis has been put on the problem of detecting differential expression between two groups. By extending the framework for a single sample to incorporate additional group covariates, our model provides a systematic approach to estimating and testing for such differences. We then apply our method to several empirical datasets, and discuss the potential for further applications to other biological tasks. </p><p> We also seek to address a different statistical question, where the goal here is to perform exploratory analysis to uncover hidden structure within the data. We incorporate the single-sample framework into a commonly used clustering scheme, and show that our enhanced clustering approach is superior to the original clustering approach in many ways. We then apply our clustering method to a few empirical datasets and discuss our findings. </p><p> Finally, we apply the shrinkage procedure used within the single-sample model to tackle a completely different statistical issue: nonparametric regression with heteroskedastic Gaussian noise. We propose an algorithm that accurately recovers both the mean and variance functions given a single set of observations, and demonstrate its advantages over state-of-the art methods through extensive simulation studies.</p>
|
3 |
Bayesian lasso| An extension for genome-wide association studyJoo, LiJin 24 March 2017 (has links)
<p>In genome-wide association study (GWAS), variable selection has been used for prioritizing candidate single-nucleotide polymorphism (SNP). Relating densely located SNPs to a complex trait, we need a method that is robust under various genetic architectures, yet is sensitive enough to detect the marginal difference between null and non-null factors. For this problem, ordinary Lasso produced too many false positives, and Bayesian Lasso by Gibbs samplers became too conservative when selection criterion was posterior credible sets.
My proposals to improve Bayesian Lasso include two aspects: To use stochastic approximation, variational Bayes for increasing computational efficiency and to use a Dirichlet-Laplace prior for separating small effects from nulls better. Both a double exponential prior of Bayesian Lasso and a Dirichlet-Laplace prior have a global-local mixture representation, and variational Bayes can effectively handle the hierarchies of a model due to the mixture representation. In the analysis of simulated and real sequencing data, the proposed methods showed meaningful improvements on both efficiency and accuracy.
|
4 |
The role of read depth in the design and analysis of sequencing experimentsRobinson, David Garrett 04 September 2015 (has links)
<p> The development of quantitative sequencing technologies, such as RNA-Seq, Bar-Seq, ChIP-Seq, and metagenomics, has offered great insight into molecular biology. Proper design and analysis of these experiments require statistical models and techniques that consider the specific nature of sequencing data, which typically consists of a matrix of read counts per feature. An issue of particular importance to the development of these methods is the role of read depth in statistical accuracy and power. The depth of an experiment affects the power to make biological conclusions, meaning an experiment design must consider the tradeoff between cost, power, and the number of samples that are examined. Similarly, per-gene read depth affects each gene's power and accuracy, and must be taken into account in any downstream analysis. </p><p> Here I explore many facets of the role of read depth in the design and analysis of sequencing experiments, and offer computational and statistical methods for addressing them. To assist in the design of sequencing experiments, I present subSeq, which examines the effect of depth in an experiment by subsampling reads to simulate lower depths. I use this method to examine the extent of read saturation across a variety of RNA-Seq experiments, and demonstrate a statistical model for predicting the effect of increasing depth in any experiment. I consider intensity-dependence in a technology comparison between microarrays and RNA-Seq, and show that the variance added by RNA-Seq depends more on depth than the variance in microarray depends on fluorescence intensity. I demonstrate that Bar-Seq data shares these depth-dependent properties with RNA-Seq and can be analyzed by the same tools, and further provide suggestions on the appropriate depth for Bar-Seq experiments. Finally, I show that per-gene read depth can be taken into account in multiple hypothesis testing to improve power, and introduce the method of functional false discovery rate (fFDR) control.</p>
|
5 |
Shrinkage of dispersion parameters in the double exponential family of distributions, with applications to genomic sequencingRuddy, Sean Matthew 27 March 2015 (has links)
<p> The prevalence of sequencing experiments in genomics has led to an increased use of methods for count data in analyzing high-throughput genomic data to perform analyses. The importance of shrinkage methods in improving the performance of statistical methods remains. A common example is that of gene expression data, where the counts per gene are often modeled as some form of an overdispersed Poisson. In this case, shrinkage estimates of the per-gene dispersion parameter have lead to improved estimation of dispersion in the case of a small number of samples. We address a different count setting introduced by the use of sequencing data: comparing differential proportional usage via an overdispersed binomial model. Such a model can be useful for testing differential exon inclusion in mRNA-Seq experiments in addition to the typical differential gene expression analysis. In this setting, there are fewer such shrinkage methods for the dispersion parameter. We introduce a novel method that is developed by modeling the dispersion based on the double exponential family of distributions proposed by Efron (1986), also known as the exponential dispersion model (Jorgensen, 1987). Our methods (WEB-Seq and DEB-Seq) are empirical bayes strategies for producing a shrunken estimate of dispersion that can be applied to any double exponential dispersion family, though we focus on the binomial and poisson. These methods effectively detect differential proportional usage, and have close ties to the weighted likelihood strategy of edgeR developed for gene expression data (Robinson and Smyth, 2007; Robinson <i>et al.,</i> 2010). We analyze their behavior on simulated data sets as well as real data for both differential exon usage and differential gene expression. In the exon usage case, we will demonstrate our methods' superior ability to control the FDR and detect truly different features compared to existing methods. In the gene expression setting, our methods fail to control the FDR; however, the rankings of the genes by p-value is among the top performers and proves to be robust to both changes in the probability distribution used to generate the counts and in low sample size situations. We provide implementation of our methods in the R package DoubleExpSeq available from the Comprehensive R Archive Network (CRAN).</p>
|
6 |
Polygenic analysis of genome-wide SNP dataSimonson, Matthew A. 28 June 2013 (has links)
<p> One of the central motivators behind genetic research is to understand how genetic variation relates to human health and disease. Recently, there has been a large-scale effort to find common genetic variants associated with many forms of disease and disorder using single nucleotide polymorphisms (SNPs). Several genome-wide association (GWAS) studies have successfully identified SNPs associated with phenotypes. However, the effect sizes attributed to individual variants is generally small, explaining only a very small amount of the genetic risk and heritability expected based on the estimates of family and twin studies. Several explanations exist for the inability of GWAS to find the "missing heritability." </p><p> The results of recent research appear to confirm the prediction made by population genetics theory that most complex phenotypes are highly polygenic, occasionally influenced by a few alleles of relatively large effect, and usually by several of small effect. Studies have also confirmed that common variants are only part of what contributes to the total genetic variance for most traits, indicating rare-variants may play a significant role. </p><p> This research addresses some of the most glaring weaknesses of the traditional GWAS approach through the application of methods of polygenic analysis. We apply several methods, including those that investigate the net effects of large sets of SNPs, more sophisticated approaches informed by biology rather than the purely statistical approach of GWAS, as well as methods that infer the effects of recessive rare variants. </p><p> Our results indicate that traditional GWAS is well complemented and improved upon by methods of polygenic analysis. We demonstrate that polygenic approaches can be used to significantly predict individual risk for disease, provide an unbiased estimate of a substantial proportion of the heritability for multiple phenotypes, identify sets of genes grouped into biological pathways that are enriched for associations, and finally, detect the significant influence of recessive rare variants.</p>
|
7 |
Module-Based Analysis for "Omics" DataWang, Zhi 24 March 2015 (has links)
<p> This thesis focuses on methodologies and applications of module-based analysis (MBA) in omics studies to investigate the relationships of phenotypes and biomarkers, e.g., SNPs, genes, and metabolites. As an alternative to traditional single–biomarker approaches, MBA may increase the detectability and reproducibility of results because biomarkers tend to have moderate individual effects but significant aggregate effect; it may improve the interpretability of findings and facilitate the construction of follow-up biological hypotheses because MBA assesses biomarker effects in a functional context, e.g., pathways and biological processes. Finally, for exploratory “omics” studies, which usually begin with a full scan of a long list of candidate biomarkers, MBA provides a natural way to reduce the total number of tests, and hence relax the multiple-testing burdens and improve power.</p><p> The first MBA project focuses on genetic association analysis that assesses the main and interaction effects for sets of genetic (G) and environmental (E) factors rather than for individual factors. We develop a kernel machine regression approach to evaluate the complete effect profile (i.e., the G, E, and G-by-E interaction effects separately or in combination) and construct a kernel function for the Gene-Environmental (GE) interaction directly from the genetic kernel and the environmental kernel. We use simulation studies and real data applications to show improved performance of the Kernel Machine (KM) regression method over the commonly adapted PC regression methods across a wide range of scenarios. The largest gain in power occurs when the underlying effect structure is involved complex GE interactions, suggesting that the proposed method could be a useful and powerful tool for performing exploratory or confirmatory analyses in GxE-GWAS.</p><p> In the second MBA project, we extend the kernel machine framework developed in the first project to model biomarkers with network structure. Network summarizes the functional interplay among biological units; incorporating network information can more precisely model the biological effects, enhance the ability to detect true signals, and facilitate our understanding of the underlying biological mechanisms. In the work, we develop two kernel functions to capture different network structure information. Through simulations and metabolomics study, we show that the proposed network-based methods can have markedly improved power over the approaches ignoring network information.</p><p> Metabolites are the end products of cellular processes and reflect the ultimate responses of biology system to genetic variations or environment exposures. Because of the unique properties of metabolites, pharmcometabolomics aims to understand the underlying signatures that contribute to individual variations in drug responses and identify biomarkers that can be helpful to response predictions. To facilitate mining pharmcometabolomic data, we establish an MBA pipeline that has great practical value in detection and interpretation of signatures, which may potentially indicate a functional basis for the drug response. We illustrate the utilities of the pipeline by investigating two scientific questions in aspirin study: (1) which metabolites changes can be attributed to aspirin intake, and (2) what are the metabolic signatures that can be helpful in predicting aspirin resistance. Results show that the MBA pipeline enables us to identify metabolic signatures that are not found in preliminary single-metabolites analysis.</p>
|
8 |
Increasing statistical power and generalizability in genomics microarray researchRamasamy, Adaikalavan January 2009 (has links)
The high-throughput technologies developed in the last decade have revolutionized the speed of data accumulation in the life sciences. As a result we have very rich and complex data that holds great promise to solving many complex biological questions. One such technology that is very well established and widespread is DNA microarrays, which allows one to simultaneously measure the expression levels of tens of thousands of genes in a biological tissue. This thesis aims to contribute to the development of statistics that allow the end users to obtain robust and meaningful results from DNA microarrays for further investigation. The methodology, implementation and pragmatic issues of two important and related topics – sample size estimations for designing new studies and meta-analysis of existing studies – are presented here to achieve this aim. Real life case studies and guided steps are also given. Sample size estimation is important at the design stage to ensure a study has sufficient statistical power to address the stated objective given the financial constraints. The commonly used formula for estimating the number of biological samples, its short-comings and potential amelioration are discussed. The optimal number of biological samples and number of measurements per sample that minimizes the cost is also presented. Meta-analysis or the synthesis of information from existing studies is very attractive because it can increase the statistical power by making comprehensive and inexpensive use of available information. Furthermore, one can also easily test the generalizability of findings (i.e. the extent of results from a particular valid study can be applied to other circumstances). The key issues in conducting a meta-analysis for microarrays studies, a checklist and R codes are presented here. Finally, the poor availability of raw data in microarray studies is discussed here with recommendations for authors, journal editors and funding bodies. Good availability of data is important for meta-analysis in order to avoid biased results and for sample size estimation.
|
9 |
Methods for modelling human functional brain networks with MEG and fMRIColclough, Giles January 2016 (has links)
MEG and fMRI offer complementary insights into connected human brain function. Evidence from the use of both techniques in the study of networked activity indicates that functional connectivity reflects almost every measurable aspect of human reality, being indicative of ability and deteriorating with disease. Functional network analyses may offer improved prediction of dysfunction and characterisation of cognition. Three factors holding back progress are the difficulty in synthesising information from multiple imaging modalities; a need for accurate modelling of connectivity in individual subjects, not just average effects; and a lack of scalable solutions to these problems that are applicable in a big-data setting. I propose two methodological advances that tackle these issues. A confound to network analysis in MEG, the artificial correlations induced across the brain by the process of source reconstruction, prevents the transfer of connectivity models from fMRI to MEG. The first advance is a fast correction for this confound, allowing comparable analyses to be performed in both modalities. A comparative study demonstrates that this new approach for MEG shows better repeatability for connectivity estimation, both within and between subjects, than a wide range of alternative models in popular use. A case-study analysis uses both fMRI and MEG recordings from a large dataset to determine the genetic basis for functional connectivity in the human brain. Genes account for 20% - 65% of the variation in connectivity, and outweigh the influence of the developmental environment. The second advance is a Bayesian hierarchical model for sparse functional networks that is applicable to both modalities. By sharing information over a group of subjects, more accurate estimates can be constructed for individuals' connectivity patterns. The approach scales to large datasets, outperforms state-of-the-art methods, and can provide a 50% noise reduction in MEG resting-state networks.
|
10 |
Computational modeling for identification of low-frequency single nucleotide variantsHao, Yangyang 16 November 2015 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Reliable detection of low-frequency single nucleotide variants (SNVs) carries great significance in many applications. In cancer genetics, the frequencies of somatic variants from tumor biopsies tend to be low due to contamination with normal tissue and tumor heterogeneity. Circulating tumor DNA monitoring also faces the challenge of detecting low-frequency variants due to the small percentage of tumor DNA in blood. Moreover, in population genetics, although pooled sequencing is cost-effective compared with individual sequencing, pooling dilutes the signals of variants from any individual. Detection of low frequency variants is difficult and can be cofounded by multiple sources of errors, especially next-generation sequencing artifacts. Existing methods are limited in sensitivity and mainly focus on frequencies around 5%; most fail to consider differential, context-specific sequencing artifacts. To face this challenge, we developed a computational and experimental framework, RareVar, to reliably identify low-frequency SNVs from high-throughput sequencing data. For optimized performance, RareVar utilized a supervised learning framework to model artifacts originated from different components of a specific sequencing pipeline. This is enabled by a customized, comprehensive benchmark data enriched with known low-frequency SNVs from the sequencing pipeline of interest. Genomic-context-specific sequencing error model was trained on the benchmark data to characterize the systematic sequencing artifacts, to derive the position-specific detection limit for sensitive low-frequency SNV detection. Further, a machine-learning algorithm utilized sequencing quality features to refine SNV candidates for higher specificity. RareVar outperformed existing approaches, especially at 0.5% to 5% frequency. We further explored the influence of statistical modeling on position specific error modeling and showed zero-inflated negative binomial as the best-performed statistical distribution. When replicating analyses on an Illumina MiSeq benchmark dataset, our method seamlessly adapted to technologies with different biochemistries. RareVar enables sensitive detection of low-frequency SNVs across different sequencing platforms and will facilitate research and clinical applications such as pooled sequencing, cancer early detection, prognostic assessment, metastatic monitoring, and relapses or acquired resistance identification.
|
Page generated in 0.079 seconds