Spelling suggestions: "subject:"estatistics|biology, bioinformatics"" "subject:"estatistics|biology, ioinformatics""
1 |
Distributed and multiphase inference in theory and practice| Principles, modeling, and computation for high-throughput scienceBlocker, Alexander W. 10 August 2013 (has links)
<p> The rise of high-throughput scientific experimentation and data collection has introduced new classes of statistical and computational challenges. The technologies driving this data explosion are subject to complex new forms of measurement error, requiring sophisticated statistical approaches. Simultaneously, statistical computing must adapt to larger volumes of data and new computational environments, particularly parallel and distributed settings. This dissertation presents several computational and theoretical contributions to these challenges. </p><p> In chapter 1, we consider the problem of estimating the genome-wide distribution of nucleosome positions from paired-end sequencing data. We develop a modeling approach based on nonparametric templates that controls for variability due to enzymatic digestion. We use this to construct a calibrated Bayesian method to detect local concentrations of nucleosome positions. Inference is carried out via a distributed HMC algorithm that scales linearly in complexity with the length of the genome being analyzed. We provide MPI-based implementations of the proposed methods, stand-alone and on Amazon EC2, which can provide inferences on an entire <i>S. cerevisiae</i> genome in less than 1 hour on EC2. </p><p> We then present a method for absolute quantitation from LC-MS/MS proteomics experiments in chapter 2. We present a Bayesian model for the non-ignorable missing data mechanism induced by this technology, which includes an unusual combination of censoring and truncation. We provide a scalable MCMC sampler for inference in this setting, enabling full-proteome analyses using cluster computing environments. A set of simulation studies and actual experiments demonstrate this approach's validity and utility. </p><p> We close in chapter 3 by proposing a theoretical framework for the analysis of preprocessing under the banner of multiphase inference. Preprocessing forms an oft-neglected foundation for a wide range of statistical and scientific analyses. We provide some initial theoretical foundations for this area, including distributed preprocessing, building upon previous work in multiple imputation. We demonstrate that multiphase inferences can, in some cases, even surpass standard single-phase estimators in efficiency and robustness. Our work suggests several paths for further research into the statistical principles underlying preprocessing.</p>
|
2 |
Genetic analysis of differentiation of T-helper lymphocytesWang, Qixin 28 November 2013 (has links)
<p> In the human immune system, T-helper cells are able to differentiate into two lymphocyte subsets: Th1 and Th2. The intracellular signaling pathways of differentiation form a dynamic regulation network by secreting distinctive types of cytokines, while differentiation is regulated by two major gene loci: T-bet and GATA-3. We developed a system dynamics model to simulate the differentiation and re-differentiation process of T-helper cells, based on gene expression levels of T-bet and GATA-3 during differentiation of these cells. We arrived at three ultimate states of the model and came to the conclusion that cell differentiation potential exists as long as the system dynamics is at an unstable equilibrium point; the T-helper cells will no longer have the potential of differentiation when the model reaches a stable equilibrium point. In addition, the time lag caused by expression of transcription factors can lead to oscillations in the secretion of cytokines during differentiation.</p>
|
3 |
Significant distinct branches of hierarchical trees| A framework for statistical analysis and applications to biological dataSun, Guoli 10 March 2015 (has links)
<p> One of the most common goals of hierarchical clustering is finding those branches of a tree that form quantifiably distinct data subtypes. Achieving this goal in a statistically meaningful way requires (a) a measure of distinctness of a branch and (b) a test to determine the significance of the observed measure, applicable to all branches and across multiple scales of dissimilarity. </p><p> We formulate a method termed Tree Branches Evaluated Statistically for Tightness (TBEST) for identifying significantly distinct tree branches in hierarchical clusters. For each branch of the tree a measure of distinctness, or tightness, is defined as a rational function of heights, both of the branch and of its parent. A statistical procedure is then developed to determine the significance of the observed values of tightness. We test TBEST as a tool for tree-based data partitioning by applying it to five benchmark datasets, one of them synthetic and the other four each from a different area of biology. With each of the five datasets, there is a well-defined partition of the data into classes. In all test cases TBEST performs on par with or better than the existing techniques. </p><p> One dataset uses Cores Of Recurrent Events (CORE) to select features. CORE was developed with my participation in the course of this work. An R language implementation of the method is available from the Comprehensive R Archive Network: cran.r-project.org/web/packages/CORE/index.html. </p><p> Based on our benchmark analysis, TBEST is a tool of choice for detection of significantly distinct branches in hierarchical trees grown from biological data. An R language implementation of the method is available from the Comprehensive R Archive Network: cran.r-project.org/web/packages/TBEST/index.html.</p>
|
4 |
Polygenic analysis of genome-wide SNP dataSimonson, Matthew A. 28 June 2013 (has links)
<p> One of the central motivators behind genetic research is to understand how genetic variation relates to human health and disease. Recently, there has been a large-scale effort to find common genetic variants associated with many forms of disease and disorder using single nucleotide polymorphisms (SNPs). Several genome-wide association (GWAS) studies have successfully identified SNPs associated with phenotypes. However, the effect sizes attributed to individual variants is generally small, explaining only a very small amount of the genetic risk and heritability expected based on the estimates of family and twin studies. Several explanations exist for the inability of GWAS to find the "missing heritability." </p><p> The results of recent research appear to confirm the prediction made by population genetics theory that most complex phenotypes are highly polygenic, occasionally influenced by a few alleles of relatively large effect, and usually by several of small effect. Studies have also confirmed that common variants are only part of what contributes to the total genetic variance for most traits, indicating rare-variants may play a significant role. </p><p> This research addresses some of the most glaring weaknesses of the traditional GWAS approach through the application of methods of polygenic analysis. We apply several methods, including those that investigate the net effects of large sets of SNPs, more sophisticated approaches informed by biology rather than the purely statistical approach of GWAS, as well as methods that infer the effects of recessive rare variants. </p><p> Our results indicate that traditional GWAS is well complemented and improved upon by methods of polygenic analysis. We demonstrate that polygenic approaches can be used to significantly predict individual risk for disease, provide an unbiased estimate of a substantial proportion of the heritability for multiple phenotypes, identify sets of genes grouped into biological pathways that are enriched for associations, and finally, detect the significant influence of recessive rare variants.</p>
|
5 |
Module-Based Analysis for "Omics" DataWang, Zhi 24 March 2015 (has links)
<p> This thesis focuses on methodologies and applications of module-based analysis (MBA) in omics studies to investigate the relationships of phenotypes and biomarkers, e.g., SNPs, genes, and metabolites. As an alternative to traditional single–biomarker approaches, MBA may increase the detectability and reproducibility of results because biomarkers tend to have moderate individual effects but significant aggregate effect; it may improve the interpretability of findings and facilitate the construction of follow-up biological hypotheses because MBA assesses biomarker effects in a functional context, e.g., pathways and biological processes. Finally, for exploratory “omics” studies, which usually begin with a full scan of a long list of candidate biomarkers, MBA provides a natural way to reduce the total number of tests, and hence relax the multiple-testing burdens and improve power.</p><p> The first MBA project focuses on genetic association analysis that assesses the main and interaction effects for sets of genetic (G) and environmental (E) factors rather than for individual factors. We develop a kernel machine regression approach to evaluate the complete effect profile (i.e., the G, E, and G-by-E interaction effects separately or in combination) and construct a kernel function for the Gene-Environmental (GE) interaction directly from the genetic kernel and the environmental kernel. We use simulation studies and real data applications to show improved performance of the Kernel Machine (KM) regression method over the commonly adapted PC regression methods across a wide range of scenarios. The largest gain in power occurs when the underlying effect structure is involved complex GE interactions, suggesting that the proposed method could be a useful and powerful tool for performing exploratory or confirmatory analyses in GxE-GWAS.</p><p> In the second MBA project, we extend the kernel machine framework developed in the first project to model biomarkers with network structure. Network summarizes the functional interplay among biological units; incorporating network information can more precisely model the biological effects, enhance the ability to detect true signals, and facilitate our understanding of the underlying biological mechanisms. In the work, we develop two kernel functions to capture different network structure information. Through simulations and metabolomics study, we show that the proposed network-based methods can have markedly improved power over the approaches ignoring network information.</p><p> Metabolites are the end products of cellular processes and reflect the ultimate responses of biology system to genetic variations or environment exposures. Because of the unique properties of metabolites, pharmcometabolomics aims to understand the underlying signatures that contribute to individual variations in drug responses and identify biomarkers that can be helpful to response predictions. To facilitate mining pharmcometabolomic data, we establish an MBA pipeline that has great practical value in detection and interpretation of signatures, which may potentially indicate a functional basis for the drug response. We illustrate the utilities of the pipeline by investigating two scientific questions in aspirin study: (1) which metabolites changes can be attributed to aspirin intake, and (2) what are the metabolic signatures that can be helpful in predicting aspirin resistance. Results show that the MBA pipeline enables us to identify metabolic signatures that are not found in preliminary single-metabolites analysis.</p>
|
Page generated in 0.137 seconds