471 |
Methods for the Analysis of Differential Composition of Gene ExpressionDimont, Emmanuel 01 March 2016 (has links)
Modern next-generation sequencing and microarray-based assays have empowered the computational biologist to measure various aspects of biological activity. This has led to the growth of genomics, transcriptomics and proteomics as fields of study of the complete set of DNA, RNA and proteins in living cells respectively. One major challenge in the analysis of this data, however, has been the widespread lack of sufficiently large sample sizes due to the high cost of new emerging technologies, making statistical inference difficult. In addition, due to the hierarchical nature of the various types of data, it is important to correctly integrate them to make meaningful biological discoveries and better informed decisions for the successful treatment of disease. In this dissertation I propose: (1) a novel method for more powerful statistical testing of differential digital gene expression between two conditions, (2) a framework for the integration of multi-level biologic data, demonstrated with the compositional analysis of gene expression and its link to promoter structure, and (3) an extension to a more complex generalized linear modeling framework, demonstrated with the compositional analysis of gene expression and its link to pathway structure adjusted for confounding covariates.
|
472 |
Quantifying Sources of Variation in High-throughput BiologyFranks, Alexander M. 17 July 2015 (has links)
One of the central challenges in systems biology research is disentangling relevant and irrelevant sources of variation. While the relevant quantities are always context dependent, an important distinction can be drawn between variability due to biological processes and variability due measurement error. Biological variability includes variability between mRNA or protein abundances within a well defined condition, variability of these abundances across conditions (physiological variability), and between species or between subject variability. Technical variability includes measurement error, technological bias, and variability due to missing data. In this dissertation, we explore statistical challenges associated with disentangling sources of variability, both biological and technical, in the analysis of high-throughput biological data. In the first chapter, we present a careful meta-analysis of 27 yeast data sets supported by a multilevel model to separate biological variability from structured technical variability. In the second chapter, we suggest a simple and general approach for deconvolving the contributions of orthogonal sources of biological variability, both between and within molecules, across multiple physiological conditions. The results discussed in these two chapters elucidate the relative importance of transcriptional and post-transcriptional regulation of protein levels. Finally, in the third chapter we introduce a novel approach for modeling non-ignorable missing data. We illustrate the utility of this methodology on missing data in mRNA and protein measurements. / Statistics
|
473 |
Non-Canonical Translation in VertebratesChew, Guo-Liang 17 July 2015 (has links)
Translation is a key process during gene expression: to produce proteins, ribosomes translate the coding sequences of mRNAs. However, vertebrate genomes contain more translation potential than these annotated coding sequences: translation has been detected in many non-coding RNAs and in the non-coding regions of mRNAs. To understand the role of such translation in vertebrates, I investigated: 1) the distribution of translation in vertebrate long non-coding RNAs, and 2) the effects of translation in the 5’ leaders of vertebrate mRNAs.
To quantify and localize translation in a genome-wide manner, we produced and analyzed ribosome profiling data in zebrafish, and analyzed ribosome profiling data produced by others. The nucleotide resolution afforded by ribosome profiling allows localization of translation to individual ORFs within a transcript, while its quantitative nature enables measurement of how much translation occurs within individual ORFs.
We combined ribosome profiling with a machine-learning approach to classify lncRNAs during zebrafish development and in mouse ES cells. We found that dozens of proposed lncRNAs are protein-coding contaminants and that many lncRNAs have ribosome profiles that resemble that of the 5’ leaders of coding mRNAs. These results clarify the annotation of lncRNAs and suggest a potential role for translation in lncRNA regulation.
Because much of the translation in non-coding regions of mRNAs occurs within uORFs, we further examined the effects of their translation on the cognate gene expression. While much is known about the repression of individual genes by their uORFs, how uORF repressiveness varies within a genome and what underlies this variation had not been characterized. To address these questions, we analyzed transcript sequences and ribosome profiling data from human, mouse and zebrafish.
Linear modeling revealed that sequence features at both uORFs and coding sequences contribute similarly and substantially toward modulating uORF repressiveness and coding sequence translational efficiency. Strikingly, uORF sequence features are conserved in mammals, and mediate the conservation of uORF repressiveness in vertebrates. uORFs are depleted near coding sequences and have initiation contexts that diminish their translation. These observations suggest that the prevalence of vertebrate uORFs may be explained by their functional conservation as weak repressors of coding sequence translation. / Biology, Molecular and Cellular
|
474 |
Towards a Systematic Approach for Characterizing Regulatory VariationBarrera, Luis A. 21 April 2016 (has links)
A growing body of evidence suggests that genetic variants that alter gene expression are responsible for many phenotypic differences across individuals, particularly for the risk of developing common diseases. However, the molecular mechanisms that underlie the vast majority of associations between genetic variants and their phenotypes remain unknown. An important limiting factor is that genetic variants remain difficult to interpret, particularly in noncoding sequences. Developing truly systematic approaches for characterizing regulatory variants will require: (a) improved annotations for the genomic sequences that control gene expression, (b) a more complete understanding of the molecular mechanisms through which genetic variants, both coding and noncoding, can affect gene expression, and (c) better experimental tools for testing hypotheses about regulatory variants.
In this dissertation, I present conceptual and methodological advances that directly contribute to each of these goals. A recurring theme in all of these developments is the statistical modeling of protein-DNA interactions and its integration with other data types. First, I describe enhancer-FACS-Seq, a high-throughput experimental approach for screening candidate enhancer sequences to test for in vivo, tissue-specific activity. Second, I present an integrative computational analysis of the in vivo binding of NF-kappaB, a key regulator of the immune system, yielding new insights into how genetic variants can affect NF-kappaB binding. Next, I describe the first comprehensive survey of coding variation in human transcription factors and what it reveals about additional sources of genetic variation that can affect gene expression. Finally, I present SIFTED, a statistical framework and web tool for the optimal design of TAL effectors, which have been used successfully in genome editing and can thus be used to test hypotheses about regulatory variants. Together, these developments help fulfill key needs in the quest to understand the molecular basis of human phenotypic variation. / Biophysics
|
475 |
Modeling Rare Protein-Coding Variation to Identify Mutation-Intolerant Genes With Application to DiseaseSamocha, Kaitlin E. 25 July 2017 (has links)
Sequencing exomes—the 1% of the genome that codes for proteins—has increased the rate at which the genetic basis of a patient’s disease is determined. Unfortunately, when a patient does not carry a well-established pathogenic variant, it is extremely challenging to establish which of the tens of thousands of variants identified in that individual is contributing to their disease. In these situations, variants must be prioritized to make further investigation more manageable. In this thesis, we have focused on creating statistical frameworks and models to aid in the interpretation of rare variants and towards establishing gene-level metrics for variant prioritization.
We developed a sensitive and specific workflow to detect newly arising (de novo) variants from exome sequencing data of parent-child trios, and created a sequence-context based mutational. This mutational model was the basis of a rigorous statistical framework to evaluate the significance of de novo variant burden not only globally, but also per gene. When we applied this framework to de novo variants identified in patients with an autism spectrum disorder, we found a global excess of de novo loss-of-function variants as well as two genes that harbored significantly more de novo loss-of-function variants than expected.
We also used the mutational model to predict the expected number of rare (minor allele frequency < 0.1%) variants in exome sequencing datasets of reference individuals. We found a significant depletion of missense and loss-of-function variants in a subset of genes, indicating that these genes are under strong evolutionary constraint. Specifically, we identified 3,230 genes that are intolerant of loss-of-function variation and that set of genes is enriched for established dominant and haploinsufficient disease genes. Similarly, we searched for regions within genes that were intolerant of missense variation. The most missense depleted 15% of the exome contains 83% of reported pathogenic variants found in haploinsufficient disease genes that cause severe disease. Additionally, both gene-level and region-level constraint metrics highlight a set of de novo variants from patients with a neurodevelopmental disorder that are more likely to be pathogenic, supporting the utility of these metrics when interpreting rare variants within the context of disease. / Medical Sciences
|
476 |
Improved Analysis of Nanopore Sequence Data and Scanning Nanopore TechniquesSzalay, Tamas 25 July 2017 (has links)
The field of nanopore research has been driven by the need to inexpensively and rapidly sequence DNA. In order to help realize this goal, this thesis describes the PoreSeq algorithm that identifies and corrects errors in real-world nanopore sequencing data and improves the accuracy of \textit{de novo} genome assembly with increasing coverage depth. The approach relies on modeling the possible sources of uncertainty that occur as DNA advances through the nanopore and then using this model to find the sequence that best explains multiple reads of the same region of DNA. PoreSeq increases nanopore sequencing read accuracy of M13 bacteriophage DNA from 85\% to 99\% at 100X coverage. We also use the algorithm to assemble \textit{E. coli} with 30X coverage and the $\lambda$ genome at a range of coverages from 3X to 50X. Additionally, we classify sequence variants at an order of magnitude lower coverage than is possible with existing methods.
This thesis also reports preliminary progress towards controlling the motion of DNA using two nanopores instead of one. The speed at which the DNA travels through the nanopore needs to be carefully controlled to facilitate the detection of individual bases. A second nanopore in close proximity to the first could be used to slow or stop the motion of the DNA in order to enable a more accurate readout.
The fabrication process for a new pyramidal nanopore geometry was developed in order to facilitate the positioning of the nanopores. This thesis demonstrates that two of them can be placed close enough to interact with a single molecule of DNA, which is a prerequisite for being able to use the driving force of the pores to exert fine control over the motion of the DNA.
Another strategy for reading the DNA is to trap it completely with one pore and to move the second nanopore instead. To that end, this thesis also shows that a single strand of immobilized DNA can be captured in a scanning nanopore and examined for a full hour, with data from many scans at many different voltages obtained in order to detect a bound protein placed partway along the molecule. / Engineering and Applied Sciences - Applied Physics
|
477 |
Statistical Methods for Large-Scale Integrative GenomicsLi, Yang 25 July 2017 (has links)
In the past 20 years, we have witnessed a significant advance of high-throughput genetic and genomic technologies. With the massively generated genomics data, there is a pressing need for statistical methods that can utilize them to make quantitative inference on substantive scientific questions. My research has been focusing on statistical methods for large-scale integrative genomics. The human genome encodes more than 20,000 genes, while the functions of about 50% (>10,000) genes remains unknown up to date. The determination of the functions of the poorly characterized genes is crucial for understanding biological processes and human diseases. In the era of Big Data, the availability of massive genomic data provides us unprecedented opportunity to identify the association between genes and predict their biological functions. Genome sequencing data and mRNA expression data are the two most important classes of genomic data. This thesis presents three research projects in self-contained chapters: (1) a statistical framework for inferring evolutionary history of human genes and identifying gene modules with shared evolutionary history from genome sequencing data, (2) a statistical method to predict frequent and specific gene co-expression by integrating a large number of mRNA expression datasets, and (3) robust variable and interaction selection for high-dimensional classification problem under the discriminant analysis and logistic regression model.
Chapter 1. Human has more than 20,000 genes but till now most of their functions are uncharacterized. Determination of the function for poorly characterized genes is crucial for understanding biological processes and study of human diseases. Functionally associated genes tend to gain and lose simultaneously during evolution, therefore identifying co-evolution of genes predicts gene-gene associations. In this chapter, we propose a mixture of tree-structured hidden Markov models for gene evolution process, and a Bayesian model-based clustering algorithm to detect gene modules with shared evolutionary history (named as evolutionary conserved modules, ECM). Dirichlet process prior is adopted for estimation of number of gene clusters and an efficient Gibbs sampler is developed for posterior distribution computation. By simulation study and benchmarks on real data sets, we show that our algorithm outperforms traditional methods that use simple metrics (e.g. Hamming distance, Pearson correlation) to measure the similarity between genes presence/absence patterns. We apply our methods on 1,025 canonical human pathways gene sets, and found a large portion of the detected gene associations are substantiated by other sources of evidence. The rest of genes have predicted functions of high priority to be verified by further biological experiments.
Chapter 2. The availability of gene expression measurements across thousands of experimental conditions provides the opportunity to predict gene function based on shared mRNA expression. While many biological complexes and pathways are coordinately expressed, their genes may be organized into co-expression modules with distinct patterns in certain tissues or conditions, which can provide insight into pathway organization and function. We developed the algorithm CLIC (clustering by inferred co-expression, www.gene-clic.org) that clusters a set of functionally-related genes into co-expressed modules, highlights the most relevant datasets, and predicts additional co-expressed genes. Using a statistical Bayesian partition model, CLIC simultaneously partitions the input gene set into disjoint co-expression modules and weights the most relevant datasets for each module. CLIC then expands each module with additional members that co-express with the module’s genes more than the background model in the weighted datasets. We applied CLIC to (i) model the background correlation in each of 3,662 mouse and human microarray datasets from the Gene Expression Omnibus (GEO), (ii) partition each of 900 annotated complexes/pathways into co-expression modules, and (iii) expand each co-expression module with additional genes showing frequent and specific co-expression over multiple GEO datasets. CLIC provided very strong functional predictions for many completely uncharacterized genes, including a link between protein C7orf55 and the mitochondrial ATP synthase complex that we experimentally validated via CRISPR knock-out. CLIC software is freely available and should become increasingly powerful with the growing wealth of transcriptomic datasets.
Chapter 3. Discriminant analysis and logistic regression are fundamental tools for classification problems. Quadratic discriminant analysis has the ability to exploit interaction effects of predictors, but the selection of interaction terms is non-trivial and the Gaussian assumption is often too restrictive for many real problems. Under the logistic regression framework, we propose a forward-backward method, SODA, for variable selection with both main and quadratic interaction terms, where in the forward stage, a stepwise procedure is conducted to screen for important predictors with both main and interaction effects, and in the backward stage SODA remove insignificant terms so as to optimize the extended BIC (EBIC) criterion. Compared with existing methods on quadratic discriminant analysis variable selection (e.g., (Murphy et al., 2010), (Zhang and Wang, 2011) and (Maugis et al., 2011)), SODA can deal with high-dimensional data with the number of predictors much larger than the sample size and does not require the joint normality assumption on predictors, leading to much enhanced robustness. Theoretical analysis establishes the consistency of SODA under high-dimensional setting. Empirical performance of SODA is assessed on both simulated and real data and is found to be superior to all existing methods we have tested. For all the three real datasets we have studied, SODA selected more parsimonious models achieving higher classification accuracies compared to other tested methods. / Statistics
|
478 |
A comparative bioinformatic analysis of zinc binuclear cluster proteinsMthombeni, Jabulani S January 2005 (has links)
Members of the zinc binuclear cluster family are important fungal transcriptional regulators sharing a common DNA binding domain. Da181p is a pleotropic zinc binuclear cluster protein involved in the induction of the UGA genes required for the γ-aminobutyrate nitrogen catabolic pathway in Saccharomyces cerevisiae. The zinc binuclear cluster domain is indispensable for function in Da181p and little is known about other domains in this protein. The aim of the study was to explore the zinc binuclear cluster protein family using comparative bioinformatics as a complement to biochemical and structural approaches. A database of all zinc binuclear cluster proteins was composed. A total of 118 zinc binuclear proteins are reported in this work. Thirty nine previously unidentified zinc binuclear cluster proteins were found. Four homologues of Da181p were identified by homology searching. Important sequence motifs were identified in the aligned sequences of Da181p and its homologues. The coiled coil motif found in the Ga14p zinc binuclear cluster protein could not be identified in Da181p and its homologues. This suggested that Da181p did not dimerise through this structural motif as other zinc binuclear cluster proteins. Solvent accessible site that could be phosphorylated by protein kinase C or casein kinase II and the role of such sites in the possible regulation of Da181p function were discussed.
|
479 |
Transcriptome-based Gene Networks for Systems-level Analysis of Plant Gene FunctionsGupta, Chirag 17 November 2017 (has links)
<p> Present day genomic technologies are evolving at an unprecedented rate, allowing interrogation of cellular activities with increasing breadth and depth. However, we know very little about how the genome functions and what the identified genes do. The lack of functional annotations of genes greatly limits the post-analytical interpretation of new high throughput genomic datasets. For plant biologists, the problem is much severe. Less than 50% of all the identified genes in the model plant <i>Arabidopsis thaliana,</i> and only about 20% of all genes in the crop model <i>Oryza sativa</i> have some aspects of their functions assigned. Therefore, there is an urgent need to develop innovative methods to predict and expand on the currently available functional annotations of plant genes. With open-access catching the ‘pulse’ of modern day molecular research, an integration of the copious amount of transcriptome datasets allows rapid prediction of gene functions in specific biological contexts, which provide added evidence over traditional homology-based functional inference. The main goal of this dissertation was to develop data analysis strategies and tools broadly applicable in systems biology research. </p><p> Two user friendly interactive web applications are presented: The Rice Regulatory Network (RRN) captures an abiotic-stress conditioned gene regulatory network designed to facilitate the identification of transcription factor targets during induction of various environmental stresses. The <i>Arabidopsis </i> Seed Active Network (SANe) is a transcriptional regulatory network that encapsulates various aspects of seed formation, including embryogenesis, endosperm development and seed-coat formation. Further, an edge-set enrichment analysis algorithm is proposed that uses network density as a parameter to estimate the gain or loss in correlation of pathways between two conditionally independent coexpression networks.</p><p>
|
480 |
Identifying Genetic Factors Influencing Sperm Mobility Phenotype in Chicken using Genome Wide Association Studies, Primordial Germ Cell Transplantation, and RNAseqOjha, Sohita 06 December 2017 (has links)
<p> Sperm mobility is a major determinant of male fertility in chicken. In spite of low heritability of reproductive traits, sperm mobility has high heritability index which suggests presence of quantitative trait loci (QTLs) governing the trait. Our research focused on three objectives: i) to identify the QTLs affecting low mobility phenotype in chicken, ii) to understand the impact of Sertoli-cells and germ cells interactions in influencing the mobility phenotype and iii) to identify the genes and gene networks differentially expressed in male and female PGCs. To detect the QTLs, genome wide association studies (GWAS) was conducted which revealed the presence of multiple minor alleles influencing the trait and indicated the role of epistasis. The second section of research involved isolation, culture and transfer of primordial germ cells (PGCs) to create high line germ line chimera chicken carrying low line PGCs. We established the culture of chicken PGCs isolated from the embryonic blood in a feeder free culture conditions but could not detect the presence of low line genotype in the semen of transgenic males. Our final study involved RNA-sequencing (RNAseq) of male and female PGCs to identify differentially expressed genes from their transcriptomes. We identified five candidate genes: 3-hydroxy-3-methylglutaryl CoA reductase (HMGCA), germ cell-less (GCL), SWIM (zinc finger SWIM domain containing transcription factor), SLC1A1 (solute carrier family 1 member 1), UBE2R2L (ubiquitin conjugating enzyme) and validated their expression level in male and female PGCs by RT-qPCR. GCL was exclusively expressed in males while SLC1A1 & UBE2R2L were expressed only in female cPGCs. This present study provides novel gender specific germ cell markers in the broiler chicken. These results will help in elucidating the genetic programming of gender specific germ line development in broilers.</p><p>
|
Page generated in 0.0993 seconds