Spelling suggestions: "subject:"bioinformatics."" "subject:"ioinformatics.""
Inferring dinoflagellate genome structure, function, and evolution from short-read high-throughput RNA-SeqGibbons, Theodore Robert 19 February 2016 (has links)
<p> Dinoflagellates are a diverse and ancient lineage of globally abundant algae that have adapted to fill a diverse array of important ecological roles. Despite their importance, dinoflagellate genomes remain relatively poorly understood because of their enormous size. It is suspected that dinoflagellate genomes have expanded through rampant gene duplication, possibly using a lineage-specific mechanism that involves reinsertion of mature transcripts back into the genome, and that may rely on spliced leader trans-splicing for reactivation and processing of recycled transcripts. Draft genomes have recently been published for two extremely small endosymbiotic species. These genomes confirm expansion of nearly 10k gene families, relative to other eukaryotes. In the more complete genome, evidence for transcript recycling based on relict spliced leader sequences was found in over 5,500 genes. Genomic efforts in larger dinoflagellates have focused instead on transcriptome sequencing, but transcriptomes assembled from short-read HTS data contain very little evidence for rampant gene duplication, or for trans-splicing. I have shown that apparent disagreement with hypotheses related to ubiquitous trans-splicing and widespread gene duplication are the result of technological limitations. By leveraging the statistical power of high-throughput sequencing, I found that spliced leader suffixes as short as six nucleotides are sufficient for positive identification. I also found that isoform sequences from families of conserved paralogs are systematically collapsed during assembly, but that many of these consensus sequences can be identified using a custom SNP-calling procedure that can be combined with traditional clustering based on pairwise sequence alignment to obtain a more complete picture of gene duplication in dinoflagellates. Efficient, automated homology detection based on pairwise sequence alignment is an equally challenging problem for which there is much room for improvement. I explored alternative metrics for scoring alignments between sequences using a popular procedure based on BLAST and Markov clustering, and showed that simplified metrics perform as well or better than more popular alternatives. I also found that Markov clustering of protein sequences suffers from a serious false positive problem when compared against manual curation, suggesting that it is more appropriate for pre-clustering of very large data sets than as a complete clustering solution. </p>
Discovery and characterization of genetic variants associated with extreme longevityGurinovich, Anastasia 01 August 2019 (has links)
Over the last decade, there have been multiple genome-wide association studies (GWASs) of human extreme longevity (EL). However, only a limited number of genetic variants have been identified as significant, and only few of these variants have been replicated in independent studies. There are two possible reasons for this limitation. First, genetic variants might have a varying effect on EL in different populations, and GWAS applied to a dataset as a whole may not pinpoint such differences. Second, EL is a very rare trait in a population, and rare and uncommon variants might be important factors in explaining its heritability but GWASs have focused on the analyses of variants that are relatively common in the population. In this dissertation, I present three projects that address these issues. First, I propose PopCluster: an algorithm that automatically discovers subsets of individuals in which the genetic effects of a variant are statistically different. PopCluster provides a simple framework to directly analyze genotype data without prior knowledge of subjects ethnicities. Second, I investigate ethnic-specific effects of APOE alleles on EL in Europeans. APOE is a well-studied gene with multiple effects on aging and longevity. The gene has 3 alleles: e2, e3 and e4, whose frequencies vary by ethnicity. I identify several ethnically different clusters in which the effect of the e2 and e4 alleles on EL changes substantially. Furthermore, I investigate the interaction of APOE alleles with the country of residence. Results of this analysis suggest possible interaction of this gene with dietary habits or other environmental factors. For the third project, I perform a GWAS of rare variants and EL in a case-control dataset with median age of cases 104 years old. I analyze 4.5 million high-imputation quality rare SNPs imputed with HRC panel with minor allele frequency < 0.05. The analysis replicates all previous genome-wide level significant SNPs and identifies a few more potential targets. Additionally, I use serum protein data available for a subset of subjects and find significant pQTLs which have potential functional role. Based on these analyses, both genetic and environmental factors appear to be important factors for EL. / 2020-07-31T00:00:00Z
Approaches for identifying lung cell type responses to perturbationCorbett, Sean 01 August 2019 (has links)
The use of genomic profiling can provide indications of underlying molecular responses to chemical perturbation, and the characterization of these responses can provide an increased understanding of the greater physiological effects of an exposure and inform clinical decision making. This approach has proven to be effective in understanding the effects of environmental exposures such as cigarette smoke on the airway epithelium, and how they may contribute to associated disease pathogenesis. Because of the existing body of work in genomic profiling towards understanding the effects of environmental exposures, it has relevant applications towards the study of the effects of emerging exposures such as electronic cigarettes, which remain poorly understood. Further, current approaches for genomic profiling could be improved through the development of data resources and computational methods which can identify not only tissue- or sample-level molecular responses to perturbation, but also responses specific to individual cells or cell types. In light of these issues, I investigated the molecular response in airway epithelium to a novel inhaled exposure, and developed methods to support more detailed characterization of such effects. In this dissertation, I describe a clinical observational study in which I examined the gene expression effects of electronic cigarettes on the airway epithelium, and compare these effects to those of conventional cigarettes (Aim 1). Next, I describe CELDA, a novel computational method for identifying cell subpopulations and the co-expressed modules of genes that identify them in single cell RNA-seq (scRNA-seq) data (Aim 2). Finally, I describe the Lung Connectivity Map (Lung CMap), a platform for interrogating lung cell type specific responses to a large set of chemical and molecular perturbations (Aim 3). Collectively, this work encompasses both observational and computational approaches for detailed characterization of the molecular responses to perturbation, and the determination of the relative effects of these novel perturbations versus their more well-described counterparts. / 2021-07-31T00:00:00Z
Genomic analyses of transcription elongation factors and intragenic transcriptionChuang, James 01 August 2019 (has links)
Transcription of protein-coding genes in eukaryotic cells is carried out by the protein complex RNA polymerase II. During the elongation phase of transcription, RNA polymerase II associates with transcription elongation factors which modulate the activity of the transcription complex and are needed to carry out co-transcriptional processes. Chapters 2 and 3 of this dissertation describe studies of Spt6 and Spt5, two conserved transcription elongation factors. Spt6 is a transcription elongation factor thought to replace nucleosomes in the wake of transcription. Saccharomyces cerevisiae spt6 mutants express elevated levels of intragenic transcripts, transcripts appearing to initiate from within gene bodies. We applied high resolution genomic assays of transcription initiation to an spt6-1004 mutant, allowing us to catalog the full extent of intragenic transcription in spt6-1004 and show for the first time on a genome-wide scale that the intragenic transcripts observed in spt6-1004 are largely explained by new transcription initiation. We also assayed chromatin structure genome-wide in spt6-1004, finding a global depletion and disordering of nucleosomes. In addition to increased intragenic transcription in spt6-1004, our results also reveal an unexpected decrease in expression from most canonical genic promoters. Comparing intragenic and genic promoters, we find that intragenic promoters share some features with genic promoters. Altogether, we propose that the transcriptional changes in spt6-1004 are explained by a competition for transcription initiation factors between genic and intragenic promoters, which is made possible by a global decrease in nucleosome protection of the genome. Spt5 is another transcription elongation factor, important for the processivity of the transcription complex and many transcription-related processes. To study the requirement for Spt5 in vivo, we applied multiple genomic assays to Schizosaccharomyces pombe cells depleted of Spt5. Our results reveal an accumulation of RNA polymerase II over the 5 ′ ends of genes upon Spt5 depletion, and a progressive decrease in transcript abundance towards the 3 ′ ends of genes. This is consistent with a model in which Spt5 depletion causes transcription elongation defects and increases early termination. We also unexpectedly discover that Spt5 depletion causes hundreds of antisense transcripts to be expressed across the genome, primarily initiating from within the first 500 base pairs of genes. The expression of intragenic transcripts when transcription elongation factors are disrupted suggests that cells have evolved to prevent spurious intragenic transcription. However, some cases of intragenic transcription are consistently detected in wild-type cells, and some of these cases are known to be important for different biological functions. Chapter 4 of this dissertation describes our efforts to better understand the functions of intragenic transcription in wild-type cells by studying uncharacterized instances of intragenic transcription. To discover uncharacterized instances of intragenic transcription, we applied high resolution genomic assays of transcription initiation to wild-type Saccharomyces cerevisiae under three stress conditions. For the condition of oxidative stress, we show that intragenic transcripts are generally expressed at lower levels than genic transcripts, and that many intragenic transcripts are likely to be translated at some level. By comparing intragenic transcription in three yeast species, we find that most examples of oxidative-stress regulated intragenic transcription identified in S. cerevisiae are not conserved. Finally, we show that the expression of an oxidative-stress-induced intragenic transcript at the gene DSK2 is needed for S. cerevisiae to survive in conditions of oxidative stress.
Algorithms for reconstruction and analysis of metabolic networks, with an application to Neurospora crassaDreyfuss, Jonathan M. 12 March 2016 (has links)
In this work, I have developed optimization-based algorithms to reconstruct and analyze metabolic network models, and I have applied them to the metabolism of the filamentous fungus Neurospora crassa. The developed algorithms are: (1) LInear MEtabolite Dilution Flux Balance Analysis (limed-FBA), which predicts flux while linearly accounting for metabolite dilution; (2) One-step functional Pruning (OnePrune), which removes blocked reactions with a single compact linear program; and (3) Consistent Reproduction Of growth/no-growth Phenotype (CROP), which reconciles differences between in silico and experimental gene essentiality faster than previous approaches. Together, these algorithms comprise Fast Automated Reconstruction of Metabolism (FARM). FARM was applied to reconstruct the first genome-scale model of N. crassa metabolism. This organism has played a central role in the development of twentieth-century genetics, biochemistry and molecular biology, and continues to serve as a model organism for eukaryotic biology. The N. crassa model consists of 836 metabolic genes, 257 pathways, 6 cellular compartments, and is supported by extensive manual curation of 491 literature citations. Against an independent test set of more than 300 essential/non-essential genes that were not used to train the model, it displays 93% sensitivity and specificity. The model was also used to simulate the biochemical genetics experiments originally performed on N. crassa by comprehensively predicting nutrient rescue of essential genes and synthetic lethal interactions, and providing detailed pathway-based mechanistic explanations of the predictions. The model provides a reliable computational framework for the integration and interpretation of ongoing experimental efforts in N. crassa, and the algorithms will enhance reconstruction and analysis of high-quality genome-scale metabolic models in general.
Transcriptional and translational regulation of sex-specific genes in mouse liverSteinhardt, George 21 February 2019 (has links)
With the advent of high-throughput sequencing technology, modeling molecular mechanisms for gene regulatory networks has expanded to include the epigenome. Using diverse high-throughput DNA sequencing platforms, previous studies have revealed such mechanisms for sex-differential gene regulation in mouse liver. This thesis describes the contribution of transcription factor (TF) HNF6 to these models. Further, the utility of digital genomic footprinting (DGF) using the DNase-I hypersensitive sites sequencing (DNase-Seq) assay or the Assay for Transposase Accessible Chromatin (ATAC-Seq) is demonstrated. Finally, this thesis characterizes the extent of post-translational control of genes active in mouse liver using the ribosome profiling assay (Ribo-Seq), by way of translational efficiency (TE), and uses Ribo-Seq to interrogate open reading frames from previously characterized untranslated regions of protein-coding genes and in a set of liver-expressed long non-coding RNA genes for evidence of translation. First, mouse liver binding sites for HNF6 are integrated for overlap with sex-biased DNase-I hypersensitivity sites, male-biased STAT5, and female-specific CUX2 binding sites. This analysis showed how epigenetic markers, together with HNF6, target specific sets of sex-biased genes, revealing specific mechanisms involving HNF6 that contribute to the sex-specificity of gene expression in mouse liver. Next, the limited utility of the DGF technique to predict TF-DNA interactions was demonstrated using publicly available datasets for 21 TFs using DNase-seq and ATAC-seq datasets and sequencing libraries prepared using chromatin as well as purified DNA. Additionally, a simple model is proposed that benchmarks performance of DNase-seq vs. ATAC-seq for the same set of 21 TFs. Finally, the extent to which liver-expressed genes are regulated by sex-differential TE was investigated using Ribo-Seq. Limited sex-differential TE was found. Further, this assay predicted novel peptides found in previously characterized non-coding open reading frames within untranslated regions of genes that may regulate TE of upstream genes, and in a set of liver-expressed long non-coding RNAs. / 2021-02-20T00:00:00Z
Pathway activity analysis of bulk and single-cell RNA-Seq dataJenkins, David 21 February 2019 (has links)
Gene expression profiling can produce effective biomarkers that can provide additional information beyond other approaches for characterizing disease. While these approaches are typically performed on standard bulk RNA sequencing data, new methods for RNA sequencing of individual cells have allowed these approaches to be applied at the resolution of a single cell. As these methods enter the mainstream, there is an increased need for user-friendly software that allows researchers without experience in bioinformatics to apply these techniques. In this thesis, I have developed new, user-friendly data resources and software tools to allow researchers to use gene expression signatures in their own datasets. Specifically, I created the Single Cell Toolkit, a user-friendly and interactive toolkit for analyzing single-cell RNA sequencing data and used this toolkit to analyze the pathway activity levels in breast cancer cells before and after cancer therapy. Next, I created and validated a set of activated oncogenic growth factor receptor signatures in breast cancer, which revealed additional heterogeneity within public breast cancer cell line and patient sample RNA sequencing datasets. Finally, I created an R package for rapidly profiling TB samples using a set of 30 existing tuberculosis gene signatures. I applied this tool to look at pathway differences in a dataset of tuberculosis treatment failure samples. Taken together, the results of these studies serve as a set of user-friendly software tools and data sets that allow researchers to rapidly and consistently apply pathway activity methods across RNA sequencing samples.
Immunogenomics of the Rhesus macaque, an animal model for HIV vaccine developmentRamesh, Akshaya 09 March 2017 (has links)
Human Immunodeficiency Virus (HIV) is a lentivirus that causes Acquired Immunodeficiency Syndrome (AIDS) resulting in the progressive failure of the immune system. Due to its rapid replication rate and high mutation frequency, the virus is able to evade the immune system and thwart an efficacious response. Current HIV infection prophylaxes and therapeutics are not optimal and there is an urgent need to develop an efficacious HIV vaccine. Recently, high-throughput sequencing of the Immunoglobulin (Ig) repertoire from HIV-infected humans and immunized Rhesus macaques has led to important insights into vaccines against HIV-1. Further elucidation of the antibody response in these crucial animal studies will require substantially greater power to analyze the Ig repertoires than is currently possible. Reliable information on macaque Ig genes is insufficient due to the incompleteness of the whole genome sequence (WGS) and the inherent difficulty of obtaining complete Ig sequences due to its complex and repetitive nature. To address this issue, we have generated a high quality, annotated WGS with precisely annotated Ig loci from ten macaques. We used low error, synthetic long reads generated by Illumina TruSeq technology, Illumina 150bp, paired-end reads (110X coverage) and Irys genome mapping technology to assemble the genome de novo. We employed a bait-and-sequence strategy using human Ig probes to capture macaque Ig genes for the accurate assembly and annotation of Ig genes and alleles. Together, these data will generate a complete Rhesus macaque genome with detailed information on allelic diversity at the Ig loci. This study is essential for making the macaque a viable model for adaptive immunity. In addition, it will provide information on the similarities and differences between macaque and human Ig genes that will aid in the design and interpretation of vaccine studies.
Optimization and machine learning methods for Computational Protein DockingZarbafian, Shahrooz 23 October 2018 (has links)
Computational Protein Docking (CPD) is defined as determining the stable complex of docked proteins given information about two individual partners, called receptor and ligand. The problem is often formulated as an energy/score minimization where the decision variables are the 6 rigid body transformation variables for the ligand in addition to more variables corresponding to flexibilities in the protein structures. The scoring functions used in CPD are highly nonlinear and nonconvex with a very large number of local minima, making the optimization problem particularly challenging. Consequently, most docking procedures employ a multistage strategy of (i) Global Sampling using a coarse scoring function to identify promising areas followed by (ii) a Refinement stage using more accurate scoring functions and possibly allowing more degrees of freedom. In the first part of this work, the problem of local optimization in the refinement stage is addressed. The goal of local optimization is to remove steric clashes between protein partners and obtain more realistic score values. The problem is formulated as optimization on the space of rigid motions of the ligand. Employing a recently introduced representation of the space of rigid motions as a manifold, a new Riemannian metric is introduced that is closely related to the Root Mean Square Deviation (RMSD) distance measure widely used in Protein Docking. It is argued that the new metric puts rotational and translational variables on equal footing as far local changes of RMSD is concerned. The implications and modifications for gradient-based local optimization algorithms are discussed. In the second part, a new methodology for resampling and refinement of ligand conformations is introduced. The algorithm is a refinement method where the inputs to the algorithm are ensembles of ligand conformations and the goal is to generate new ensembles of refined conformations, closer to the native complex. The algorithm builds upon a previous work and introduces multiple new innovations: Clustering the input conformations, performing dimensionality reduction using Principle Component Analysis (PCA), underestimating the scoring function and resampling and refinement of new conformations. The performance of the algorithm on a comprehensive benchmark of protein complexes is reported. The third part of this work focuses on using machine learning framework for addressing two specific problems in Protein Docking: (i) Constructing a machine learning model in order to predict whether a given receptor and ligand pair interact. This is of significant importance for constructing the so-called protein interaction networks, an critical step in the Drug Discovery process. The success of the algorithm is verified on a benchmark for discrimination between Biological and Crystallographic Dimers. (ii) A ranking scheme for output predictions of a protein docking server is devised. The machine learning model employs the features of the docking server predictions to produce a ranked list with the top ranked predictions having higher probability of being close to the native solution. Two state-of-the-art approaches to the ranking problem are presented and compared in detail and the implications of using the superior approach for a structural docking server is discussed.
Combined Host and Microbial Metagenomic Next-Generation Sequencing| Applying Integrated Analysis Approaches for a Comprehensive Evaluation of Infectious Disease Response to Inform Diagnosis, Surveillance, and TreatmentKalantar, Katrina 13 April 2019 (has links)
<p> Infectious diseases are a leading cause of morbidity and mortality worldwide. Despite significant advancement in our understanding of infectious disease biology, existing microbiologic diagnostic tests often fail to identify etiologic pathogens in cases of suspected infection. Metagenomic next-generation sequencing (mNGS) offers the potential for a universal pathogen detection method, but analysis and interpretation of findings are challenging. This is especially true for lower respiratory tract infections (LRTIs) where mNGS data interpretation is complicated by the existence of a respiratory microbiome composed of pathobionts present in both health and disease. </p><p> To address the need for improved LRTI diagnostics, we first compared two fluid types commonly used for diagnosis of LRTI, showing that despite moderate microbiome differences, both mini-bronchioalveolar lavage (mBAL) and tracheal aspirate (TA) samples are suitable for identification of pathogens in the context of an infection. Then, we evaluated the utility of mNGS as a diagnostic for LRTI in a cohort of 92 TA samples from adults with acute respiratory failure. We developed methods for sifting putative pathogens from commensal microbiota as well as pathogen, microbiome diversity, and host gene expression metrics to identify LRTI-positive patients and differentiate them from critically ill controls with noninfectious acute respiratory illnesses. We applied the models developed for evaluation of LRTI status to several other cohorts and disease contexts to show their broad applicability. </p><p> The low sensitivity of existing clinical diagnostics results in an imperfect gold standard, complicating the development of mNGS-based biomarkers. We explored the impact of label noise on host gene expression classifiers and methods for circumventing the issue. First, we tested whether label-noise robust logistic regression approaches could improve classifier performance by enabling the use of a larger training set. Then, we tested whether variational autoencoders, an unsupervised dimensionality reduction approach, could generate novel insight from combined host and microbial mNGS data. Altogether, this work suggests that a single streamlined protocol offering an integrated genomic portrait of pathogen, microbiome, and host transcriptome may hold promise as a tool for diagnosis of infections and contextualization of patient response.</p><p>
Page generated in 0.0857 seconds