531 |
Alinhamento múltiplo de genomas de eucariotos com montagens altamente fragmentadas / Multiple alignment of large eukaryotic genomes with highly fragmented assembliesGeorge Willian Condomitti Epamino 04 August 2017 (has links)
O advento do sequenciamento de nova geração (NGS - Next Generation Sequencing) nos últimos anos proporcionou um aumento expressivo no número de projetos genômicos. De maneira simplificada, as máquinas sequenciadoras geram como resultado fragmentos de DNA que são utilizados por programas montadores de genoma. Esses programas tentam juntar os fragmentos de DNA de modo a obter a representação completa da sequência genômica (por exemplo um cromossomo) da espécie sendo sequenciada. Em alguns casos o processo de montagem pode ser executado com maior facilidade para organismos com genomas de tamanhos pequenos (por exemplo bactérias com genoma em torno de 5Mpb), através de pipelines que automatizam a maior parte da tarefa. Um cenário mais complicado surge quando a espécie possui genoma com grande comprimento (acima de 1Gpb) e elementos repetidos, como no caso de alguns eucariotos. Nesses casos o resultado da montagem é geralmente composto por milhares de fragmentos (chamados de contigs), uma ordem de magnitude muito superior ao número de cromossomos estimado para um organismo (comumente da ordem de dois dígitos), dando origem a uma montagem altamente fragmentada. Uma atividade comum nesses projetos é a comparação da montagem com a de outro genoma como forma de validação e também para identificação de regiões conservadas entre os organismos. Embora o problema de alinhamento par-a-par de genomas grandes seja bem contornado por abordagens existentes, o alinhamento múltiplo (AM) de genomas grandes em estado fragmentado ainda é uma tarefa de difícil resolução, por demandar alto custo computacional e grande quantidade de tempo. Este trabalho consiste em uma metologia para fazer alinhamento múltiplo de genomas grandes de eucariotos com montagens altamente fragmentadas. Nossa implementação, baseada em alinhamento estrela, se mostrou capaz de fazer AM de grupos de montagens com diversos níveis de fragmentação. O maior deles, um conjunto de 5 genomas de répteis, levou 14 horas de processamento para fornecer um mapa de regiões conservadas entre as espécies. O algoritmo foi implementado em um software que batizamos de FROG (FRagment Overlap multiple Genome alignment), de código aberto e disponível sob licença GPLv3. / The advent of Next Generation Sequencing (NGS) in recent years has led to an expressive increase in the number of genomic projects. In a simplified way, sequencing machines generate DNA fragments that are used by genome assembler software. These programs try to merge the DNA fragments to obtain the complete representation of the genomic sequence (for example a chromosome) of the species being sequenced. In some cases the assembling process can be performed more easily for organisms with small-sized genomes (e.g. bacteria with a genome length of approximately 5Mpb) through pipelines that automate most of the task. A trickier scenario arises when the species has a very large genome (above 1Gbp) and complex elements, as in the case of some eukaryotes. In those cases the result of the assembly is usually composed of thousands of fragments (called contigs), an order of magnitude much higher than the number of chromosomes estimated for an organism (usually in the order two digits), giving rise to a highly fragmented assembly. A common activity in these projects is the comparison of the assembly with that of another genome as a form of validation and also to identify common elements between organisms. Although the problem of pairwise alignment of large genomes is well circumvented by existing approaches, multiple alignment of large genomes with highly fragmented assemblies remains a difficult task due to its time and computational requirements. This work consists of a methodology for doing multiple alignment of large eukaryotic genomes with highly fragmented assemblies, a problem that few solutions are able to cope with. Our star alignment-based implementation, was able to accomplish a MSA of groups of assemblies with different levels of fragmentation. The largest of them, a set of 5 reptilian genomes where the B. jararaca assembly (800,000 contigs, N50 of 3.1Kbp) was used as anchor, took 14 hours of execution time to provide a map of conserved regions among the participating species. The algorithm was implemented in a software named FROG (FRagment Overlap multiple Genome alignment), available under the General Public License v3 (GPLv3) terms.
|
532 |
Comparação de genomas completos de especies da familia Vibrionacea empregando rearranjo de genomas / A rearrangement-based approach to compare whole genomes of Vibrionacea strainsCogo, Patricia Pilisson 23 February 2007 (has links)
Orientador: João Meidanis / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-11T06:12:01Z (GMT). No. of bitstreams: 1
Cogo_PatriciaPilisson_M.pdf: 1149626 bytes, checksum: 10816aa20a620eb105df492903697347 (MD5)
Previous issue date: 2008 / Resumo: A evolução das técnicas de seqüenciamento tornou possível a obtenção de uma enorme quantidade de dados genômicos. O desafio atual é analisar esses dados e construir novos conhecimentos a partir deles. Neste contexto, um problema importante e ainda em aberto é a criação de métodos de análise taxonômica de genomas completos. Especialmente para organismos procariontes, para os quais ainda não há um conceito claro de espécie, a comparação de genomas completos pode significar uma importante contribuição. Neste trabalho propomos uma metodologia para comparação de genomas completos baseada na teoria de Rearranjo de Genomas, aplicando-a a organismos da família Vibrionaceae ¿ uma família heterogênea que compreende organismos de cinco diferentes gêneros, incluindo o vibrião causador da cólera, uma doença grave e que ainda causa anualmente milhares de mortes em países em desenvolvimento. As distâncias genômicas obtidas quando analisamos separadamente cada um dos dois cromossomos que compõem o genoma desses organismos estão de acordo com as árvores filogenéticas construídas empregando outros métodos de comparação genômica. Esse resultado corrobora nosso método e a utilização da teoria de rearranjo de genomas como uma alternativa para análise de genomas completos. Além disso, pode indicar que os eventos modelados neste trabalho, como perda de genes, transferências horizontais, reversões entre outros, exercem um papel importante na evolução desses organismos. Compreender a dinâmica desses eventos e combiná-la a outros métodos de análise genômica pode significar um grande avanço para a construção de uma filogenia mais acurada para estes vibriões / Abstract: The evolution in genomic sequencing techniques has resulted in a large amount of genomic data. The challenge that arises from this scenario is to analyze these data and to extract from them relevant biologic information. In this context, taxonomic analysis of complete genome sequences is still an open problem. Futhermore, it is critical for procaryotes, which still lack a clear definition of species, and whose taxonomic classification is in continuous evolution, where complete genomes comparison may well play a significant role. In this work, we propose a methodology to compare complete genomes based on genome rearrangement theory. We have applied our method to organisms of Vibrionaceae family ¿ a heterogeneous family that comprehends organisms from five different genera, including the agent responsible for cholera, a severe disease in developing countries. The genomic distances obtained when we analysed each chromosome individually are in agreement with phylogenic trees built using other genomic methods. This result validates our method and the genomic rearrangement theory as an alternative to analyze complete genomes. It also can indicate the importance played by rearrangement events in the vibrio genomic evolution. The understanding of these events, combined with other genomic methods, can play an important role in the construction of a robust vibrio phylogeny / Mestrado / Biologia Computaçional / Mestre em Ciência da Computação
|
533 |
Investigation on Genetic Modifiers of Age at Onset of Major Depressive DisorderGedik, Huseyin 01 January 2017 (has links)
Major Depressive Disorder (MDD) is a complex multifactorial disorder, which would lead to disability. Environmental and genetic factors are involved in MDD etiology. The aim of this project was to identify loci modifying age at onset (AAO) of MDD using survival models after adjusting for Childhood Sexual Abuse (CSA). To achieve this aim, a dataset was made available by the China Oxford and VCU Experimental Research on Genetic Epidemiology (CONVERGE) consortium. The study population had 5,220 controls and 5,282 cases with MDD. We performed two univariate association analyses using Cox Proportional Hazard (Cox PH) models. These two are Full Sample (FS), cases and controls, and only the Case Cohort (CC). No genome-wide significant associations were found in univariate analyses. Subsequent gene set enrichment analysis showed that there were significant enrichments in neurological Gene Ontology terms and some novel non-neural pathways. These findings may allow us to better understand MDD pathology.
|
534 |
Métagénomique comparative de novo à grande échelle / Large scale de novo comparative metagenomicsBenoit, Gaëtan 29 November 2017 (has links)
La métagénomique comparative est dite de novo lorsque les échantillons sont comparés sans connaissances a priori. La similarité est alors estimée en comptant le nombre de séquences d’ADN similaires entre les jeux de données. Un projet métagénomique génère typiquement des centaines de jeux de données. Chaque jeu contient des dizaines de millions de courtes séquences d’ADN de 100 à 200 nucléotides (appelées lectures). Dans le contexte du début de cette thèse, il aurait fallu des années pour comparer une telle masse de données avec les méthodes usuelles. Cette thèse présente des approches de novo pour calculer très rapidement la similarité entre de nombreux jeux de données. Les travaux que nous proposons se basent sur le k-mer (mot de taille k) comme unité de comparaison des métagénomes. La méthode principale développée pendant cette thèse, nommée Simka, calcule de nombreuses mesures de similarité en remplacement les comptages d’espèces classiquement utilisés par des comptages de grands k-mers (k > 21). Simka passe à l’échelle sur les projets métagénomiques actuels grâce à un nouvelle stratégie pour compter les k-mers de nombreux jeux de données en parallèle. Les expériences sur les données du projet Human Microbiome Projet et Tara Oceans montrent que les similarités calculées par Simka sont bien corrélées avec les similarités basées sur des comptages d’espèces ou d’OTUs. Simka a traité ces projets (plus de 30 milliards de lectures réparties dans des centaines de jeux) en quelques heures. C’est actuellement le seul outil à passer à l’échelle sur une telle quantité de données, tout en étant complet du point de vue des résultats de comparaisons. / Metagenomics studies the genomic content of a sample extracted from a natural environment. Among available analyses, comparative metagenomics aims at estimating the similarity between two or more environmental samples at the genomic level. The traditional approach compares the samples based on their content in known identified species. However, this method is biased by the incompleteness of reference databases. By contrast, de novo comparative metagenomics does not rely on a priori knowledge. Sample similarity is estimated by counting the number of similar DNA sequences between datasets. A metagenomic project typically generates hundreds of datasets. Each dataset contains tens of millions of short DNA sequences ranging from 100 to 150 base pairs (called reads). In the context of this thesis, it would require years to compare such an amount of data with usual methods. This thesis presents novel de novo approaches to quickly compute the similarity between numerous datasets. The main idea underlying our work is to use the k-mer (word of size k) as a comparison unit of the metagenomes. The main method developed during this thesis, called Simka, computes several similarity measures by replacing species counts by k-mer counts (k > 21). Simka scales-up today’s metagenomic projects thanks to a new parallel k-mer counting strategy on multiple datasets. Experiments on data from the Human Microbiome Project and Tara Oceans show that the similarities computed by Simka are well correlated with reference-based and OTU-based similarities. Simka processed these projects (more than 30 billions of reads distributed in hundreds of datasets) in few hours. It is currently the only tool able to scale-up such projects, while providing precise and extensive comparison results.
|
535 |
A Bayesian Group Sparse Multi-Task Regression Model for Imaging GenomicsGreenlaw, Keelin 26 August 2015 (has links)
Recent advances in technology for brain imaging and high-throughput genotyping have motivated studies examining the influence of genetic variation on brain structure. In this setting, high-dimensional regression for multi-SNP association analysis is challenging as the brain imaging phenotypes are multivariate and there is a desire to incorporate a biological group structure among SNPs based on their belonging genes. Wang et al. (Bioinformatics, 2012) have recently developed an approach for simultaneous estimation and SNP selection based on penalized regression with regularization based on a novel group l_{2,1}-norm penalty, which encourages sparsity at the gene level. A problem with the proposed approach is that it only provides a point estimate. We solve this problem by developing a corresponding Bayesian formulation based on a three-level hierarchical model that allows for full posterior inference using Gibbs sampling. For the selection of tuning parameters, we consider techniques based on: (i) a fully Bayes approach with hyperpriors, (ii) empirical Bayes with implementation based on a Monte Carlo EM algorithm, and (iii) cross-validation (CV). When the number of SNPs is greater than the number of observations we find that both the fully Bayes and empirical Bayes approaches overestimate the tuning parameters, leading to overshrinkage of regression coefficients. To understand this problem we derive an approximation to the marginal likelihood and investigate its shape under different settings. Our investigation sheds some light on the problem and suggests the use of cross-validation or its approximation with WAIC (Watanabe, 2010) when the number of SNPs is relatively large. Properties of our Gibbs-WAIC approach are investigated using a simulation study and we apply the methodology to a large dataset collected as part of the Alzheimer's Disease Neuroimaging Initiative. / Graduate
|
536 |
Comparative Genomics of Aspergillus flavus S and L Morphotypes Yields Insight into Niche AdaptionOhkura, Mana, Ohkura, Mana January 2017 (has links)
This dissertation consists of three manuscripts for publication: Appendix A presents a genomic comparison of Aspergillus flavus isolates with different morphologies, and Appendices B and C present the identification and systematics of an emerging snake pathogen, Ophidiomyces ophiodiicola. The comparative genomics project of A. flavus tests the hypothesis that isolates with different morphologies within the species are adapted to different niches. Our results reveal differences in genome structure and protein content that are implicated in niche adaptation to the soil and phyllosphere. The systematics project of O. ophiodiicola was initiated to resolve the frequent misidentification of emerging reptilian diseases that is occuring in the literature. One of these emerging pathogens, O. ophiodiicola, was incorrectly described in the genus Chrysosporium due to its resemblance in spore morphology; therefore, the taxonomy of the genus was revised. We hope the review will aid in accurate identification and tracking of emerging reptilian diseases to better understand their epidemiology.
|
537 |
Evolinc: A Tool for the Identification and Evolutionary Comparison of Long Intergenic Non-coding RNAsNelson, Andrew D. L., Devisetty, Upendra K., Palos, Kyle, Haug-Baltzell, Asher K., Lyons, Eric, Beilstein, Mark A. 09 May 2017 (has links)
Long intergenic non-coding RNAs (lincRNAs) are an abundant and functionally diverse class of eukaryotic transcripts. Reported lincRNA repertoires in mammals vary, but are commonly in the thousands to tens of thousands of transcripts, covering similar to 90% of the genome. In addition to elucidating function, there is particular interest in understanding the origin and evolution of lincRNAs. Aside from mammals, lincRNA populations have been sparsely sampled, precluding evolutionary analyses focused on their emergence and persistence. Here we present Evolinc, a two-module pipeline designed to facilitate lincRNA discovery and characterize aspects of lincRNA evolution. The first module (Evolinc-I) is a lincRNA identification workflow that also facilitates downstream differential expression analysis and genome browser visualization of identified lincRNAs. The second module (Evolinc-II) is a genomic and transcriptomic comparative analysis workflow that determines the phylogenetic depth to which a lincRNA locus is conserved within a user-defined group of related species. Here we validate lincRNA catalogs generated with Evolinc-I against previously annotated Arabidopsis and human lincRNA data. Evolinc-I recapitulated earlier findings and uncovered an additional 70 Arabidopsis and 43 human lincRNAs. We demonstrate the usefulness of Evolinc-II by examining the evolutionary histories of a public dataset of 5,361 Arabidopsis lincRNAs. We used Evolinc-II to winnow this dataset to 40 lincRNAs conserved across species in Brassicaceae. Finally, we show how Evolinc-II can be used to recover the evolutionary history of a known lincRNA, the human telomerase RNA (TERC). These latter analyses revealed unexpected duplication events as well as the loss and subsequent acquisition of a novel TERC locus in the lineage leading to mice and rats. The Evolinc pipeline is currently integrated in CyVerse's Discovery Environment and is free for use by researchers.
|
538 |
Molecular Characterisation of the Brassinosteroid, Phytosulfokine and cGMP-dependent Responses in Arabidopsis thalianaKwezi, Lusisizwe January 2010 (has links)
Philosophiae Doctor - PhD / In this thesis, we have firstly cloned and expressed the domains that harbours the putative catalytic GC domain in these receptor molecules and demonstrate that these molecules can convert GTP to cGMP in vitro. Secondly, we show that exogenous application of both Phytosulfokine and Brassinosteroid increase changes of intracellular cGMP levels in Arabidopsis mesophyll protoplast demonstrating that these molecules have GC activity in vivo and therefore provide a link as second messenger between the hormones and down-stream responses. In order to elucidate a relationship between the kinase and GC domains of the PSK receptor, we have used the AtPSKR1 receptor as a model and show that it has Serine/Threonine kinase activity using the Ser/Thr peptide 1 as a substrate. In addition, we show that the receptor`s ability to phosphorylate a substrate is affected by the product (cGMP) of its co-domain (GC) and that the receptor autophosphorylates on serine residues and this step was also observed to be affected by cGMP. When Arabidopsis plants are treated with a cell permeable analogue of cGMP, we note that this can affect changes in the phosphoproteome in Arabidopsis and conclude therefore that the cGMP plays a role in kinase-dependent downstream signalling. The obtained results suggest that the receptor molecules investigated here belong to a novel class of GCs that contains both a cytosolic kinase and GC domains, and thus have a domain organisation that is not dissimilar to that of atrial natriuretic peptide receptors NPR1 and NPR2. The findings also strongly suggest that cGMP has a role as a second messenger in both Brassinosteroid and Phytosulfokine signalling. We speculate that other proteins with similar domain organisations may also have dual catalytic activities and that a significant number of GCs, both in plants and animals, remain to be discovered and characterised. / South Africa
|
539 |
Characterisation of a susceptibility locus for inflammatory arthritisSteel, Kathryn Jean Audrey January 2014 (has links)
Inflammatory arthritis (IA) types such as rheumatoid arthritis (RA), juvenile idiopathic arthritis (JIA) and psoriatic arthritis (PsA) have been shown to exhibit common clinical features. As complex diseases they have a known genetic component, some of which is known to be shared. The aim of this study was to assess the genetic overlap between 3 types of IA (RA, JIA and PsA) using genotype data generated on the Immunochip array and to select a biologically promising overlapping region for further genetic and functional investigation. Overlap analysis was performed using association data generated for a large cohort of inflammatory arthritis cases and shared controls (11,475 RA; 2816 JIA; 929 PsA respectively). 50 genetic regions were identified as being associated with more than 1 type of IA (p < 1x10-3), with several interesting similarities and differences observed between the diseases. As several of the overlapping regions detected represented novel disease associations, they required replication in an independent sample cohort. 12 variants were selected for replication in an independent RA cohort of 3879 cases and 2561 controls. Of these, 2 variants in the CTLA4 and MTMR3 regions were successfully replicated in RA at p<0.05. Bioinformatics analysis was performed for the 50 overlapping regions, with one particularly promising region, RUNX1, selected for further investigation. In this region, the same variant (rs9979383) is associated across the 3 diseases, with similar odds ratios (OR 0.8-0.9) observed in each disease. As this region represented both a novel IA association and had not been densely genotyped on the Immunochip array, fine mapping was performed by genotyping 51 SNPS in 3491 cases and 2359 controls. This resulted in replication of the association at rs9979383 (p=0.02) with no additional significant genetic effects detected, therefore this variant was selected for further functional analysis. As rs9979383 lies ~280kb upstream of the RUNX1 gene, a cis-eQTL analysis was performed to identify if the variant acts by regulation of RUNX1 gene expression. This was performed in whole blood, CD4+ and CD8+ lymphocytes from 75 (and a subset of 23) healthy volunteers respectively. No significant eQTLs were detected between rs9979383 and RUNX1 in whole blood (p =0.9) or RUNX1/LOC100506403 CD4+ and CD8+ lymphocytes (p=0.1). This study has provided insight into the genetic similarities and differences between different types of inflammatory arthritis, which can be applied to further investigations into disease susceptibility. Although no significant cis-eQTL was detected in any of these tissues with either RUNX1 or the nearby lnc-RNA LOC100506403, in cells from healthy volunteers under unstimulated conditions, these findings will direct future functional investigations into the role of this overlapping region in the susceptibility of IA.
|
540 |
Investigating the Origin and Functions of a Novel Small RNA in <i>Escherichia coli</i>Kacharia, Fenil Rashmin 08 June 2016 (has links)
Non-coding small RNAs (sRNAs) regulate various cellular processes in bacteria. They bind to a chaperone protein Hfq for stability and regulate gene expression by base-pairing with target mRNAs. Although the importance of sRNAs in bacteria has been well established, the mode of origination of novel sRNA genes is still elusive, mainly because the rapid rate of evolution of sRNAs obscures their original sources. To overcome this impediment, we identified a recently formed sRNA (EcsR2) in E. coli, and show that it evolved from a degraded bacteriophage gene. Our analyses also revealed that young sRNAs such as EcsR2 are expressed at low levels and evolve at a rapid rate in comparison to older sRNAs, thereby uncovering a novel process that potentially facilitates newly emerging (and probably mildly deleterious) sRNAs to persist in bacterial genomes. We also show that even though EcsR2 is slightly deleterious to E. coli, it could bind to Hfq and mRNAs to regulate the expression of several genes. Interestingly, while EcsR2 expression is induced by glucose, the expression of its putative targets are regulated by the transcription factor CRP in response to glucose, indicating that EcsR2 has been incorporated into the carbon regulatory network in E. coli. Collectively, this work provides evidence for the emergence, evolution and functions of a novel "young" sRNA in bacteria.
|
Page generated in 0.0915 seconds