Global ETD Search

1	The OGCleaner: Detecting False-Positive Sequence Homology Fujimoto, Masaki Stanley 01 June 2017 (has links) Within bioinformatics, phylogenetics is the study of the evolutionary relationships between different species and organisms. The genetic revolution has caused an explosion in the amount of raw genomic information that is available to scientists for study. While there has been an explosion in available data, analysis methods have lagged behind. A key task in phylogenetics is identifying homology clusters. Current methods rely on using heuristics based on pairwise sequence comparison to identify homology clusters. We propose the Orthology Group Cleaner (the OGCleaner) as a method to evaluate cluster level verification of putative homology clusters in order to create higher quality phylogenetic tree reconstruction. machine learning orthology clusters phylogenetics Computer Sciences
2	The Orthology Road Hernandez Rosales, Maribel 14 November 2013 (has links) (PDF) The evolution of biological species depends on changes in genes. Among these changes are the gradual accumulation of DNA mutations, insertions and deletions, duplication of genes, movements of genes within and between chromosomes, gene losses and gene transfer. As two populations of the same species evolve independently, they will eventually become reproductively isolated and become two distinct species. The evolutionary history of a set of related species through the repeated occurrence of this speciation process can be represented as a tree-like structure, called a phylogenetic tree or a species tree. Since duplicated genes in a single species also independently accumulate point mutations, insertions and deletions, they drift apart in composition in the same way as genes in two related species. The divergence of all the genes descended from a single gene in an ancestral species can also be represented as a tree, a gene tree that takes into account both speciation and duplication events. In order to reconstruct the evolutionary history from the study of extant species, we use sets of similar genes, with relatively high degree of DNA similarity and usually with some functional resemblance, that appear to have been derived from a common ancestor. The degree of similarity among different instances of the “same gene” in different species can be used to explore their evolutionary history via the reconstruction of gene family histories, namely gene trees. Orthology refers specifically to the relationship between two genes that arose by a speciation event, recent or remote, rather than duplication. Comparing orthologous genes is essential to the correct reconstruction of species trees, so that detecting and identifying orthologous genes is an important problem, and a longstanding challenge, in comparative and evolutionary genomics as well as phylogenetics. A variety of orthology detection methods have been devised in recent years. Although many of these methods are dependent on generating gene and/or species trees, it has been shown that orthology can be estimated at acceptable levels of accuracy without having to infer gene trees and/or reconciling gene trees with species trees. Therefore, there is good reason to look at the connection of trees and orthology from a different angle: How much information about the gene tree, the species tree, and their reconciliation is already contained in the orthology relation among genes? Intriguingly, a solution to the first part of this question has already been given by Boecker and Dress [Boecker and Dress, 1998] in a different context. In particular, they completely characterized certain maps which they called symbolic ultrametrics. Semple and Steel [Semple and Steel, 2003] then presented an algorithm that can be used to reconstruct a phylogenetic tree from any given symbolic ultrametric. In this thesis we investigate a new characterization of orthology relations, based on symbolic ultramterics for recovering the gene tree. According to Fitch’s definition [Fitch, 2000], two genes are (co-)orthologous if their last common ancestor in the gene tree represents a speciation event. On the other hand, when their last common ancestor is a duplication event, the genes are paralogs. The orthology relation on a set of genes is therefore determined by the gene tree and an “event labeling” that identifies each interior vertex of that tree as either a duplication or a speciation event. In the context of analyzing orthology data, the problem of reconciling event-labeled gene trees with a species tree appears as a variant of the reconciliation problem where genes trees have no labels in their internal vertices. When reconciling a gene tree with a species tree, it can be assumed that the species tree is correct or, in the case of a unknown species tree, it can be inferred. Therefore it is crucial to know for a given gene tree whether there even exists a species tree. In this thesis we characterize event-labelled gene trees for which a species tree exists and species trees to which event-labelled gene trees can be mapped. Reconciliation methods are not always the best options for detecting orthology. A fundamental problem is that, aside from multicellular eukaryotes, evolution does not seem to have conformed to the descent-with-modification model that gives rise to tree-like phylogenies. Examples include many cases of prokaryotes and viruses whose evolution involved horizontal gene transfer. To treat the problem of distinguishing orthology and paralogy within a more general framework, graph-based methods have been proposed to detect and differentiate among evolutionary relationships of genes in those organisms. In this work we introduce a measure of orthology that can be used to test graph-based methods and reconciliation methods that detect orthology. Using these results a new algorithm BOTTOM-UP to determine whether a map from the set of vertices of a tree to a set of events is a symbolic ultrametric or not is devised. Additioanlly, a simulation environment designed to generate large gene families with complex duplication histories on which reconstruction algorithms can be tested and software tools can be benchmarked is presented. Phylogenetik Phylogenomik Orthologien Phylogenetics Phylogenomics Orthology Reconciliation ddc:500
3	The relationship between orthology, protein domain architecture and protein function Forslund, Kristoffer January 2011 (has links) Lacking experimental data, protein function is often predicted from evolutionary and protein structure theory. Under the 'domain grammar' hypothesis the function of a protein follows from the domains it encodes. Under the 'orthology conjecture', orthologs, related through species formation, are expected to be more functionally similar than paralogs, which are homologs in the same or different species descended from a gene duplication event. However, these assumptions have not thus far been systematically evaluated. To test the 'domain grammar' hypothesis, we built models for predicting function from the domain combinations present in a protein, and demonstrated that multi-domain combinations imply functions that the individual domains do not. We also developed a novel gene-tree based method for reconstructing the evolutionary histories of domain architectures, to search for cases of architectures that have arisen multiple times in parallel, and found this to be more common than previously reported. To test the 'orthology conjecture', we first benchmarked methods for homology inference under the obfuscating influence of low-complexity regions, in order to improve the InParanoid orthology inference algorithm. InParanoid was then used to test the relative conservation of functionally relevant properties between orthologs and paralogs at various evolutionary distances, including intron positions, domain architectures, and Gene Ontology functional annotations. We found an increased conservation of domain architectures in orthologs relative to paralogs, in support of the 'orthology conjecture' and the 'domain grammar' hypotheses acting in tandem. However, equivalent analysis of Gene Ontology functional conservation yielded spurious results, which may be an artifact of species-specific annotation biases in functional annotation databases. I discuss possible ways of circumventing this bias so the 'orthology conjecture' can be tested more conclusively. / At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 6: Epub ahead of print. homology orthology paralogy gene duplications protein function prediction low-complexity regions protein domains domain architecture evolution introns intron position conservation orthology conjecture domain grammar hypothesis
4	Orthologs, turn-over, and remolding of tRNAs in primates and fruit flies Velandia-Huerto, Cristian A., Berkemer, Sarah J., Hoffmann, Anne, Retzlaff, Nancy, Romero Marroquín, Liiana C., Hernández-Rosales, Maribel, Stadler, Peter F., Bermúdez-Santana, Clara I. 05 September 2016 (has links) (PDF) Background: Transfer RNAs (tRNAs) are ubiquitous in all living organism. They implement the genetic code so that most genomes contain distinct tRNAs for almost all 61 codons. They behave similar to mobile elements and proliferate in genomes spawning both local and non-local copies. Most tRNA families are therefore typically present as multicopy genes. The members of the individual tRNA families evolve under concerted or rapid birth-death evolution, so that paralogous copies maintain almost identical sequences over long evolutionary time-scales. To a good approximation these are functionally equivalent. Individual tRNA copies thus are evolutionary unstable and easily turn into pseudogenes and disappear. This leads to a rapid turnover of tRNAs and often large differences in the tRNA complements of closely related species. Since tRNA paralogs are not distinguished by sequence, common methods cannot not be used to establish orthology between tRNA genes. Results: In this contribution we introduce a general framework to distinguish orthologs and paralogs in gene families that are subject to concerted evolution. It is based on the use of uniquely aligned adjacent sequence elements as anchors to establish syntenic conservation of sequence intervals. In practice, anchors and intervals can be extracted from genome-wide multiple sequence alignments. Syntenic clusters of concertedly evolving genes of different families can then be subdivided by list alignments, leading to usually small clusters of candidate co-orthologs. On the basis of recent advances in phylogenetic combinatorics, these candidate clusters can be further processed by cograph editing to recover their duplication histories. We developed a workflow that can be conceptualized as stepwise refinement of a graph of homologous genes. We apply this analysis strategy with different types of synteny anchors to investigate the evolution of tRNAs in primates and fruit flies. We identified a large number of tRNA remolding events concentrated at the tips of the phylogeny. With one notable exception all phylogenetically old tRNA remoldings do not change the isoacceptor class. Conclusions: Gene families evolving under concerted evolution are not amenable to classical phylogenetic analyses since paralogs maintain identical, species-specific sequences, precluding the estimation of correct gene trees from sequence differences. This leaves conservation of syntenic arrangements with respect to "anchor elements" that are not subject to concerted evolution as the only viable source of phylogenetic information. We have demonstrated here that a purely synteny-based analysis of tRNA gene histories is indeed feasible. Although the choice of synteny anchors influences the resolution in particular when tight gene clusters are present, and the quality of sequence alignments, genome assemblies, and genome rearrangements limits the scope of the analysis, largely coherent results can be obtained for tRNAs. In particular, we conclude that a large fraction of the tRNAs are recent copies. This proliferation is compensated by rapid pseudogenization as exemplified by many very recent alloacceptor remoldings. Konzertierte Evolution tRNA Struktur Syntenie Orthologie concerted evolution tRNA remolding synteny orthology ddc:570 ddc:004
5	Prediction of Protein-Protein Interactions in Escherichia coli from Experimental Data in Treponema pallidum Abreu, Marco A 01 January 2015 (has links) Protein – Protein interactions (PPIs) are thought to be conserved between species, although this has not been systematically investigated. This problem was explored in Escherichia coli from experimental data in Treponema pallidum by predicting PPIs, focusing on protein domains of little or unknown function. The comparison of T. pallidum to a model organism such as E. coli can not only reveal additional data about T. pallidum but also reveals how E. coli is similar to this distantly related, obligate parasite. A set of novel T. pallidum interactions, enriched for proteins of unknown function, were the basis of over 23,000 predicted homologous E. coli protein-protein and domain-domain interactions. Utilizing computational methods of protein analysis to define identity cross-species comparisons, this work shows that T. pallidum is nearly 61% similar to E. coli by orthologous groups (OG), demonstrating that what we knew of T. pallidum can be applied to E. coli. Observed binary interactions of that same pool of OGs result in only 4.3% shared T. pallidum interactions. Assigning function to proteins of unknown function leads to a greater understanding of how individual proteins relate to the larger interactome, the whole of interactions within a cell. Protein-protein interaction bioinformatics pfam COG OG eggNOG orthology homology Bioinformatics
6	Probability calculations of orthologous genes Lagervik Öster, Alice January 2005 (has links) The aim of this thesis is to formulate and implement an algorithm that calculates the probability for two genes being orthologs, given a gene tree and a species tree. To do this, reconciliations between the gene tree and the species trees are used. A birth and death process is used to model the evolution, and used to calculate the orthology probability. The birth and death parameters are approximated with a Markov Chain Monte Carlo (MCMC). A MCMC framework for probability calculations of reconciliations written by Arvestad et al. (2003) is used. Rules for orthologous reconciliations are developed and implemented to calculate the probability for the reconciliations that have two genes as orthologs. The rules where integrated with the Arvestad et al. (2003) framework, and the algorithm was then validated and tested. Phylogeny Reconciliation Orthology Birth and Death process Markov Chain Monte Carlo Bioinformatics Bioinformatik
7	Probability calculations of orthologous genes Lagervik Öster, Alice January 2005 (has links) <p>The aim of this thesis is to formulate and implement an algorithm that calculates the probability for two genes being orthologs, given a gene tree and a species tree. To do this, reconciliations between the gene tree and the species trees are used. A birth and death process is used to model the evolution, and used to calculate the orthology probability. The birth and death parameters are approximated with a Markov Chain Monte Carlo (MCMC). A MCMC framework for probability calculations of reconciliations written by Arvestad et al. (2003) is used. Rules for orthologous reconciliations are developed and implemented to calculate the probability for the reconciliations that have two genes as orthologs. The rules where integrated with the Arvestad et al. (2003) framework, and the algorithm was then validated and tested.</p> Phylogeny Reconciliation Orthology Birth and Death process Markov Chain Monte Carlo Bioinformatics Bioinformatik
8	Développement de méthodes évolutionnaires d'extraction de connaissance et application à des systèmes biologiques complexes / Development of evolutionary knowledge extraction methods and their application in biological complex systems Linard, Benjamin 15 October 2012 (has links) La biologie des systèmes s’est beaucoup développée ces dix dernières années, confrontant plusieurs niveaux biologiques (molécule, réseau, tissu, organisme, écosystème…). Du point de vue de l’étude de l’évolution, elle offre de nombreuses possibilités. Cette thèse porte sur le développement de nouvelles méthodologies et de nouveaux outils pour étudier l’évolution des systèmes biologiques tout en considérant l’aspect multidimensionnel des données biologiques. Ce travail tente de palier un manque méthodologique évidant pour réaliser des études haut-débit dans le récent domaine de la biologie évolutionnaire des systèmes. De nouveaux messages évolutifs liés aux contraintes intra et inter processus ont été décrites. En particulier, mon travail a permis (i) la création d’un algorithme et un outil bioinformatique dédié à l’étude des relations évolutives d’orthologie existant entre les gènes de centaines d’espèces, (ii) le développement d’un formalisme original pour l’intégration de variables biologiques multidimensionnelles permettant la représentation synthétique de l’ histoire évolutive d’un gène donné, (iii) le couplage de cet outil intégratif avec des approches mathématiques d’extraction de connaissances pour étudier les perturbations évolutives existant au sein des processus biologiques humains actuellement documentés (voies métaboliques, voies de signalisations…). / Systems biology has developed enormously over the 10 last years, with studies covering diverse biological levels (molecule, network, tissue, organism, ecology…). From an evolutionary point of view, systems biology provides unequalled opportunities. This thesis describes new methodologies and tools to study the evolution of biological systems, taking into account the multidimensional properties of biological parameters associated with multiple levels. Thus it addresses the clear need for novel methodologies specifically adapted to high-throughput evolutionary systems biology studies. By taking account the multi-level aspects of biological systems, this work highlight new evolutionary trends associated with both intra and inter-process constraints. In particular, this thesis includes (i) the development of an algorithm and a bioinformatics tool dedicated to comprehensive orthology inference and analysis for hundreds of species, (ii) the development of an original formalism for the integration of multi-scale variables allowing the synthetic representation of the evolutionary history of a given gene, (iii) the combination of this integrative tool with mathematical knowledge discovery approaches in order to highlight evolutionary perturbations in documented human biological systems (metabolic and signalling pathways...). Orthologie Extraction de connaissance Évolution Réseaux biologiques Orthology Knowledge extraction Evolution Biological networks 576 006.3
9	From Homologous Genes to Phylogenetic Species Trees: On Tree Representations of Binary Relations Wieseke, Nicolas 27 September 2017 (has links) Orthology and paralogy distinguish whether a pair of genes originated by a speciation or a gene duplication event, whereas xenology refers to horizontal gene transfer. These concepts play a key role in phylogenomics and species tree inference is one of its prevalent tasks. Commonly, species tree inference is performed using sequence-based phylogenetic methods which heavily rely on the initial data sets to be solely composed of 1:1 orthologs. Such approaches are strongly restricted to a small set of genes that provide information about the species tree. In this work, it is shown that the restriction to 1:1 orthologs is not necessary to reconstruct a reliable hypothesis on the evolutionary history of species. Besides orthology, knowledge on all three major driving forces of gene evolution can be considered: speciation, gene duplication, and horizontal gene transfer. The corresponding concepts of orthology, paralogy, and xenology imply binary relations on pairs of genes. These relations, in turn, convey meaningful phylogenetic information and allow the inference of plausible phylogenetic species trees. To this end, it is shown that orthology, paralogy, and xenology have to fulfill certain mathematical properties. In particular, they have to be representable as a tree – the so-called gene tree. This work investigates the theoretical concepts of tree representable sets of binary relations to unfold the underlying mathematical structure. Various novel characterizations for those relations are given and the close connection between tree representable sets of binary relations and cographs, symbolic ultrametrics, and so-called unp 2-structures is revealed. Based on the novel characterizations, polynomial-time recognition algorithms for tree representable sets of relations are presented. In the case, a set of relations is tree representable, the corresponding tree representation can be found in polynomial time as well. Moreover, for the NP-complete problems of editing a given set of relations to its closest tree representable set, exact algorithms are developed by means of formulations as integer linear program. Finally, all algorithms have been implemented in the software ParaPhylo, a species tree inference method based on orthology and paralogy data. It is demonstrated on simulated data sets, as well as real-life data sets, that non-trivial phylogenies can indeed be reconstructed from tree-free orthology estimates alone. info:eu-repo/classification/ddc/000 ddc:000
10	From Best Match Graphs to Gene Trees: A new perspective on graph-based orthology inference Geiß, Manuela 11 November 2019 (has links) Orthology detection is an important task within the context of genome an- notation, gene nomenclature, and the understanding of gene evolution. With the rapidly accelerating pace at which new genomes become available, highly efficient methods are urgently required. As demonstrated in a large body of literature, reciprocal best match (RBH) methods are reasonably accurate and scale to large data sets. Nevertheless, they are far from perfect and prone to both, false positive and false negative, orthology calls. This work gives a complete characterization of best match as well as reciprocal best match graphs (BMGs and RBMGs) that arise at the first step of RBH methods. While BMGs as well as RBMGs with at most three species can be recognized in polynomial time, RBMGs with more than three species have a surprisingly complicated structure and it remains an open problem whether there exist polynomial time algorithms for the recognition of these RBMGs. In contrast to RBMGs, for which many (often mutually inconsistent) least re- solved trees may exist, there is a unique least resolved tree for BMGs. This tree is a homeomorphic image of the true, but typically unknown, gene tree. Furthermore, in the absence of horizontal gene transfer (HGT), the reciprocal best match graph contains the orthology relation suggesting that RBMGs can only contain false positive but no false negative orthology assignments. Simu- lation scenarios reveal that so-called good quartets, a certain graph pattern on four vertices in BMGs, can be used to successfully identify almost all false pos- itive edges in RBMGs. Together with the existence of a unique least resolved tree, this suggests that BMGs contain a lot of valuable information for orthol- ogy inference that would be lost by exclusively considering RBMGs. These insights motivate to include additional BMG and RBMG editing steps in or- thology detection pipelines based on the presented theoretical insights. Moreover, a workflow is introduced to infer best matches from sequence data by retrieving quartet structures from local information instead of reconstructing the whole gene tree. A crucial prerequisite for this pipeline is the choice of suitable outgroups. However, the empirical simulations also reveal that HGT events cause strong deviations of the orthology relation from the RBMG as well as good quartets that are no longer associated with false positive orthologs, suggesting the need for further investigation of the xenology relation. The directed Fitch’s xenology relation is characterized in terms of forbidden 3-vertex subgraphs and moreover, a polynomial time algorithm for the recog- nition and the reconstruction of a unique least resolved tree is presented. The undirected Fitch relation, in contrast, is shown to be a complete multipartite graph, which does not provide any interesting phylogenetic information. In summary, the results of this work can be used to develop new methods for inferring orthology, paralogy, and HGT. They promise major improvements in the accuracy and the computational performance of RBH-based approaches. info:eu-repo/classification/ddc/000 ddc:000

Search results