• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 7
  • 3
  • 2
  • 2
  • Tagged with
  • 13
  • 13
  • 5
  • 5
  • 4
  • 3
  • 3
  • 3
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

GeneTUC: Natural Language Understanding in Medical Text

Sætre, Rune January 2006 (has links)
<p>Natural Language Understanding (NLU) is a 50 years old research field, but its application to molecular biology literature (BioNLU) is a less than 10 years old field. After the complete human genome sequence was published by Human Genome Project and Celera in 2001, there has been an explosion of research, shifting the NLU focus from domains like news articles to the domain of molecular biology and medical literature. BioNLU is needed, since there are almost 2000 new articles published and indexed every day, and the biologists need to know about existing knowledge regarding their own research. So far, BioNLU results are not as good as in other NLU domains, so more research is needed to solve the challenges of creating useful NLU applications for the biologists.</p><p>The work in this PhD thesis is a “proof of concept”. It is the first to show that an existing Question Answering (QA) system can be successfully applied in the hard BioNLU domain, after the essential challenge of unknown entities is solved. The core contribution is a system that discovers and classifies unknown entities and relations between them automatically. The World Wide Web (through Google) is used as the main resource, and the performance is almost as good as other named entity extraction systems, but the advantage of this approach is that it is much simpler and requires less manual labor than any of the other comparable systems.</p><p>The first paper in this collection gives an overview of the field of NLU and shows how the Information Extraction (IE) problem can be formulated with Local Grammars. The second paper uses Machine Learning to automatically recognize protein name based on features from the GSearch Engine. In the third paper, GSearch is substituted with Google, and the task in this paper is to extract all unknown names belonging to one of 273 biomedical entity classes, like genes, proteins, processes etc. After getting promising results with Google, the fourth paper shows that this approach can also be used to retrieve interactions or relationships between the named entities. The fifth paper describes an online implementation of the system, and shows that the method scales well to a larger set of entities.</p><p>The final paper concludes the “proof of concept” research, and shows that the performance of the original GeneTUC NLU system has increased from handling 10% of the sentences in a large collection of abstracts in 2001, to 50% in 2006. This is still not good enough to create a commercial system, but it is believed that another 40% performance gain can be achieved by importing more verb templates into GeneTUC, just like nouns were imported during this work. Work has already begun on this, in the form of a local Masters Thesis.</p>
2

GeneTUC: Natural Language Understanding in Medical Text

Sætre, Rune January 2006 (has links)
Natural Language Understanding (NLU) is a 50 years old research field, but its application to molecular biology literature (BioNLU) is a less than 10 years old field. After the complete human genome sequence was published by Human Genome Project and Celera in 2001, there has been an explosion of research, shifting the NLU focus from domains like news articles to the domain of molecular biology and medical literature. BioNLU is needed, since there are almost 2000 new articles published and indexed every day, and the biologists need to know about existing knowledge regarding their own research. So far, BioNLU results are not as good as in other NLU domains, so more research is needed to solve the challenges of creating useful NLU applications for the biologists. The work in this PhD thesis is a “proof of concept”. It is the first to show that an existing Question Answering (QA) system can be successfully applied in the hard BioNLU domain, after the essential challenge of unknown entities is solved. The core contribution is a system that discovers and classifies unknown entities and relations between them automatically. The World Wide Web (through Google) is used as the main resource, and the performance is almost as good as other named entity extraction systems, but the advantage of this approach is that it is much simpler and requires less manual labor than any of the other comparable systems. The first paper in this collection gives an overview of the field of NLU and shows how the Information Extraction (IE) problem can be formulated with Local Grammars. The second paper uses Machine Learning to automatically recognize protein name based on features from the GSearch Engine. In the third paper, GSearch is substituted with Google, and the task in this paper is to extract all unknown names belonging to one of 273 biomedical entity classes, like genes, proteins, processes etc. After getting promising results with Google, the fourth paper shows that this approach can also be used to retrieve interactions or relationships between the named entities. The fifth paper describes an online implementation of the system, and shows that the method scales well to a larger set of entities. The final paper concludes the “proof of concept” research, and shows that the performance of the original GeneTUC NLU system has increased from handling 10% of the sentences in a large collection of abstracts in 2001, to 50% in 2006. This is still not good enough to create a commercial system, but it is believed that another 40% performance gain can be achieved by importing more verb templates into GeneTUC, just like nouns were imported during this work. Work has already begun on this, in the form of a local Masters Thesis.
3

Méthodologie semi-formelle pour l’étude de systèmes biologiques : application à l'homéostasie du fer / Semi-formal methodology for biological systems study : application to iron homeostasis

Mobilia, Nicolas 29 September 2015 (has links)
Les travaux de cette thèse portent principalement sur le développement d'une méthodologie pour la modélisation de systèmes biologiques. Cette méthodologie, basée sur une modélisation en équations différentielles, intègre aussi bien des méthodes formelles (solveur sur intervalles, solveur de formules STL), qu'analytiques (calcul de stabilité d'état stationnaire) ou numériques (algorithme d'optimisation, analyses statistiques). Elle permet l'intégration de différents types de données, telles la réponse comportementale à une perturbation ou des données quantitatives (demie-vie, concentrations). En collaboration avec une équipe de biologistes, cette méthodologie est appliquée, avec succès, au système de l'homéostasie du fer : nous étudions la réponse intracellulaire du système, via des protéines régulatrices spécifiques (protéines IRP), face à une situation de carence en fer. Un résultat majeur de cette étude est l'amélioration des connaissances sur la concentration de fer intracellulaire nécessaire à la prolifération des cellules : cette concentration est mise en avant par l'étude du modèle, puis est confirmée expérimentalement.Le deuxième volet de ces travaux portent sur le développement d'un outil pour la modélisation de réseaux de gènes avec le formalisme des réseaux de Thomas. Cet outil, développé en ASP (Answer Set Programming), permet l'intégration de différents types de données telles des données sur des mutants ou l'existence de différents états stationnaires. Cet outil permet d'éviter automatiquement l'incohérence en cas de contradiction entre différentes hypothèses sur le système. Il permet également l'inférence de propriétés biologiques telles que l'ordre entre paramètres cinétiques. / The major part of this PhD consists in the creation of a methodology to model biological systems. This methodology considers models based on differential equations, and uses formal methods (interval solver, verification of STL formula), analytical methods (study of stability) and numerical methods (optimization algorithm, statistical analysis). Moreover, many kind of data, like behavioral response to perturbation, or quantitative data (metabolite half-life and concentration) can be incorporated. In collaboration with a biologist team, this methodology is successfully applied to the iron homeostasis network : we study the response of the system to an iron depletion, at the intracellular level, based on specific regulatory proteins (IRP proteins). A major output of this study is insight into the level of iron cells need to proliferate : this concentration is pointed out by the study of the model, and is experimentally validated.The second part of the PhD is the creation of a tool to model genetic regulatory networks, using Thomas' formalism. This tool, developed using ASP (Answer Set Programming) programming language, can integrate many kind of data, like mutation data, or the existence of many steady states. It automatically avoids inconsistency in case of contradiction between different hypotheses. It also infers biological properties such as relationships between kinetic parameters.
4

Gerenciamento de workflows cientificos em bioinformatica / Management of bioinformatics scientific workflows

Digiampietri, Luciano Antonio 03 August 2007 (has links)
Orientadores: João Carlos Setubal, Claudia Bauzer Medeiros / Tese (doutorado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-08T21:21:44Z (GMT). No. of bitstreams: 1 Digiampietri_LucianoAntonio_D.pdf: 2647979 bytes, checksum: cb55c9b2b4185459d26541f301c5dd5b (MD5) Previous issue date: 2007 / Resumo: Atividades em bioinformática estão crescendo por todo o mundo, acompanhadas por uma proliferação de dados e ferramentas. Isto traz novos desafios, por exemplo, como entender e organizar esses recursos, como compartilhar e re-usar experimentos bem sucedidos (ferramentas e dados), e como prover interoperabilidade entre dados e ferramentas de diferentes locais e utilizados por usuários com perfis distintos. Esta tese propõe uma infra-estrutura computacional para resolver tais problemas. A infra-estrutura permite projetar, re-usar, anotar, validar, compartilhar e documentar experimentos de bioinformática. Workflows científicos são os mecanismos utilizados para representar tais experimentos. Combinando pesquisa em bancos de dados, workflows científicos, inteligência artificial e Web semântica, a infra-estrutura se beneficia do uso de ontologias para permitir a especificação e anotação de workflows de bioinformática e para servir como base aos mecanismos de rastreabilidade. Além disso, ela usa técnicas de planejamento em inteligência artificial para prover as composições automática, iterativa e supervisionada de tarefas para satisfazer as necessidades dos diferentes tipos de usuários. Os aspectos de integração de dados e interoperabilidade são resolvidos combinando o uso de ontologias, mapeamento entre estruturas e algoritmos de casamento de interfaces. A infra-estrutura foi implementada em um protótipo e validada com dados reais de bioinformática / Abstract: Bioinformatics activities are growing all over the world, following a proliferation of data and tools. This brings new challenges, such as how to understand and organize these resources, how to exchange and reuse successful experimental procedures (tools and data), and how to provide interoperability among data and tools across different sites, and used for users with distinct profiles. This thesis proposes a computational infrastructure to solve these problems. The infrastructure allows to design, reuse, annotate, validate, share and document bioinformatics experiments. Scientific workflows are the mechanisms used to represent these experiments. Combining research on databases, scientific workflows, artificial intelligence and semanticWeb, the infrastructure takes advantage of ontologies to support the specification and annotation of bioinformatics workflows and, to serve as basis for traceability mechanisms. Moreover, it uses artificial intelligence planning techniques to support automatic, iterative and supervised composition of tasks to satisfy the needs of the different kinds of user. The data integration and interoperability aspects are solved combining the use of ontologies, structure mapping and interface matching algorithms. The infrastructure was implemented in a prototype and validated on real bioinformatics data / Doutorado / Sistemas de Informação / Doutor em Ciência da Computação
5

Méthodologie semi-formelle pour l’étude de systèmes biologiques : Application à l'homéostasie du fer / Semi-formal methodology for biological systems study : Application to iron homeostasis

Mobilia, Nicolas 29 September 2015 (has links)
Les travaux de cette thèse portent principalement sur le développement d'une méthodologie pour la modélisation de systèmes biologiques. Cette méthodologie, basée sur une modélisation en équations différentielles, intègre aussi bien des méthodes formelles (solveur sur intervalles, solveur de formules STL), qu'analytiques (calcul de stabilité d'état stationnaire) ou numériques (algorithme d'optimisation, analyses statistiques). Elle permet l'intégration de différents types de données, telles la réponse comportementale à une perturbation ou des données quantitatives (demie-vie, concentrations). En collaboration avec une équipe de biologistes, cette méthodologie est appliquée, avec succès, au système de l'homéostasie du fer : nous étudions la réponse intracellulaire du système, via des protéines régulatrices spécifiques (protéines IRP), face à une situation de carence en fer. Un résultat majeur de cette étude est l'amélioration des connaissances sur la concentration de fer intracellulaire nécessaire à la prolifération des cellules : cette concentration est mise en avant par l'étude du modèle, puis est confirmée expérimentalement.Le deuxième volet de ces travaux portent sur le développement d'un outil pour la modélisation de réseaux de gènes avec le formalisme des réseaux de Thomas. Cet outil, développé en ASP (Answer Set Programming), permet l'intégration de différents types de données telles des données sur des mutants ou l'existence de différents états stationnaires. Cet outil permet d'éviter automatiquement l'incohérence en cas de contradiction entre différentes hypothèses sur le système. Il permet également l'inférence de propriétés biologiques telles que l'ordre entre paramètres cinétiques. / The major part of this PhD consists in the creation of a methodology to model biological systems. This methodology considers models based on differential equations, and uses formal methods (interval solver, verification of STL formula), analytical methods (study of stability) and numerical methods (optimization algorithm, statistical analysis). Moreover, many kind of data, like behavioral response to perturbation, or quantitative data (metabolite half-life and concentration) can be incorporated. In collaboration with a biologist team, this methodology is successfully applied to the iron homeostasis network : we study the response of the system to an iron depletion, at the intracellular level, based on specific regulatory proteins (IRP proteins). A major output of this study is insight into the level of iron cells need to proliferate : this concentration is pointed out by the study of the model, and is experimentally validated.The second part of the PhD is the creation of a tool to model genetic regulatory networks, using Thomas' formalism. This tool, developed using ASP (Answer Set Programming) programming language, can integrate many kind of data, like mutation data, or the existence of many steady states. It automatically avoids inconsistency in case of contradiction between different hypotheses. It also infers biological properties such as relationships between kinetic parameters.
6

Ett verktyg för konstruktion av ontologier från text / A Tool for Facilitating Ontology Construction from Texts

Chétrit, Héloèise January 2004 (has links)
<p>With the growth of information stored over Internet, especially in the biological field, and with discoveries being made daily in this domain, scientists are faced with an overwhelming amount of articles. Reading all published articles is a tedious and time-consuming process. Therefore a way to summarise the information in the articles is needed. A solution is the derivation of an ontology representing the knowledge enclosed in the set of articles and allowing to browse through them. </p><p>In this thesis we present the tool Ontolo, which allows to build an initial ontology of a domain by inserting a set of articles related to that domain in the system. The quality of the ontology construction has been tested by comparing our ontology results for keywords to the ones provided by the Gene Ontology for the same keywords. </p><p>The obtained results are quite promising for a first prototype of the system as it finds many common terms on both ontologies for justa few hundred of inserted articles.</p>
7

Alinhamento múltiplo progressivo de sequências de proteínas / Progressive multiple alignment of protein sequences

Souza, Maria Angélica Lopes de 16 August 2018 (has links)
Orientador: Zanoni Dias / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação / Made available in DSpace on 2018-08-16T22:45:38Z (GMT). No. of bitstreams: 1 Souza_MariaAngelicaLopesde_M.pdf: 2988201 bytes, checksum: 0742d490b058c7a3dae6fddd7314aba4 (MD5) Previous issue date: 2010 / Resumo: O alinhamento múltiplo dc sequências é uma tarefa de grande relevância cm Bioin-formática. Através dele é possível estudar eventos evolucionários c restrições estruturais ou funcionais, sejam de sequências de proteína, DNA ou RNA, tornando possível entender a estrutura, função c evolução dos genes que compõem um organismo. O objetivo do alinhamento múltiplo é a melhor representação do cenário dc evolução das sequencias ao longo do tempo, considerando a possibilidade dc ocorrerem diferentes eventos de mutação. Encontrar um alinhamento múltiplo dc sequencias ótimo é um problema NP-Difícil. Desta forma, diversas abordagens têm sido desenvolvidas no intuito de encontrar uma solução heurística que represente da melhor maneira possível o cenário dc evolução real, dentre elas está a abordagem progressiva. O alinhamento progressivo c uma das maneiras mais simples dc se realizar o alinhamento múltiplo, pois utiliza pouco tempo c memória computacional. Ele c realizado cm três etapas principais: determinar a distância entre as sequências que serão alinhadas, construir uma árvore guia a partir das distâncias c finalmente construir o alinhamento múltiplo. Este trabalho foi desenvolvido a partir do estudo de diferentes métodos para realizar cada etapa dc um alinhamento progressivo. Foram construídos 342 alinhadores resultantes da combinação dos métodos estudados. Os parâmetros dc entrada adequados para a maioria dos alinhadores foram determinados por estudos empíricos. Após a definição dos parâmetros adequados para cada tipo dc ahnhador, foram realizados testes com dois subconjuntos de referencia do BAliBASE. Com esses testes observamos que os melhores alinhadores foram aqueles que utilizam o agrupamento dc perfil para gerar o alinhamento múltiplo, com destaque paTa os que utilizam pontuação afim para penalizar buracos. Observamos também, que dentre os alinhadores dc agrupamento por consenso, os que utilizam função logarítmica, para penalizar buracos demonstraram melhores desempenhos / Abstract: The multiple sequence alignment is a relevant task in Bioinf'ormatics. Using this technique is possible to study evolutionary events and also structural or functional restrictions of protein, DNA, or RNA sequences. This study helps the understanding of the structure, function, and evolution of the genes that make up an organism. The multiple sequence alignment tries to achieve the best representation of a sequence evolution scenario, considering different mutation events occurrence. Finding an optimal multiple sequence alignment is a NP-Hard problem. Thus, several approaches have been developed in order to find an heuristic solution that represents the real evolution cenário, such as the progressive approach. The progressive alignment is a simple way to perform the multiple alignment, because its low memcny usage and computational time. It is performed in three main stages: (i) determining the distance between the sequences to be aligned, (ii) constructing a guide tree from the distances and finally (hi) building the multiple alignment guided by the tree. This work studied different methods for performing each step of progressive alignment and 342 aligners were built combining these methods. The input parameters suitable for most aligners were determined by empirical studies. After the parameters definition for each type of aligner, which where tested against two reference subsets of BAliBASE. The test results showed that the best aligners were those using the profile alignment to generate the multiple alignment, especially those using affine gap penalty function. In addition, this work shows that among the aligners of grouping by consensus, those that use the logarithmic gap penalty function presented better performance / Mestrado / Bioinformatica / Mestre em Ciência da Computação
8

Ett verktyg för konstruktion av ontologier från text / A Tool for Facilitating Ontology Construction from Texts

Chétrit, Héloèise January 2004 (has links)
With the growth of information stored over Internet, especially in the biological field, and with discoveries being made daily in this domain, scientists are faced with an overwhelming amount of articles. Reading all published articles is a tedious and time-consuming process. Therefore a way to summarise the information in the articles is needed. A solution is the derivation of an ontology representing the knowledge enclosed in the set of articles and allowing to browse through them. In this thesis we present the tool Ontolo, which allows to build an initial ontology of a domain by inserting a set of articles related to that domain in the system. The quality of the ontology construction has been tested by comparing our ontology results for keywords to the ones provided by the Gene Ontology for the same keywords. The obtained results are quite promising for a first prototype of the system as it finds many common terms on both ontologies for justa few hundred of inserted articles.
9

Modélisation, purification et caractérisation des modules et domaines de la PI4KA humaine / Molecular modeling, purification and characterisation of the human PI4KA modules and domains

Taveneau, Cyntia 08 September 2015 (has links)
La phosphatidylinositol-4-kinase de type IIIα est une kinase de lipide eukaryote ubiquitaire qui synthétise le phosphatidylinositol-4-phosphate PtdIns(4)P de la membrane plasmique. Ce phosphoinositide est d’autant plus important qu’il tient un rôle majeur dans différentes voies de signalisation cellulaire, le traffic vésiculaire ainsi que dans l’identité des organelles. De plus, la PIK4A humaine est un facteur essentiel pour la réplication du virus de l’hépatite C (VHC). En effet, le recrutement du complexe de réplication du VHC par la protéine virale NS5A à la membrane du reticulum endoplasmique permet la formation d’un réseau membranaire à l’origine de la structuration des complexes de replication viraux.La PI4KA est une kinase imposante (2102 résidus, 240 kDa pour la PI4KA humaine) qui possède un domaine kinase C-terminal d’environ 400 résidus précédé d’un domaine formé de répétitions Armadillo pour lequel aucune fonction n’a été determinée. Le rôle ainsi que le repliement des 1500 résidus N-terminaux de PI4KA ne sont pas connus à ce jour.Afin d’en savoir plus sur la structure tri-dimensionnelle de la PI4KA humaine, nous avons utilisé des outils bio-informatiques afin de délimiter et de modéliser les modules et domaines la composant. Nous avons pu ainsi les exprimer et les produire en bactérie et en cellules d’insecte afin de vérifier nos hypothèses. Nous avons pu conclure que PI4KA est composée de deux modules. Le module N-terminal (1100 résidus), est composé de deux domaines dont un solénoïde α. Les résultats obtenus par diffusion des rayons X aux petits angles (SAXS) nous permettent de définir leur agencement potentiel. Le second module (1000 résidus), le module C-terminal, est l’enzyme-core. Son analyse nous a permis d’identifier une similitude remarquable avec les sérine/thréonine kinases PIKKs, comme mTor, apparentées aux phosphatidylinositol-3-kinases. Nous avons défini au début du module C-terminal de PI4KA trois domaines putatifs que nous avons nommés DI, DII et DIII. Nos collaborateurs ont montré qu'ils sont essentiels à l’activité kinase de la protéine ainsi qu’à la replication du VHC. Le domaine DI a été caractérisé et a permis la validation d’une nouvelle paramétrisation de la molécule de N, N-dimethyl-dodecylamine oxide (LDAO) pour des simulations de dynamique moléculaire. Enfin, la PI4KA humaine dans son entier a été exprimée en cellules d’insecte puis purifiée, et un premier test d’interaction avec les membranes a été initié. / The eukaryotic lipid kinase phosphatidylinositol 4-kinase III alpha is a ubiquitous enzyme that synthesizes the plasma membrane pool of phosphatidylinositol 4-phosphate. This important phosphoinositide has key roles in different signalization pathways, vesicular traffic and cellular compartment identity. Moreover, PI4KA is an essential factor for hepatitis C virus (HCV) replication. Indeed, PI4KA's interaction with the non-structural HCV protein NS5A at the endoplasmic reticulum membrane leads to formation of a “membranous web” giving to the membrane the signature necessary to the formation of viral replication machineryPI4KA is a large protein (2102 residues, 240 kDa for human PI4KA) with the kinase domain making up the ca 400 C-terminal residues preceded by an Armadillo domain for which no function is known. There is essentially no structural information about the 1500 N-terminal residues and no clue as to the function of most of this region of PI4KA.We use computational methods in order to delineate fragments of human PI4KA amenable to soluble production in Escherichia coli and insect cells. We clone and express these fragments and evaluate the soluble fraction of each construction. Our results further suggest that PI4KA can be described as a two-module protein. The N-terminal module (1100 residues), is composed of two domains which one is an alpha solenoid. Their potential arrangement was defined by small angle X-ray scattering (SAXS).The second module (1000 residues), the C-terminal module, is the core enzyme. Its analysis leads us to identify similarities with the serine/threonine kinases PIKKs, as mTor, homologous to phosphatidylinositol-3-kinases. Three putative domains were delineate at the beginning of this C-terminal module. We name the DI, DII and DIII. Our collaborators have shown their necessity to the kinase activity of PI4KA and the HCV replication. DI domain was characterized and allowed the validation of a new parametrization of the N, N-dimethyl-dodecylamine oxide molecule (LDAO) for simulation of molecular dynamics. Finally, the full-length human PI4KA was expressed in insect cells, purified and a first interaction experiment with membranes have been initiated.
10

Évolution de familles de gènes par duplications et pertes : algorithmes pour la correction d’arbres bruités

Doroftei, Andrea 02 1900 (has links)
Les gènes sont les parties du génome qui codent pour les protéines. Les gènes d’une ou plusieurs espèces peuvent être regroupés en "familles", en fonction de leur similarité de séquence. Cependant, pour connaître les relations fonctionnelles entre ces copies de gènes, la similarité de séquence ne suffit pas. Pour cela, il est important d’étudier l’évolution d’une famille par duplications et pertes afin de pouvoir distinguer entre gènes orthologues, des copies ayant évolué par spéciation et susceptibles d’avoir conservé une fonction commune, et gènes paralogues, des copies ayant évolué par duplication qui ont probablement développé des nouvelles fonctions. Étant donnée une famille de gènes présents dans n espèces différentes, un arbre de gènes (obtenu par une méthode phylogénétique classique), et un arbre phylogénétique pour les n espèces, la "réconciliation" est l’approche la plus courante permettant d’inférer une histoire d’évolution de cette famille par duplications, spéciations et pertes. Le degré de confiance accordé à l’histoire inférée est directement relié au degré de confiance accordé à l’arbre de gènes lui-même. Il est donc important de disposer d’une méthode préliminaire de correction d’arbres de gènes. Ce travail introduit une méthodologie permettant de "corriger" un arbre de gènes : supprimer le minimum de feuilles "mal placées" afin d’obtenir un arbre dont les sommets de duplications (inférés par la réconciliation) sont tous des sommets de "duplications apparentes" et obtenir ainsi un arbre de gènes en "accord" avec la phylogénie des espèces. J’introduis un algorithme exact pour des arbres d’une certaine classe, et une heuristique pour le cas général. / Genes are segments of genomes that code for proteins. Genes of one or more species can be grouped into gene families based on their sequence similarity. In order to determine functional relationships among these multiple gene copies of a family, sequence homology is insufficient as no direct information on the evolution of the gene family by duplication, speciation and loss can be inferred directly from a family of homologous genes. And it is precisely this information that allows us to distinguish between orthologous gene copies, that have evolved by speciation and are more likely to preserve the same function and paralogous gene copies that have evolved by duplication and usually acquire new functions. For a given gene family contained within n species, a gene tree (inferred by typical phylogenetic methods) and a phylogenetic tree of the considered species, reconciliation between the gene tree and the species tree is the most commonly used approach to infer a duplication, speciation and loss history for the gene family. The main criticism towards reconciliation methods is that the inferred duplication and loss history for a gene family is strongly dependent on the gene tree considered for this family. Indeed, just a few misplaced leaves in the gene tree can lead to a completely different history, possibly with significantly more duplications and losses. It is therefore important to have a preliminary method for "correcting” the gene tree, i.e. removing potentially misplaced branches. N. El-Mabrouk and C. Chauve introduced "non-apparent duplications" as nodes that are likely to result from the misplacement of one leaf in the gene tree. Simply put, such a node indicates that one or more triplets contradict the phylogeny given by the species tree. In this work, the problem of eliminating non-apparent duplications from a given gene tree by a minimum number of leaf removals is considered. Depending on the disposition of this type of nodes in the gene tree, the algorithm introduced leads to an O(nlogn) performance and an optimal solution in a best case scenario . The general case however is solved using an heuristic method.

Page generated in 0.0747 seconds