Spelling suggestions: "subject:"biolological databases"" "subject:"bybiological databases""
11 |
Automatic Discovery of Hidden Associations Using Vector Similarity : Application to Biological Annotation Prediction / Découverte automatique des associations cachées en utilisant la similarité vectorielle : application à la prédiction de l'annotation biologiqueAlborzi, Seyed Ziaeddin 23 February 2018 (has links)
Cette thèse présente: 1) le développement d'une nouvelle approche pour trouver des associations directes entre des paires d'éléments liés indirectement à travers diverses caractéristiques communes, 2) l'utilisation de cette approche pour associer directement des fonctions biologiques aux domaines protéiques (ECDomainMiner et GODomainMiner) et pour découvrir des interactions domaine-domaine, et enfin 3) l'extension de cette approche pour annoter de manière complète à partir des domaines les structures et les séquences des protéines. Au total, 20 728 et 20 318 associations EC-Pfam et GO-Pfam non redondantes ont été découvertes, avec des F-mesures de plus de 0,95 par rapport à un ensemble de référence Gold Standard extrait d'une source d'associations connues (InterPro). Par rapport à environ 1500 associations déterminées manuellement dans InterPro, ECDomainMiner et GODomainMiner produisent une augmentation de 13 fois le nombre d'associations EC-Pfam et GO-Pfam disponibles. Ces associations domaine-fonction sont ensuite utilisées pour annoter des milliers de structures de protéines et des millions de séquences de protéines pour lesquelles leur composition de domaine est connue mais qui manquent actuellement d'annotations fonctionnelles. En utilisant des associations de domaines ayant acquis des annotations fonctionnelles inférées, et en tenant compte des informations de taxonomie, des milliers de règles d'annotation ont été générées automatiquement. Ensuite, ces règles ont été utilisées pour annoter des séquences de protéines dans la base de données TrEMBL / This thesis presents: 1) the development of a novel approach to find direct associations between pairs of elements linked indirectly through various common features, 2) the use of this approach to directly associate biological functions to protein domains (ECDomainMiner and GODomainMiner), and to discover domain-domain interactions, and finally 3) the extension of this approach to comprehensively annotate protein structures and sequences. ECDomainMiner and GODomainMiner are two applications to discover new associations between EC Numbers and GO terms to protein domains, respectively. They find a total of 20,728 and 20,318 non-redundant EC-Pfam and GO-Pfam associations, respectively, with F-measures of more than 0.95 with respect to a “Gold Standard” test set extracted from InterPro. Compared to around 1500 manually curated associations in InterPro, ECDomainMiner and GODomainMiner infer a 13-fold increase in the number of available EC-Pfam and GO-Pfam associations. These function-domain associations are then used to annotate thousands of protein structures and millions of protein sequences for which their domain composition is known but that currently lack experimental functional annotations. Using inferred function-domain associations and considering taxonomy information, thousands of annotation rules have automatically been generated. Then, these rules have been utilized to annotate millions of protein sequences in the TrEMBL database
|
12 |
MIDB : um modelo de integração de dados biológicosPerlin, Caroline Beatriz 29 February 2012 (has links)
Made available in DSpace on 2016-06-02T19:05:56Z (GMT). No. of bitstreams: 1
4370.pdf: 1089392 bytes, checksum: 82daa0e51d37184f8864bd92d9342dde (MD5)
Previous issue date: 2012-02-29 / In bioinformatics, there is a huge volume of data related to biomolecules and to nucleotide and amino acid sequences that reside (in almost their totality) in several Biological Data Bases (BDBs). For a specific sequence, there are some informational classifications: genomic data, evolution-data, structural data, and others. Some BDBs store just one or some of these classifications. Those BDBs are hosted in different sites and servers, with several data base management systems with different data models. Besides, instances and schema might have semantic heterogeneity. In such scenario, the objective of this project is to propose a biological data integration model, that adopts new schema integration and instance integration techniques. The proposed integration model has a special mechanism of schema integration and another mechanism that performs the instance integration (with support of a dictionary) allowing conflict resolution in the attribute values; and a Clustering Algorithm is used in order to cluster similar entities. Besides, a domain specialist participates managing those clusters. The proposed model was validated through a study case focusing on schema and instance integration about nucleotide sequence data from organisms of Actinomyces gender, captured from four different data sources. The result is that about 97.91% of the attributes were correctly categorized in the schema integration, and the instance integration was able to identify that about 50% of the clusters created need support from a specialist, avoiding errors on the instance resolution. Besides, some contributions are presented, as the Attributes Categorization, the Clustering Algorithm, the distance functions proposed and the proposed model itself. / Na bioinformática, existe um imenso volume de dados sendo produzidos, os quais estão relacionados a sequências de nucleotídeos e aminoácidos que se encontram, em quase a sua totalidade, armazenados em Bancos de Dados Biológicos (BDBs). Para uma determinada sequência existem algumas classificações de informação: dados genômicos, dados evolutivos, dados estruturais, dentre outros. Existem BDBs que armazenam somente uma ou algumas dessas classificações. Tais BDBs estão hospedados em diferentes sites e servidores, com sistemas gerenciadores de banco de dados distintos e com uso de diferentes modelos de dados, além de terem instâncias e esquemas com heterogeneidade semântica. Dentro desse contexto, o objetivo deste projeto de mestrado é propor um Modelo de Integração de Dados Biológicos, com novas técnicas de integração de esquemas e integração de instâncias. O modelo de integração proposto possui um mecanismo especial de integração de esquemas, e outro mecanismo que realiza a integração de instâncias de dados (com um dicionário acoplado) permitindo resolução de conflitos nos valores dos atributos; e um Algoritmo de Clusterização é utilizado, com o objetivo de realizar o agrupamento de entidades similares. Além disso, o especialista de domínio participa do gerenciamento desses agrupamentos. Esse modelo foi validado por meio de um estudo de caso com ênfase na integração de esquemas e integração de instâncias com dados de sequências de nucleotídeos de genes de organismos do gênero Actinomyces, provenientes de quatro diferentes fontes de dados. Como resultado, obteve-se que aproximadamente 97,91% dos atributos foram categorizados corretamente na integração de esquemas e a integração de instâncias conseguiu identificar que aproximadamente 50% dos clusters gerados precisam de tratamento do especialista, evitando erros de resolução de entidades. Além disso, algumas contribuições são apresentadas, como por exemplo a Categorização de Atributos, o Algoritmo de Clusterização, as funções de distância propostas e o modelo MIDB em si.
|
Page generated in 0.0499 seconds