1 |
Exploitation automatisée des contextes métabolique et génomique pour l'annotation fonctionnelle des génomes prokaryotes / Automatically exploiting genomic and metabolic contexts to aid the functional annotation of prokaryote genomesSmith, Adam Alexander Thil 03 February 2012 (has links)
Cette thèse porte sur le développement d'approches bioinformatiques exploitant de l'information de contextes génomiques et métaboliques afin de générer des annotations fonctionnelles de gènes prokaryotes, et comporte deux projets principaux. Le premier projet focalise sur les activités enzymatiques orphelines de séquence. Environ 27% des activités définies par le International Union of Biochemistry and Molecular Biology sont encore aujourd'hui orphelines. Pour celles-ci, les méthodes bioinformatiques traditionnelles ne peuvent proposer de gènes candidats; il est donc impératif d'utiliser des méthodes exploitant des informations contextuelles dans ces cas. La stratégie CanOE (fishingCandidate genes for Orphan Enzymes) a été développée et rajoutée à la plateforme MicroScope dans ce but, intégrant des informations génomiques et métaboliques sur des milliers d'organismes prokaryotes afin de localiser des gènes probants pour des activités orphelines. Le projet miroir au précédent est celui des protéines de fonction inconnue. Un projet collaboratif a été initié au Genoscope afin de formaliser les stratégies d'exploration des fonctions de familles protéiques prokaryotes. Une version pilote du projet a été mise en place sur la famille “DUF849” dont une fonction enzymatique avait été récemment découverte. Des stratégies de proposition d'activités enzymatiques alternatives et d'établissement de sous familles isofonctionnelles ont été mises en place dans le cadre de cette thèse, afin de guider les expérimentations de paillasse et d'analyser leurs résultats. / The subject of this thesis concerns the development of bioinformatic strategies exploiting genomic and metabolic contextual information in order to generate functional annotations for prokaryote genes. Two main projects were involved during this work: the first focuses on sequence-orphan enzymatic activities. Today, roughly 27% of activities defined by International Union of Biochemistry and Molecular Biology are sequence-orphans. For these, traditional bioinformatic approaches cannot propose candidate genes. It is thus imperative to use alternative, context-based approaches in such cases. The CanOE strategy fishing Candidate genes for Orphan Enzymes) was developed and added to the MicroScope bioinformatics platform in this aim. It integrates genomic and metabolic information across thousands of prokaryote genomes in order to locate promising gene candidates for orphan activities. The mirror project focuses on protein families of unknown function. A collaborative project has been set up at the Genoscope in hope of formalising functional exploration strategies for prokaryote protein families. A pilot version was created on the “DUF849” Pfam family, for which a single activity had recently been elucidated. Strategies for proposing novel functions and activities and creating isofunctional sub-families were researched, so as to guide biochemical experimentations and to analyse their results.
|
2 |
Sifter-T: Um framework escalável para anotação filogenômica probabilística funcional de domínios protéicos / Sifter-T: A scalable framework for phylogenomic probabilistic protein domain functional annotationSilva, Danillo Cunha de Almeida e 25 October 2013 (has links)
É conhecido que muitos softwares deixam de ser utilizados por sua complexa usabilidade. Mesmo ferramentas conhecidas por sua qualidade na execução de uma tarefa são abandonadas em favor de ferramentas mais simples de usar, de instalar ou mais rápidas. Na área da anotação funcional a ferramenta Sifter (v2.0) é considerada uma das com melhor qualidade de anotação. Recentemente ela foi considerada uma das melhores ferramentas de anotação funcional segundo o Critical Assessment of protein Function Annotation (CAFA) experiment. Apesar disso, ela ainda não é amplamente utilizada, provavelmente por questões de usabilidade e adequação do framework à larga escala. O workflow SIFTER original consiste em duas etapas principais: A recuperação das anotações para uma lista de genes e a geração de uma árvore de genes reconciliada para a mesma lista. Em seguida, a partir da árvore de genes o Sifter constrói uma rede bayesiana de mesma estrutura nas quais as folhas representam os genes. As anotações funcionais dos genes conhecidos são associadas a estas folhas e em seguida as anotações são propagadas probabilisticamente ao longo da rede bayesiana até as folhas sem informação a priori. Ao fim do processo é gerada para cada gene de função desconhecida uma lista de funções putativas do tipo Gene Ontology e suas probabilidades de ocorrência. O principal objetivo deste trabalho é aperfeiçoar o código-fonte original para melhor desempenho, potencialmente permitindo que seja usado em escala genômica. Durante o estudo do workflow de pré-processamento dos dados encontramos oportunidades para aperfeiçoamento e visualizamos estratégias para abordá-las. Dentre as estratégias implementadas temos: O uso de threads paralelas; balanceamento de carga de processamento; algoritmos revisados para melhor aproveitamento de disco, memória e tempo de execução; adequação do código fonte ao uso de bancos de dados biológicos em formato utilizado atualmente; aumento da acessibilidade do usuário; expansão dos tipos de entrada aceitos; automatização do processo de reconciliação entre árvores de genes e espécies; processos de filtragem de seqüências para redução da dimensão da análise; e outras implementações menores. Com isto conquistamos aumento de performance de até 87 vezes para a recuperação de anotações e 73,3% para a reconstrução da árvore de genes em máquinas quad-core, e redução significante de consumo de memória na fase de realinhamento. O resultado desta implementação é apresentado como Sifter-T (Sifter otimizado para Throughput), uma ferramenta open source de melhor usabilidade, velocidade e qualidade de anotação em relação à implementação original do workflow de Sifter. Sifter-T foi escrito de forma modular em linguagem de programação Python; foi elaborado para simplificar a tarefa de anotação de genomas e proteomas completos; e os resultados são apresentados de forma a facilitar o trabalho do pesquisador. / It is known that many software are no longer used due to their complex usability. Even tools known for their task execution quality are abandoned in favour of faster tools, simpler to use or install. In the functional annotation field, Sifter (v2.0) is regarded as one of the best when it comes to annotation quality. Recently it has been considered one of the best tools for functional annotation according to the \"Critical Assessment of Protein Function Annotation (CAFA) experiment. Nevertheless, it is still not widely used, probably due to issues with usability and suitability of the framework to a high throughput scale. The original workflow SIFTER consists of two main steps: The annotation recovery for a list of genes and the reconciled gene tree generation for the same list. Next, based on the gene tree, Sifter builds a Bayesian network structure in which its leaves represent genes. The known functional annotations are associated to the aforementioned leaves, and then the annotations are probabilistically propagated along the Bayesian network to the leaves without a priori information. At the end of the process, a list of Gene Ontology functions and their occurrence probabilities is generated for each unknown function gene. This work main goal is to optimize the original source code for better performance, potentially allowing it to be used in a genome-wide scale. Studying the pre-processing workflow we found opportunities for improvement and envisioned strategies to address them. Among the implemented strategies we have: The use of parallel threads; CPU load balancing, revised algorithms for best utilization of disk access, memory usage and runtime; source code adaptation to currently used biological databases; improved user accessibility; input types increase; automatic gene and species tree reconciliation process; sequence filtering to reduce analysis dimension, and other minor implementations. With these implementations we achieved great performance speed-ups. For example, we obtained 87-fold performance increase in the annotation recovering module and 72.3% speed increase in the gene tree generation module using quad-core machines. Additionally, significant memory usage decrease during the realignment phase was obtained. This implementation is presented as Sifter-T (Sifter Throughput-optimized), an open source tool with better usability, performance and annotation quality when compared to the Sifter\'s original workflow implementation. Sifter-T was written in a modular fashion using Python programming language; it is designed to simplify complete genomes and proteomes annotation tasks and the outputs are presented in order to make the researcher\'s work easier.
|
3 |
Using functional annotation to characterize genome-wide association resultsFisher, Virginia Applegate 11 December 2018 (has links)
Genome-wide association studies (GWAS) have successfully identified thousands of variants robustly associated with hundreds of complex traits, but the biological mechanisms driving these results remain elusive. Functional annotation, describing the roles of known genes and regulatory elements, provides additional information about associated variants. This dissertation explores the potential of these annotations to explain the biology behind observed GWAS results.
The first project develops a random-effects approach to genetic fine mapping of trait-associated loci. Functional annotation and estimates of the enrichment of genetic effects in each annotation category are integrated with linkage disequilibrium (LD) within each locus and GWAS summary statistics to prioritize variants with plausible functionality. Applications of this method to simulated and real data show good performance in a wider range of scenarios relative to previous approaches. The second project focuses on the estimation of enrichment by annotation categories. I derive the distribution of GWAS summary statistics as a function of annotations and LD structure and perform maximum likelihood estimation of enrichment coefficients in two simulated scenarios. The resulting estimates are less variable than previous methods, but the asymptotic theory of standard errors is often not applicable due to non-convexity of the likelihood function. In the third project, I investigate the problem of selecting an optimal set of tissue-specific annotations with greatest relevance to a trait of interest. I consider three selection criteria defined in terms of the mutual information between functional annotations and GWAS summary statistics. These algorithms correctly identify enriched categories in simulated data, but in the application to a GWAS of BMI the penalty for redundant features outweighs the modest relationships with the outcome yielding null selected feature sets, due to the weaker overall association and high similarity between tissue-specific regulatory features.
All three projects require little in the way of prior hypotheses regarding the mechanism of genetic effects. These data-driven approaches have the potential to illuminate unanticipated biological relationships, but are also limited by the high dimensionality of the data relative to the moderate strength of the signals under investigation. These approaches advance the set of tools available to researchers to draw biological insights from GWAS results.
|
4 |
Sifter-T: Um framework escalável para anotação filogenômica probabilística funcional de domínios protéicos / Sifter-T: A scalable framework for phylogenomic probabilistic protein domain functional annotationDanillo Cunha de Almeida e Silva 25 October 2013 (has links)
É conhecido que muitos softwares deixam de ser utilizados por sua complexa usabilidade. Mesmo ferramentas conhecidas por sua qualidade na execução de uma tarefa são abandonadas em favor de ferramentas mais simples de usar, de instalar ou mais rápidas. Na área da anotação funcional a ferramenta Sifter (v2.0) é considerada uma das com melhor qualidade de anotação. Recentemente ela foi considerada uma das melhores ferramentas de anotação funcional segundo o Critical Assessment of protein Function Annotation (CAFA) experiment. Apesar disso, ela ainda não é amplamente utilizada, provavelmente por questões de usabilidade e adequação do framework à larga escala. O workflow SIFTER original consiste em duas etapas principais: A recuperação das anotações para uma lista de genes e a geração de uma árvore de genes reconciliada para a mesma lista. Em seguida, a partir da árvore de genes o Sifter constrói uma rede bayesiana de mesma estrutura nas quais as folhas representam os genes. As anotações funcionais dos genes conhecidos são associadas a estas folhas e em seguida as anotações são propagadas probabilisticamente ao longo da rede bayesiana até as folhas sem informação a priori. Ao fim do processo é gerada para cada gene de função desconhecida uma lista de funções putativas do tipo Gene Ontology e suas probabilidades de ocorrência. O principal objetivo deste trabalho é aperfeiçoar o código-fonte original para melhor desempenho, potencialmente permitindo que seja usado em escala genômica. Durante o estudo do workflow de pré-processamento dos dados encontramos oportunidades para aperfeiçoamento e visualizamos estratégias para abordá-las. Dentre as estratégias implementadas temos: O uso de threads paralelas; balanceamento de carga de processamento; algoritmos revisados para melhor aproveitamento de disco, memória e tempo de execução; adequação do código fonte ao uso de bancos de dados biológicos em formato utilizado atualmente; aumento da acessibilidade do usuário; expansão dos tipos de entrada aceitos; automatização do processo de reconciliação entre árvores de genes e espécies; processos de filtragem de seqüências para redução da dimensão da análise; e outras implementações menores. Com isto conquistamos aumento de performance de até 87 vezes para a recuperação de anotações e 73,3% para a reconstrução da árvore de genes em máquinas quad-core, e redução significante de consumo de memória na fase de realinhamento. O resultado desta implementação é apresentado como Sifter-T (Sifter otimizado para Throughput), uma ferramenta open source de melhor usabilidade, velocidade e qualidade de anotação em relação à implementação original do workflow de Sifter. Sifter-T foi escrito de forma modular em linguagem de programação Python; foi elaborado para simplificar a tarefa de anotação de genomas e proteomas completos; e os resultados são apresentados de forma a facilitar o trabalho do pesquisador. / It is known that many software are no longer used due to their complex usability. Even tools known for their task execution quality are abandoned in favour of faster tools, simpler to use or install. In the functional annotation field, Sifter (v2.0) is regarded as one of the best when it comes to annotation quality. Recently it has been considered one of the best tools for functional annotation according to the \"Critical Assessment of Protein Function Annotation (CAFA) experiment. Nevertheless, it is still not widely used, probably due to issues with usability and suitability of the framework to a high throughput scale. The original workflow SIFTER consists of two main steps: The annotation recovery for a list of genes and the reconciled gene tree generation for the same list. Next, based on the gene tree, Sifter builds a Bayesian network structure in which its leaves represent genes. The known functional annotations are associated to the aforementioned leaves, and then the annotations are probabilistically propagated along the Bayesian network to the leaves without a priori information. At the end of the process, a list of Gene Ontology functions and their occurrence probabilities is generated for each unknown function gene. This work main goal is to optimize the original source code for better performance, potentially allowing it to be used in a genome-wide scale. Studying the pre-processing workflow we found opportunities for improvement and envisioned strategies to address them. Among the implemented strategies we have: The use of parallel threads; CPU load balancing, revised algorithms for best utilization of disk access, memory usage and runtime; source code adaptation to currently used biological databases; improved user accessibility; input types increase; automatic gene and species tree reconciliation process; sequence filtering to reduce analysis dimension, and other minor implementations. With these implementations we achieved great performance speed-ups. For example, we obtained 87-fold performance increase in the annotation recovering module and 72.3% speed increase in the gene tree generation module using quad-core machines. Additionally, significant memory usage decrease during the realignment phase was obtained. This implementation is presented as Sifter-T (Sifter Throughput-optimized), an open source tool with better usability, performance and annotation quality when compared to the Sifter\'s original workflow implementation. Sifter-T was written in a modular fashion using Python programming language; it is designed to simplify complete genomes and proteomes annotation tasks and the outputs are presented in order to make the researcher\'s work easier.
|
5 |
Análise computacional do genoma e transcritoma de Plasmodium vivax: contribuições da bioinformática para o estudo da malária / Computational analysis of the Plasmodium vivax transcriptome and genome: bioinformatics contributions for the malaria investigationCorrêa, Bruna Renata Silva 02 April 2012 (has links)
Plasmodium vivax é o parasita causador de malária humana com maior distribuição global, responsável pela redução da qualidade de vida de milhões de pessoas ao redor do mundo. O objetivo geral do trabalho foi contribuir, através de metodologias estatísticas e de bioinformática, para o entendimento do mecanismo de escape da eliminação pelo baço do hospedeiro utilizado por P. vivax. Para isso, primeiramente realizou-se a análise estatística de um experimento de transcritômica, através de microarrays. Esse experimento foi conduzido previamente pelo grupo de colaboradores do presente estudo, utilizando um modelo animal, Aotus lemurinus griseimembra, com o objetivo de identificar genes de P. vivax expressos somente em parasitas retirados de macacos que possuíam o baço intacto. Em uma segunda fase, foi projetado um tiling array contendo todos os éxons e as regiões 5UTR e 3UTR disponíveis do genoma de P. vivax, que será utilizado para a realização de mais investigações a respeito da influência da presença do baço na expressão gênica de P. vivax. Na última etapa, foi conduzida uma melhoria na anotação funcional do genoma de P. vivax, através de uma metodologia automática, com o objetivo de adicionar informações para auxiliar na interpretação biológica dos resultados obtidos anteriormente e em estudos futuros. / Plasmodium vivax is the parasite that causes the human malaria type with the largest global distribution and it is responsible for quality of life impairment of millions of people around the world. The general purpose of this study was contribute to understand the mechanism used by P. vivax to scape from the host spleen elimination, through statistical methodologies and bioinformatics. First of all, we carried out statistical analysis of a microarray experiment conducted earlier by the collaborators of this study, using Aotus lemurinus griseimembra as model organism, in order to identify genes of P. vivax expressed only in parasites extracted from monkeys with intact spleen. In the second step, we designed a tiling array containing 5\'UTR, 3\'UTR and all the exons of the P. vivax genome, which will be used to perform more experiments to investigate the role of the spleen on the parasite gene expression. In the last step, we add information to the functional annotation of P. vivax genome, through an automated methodology, to improve the biological interpretation of the results previously obtained and in future studies.
|
6 |
Méthodes pour l'identification de domaines protéiques divergents / Functional annotation of divergent genomes : application to Leishmania parasiteGhouila, Amel 16 December 2013 (has links)
L'étude de la composition des protéines en domaines est une étape clé pour la détermination de ses fonctions. Pfam est l'une des banques de domaines les plus répandues où chaque domaine est représenté par un HMM profil construit à partir d'un alignement multiple de protéines contenant le domaine. La méthode classique de recherche des domaines Pfam consiste à comparer la séquence cible à la librairie complète des HMM profils pour mesurer sa ressemblance aux différents modèles. Cependant, appliquée aux protéines d'organismes divergents, cette méthode manque de sensibilité. L'objectif de cette thèse est d'apporter de nouvelles méthodes pour améliorer le processus de prédictions des domaines plus adaptées à l'étude des protéines divergentes. Les premiers travaux ont consisté en l'adaptation et application de la méthode CODD, récemment proposée, à l'ensemble des pathogènes de la base de données EuPathDB. Une base de données nommée EupathDomains (http://www.atgc-montpellier.fr/EuPathDomains/) recensant l'ensemble des domaines connus et ceux nouvellement prédits chez ces pathogènes a été mise en place à l'issue de ces travaux. Nous nous sommes ensuite attachés à proposer diverses améliorations. Nous proposons un algorithme ''CODD_exclusive'' qui utilise des informations d'incompatibilité de domaines pour améliorer la précision des prédictions. Nous proposons également une autre stratégie basée sur l'utilisation de règles d'association pour la détermination des co-occurrences de domaines utilisées dans le processus de certification. La dernière partie de cette thèse s'intéresse à l'utilisation des méthodes profil/profil pour annoter un génome entier. Couplée à la procédure d'annotation par co-occurrence, cette approche permet une amélioration notable en termes de nombre de domaines certifiés et également en termes de précision. / The determination of protein domain composition provides strong clues for the protein function prediction. One of the most widelyused domain scheme is the Pfam database in which each family is represented by a multiple sequence alignment and a profileHidden Markov Model (profile HMM). When analyzing a new sequence, each Pfam HMM is used to compute a score measuring the similarity between the sequenceand the domain. However, applied to divergent proteins, this strategy may miss several domains. This is the case for all eukaryotic pathogens, where noPfam domains are detected in half or even more of their proteins.The main objective of this thesis is to develop methods to improve the sensitivity of Pfam domain detection in divergent proteins. We first adapted the recently proposed CODD method to the whole set of pathogens in EupathDB. A public database named EupathDomains (http://www.atgc-montpellier.fr/EuPathDomains/) gathers known and new domains detected by CODD, along with the associated confidence measurements and the GO annotations.We then proposed other methods to further improve domain detection in these organisms. We proposed ''CODD_exclusive'' algorithm that integrates domain exclusion information to prune false positive domains that are in conflict with other domains of the protein. We also suggested the use of association rules to determine the correlations between domains and used these informations in the certification process.In the last part of this thesis, we focused in the use of profile/profile methods to predict protein domains in a whole genome. Combined with the co-occurrence informations, it achieved high sensitivity and accuracy in predicting domains.
|
7 |
Subsequence Feature Maps For Protein Function AnnotationSarac, Omer Sinan 01 August 2008 (has links) (PDF)
With the advances in sequencing technologies, the number of protein sequences with
unknown function increases rapidly. Hence, computational methods for functional annotation
of these protein sequences become of the upmost importance. In this thesis,
we first defined a feature space mapping of protein primary sequences to fixed dimensional
numerical vectors. This mapping, which is called the Subsequence Profile Map
(SPMap), takes into account the models of the subsequences of protein sequences. The
resulting vectors were used as an input to support vector machines (SVM) for functional
classification of proteins. Second, we defined the protein functional annotation problem
as a classification problem and construct a classification framework defined on Gene Ontology
(GO) terms. Dierent classification methods as well as their combinations are
assessed on this framework which is based on 300 GO molecular function terms. The reiv
sults showed that combination enhances the classification accuracy. The resultant system
is made publicly available as an online function annotation tool.
|
8 |
Improving Clustering of Gene Expression PatternsJonsson, Per January 2000 (has links)
<p>The central question investigated in this project was whether clustering of gene expression patterns could be done more biologically accurate by providing the clustering technique with additional information about the genes as input besides the expression levels. With the term biologically accurate we mean that the genes should not only be clustered together according to their similarities in expression profiles, but also according to their functional similarity in terms of functional annotation and metabolic pathway. The data was collected at AstraZeneca R&D Mölndal Sweden and the applied computational technique was self-organising maps. In our experiments we used the combination of expression profiles together with enzyme classification annotation as input for the self-organising maps instead of just the expression profiles. The results were evaluated both statistically and biologically. The statistical evaluation showed that our method resulted in a small decrease in terms of compactness and isolation. The biological evaluation showed that our method resulted in clusters with greater functional homogeneity with respect to enzyme classification, functional hierarchy and metabolic pathway annotation.</p>
|
9 |
Improving Clustering of Gene Expression PatternsJonsson, Per January 2000 (has links)
The central question investigated in this project was whether clustering of gene expression patterns could be done more biologically accurate by providing the clustering technique with additional information about the genes as input besides the expression levels. With the term biologically accurate we mean that the genes should not only be clustered together according to their similarities in expression profiles, but also according to their functional similarity in terms of functional annotation and metabolic pathway. The data was collected at AstraZeneca R&D Mölndal Sweden and the applied computational technique was self-organising maps. In our experiments we used the combination of expression profiles together with enzyme classification annotation as input for the self-organising maps instead of just the expression profiles. The results were evaluated both statistically and biologically. The statistical evaluation showed that our method resulted in a small decrease in terms of compactness and isolation. The biological evaluation showed that our method resulted in clusters with greater functional homogeneity with respect to enzyme classification, functional hierarchy and metabolic pathway annotation.
|
10 |
Análise computacional do genoma e transcritoma de Plasmodium vivax: contribuições da bioinformática para o estudo da malária / Computational analysis of the Plasmodium vivax transcriptome and genome: bioinformatics contributions for the malaria investigationBruna Renata Silva Corrêa 02 April 2012 (has links)
Plasmodium vivax é o parasita causador de malária humana com maior distribuição global, responsável pela redução da qualidade de vida de milhões de pessoas ao redor do mundo. O objetivo geral do trabalho foi contribuir, através de metodologias estatísticas e de bioinformática, para o entendimento do mecanismo de escape da eliminação pelo baço do hospedeiro utilizado por P. vivax. Para isso, primeiramente realizou-se a análise estatística de um experimento de transcritômica, através de microarrays. Esse experimento foi conduzido previamente pelo grupo de colaboradores do presente estudo, utilizando um modelo animal, Aotus lemurinus griseimembra, com o objetivo de identificar genes de P. vivax expressos somente em parasitas retirados de macacos que possuíam o baço intacto. Em uma segunda fase, foi projetado um tiling array contendo todos os éxons e as regiões 5UTR e 3UTR disponíveis do genoma de P. vivax, que será utilizado para a realização de mais investigações a respeito da influência da presença do baço na expressão gênica de P. vivax. Na última etapa, foi conduzida uma melhoria na anotação funcional do genoma de P. vivax, através de uma metodologia automática, com o objetivo de adicionar informações para auxiliar na interpretação biológica dos resultados obtidos anteriormente e em estudos futuros. / Plasmodium vivax is the parasite that causes the human malaria type with the largest global distribution and it is responsible for quality of life impairment of millions of people around the world. The general purpose of this study was contribute to understand the mechanism used by P. vivax to scape from the host spleen elimination, through statistical methodologies and bioinformatics. First of all, we carried out statistical analysis of a microarray experiment conducted earlier by the collaborators of this study, using Aotus lemurinus griseimembra as model organism, in order to identify genes of P. vivax expressed only in parasites extracted from monkeys with intact spleen. In the second step, we designed a tiling array containing 5\'UTR, 3\'UTR and all the exons of the P. vivax genome, which will be used to perform more experiments to investigate the role of the spleen on the parasite gene expression. In the last step, we add information to the functional annotation of P. vivax genome, through an automated methodology, to improve the biological interpretation of the results previously obtained and in future studies.
|
Page generated in 0.1366 seconds