Spelling suggestions: "subject:"biocuration"" "subject:"procuration""
1 |
Implementing Effective Biocuration Process, Training, and Quality Management Protocols on Undergraduate Biocuration of Amyotrophic Lateral SclerosisTrue, Rachel Wilcox 18 August 2015 (has links)
Biocuration is manual scientific collection, annotation and validation of literary information of biological and model organisms into a single database. Successful biocuration processes involve those with an extensive collection of literature, a user-friendly database interfaces for entering and analyzing data from published papers, and highly regulated training and quality assurance protocols. Due to the rapid expansion of biomedical literature, an efficient and accurate biocuration process has become more valuable due to the magnitude of data available in published literature. As the biocuration process incorporates undergraduates, it is critical that the medium for data collection is simple, ergonomic, and infallible. A reconstructed FileMaker Pro database was introduced to previously trained undergraduate students for process evaluation. Streamlining the biocuration process and grouping data structure to be more intuitive were two goals the new database interface hoped to achieve. The creation of a rigorous training program and strict quality management protocol is needed to prepare the lab for the introduction of efficient biocuration processes. Through the database designing process, training protocols were drafted to effectively call the biocurator’s attention to important changes in the interface design. Upon prototyping the database, entry errors were reviewed, training protocols were adjusted, and the quality protocols were drafted. When the combination of undergraduate biocurators and the reconstructed database under these new protocols was compared to statistics in the biocuration field, results proved to show increase in both productivity rates as well as accuracy rates. By having such efficiency at the undergraduate level, subject matter experts will no longer be required to perform this type of research and can focus on analysis. This will increase research productivity and reduce costs in the overall biocuration process. With over 12,000 published papers regarding Amyotrophic Lateral Sclerosis on Pubmed in 2014 alone, this revolutionary combination could lead to quickly finding a suitable cure for these patients.
|
2 |
Evaluation of Word and Paragraph Embeddings and Analogical Reasoning as an Alternative to Term Frequency-Inverse Document Frequency-based Classification in Support of BiocurationSullivan, Daniel Edward 07 June 2016 (has links)
This research addresses the problem, can unsupervised learning generate a representation that improves on the commonly used term frequency-inverse document frequency (TF-IDF ) representation by capturing semantic relations? The analysis measures the quality of sentence classification using term TF-IDF representations, and finds a practical upper limit to precision and recall in a biomedical text classification task (F1-score of 0.85). Arguably, one could use ontologies to supplement TF-IDF, but ontologies are sparse in coverage and costly to create. This prompts a correlated question: can unsupervised learning capture semantic relations at least as well as existing ontologies, and thus supplement existing sparse ontologies? A shallow neural network implementing the Skip-Gram algorithm is used to generate semantic vectors using a corpus of approximately 2.4 billion words. The ability to capture meaning is assessed by comparing semantic vectors generated with MESH. Results indicate that semantic vectors trained by unsupervised methods capture comparable levels of semantic features in some cases, such as amino acid (92% of similarity represented in MESH), but perform substantially poorer in more expansive topics, such as pathogenic bacteria (37.8% similarity represented in MESH). Possible explanations for this difference in performance are proposed along with a method to combine manually curated ontologies with semantic vector spaces to produce a more comprehensive representation than either alone. Semantic vectors are also used as representations for paragraphs, which, when used for classification, achieve an F1-score of 0.92. The results of classification and analogical reasoning tasks are promising but a formal model of semantic vectors, subject to the constraints of known linguistic phenomenon, is needed. This research includes initial steps for developing a formal model of semantic vectors based on a combination of linear algebra and fuzzy set theory subject to the semantic molecularism linguistic model. This research is novel in its analysis of semantic vectors applied to the biomedical domain, analysis of different performance characteristics in biomedical analogical reasoning tasks, comparison semantic relations captured by between vectors and MESH, and the initial development of a formal model of semantic vectors. / Ph. D.
|
3 |
Modelagem computacional de famílias de proteínas microbianas relevantes para produção de bioenergia / Computational modeling of microbial protein families relevants to bioenergy production process.Rego, Fernanda Orpinelli Ramos do 17 August 2015 (has links)
Modelos ocultos de Markov (HMMs - hidden Markov models) são ferramentas essenciais para anotação automática de proteínas. Por muitos anos, bancos de dados de famílias de proteínas baseados em HMMs têm sido disponibilizados para a comunidade científica (e.g. TIGRfams). Muitos esforços também têm sido dedicados à geração automática de HMMs de famílias de proteínas (e.g. PANTHER). No entanto, HMMs manualmente curados de famílias de proteínas permanecem como o padrão-ouro para anotação de genomas. Neste contexto, este trabalho teve como principal objetivo a geração de cerca de 80 famílias de proteínas microbianas relevantes para produção de bioenergia, baseadas em HMMs. Para gerar os HMMs, seguimos um protocolo de curadoria manual, gerado neste trabalho. Partimos de uma proteína que tenha função experimentalmente comprovada, esteja associada a uma publicação e tenha sido manualmente anotada com termos da Gene Ontology, criados pelo projeto MENGO¹ (Microbial ENergy Gene Ontology). Os próximos passos consistiram na (1) definição de um critério de seleção para inclusão de membros à família; (2) busca por membros via BLAST; (3) geração do alinhamento múltiplo (MUSCLE 3.7) e do HMM (HMMER 3.0); (4) análise dos resultados e iteração do processo, com o HMM preliminar usado nas buscas adicionais; (5) definição de uma nota de corte (cutoff) para o HMM final; (6) validação individual dos modelos. As principais contribuições deste trabalho são 74 HMMs (manualmente curados) disponibilizados via web (http://mengofams.lbi.iq.usp.br/), onde é possível fazer buscas e o download dos modelos, um protocolo detalhado sobre a curadoria manual de HMMs para famílias de proteínas e uma lista com proteínas candidatas a reanotação. / Hidden Markov Models (HMMs) are essential tools for automated annotation of protein sequences. For many years now protein family resources based on HMMs have been made available to the scientific community (e.g. TIGRfams). Much effort has also been devoted to the automated generation of protein family HMMs (e.g Panther). However, manually curated protein family HMMs remain the gold standard for use in genome annotation. In this context, this work had as main objectives the generation of appoximately 80 protein families based on HMMs. We follow a standard protocol, that was generated in this work, to create the HMMs. At first, we start from a protein with experimentally proven function, associated to a publication and that was manually annotated with new terms from Gene Ontology provided by MENGO¹ (Microbial ENergy Gene Ontology). The next steps consists of (1) definition of selection criteria to capture members of the family; (2) search for members via BLAST; (3) generation of multiple alignment (MUSCLE 3.7) and the HMM (HMMER 3.0); (4) result analysis and iteration of the process, using the preliminary HMM; (5) cutoff definition to the final HMM; (6) individual validation of the models using tests against NCBIs NR database. The main deliverables of this work are 74 HMMs manually curated available in the site project (mengofams.lbi.iq.usp.br) that allows browsing and download of all HMMs curated so far, a standard protocol manual curation of protein families, a list with proteins that need to be reviewed.
|
4 |
Propagação semi-automática de termos Gene Ontology a proteínas com potencial biotecnológico para a produção de bioenergia / Semi-automatic propagation of Gene Ontology terms to proteins with biotechnology potential for bioenergy productionTaniguti, Lucas Mitsuo 18 November 2014 (has links)
O aumento no volume de dados biológicos, oriundos principalmente do surgimento de sequenciadores de segunda geração, configura um desafio para a manutenção dos bancos de dados, que devem armazenar, disponibilizar e, no caso de bancos secundários, propagar informações biológicas para sequências sem caracterização experimental. Tal propagação é crucial , pois o fluxo com que novas sequências são depositadas é muito superior ao que proteínas são experimentalmente caracterizadas. De forma análoga ao EC number (Enzyme Commission number), a organização de proteínas em famílias visa organizar e facilitar operações automáticas nos bancos de dados. Dentro desse contexto este trabalho teve como objetivos a geração de modelos computacionais para famílias de proteínas envolvidas em processos microbianos biotecnologicamente interessantes para a produção de bioenergia. Para a geração dos modelos estatísticos foram escolhidas proteínas referência analisadas a priori em colaboração com o projeto MENGO1 . A partir da proteína referência foram realizadas buscas no UniProtKB com o objetivo de encontrar proteínas representativas para cada família e descrições de função com base na literatura científica. Com a coleção de sequências primárias das proteínas selecionadas foram realizados alinhamentos múltiplos de sequências com o programa MUSCLE 3.7 e posteriormente com o programa HMMER foram gerados os modelos computacionais (perfis de cadeia oculta de Markov). Os modelos passaram por consecutivas revisões para serem utilizados na propagação dos termos do Gene Ontology com confiança.Um total de 1.233 proteínas puderam receber os termos GO. Dessas proteínas 79% não apresentavam os termos GO disponibilizados no banco de dados UniProtKB. Uma comparação dos perfis-HMM com a utilização de redes de similaridade a um E-value de 10-14 confirmou a utilidade dos modelos na propagação adequada dos termos. Uma segunda validação utilizando um banco de dados construído com sequências aleatórias com base nos modelos e na frequência de codons das proteínas anotadas do SwisProt permitiu verificar a sensibilidade da estratégia quanto a recuperar membros não pertencentes aos modelos gerados. / The increase of biological data produced mainly by the second generation technologies stands as a challenge for the biological databases, that needs to adress issues like storage, data availability and, in the case of secondary databases, to propagate biological information to sequences with no experimental characterization. The propagation is important since the flow that new sequences are submited into databases is much higher than proteins having their function described by experiments. Similarly to the EC. number (Enzyme Commission number), an organization of protein families aims to organize and help automatic processes in databases. In this context this work had as goals the generation of computational models for protein families related to microbial processes with biotechnology potential for production of bioenergy. Several proteins annotated by MENGO2, a project in collaboration, were used as seeds to the statistic models. Alignments were made on UniProtKB, querying the seeds proteins, looking for representatives for each family generated and the existence of function descriptions referenced on the cientific literature. Multiple sequence alignment were made on each collection of seeds proteins, representatives of the families, thorough the MUSCLE 3.7 program, and after were generated the computational models (profile Hidden Markov Models) with the HMMER package. The models were consecutively reviewed until the curator consider it reliable for propagation of Gene Ontology terms. A set of 1,233 proteins from UniProtKB were classified in our families, suggesting that they could be annotated by the GO terms using MENGOfams families. From those proteins, 79% were not annotated by the MENGO specific GO terms. To compare the results that would be obtained using only BLAST similarity measures and using pHMMs we generated similarity networks, using an Evaue cutoff of 10-14. The results showed that the classification results of pHMMs are valuable for biological annotation propagation because it identifies precisely members of each family. A second analysis was applied for each family, using the respective pHMMs to query a collection of sequences generated by a null model. For null model were assumed that all sequences were not homologous and could be represented just by the aminoacid frequencies observed in the SwissProt database. No non-homologous proteins were classified as members by the MENGOfams models, suggesting that they were sensitive to identify only true member sequences.
|
5 |
The effect of single nucleotide polymorphisms and metabolic substrates on the cellular distribution of mammalian BK channelsAdeyileka-Tracz, Bernadette Ayokunumi January 2017 (has links)
Humans are approximately 99% similar with inter-individual differences caused in part by single-nucleotide polymorphisms (SNPs), which poses a challenge for the effective treatment of disease. Bioinformatics resources can help to store and analyse gene and protein information to address this challenge, however these resources have limitations, so the collation and biocuration of gene and protein information is required. Using the large conductance calcium- and voltage-activated potassium channel, also known as the Big Potassium (BK) channel as an example, due to its ubiquitous expression and widespread varied role in human physiology, this study aimed to prioritise SNPs with the potential to affect the function of the channel. Using a BK channel resource created with bioinformatics tools and published literature, mSlo SNPs H55Q and G57A, located in the S0-S1 linker, were prioritised and selected for lab-based verification. These SNPs flank three cysteine residues proven to modulate channel cellular distribution via palmitoylation, a reversible process shown to increase protein association with the cell membrane. The SNPs alter the predicted palmitoylation status of C56, one of the cysteine residues located in the S0-S1 linker. The cellular distribution of BK channels incorporating the SNPs was assessed using confocal microscopy and revealed that the direction and magnitude of SNP mimetic cell membrane expression was closely related to the C56 predicted palmitoylation score; a 'C56 palmitoylation pattern' was observed. It was shown that exposure to metabolic substrates glucose, palmitate and oleate modulated SNP-mimetic cellular distribution and could invert the 'C56 palmitoylation pattern', indicating that there is interplay between the metabolic status of the cell and the amino-acid composition of the channel via palmitoylation. The creation of a novel BK channel resource in this thesis highlighted the limitations, and inter-dependency of bioinformatics and lab based experimentation, whilst SNP verification experiments solidified the link between S0-S1 cysteine residues and BK cellular distribution. BK channel function is linked with a number of physiological processes; thus, the potential clinical consequences of the SNPs prioritised in this thesis require further research.
|
6 |
Propagação semi-automática de termos Gene Ontology a proteínas com potencial biotecnológico para a produção de bioenergia / Semi-automatic propagation of Gene Ontology terms to proteins with biotechnology potential for bioenergy productionLucas Mitsuo Taniguti 18 November 2014 (has links)
O aumento no volume de dados biológicos, oriundos principalmente do surgimento de sequenciadores de segunda geração, configura um desafio para a manutenção dos bancos de dados, que devem armazenar, disponibilizar e, no caso de bancos secundários, propagar informações biológicas para sequências sem caracterização experimental. Tal propagação é crucial , pois o fluxo com que novas sequências são depositadas é muito superior ao que proteínas são experimentalmente caracterizadas. De forma análoga ao EC number (Enzyme Commission number), a organização de proteínas em famílias visa organizar e facilitar operações automáticas nos bancos de dados. Dentro desse contexto este trabalho teve como objetivos a geração de modelos computacionais para famílias de proteínas envolvidas em processos microbianos biotecnologicamente interessantes para a produção de bioenergia. Para a geração dos modelos estatísticos foram escolhidas proteínas referência analisadas a priori em colaboração com o projeto MENGO1 . A partir da proteína referência foram realizadas buscas no UniProtKB com o objetivo de encontrar proteínas representativas para cada família e descrições de função com base na literatura científica. Com a coleção de sequências primárias das proteínas selecionadas foram realizados alinhamentos múltiplos de sequências com o programa MUSCLE 3.7 e posteriormente com o programa HMMER foram gerados os modelos computacionais (perfis de cadeia oculta de Markov). Os modelos passaram por consecutivas revisões para serem utilizados na propagação dos termos do Gene Ontology com confiança.Um total de 1.233 proteínas puderam receber os termos GO. Dessas proteínas 79% não apresentavam os termos GO disponibilizados no banco de dados UniProtKB. Uma comparação dos perfis-HMM com a utilização de redes de similaridade a um E-value de 10-14 confirmou a utilidade dos modelos na propagação adequada dos termos. Uma segunda validação utilizando um banco de dados construído com sequências aleatórias com base nos modelos e na frequência de codons das proteínas anotadas do SwisProt permitiu verificar a sensibilidade da estratégia quanto a recuperar membros não pertencentes aos modelos gerados. / The increase of biological data produced mainly by the second generation technologies stands as a challenge for the biological databases, that needs to adress issues like storage, data availability and, in the case of secondary databases, to propagate biological information to sequences with no experimental characterization. The propagation is important since the flow that new sequences are submited into databases is much higher than proteins having their function described by experiments. Similarly to the EC. number (Enzyme Commission number), an organization of protein families aims to organize and help automatic processes in databases. In this context this work had as goals the generation of computational models for protein families related to microbial processes with biotechnology potential for production of bioenergy. Several proteins annotated by MENGO2, a project in collaboration, were used as seeds to the statistic models. Alignments were made on UniProtKB, querying the seeds proteins, looking for representatives for each family generated and the existence of function descriptions referenced on the cientific literature. Multiple sequence alignment were made on each collection of seeds proteins, representatives of the families, thorough the MUSCLE 3.7 program, and after were generated the computational models (profile Hidden Markov Models) with the HMMER package. The models were consecutively reviewed until the curator consider it reliable for propagation of Gene Ontology terms. A set of 1,233 proteins from UniProtKB were classified in our families, suggesting that they could be annotated by the GO terms using MENGOfams families. From those proteins, 79% were not annotated by the MENGO specific GO terms. To compare the results that would be obtained using only BLAST similarity measures and using pHMMs we generated similarity networks, using an Evaue cutoff of 10-14. The results showed that the classification results of pHMMs are valuable for biological annotation propagation because it identifies precisely members of each family. A second analysis was applied for each family, using the respective pHMMs to query a collection of sequences generated by a null model. For null model were assumed that all sequences were not homologous and could be represented just by the aminoacid frequencies observed in the SwissProt database. No non-homologous proteins were classified as members by the MENGOfams models, suggesting that they were sensitive to identify only true member sequences.
|
7 |
Modelagem computacional de famílias de proteínas microbianas relevantes para produção de bioenergia / Computational modeling of microbial protein families relevants to bioenergy production process.Fernanda Orpinelli Ramos do Rego 17 August 2015 (has links)
Modelos ocultos de Markov (HMMs - hidden Markov models) são ferramentas essenciais para anotação automática de proteínas. Por muitos anos, bancos de dados de famílias de proteínas baseados em HMMs têm sido disponibilizados para a comunidade científica (e.g. TIGRfams). Muitos esforços também têm sido dedicados à geração automática de HMMs de famílias de proteínas (e.g. PANTHER). No entanto, HMMs manualmente curados de famílias de proteínas permanecem como o padrão-ouro para anotação de genomas. Neste contexto, este trabalho teve como principal objetivo a geração de cerca de 80 famílias de proteínas microbianas relevantes para produção de bioenergia, baseadas em HMMs. Para gerar os HMMs, seguimos um protocolo de curadoria manual, gerado neste trabalho. Partimos de uma proteína que tenha função experimentalmente comprovada, esteja associada a uma publicação e tenha sido manualmente anotada com termos da Gene Ontology, criados pelo projeto MENGO¹ (Microbial ENergy Gene Ontology). Os próximos passos consistiram na (1) definição de um critério de seleção para inclusão de membros à família; (2) busca por membros via BLAST; (3) geração do alinhamento múltiplo (MUSCLE 3.7) e do HMM (HMMER 3.0); (4) análise dos resultados e iteração do processo, com o HMM preliminar usado nas buscas adicionais; (5) definição de uma nota de corte (cutoff) para o HMM final; (6) validação individual dos modelos. As principais contribuições deste trabalho são 74 HMMs (manualmente curados) disponibilizados via web (http://mengofams.lbi.iq.usp.br/), onde é possível fazer buscas e o download dos modelos, um protocolo detalhado sobre a curadoria manual de HMMs para famílias de proteínas e uma lista com proteínas candidatas a reanotação. / Hidden Markov Models (HMMs) are essential tools for automated annotation of protein sequences. For many years now protein family resources based on HMMs have been made available to the scientific community (e.g. TIGRfams). Much effort has also been devoted to the automated generation of protein family HMMs (e.g Panther). However, manually curated protein family HMMs remain the gold standard for use in genome annotation. In this context, this work had as main objectives the generation of appoximately 80 protein families based on HMMs. We follow a standard protocol, that was generated in this work, to create the HMMs. At first, we start from a protein with experimentally proven function, associated to a publication and that was manually annotated with new terms from Gene Ontology provided by MENGO¹ (Microbial ENergy Gene Ontology). The next steps consists of (1) definition of selection criteria to capture members of the family; (2) search for members via BLAST; (3) generation of multiple alignment (MUSCLE 3.7) and the HMM (HMMER 3.0); (4) result analysis and iteration of the process, using the preliminary HMM; (5) cutoff definition to the final HMM; (6) individual validation of the models using tests against NCBIs NR database. The main deliverables of this work are 74 HMMs manually curated available in the site project (mengofams.lbi.iq.usp.br) that allows browsing and download of all HMMs curated so far, a standard protocol manual curation of protein families, a list with proteins that need to be reviewed.
|
8 |
Aplicando princípios de aprendizado de máquina na construção de um biocurador automático para o Gene Ontology (GO)Amaral, Laurence Rodrigues do 08 October 2013 (has links)
Made available in DSpace on 2016-06-02T19:03:58Z (GMT). No. of bitstreams: 1
6030.pdf: 2345815 bytes, checksum: 385c6d8c1bda1d4afe540c01668338fa (MD5)
Previous issue date: 2013-10-08 / Nowadays, the amount of biological data available by universities, hospitals and research centers has increased exponentially due the use of bioinformatics, with the development of methods and advanced computational tools, and high-throughput techniques. Due to this significant increase in the amount of available data, new strategies for capture, storage and analysis of data are necessary. In this scenario, a new research area is developing, called biocuration. The biocuration is becoming a fundamental part in the biological and biomedical research, and the main function is related with the structuration and organization of the biological information, making it readable and accessible to mens and computers. Seeking to support a fast and reliable understanding of new domains, different initiatives are being proposed, and the Gene Ontology (GO) is one of the main examples. The GO is one the main initiatives in bioinformatics, whose main goal is to standardize the representation of genes and their products, providing interconnections between species and databases. Thus, the main objective of this research is to propose a computational architecture that uses principles of never-ending learning to help biocurators in new GO classifications. Nowadays, this classification task is totally manual. The proposed architecture uses semi-supervised learning combining different classifiers used in the classification of new GO samples. In addition, this research also aims to build high-level knowledge in the form of simple IF-THEN rules and decision trees. The generated knowledge can be used by the GO biocurators in the search for important patterns present in the biological data, revealing concise and relevant information about the application domain. / Nos dias atuais, a quantidade de dados biológicos disponibilizados por universidades, hospitais e centros de pesquisa tem aumentado de forma exponencial, devido ao emprego da bio-informática, através do desenvolvimento de métodos e técnicas computacionais avançados, e de técnicas de high-throughput. Devido a esse significativo aumento na quantidade de dados disponibilizados, gerou-se a necessidade da criação de novas estratégias para captura, armazenamento e principalmente analise desses dados. Devido a esse cenário, um novo campo de trabalho e pesquisa vem surgindo, chamado biocuragem. A biocuragem está se tornando parte fundamental na pesquisa biomédica e biológica, e tem por principal função estruturar e organizar a informação biológica, tornando-a legível e acessível a homens e computadores. Buscando prover um rápido e confiável entendimento de novos domínios, diferentes iniciativas estão sendo propostas, tendo no Gene Ontology (GO) um dos seus principais exemplos. O GO se destaca mundialmente sendo uma das principais iniciativas em bioinformática, cuja principal meta e padronizar a representação dos genes e seus produtos, provendo interconexões entre espécies e bancos de dados. Dessa forma, objetiva-se com essa pesquisa propor uma arquitetura computacional que utiliza princípios de aprendizado de maquina sem-fim para auxiliar biocuradores do GO na tarefa de classificação de novos termos, tarefa essa, totalmente manual. A arquitetura proposta utiliza aprendizado semi-supervisionado combinando diferentes classificadores na rotulação de novas instâncias do GO. Além disso, essa pesquisa também tem por objetivo a construção de conhecimento de alto-nível na forma de simples regras SE-ENTÃO e árvores de decisão. Esse conhecimento gerado pode ser utilizado pelos biocuradores do GO na busca por padrões importantes presentes nos dados biológicos, revelando informações concisas e relevantes sobre o domínio da aplicação.
|
Page generated in 0.0687 seconds