111 |
Handling domain knowledge in system design models. An ontology based approach.Hacid, Kahina 06 March 2018 (has links) (PDF)
Complex systems models are designed in heterogeneous domains and this heterogeneity is rarely considered explicitly when describing and validating processes. Moreover, these systems usually involve several domain experts and several design models corresponding to different analyses (views) of the same system. However, no explicit information regarding the characteristics neither of the domain nor of the performed system analyses is given. In our thesis, we propose a general framework offering first, the formalization of domain knowledge using ontologies and second, the capability to strengthen design models by making explicit references to the domain knowledgeformalized in these ontology. This framework also provides resources for making explicit the features of an analysis by formalizing them within models qualified as ‘’points of view ‘’. We have set up two deployments of our approach: a Model Driven Engineering (MDE) based deployment and a formal methods one based on proof and refinement. This general framework has been validated on several no trivial case studies issued from system engineering.
|
112 |
Effective Gene Expression Annotation Approaches for Mouse Brain ImagesJanuary 2016 (has links)
abstract: Understanding the complexity of temporal and spatial characteristics of gene expression over brain development is one of the crucial research topics in neuroscience. An accurate description of the locations and expression status of relative genes requires extensive experiment resources. The Allen Developing Mouse Brain Atlas provides a large number of in situ hybridization (ISH) images of gene expression over seven different mouse brain developmental stages. Studying mouse brain models helps us understand the gene expressions in human brains. This atlas collects about thousands of genes and now they are manually annotated by biologists. Due to the high labor cost of manual annotation, investigating an efficient approach to perform automated gene expression annotation on mouse brain images becomes necessary. In this thesis, a novel efficient approach based on machine learning framework is proposed. Features are extracted from raw brain images, and both binary classification and multi-class classification models are built with some supervised learning methods. To generate features, one of the most adopted methods in current research effort is to apply the bag-of-words (BoW) algorithm. However, both the efficiency and the accuracy of BoW are not outstanding when dealing with large-scale data. Thus, an augmented sparse coding method, which is called Stochastic Coordinate Coding, is adopted to generate high-level features in this thesis. In addition, a new multi-label classification model is proposed in this thesis. Label hierarchy is built based on the given brain ontology structure. Experiments have been conducted on the atlas and the results show that this approach is efficient and classifies the images with a relatively higher accuracy. / Dissertation/Thesis / Masters Thesis Computer Science 2016
|
113 |
Drosophila Stage Annotation using Sparse Learning MethodJanuary 2012 (has links)
abstract: Drosophila melanogaster, as an important model organism, is used to explore the mechanism which governs cell differentiation and embryonic development. Understanding the mechanism will help to reveal the effects of genes on other species or even human beings. Currently, digital camera techniques make high quality Drosophila gene expression imaging possible. On the other hand, due to the advances in biology, gene expression images which can reveal spatiotemporal patterns are generated in a high-throughput pace. Thus, an automated and efficient system that can analyze gene expression will become a necessary tool for investigating the gene functions, interactions and developmental processes. One investigation method is to compare the expression patterns of different developmental stages. Recently, however, the expression patterns are manually annotated with rough stage ranges. The work of annotation requires professional knowledge from experienced biologists. Hence, how to transfer the domain knowledge in biology into an automated system which can automatically annotate the patterns provides a challenging problem for computer scientists. In this thesis, the problem of stage annotation for Drosophila embryo is modeled in the machine learning framework. Three sparse learning algorithms and one ensemble algorithm are used to attack the problem. The sparse algorithms are Lasso, group Lasso and sparse group Lasso. The ensemble algorithm is based on a voting method. Besides that the proposed algorithms can annotate the patterns to stages instead of stage ranges with high accuracy; the decimal stage annotation algorithm presents a novel way to annotate the patterns to decimal stages. In addition, some analysis on the algorithm performance are made and corresponding explanations are given. Finally, with the proposed system, all the lateral view BDGP and FlyFish images are annotated and several interesting applications of decimal stage value are revealed. / Dissertation/Thesis / M.S. Computer Science 2012
|
114 |
Annotation sémantique automatique de textes par exploration contextuelle? : application aux relations de localisation en coréen / Automatic Semantic Annotation based on the Exploration Contextual method and Application for localization relations in Korean TextChai, Hyunzoo 09 July 2009 (has links)
Le travail effectué au cours de ma thèse s’inscrit dans le cadre du Web Sémantique pour rendre l’annotation sémantique. La vision du Web Sémantique a pour son objectif d’avoir les informations disponibles pour que les utilisateurs puissent les exploiter selon leurs besoins. En effet, dans les systèmes d’information actuels, dont la complexité se traduit par un volume important de données, le défi n’est plus de réunir des données, mais d’en extraire des informations pertinentes. Pour cela, les données doivent être étiquetées sémantiquement. En plus, comparé aux langues flexionnelles comme le Français, la technologie dans le traitement de langue agglutinative comme le Coréen a toujours des manques à cause de la complexité des morphologies et syntaxe. / We present an automatic semantic annotation system for Korean on the EXCOM (EXploration COntextual for Multilingual) platform. The purpose of natural language processing is enabling computers to understand human language, so that they can perform more sophisticated tasks. Accordingly, current research concentrates more and more on extracting semantic information. The realization of semantic processing requires the widespread annotation of documents. However, compared to that of inflectional languages, the technology in agglutinative language processing such as Korean still has shortcomings. EXCOM identifies semantic information in Korean text using our new method, the Contextual Exploration Method. Our system properly annotates approximately 90% of standard Korean sentences, and this annotation rate holds across text domains.
|
115 |
Computational proteomics for genome annotationBlakeley, Paul January 2013 (has links)
The field of proteogenomics operates at the interface between proteomics and genomics, and has emerged during the past decade to exploit the vast quantities of high-throughput sequence data. A range of different proteogenomics approaches have been developed, which integrate mass spectrometry data with genome sequence data to provide empirical evidence for protein-coding genes. However, current methods may not be optimized as they do not fully consider the splicing complexity in eukaryotes and there is currently no best practice method. To address this, we investigate the level of proteomics support for Ensembl gene models in human, and a selection of model organisms. We find a disparity between the number of splice variants confirmed by extant data, and the number that can theoretically be confirmed using current proteomics technologies. We then go on to investigate EST-based proteogenomics methods, which enabled the discovery of novel peptide sequences in the chicken genome, which represent hitherto unannotated genes, amended gene models, polymorphisms, and genes missing from the genome assembly. Different approaches for searching mass spectrometry data against transcript sequences are explored, and we show that searching mass spectra against protein sequences predicted by the EORF and ESTScan2 translation tools results in the best sensitivity.
|
116 |
Sifter-T: Um framework escalável para anotação filogenômica probabilística funcional de domínios protéicos / Sifter-T: A scalable framework for phylogenomic probabilistic protein domain functional annotationDanillo Cunha de Almeida e Silva 25 October 2013 (has links)
É conhecido que muitos softwares deixam de ser utilizados por sua complexa usabilidade. Mesmo ferramentas conhecidas por sua qualidade na execução de uma tarefa são abandonadas em favor de ferramentas mais simples de usar, de instalar ou mais rápidas. Na área da anotação funcional a ferramenta Sifter (v2.0) é considerada uma das com melhor qualidade de anotação. Recentemente ela foi considerada uma das melhores ferramentas de anotação funcional segundo o Critical Assessment of protein Function Annotation (CAFA) experiment. Apesar disso, ela ainda não é amplamente utilizada, provavelmente por questões de usabilidade e adequação do framework à larga escala. O workflow SIFTER original consiste em duas etapas principais: A recuperação das anotações para uma lista de genes e a geração de uma árvore de genes reconciliada para a mesma lista. Em seguida, a partir da árvore de genes o Sifter constrói uma rede bayesiana de mesma estrutura nas quais as folhas representam os genes. As anotações funcionais dos genes conhecidos são associadas a estas folhas e em seguida as anotações são propagadas probabilisticamente ao longo da rede bayesiana até as folhas sem informação a priori. Ao fim do processo é gerada para cada gene de função desconhecida uma lista de funções putativas do tipo Gene Ontology e suas probabilidades de ocorrência. O principal objetivo deste trabalho é aperfeiçoar o código-fonte original para melhor desempenho, potencialmente permitindo que seja usado em escala genômica. Durante o estudo do workflow de pré-processamento dos dados encontramos oportunidades para aperfeiçoamento e visualizamos estratégias para abordá-las. Dentre as estratégias implementadas temos: O uso de threads paralelas; balanceamento de carga de processamento; algoritmos revisados para melhor aproveitamento de disco, memória e tempo de execução; adequação do código fonte ao uso de bancos de dados biológicos em formato utilizado atualmente; aumento da acessibilidade do usuário; expansão dos tipos de entrada aceitos; automatização do processo de reconciliação entre árvores de genes e espécies; processos de filtragem de seqüências para redução da dimensão da análise; e outras implementações menores. Com isto conquistamos aumento de performance de até 87 vezes para a recuperação de anotações e 73,3% para a reconstrução da árvore de genes em máquinas quad-core, e redução significante de consumo de memória na fase de realinhamento. O resultado desta implementação é apresentado como Sifter-T (Sifter otimizado para Throughput), uma ferramenta open source de melhor usabilidade, velocidade e qualidade de anotação em relação à implementação original do workflow de Sifter. Sifter-T foi escrito de forma modular em linguagem de programação Python; foi elaborado para simplificar a tarefa de anotação de genomas e proteomas completos; e os resultados são apresentados de forma a facilitar o trabalho do pesquisador. / It is known that many software are no longer used due to their complex usability. Even tools known for their task execution quality are abandoned in favour of faster tools, simpler to use or install. In the functional annotation field, Sifter (v2.0) is regarded as one of the best when it comes to annotation quality. Recently it has been considered one of the best tools for functional annotation according to the \"Critical Assessment of Protein Function Annotation (CAFA) experiment. Nevertheless, it is still not widely used, probably due to issues with usability and suitability of the framework to a high throughput scale. The original workflow SIFTER consists of two main steps: The annotation recovery for a list of genes and the reconciled gene tree generation for the same list. Next, based on the gene tree, Sifter builds a Bayesian network structure in which its leaves represent genes. The known functional annotations are associated to the aforementioned leaves, and then the annotations are probabilistically propagated along the Bayesian network to the leaves without a priori information. At the end of the process, a list of Gene Ontology functions and their occurrence probabilities is generated for each unknown function gene. This work main goal is to optimize the original source code for better performance, potentially allowing it to be used in a genome-wide scale. Studying the pre-processing workflow we found opportunities for improvement and envisioned strategies to address them. Among the implemented strategies we have: The use of parallel threads; CPU load balancing, revised algorithms for best utilization of disk access, memory usage and runtime; source code adaptation to currently used biological databases; improved user accessibility; input types increase; automatic gene and species tree reconciliation process; sequence filtering to reduce analysis dimension, and other minor implementations. With these implementations we achieved great performance speed-ups. For example, we obtained 87-fold performance increase in the annotation recovering module and 72.3% speed increase in the gene tree generation module using quad-core machines. Additionally, significant memory usage decrease during the realignment phase was obtained. This implementation is presented as Sifter-T (Sifter Throughput-optimized), an open source tool with better usability, performance and annotation quality when compared to the Sifter\'s original workflow implementation. Sifter-T was written in a modular fashion using Python programming language; it is designed to simplify complete genomes and proteomes annotation tasks and the outputs are presented in order to make the researcher\'s work easier.
|
117 |
PATO: um ambiente integrado com interface gráfica para a curadoria de dados de sequências biológicas / PATO: an integrated enviroment with GUI to data curation of biological sequencesLiliane Santana Oliveira 22 November 2013 (has links)
A evolução das tecnologias de sequenciamento de DNA tem permitido a elucidação da sequência genômica de um número cada vez maior de organismos. Contudo, a obtenção da sequência nucleotídica do genoma é apenas a primeira etapa no estudo dos organismos. O processo de anotação consiste na identicação as diferentes regiões de interesse no genoma e suas funcionalidades. Várias ferramentas computacionais foram desenvolvidas para auxiliar o processo de anotação, porém nenhuma delas permite ao usuário selecionar sequências, processá-las de forma a encontrar evidências a respeito das regiões genômicas, como predição gênica e de domínios protéicos, analisá-las gracamente e adicionar informações a respeito de suas regiões em um mesmo ambiente. Assim, o objetivo desse projeto foi o desenvolvimento de uma plataforma gráca para a anotação genômica que permite ao usuário realizar as tarefas necessárias para o processo de anotação em uma única ferramenta integrada a um banco de dados. A idéia é proporcionar ao usuário liberdade para trabalhar com o seu conjunto de dados, possibilitando a seleção de sequências para análise, construção dos pipelines processamento das mesmas e análise dos resultados encontrados a partir de visualizador que permite ao usuário adicionar in- formações às regiões e fazer a curadoria das sequências. A ferramenta resultante é facilmente extensível, permitindo o acoplamento modular de novas funcionalidades de anotação e sua estrutura permite ao usuário trabalhar tanto com projetos de sequências expressas como anotação de genomas. / The evolution of the technologies of DNA sequencing has permitted the elucidation of genomic sequence of an increasing number of organisms. Though, the obtainment of the genome nucleotide sequence is only the rst step in the study of organisms. The annotation process consists in the identication of different regions of interest on the genome and their features. Several computational tools were developed to support the annotation process, however none allow the user to select sequences, process them, analyze them graphically and add information about its regions in the same surrounding. Thus, the aim of this project was to develop a graphic platform to genome annotation that allows the user to realize your tasks required from the annotation process in a single tool integrated to a database. The idea is to provide from the user liberty to work with your dataset, enabling the selection of sequences for analyze, pipeline construction, processing them and analyze of results from the viewer that allows the user to add information in the regions and to do the trusteeship of sequences. The resulting tool is easily extensible; allowing the engagement modular of new functionalities of annotation and its structure allows the user works both projects of expressed sequences and with genome annotation.
|
118 |
Contributions to In Silico Genome AnnotationKalkatawi, Manal M. 30 November 2017 (has links)
Genome annotation is an important topic since it provides information for the foundation
of downstream genomic and biological research. It is considered as a way of summarizing
part of existing knowledge about the genomic characteristics of an organism. Annotating
different regions of a genome sequence is known as structural annotation, while
identifying functions of these regions is considered as a functional annotation. In silico
approaches can facilitate both tasks that otherwise would be difficult and timeconsuming.
This study contributes to genome annotation by introducing several novel
bioinformatics methods, some based on machine learning (ML) approaches.
First, we present Dragon PolyA Spotter (DPS), a method for accurate identification of the
polyadenylation signals (PAS) within human genomic DNA sequences. For this, we derived
a novel feature-set able to characterize properties of the genomic region surrounding the
PAS, enabling development of high accuracy optimized ML predictive models. DPS
considerably outperformed the state-of-the-art results.
The second contribution concerns developing generic models for structural annotation,
i.e., the recognition of different genomic signals and regions (GSR) within eukaryotic DNA.
We developed DeepGSR, a systematic framework that facilitates generating ML models
to predict GSR with high accuracy. To the best of our knowledge, no available generic and
automated method exists for such task that could facilitate the studies of newly sequenced organisms. The prediction module of DeepGSR uses deep learning algorithms
to derive highly abstract features that depend mainly on proper data representation and
hyperparameters calibration. DeepGSR, which was evaluated on recognition of PAS and
translation initiation sites (TIS) in different organisms, yields a simpler and more precise
representation of the problem under study, compared to some other hand-tailored
models, while producing high accuracy prediction results.
Finally, we focus on deriving a model capable of facilitating the functional annotation of
prokaryotes. As far as we know, there is no fully automated system for detailed
comparison of functional annotations generated by different methods. Hence, we
developed BEACON, a method and supporting system that compares gene annotation
from various methods to produce a more reliable and comprehensive annotation. Overall,
our research contributed to different aspects of the genome annotation.
|
119 |
Automatic Protein Function Annotation Through Text MiningToonsi, Sumyyah 25 August 2019 (has links)
The knowledge of a protein’s function is essential to many studies in molecular biology, genetic experiments and protein-protein interactions. The Gene Ontology (GO) captures gene products' functions in classes and establishes relationship between them. Manually annotating proteins with GO functions from the bio-medical litera- ture is a tedious process which calls for automation. We develop a novel, dictionary- based method to annotate proteins with functions from text. We extract text-based features from words matched against a dictionary of GO. Since classes are included upon any word match with their class description, the number of negative samples outnumbers the positive ones. To mitigate this imbalance, we apply strict rules before weakly labeling the dataset according to the curated annotations. Furthermore, we discard samples of low statistical evidence and train a logistic regression classifier. The results of a 5-fold cross-validation show a high precision of 91% and 96% accu- racy in the best performing fold. The worst fold showed a precision of 80% and an accuracy of 95%. We conclude by explaining how this method can be used for similar annotation problems.
|
120 |
Molekulargenetische und physiologische Untersuchungen an der Weinhefe Kloeckera apiculata (Hanseniaspora uvarum)Bink, Frauke Julia 24 August 2010 (has links)
In der ersten Hälfte der Weingärung dominiert die Weinhefe Kloeckera apiculata (perfekte Form: Hanseniaspora uvarum), auch wenn eine Starterkultur der Reinzuchthefe Saccharomyces cerevisiae zugesetzt wurde. Sogenannte Spontangärungen ohne einen solchen Zusatz ergeben gelegentlich qualitativ bessere Weine. Allerdings überwiegt das Risiko von Gärstockungen (z.B. durch Essigsäure-Produktion) und Fehltönen bei der Spontangärung ihre potentiellen Vorteile in der großtechnischen Weinherstellung. Dort könnten wesentliche Qualitätssteigerungen durch die parallele Zugabe von Starterkulturen eines modifizierten K. apiculata Stammes erzielt werden, der die Aromakomposition verbessern könnte, ohne sich negativ auf die Produktqualität auszuwirken.
Zu diesem Zweck ist ein vertieftes Verständnis der Genetik und Physiologie von K. apiculata nötig. Aufgrund dessen wurde daher mit dem Aufbau eines molekulargenetischen Systems dieser Hefe begonnen. Dafür wurde zunächst genomische DNA präpariert und zur Herstellung einer Genbank verwendet, mit deren Hilfe einige Marken (HIS3, URA3, TRP1 und LEU2) über heterologe Komplementation in entsprechenden S. cerevisiae-Stämmen erhalten wurden. Die so isolierten Gene sollen in Zukunft dazu dienen, um E.coli/K. apiculata „-Shuttle-Vektoren“ für Klonierungen zu konstruieren. Weiterhin soll versucht werden, Deletionen im Kloeckera Genom zu erhalten um gezielt Stoffwechselwege auszuschalten. So zum Beispiel den Stoffwechselweg, der zur Essigsäureproduktion führt.
Erste Untersuchungen zur Phosphofructokinase, dem ersten für die Glykolyse spezifischen Enzym auf dem Weg zur alkoholischen Gärung, legen ähnlich wie bei S. cerevisiae einen heterooktameren Aufbau des Enzyms nahe. Erste Ergebnisse aus der enzymatischen Analyse nach heterologer Expression der kodierenden Gene in S. cerevisiae werden vorgestellt.
Zudem wurden mit der Sequenzierung des vollständigen Genoms von K. apiculata begonnen und es konnten etwa 90% der Sequenz entschlüsselt werden. Damit konnten sehr viele Homologe zu Genen identifiziert werden, die in anderen Hefearten für Proteine mit bekannter Funktion kodieren. Darüber hinaus lassen die vorläufigen Ergebnisse vermuten, dass eine Genomduplikation, wie sie für Hefen aus der Saccharomyces-Gruppe postuliert wird, in K. apiculata noch nicht stattgefunden hat.
Die ersten Ergebnisse aus vergleichenden FACS-Analysen deuten an, dass es sich bei K. apiculata um einen diploiden Organismus handelt. Es ist ebenfalls gelungen, die Chromosomen von K. apiculata mit Hilfe einer speziellen Gelelektrophorese- Technik (PFGE) zu trennen. Nach einem Southern-Blot konnten zudem durch Hybridisierung mit Sonden einzelne Gene den entsprechenden Chromosomen zugeordnet werden.
|
Page generated in 0.0857 seconds