• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 60
  • 12
  • 11
  • 6
  • 3
  • 2
  • 2
  • 2
  • 1
  • Tagged with
  • 112
  • 112
  • 96
  • 49
  • 41
  • 40
  • 40
  • 39
  • 32
  • 24
  • 24
  • 17
  • 16
  • 16
  • 15
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
91

Approche multi-niveaux pour l'analyse des données textuelles non-standardisées : corpus de textes en moyen français / Multi-level approach for the analysis of non-standardized textual data : corpus of texts in middle french

Aouini, Mourad 19 March 2018 (has links)
Cette thèse présente une approche d'analyse des textes non-standardisé qui consiste à modéliser une chaine de traitement permettant l’annotation automatique de textes à savoir l’annotation grammaticale en utilisant une méthode d’étiquetage morphosyntaxique et l’annotation sémantique en mettant en œuvre un système de reconnaissance des entités nommées. Dans ce contexte, nous présentons un système d'analyse du Moyen Français qui est une langue en pleine évolution dont l’orthographe, le système flexionnel et la syntaxe ne sont pas stables. Les textes en Moyen Français se singularisent principalement par l’absence d’orthographe normalisée et par la variabilité tant géographique que chronologique des lexiques médiévaux.L’objectif est de mettre en évidence un système dédié à la construction de ressources linguistiques, notamment la construction des dictionnaires électroniques, se basant sur des règles de morphologie. Ensuite, nous présenterons les instructions que nous avons établies pour construire un étiqueteur morphosyntaxique qui vise à produire automatiquement des analyses contextuelles à l’aide de grammaires de désambiguïsation. Finalement, nous retracerons le chemin qui nous a conduits à mettre en place des grammaires locales permettant de retrouver les entités nommées. De ce fait, nous avons été amenés à constituer un corpus MEDITEXT regroupant des textes en Moyen Français apparus entre le fin du XIIIème et XVème siècle. / This thesis presents a non-standardized text analysis approach which consists a chain process modeling allowing the automatic annotation of texts: grammar annotation using a morphosyntactic tagging method and semantic annotation by putting in operates a system of named-entity recognition. In this context, we present a system analysis of the Middle French which is a language in the course of evolution including: spelling, the flexional system and the syntax are not stable. The texts in Middle French are mainly distinguished by the absence of normalized orthography and the geographical and chronological variability of medieval lexicons.The main objective is to highlight a system dedicated to the construction of linguistic resources, in particular the construction of electronic dictionaries, based on rules of morphology. Then, we will present the instructions that we have carried out to construct a morphosyntactic tagging which aims at automatically producing contextual analyzes using the disambiguation grammars. Finally, we will retrace the path that led us to set up local grammars to find the named entities. Hence, we were asked to create a MEDITEXT corpus of texts in Middle French between the end of the thirteenth and fifteenth centuries.
92

Anotação e classificação automática de entidades nomeadas em notícias esportivas em Português Brasileiro / Automatic named entity recognition and classification for brazilian portuguese sport news

Zaccara, Rodrigo Constantin Ctenas 11 July 2012 (has links)
O objetivo deste trabalho é desenvolver uma plataforma para anotação e classificação automática de entidades nomeadas para notícias escritas em português do Brasil. Para restringir um pouco o escopo do treinamento e análise foram utilizadas notícias esportivas do Campeonato Paulista de 2011 do portal UOL (Universo Online). O primeiro artefato desenvolvido desta plataforma foi a ferramenta WebCorpus. Esta tem como principal intuito facilitar o processo de adição de metainformações a palavras através do uso de uma interface rica web, elaborada para deixar o trabalho ágil e simples. Desta forma as entidades nomeadas das notícias são anotadas e classificadas manualmente. A base de dados foi alimentada pela ferramenta de aquisição e extração de conteúdo desenvolvida também para esta plataforma. O segundo artefato desenvolvido foi o córpus UOLCP2011 (UOL Campeonato Paulista 2011). Este córpus foi anotado e classificado manualmente através do uso da ferramenta WebCorpus utilizando sete tipos de entidades: pessoa, lugar, organização, time, campeonato, estádio e torcida. Para o desenvolvimento do motor de anotação e classificação automática de entidades nomeadas foram utilizadas três diferentes técnicas: maximização de entropia, índices invertidos e métodos de mesclagem das duas técnicas anteriores. Para cada uma destas foram executados três passos: desenvolvimento do algoritmo, treinamento utilizando técnicas de aprendizado de máquina e análise dos melhores resultados. / The main target of this research is to develop an automatic named entity classification tool to sport news written in Brazilian Portuguese. To reduce this scope, during training and analysis only sport news about São Paulo Championship of 2011 written by UOL2 (Universo Online) was used. The first artefact developed was the WebCorpus tool, which aims to make easier the process of add meta informations to words, through a rich web interface. Using this, all the corpora news are tagged manually. The database used by this tool was fed by the crawler tool, also developed during this research. The second artefact developed was the corpora UOLCP2011 (UOL Campeonato Paulista 2011). This corpora was manually tagged using the WebCorpus tool. During this process, seven classification concepts were used: person, place, organization, team, championship, stadium and fans. To develop the automatic named entity classification tool, three different approaches were analysed: maximum entropy, inverted index and merge tecniques using both. Each approach had three steps: algorithm development, training using machine learning tecniques and best score analysis.
93

Použití hlubokých kontextualizovaných slovních reprezentací založených na znacích pro neuronové sekvenční značkování / Deep contextualized word embeddings from character language models for neural sequence labeling

Lief, Eric January 2019 (has links)
A family of Natural Language Processing (NLP) tasks such as part-of- speech (PoS) tagging, Named Entity Recognition (NER), and Multiword Expression (MWE) identification all involve assigning labels to sequences of words in text (sequence labeling). Most modern machine learning approaches to sequence labeling utilize word embeddings, learned representations of text, in which words with similar meanings have similar representations. Quite recently, contextualized word embeddings have garnered much attention because, unlike pretrained context- insensitive embeddings such as word2vec, they are able to capture word meaning in context. In this thesis, I evaluate the performance of different embedding setups (context-sensitive, context-insensitive word, as well as task-specific word, character, lemma, and PoS) on the three abovementioned sequence labeling tasks using a deep learning model (BiLSTM) and Portuguese datasets. v
94

Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision

Täckström, Oscar January 2013 (has links)
Contemporary approaches to natural language processing are predominantly based on statistical machine learning from large amounts of text, which has been manually annotated with the linguistic structure of interest. However, such complete supervision is currently only available for the world's major languages, in a limited number of domains and for a limited range of tasks. As an alternative, this dissertation considers methods for linguistic structure prediction that can make use of incomplete and cross-lingual supervision, with the prospect of making linguistic processing tools more widely available at a lower cost. An overarching theme of this work is the use of structured discriminative latent variable models for learning with indirect and ambiguous supervision; as instantiated, these models admit rich model features while retaining efficient learning and inference properties. The first contribution to this end is a latent-variable model for fine-grained sentiment analysis with coarse-grained indirect supervision. The second is a model for cross-lingual word-cluster induction and the application thereof to cross-lingual model transfer. The third is a method for adapting multi-source discriminative cross-lingual transfer models to target languages, by means of typologically informed selective parameter sharing. The fourth is an ambiguity-aware self- and ensemble-training algorithm, which is applied to target language adaptation and relexicalization of delexicalized cross-lingual transfer parsers. The fifth is a set of sequence-labeling models that combine constraints at the level of tokens and types, and an instantiation of these models for part-of-speech tagging with incomplete cross-lingual and crowdsourced supervision. In addition to these contributions, comprehensive overviews are provided of structured prediction with no or incomplete supervision, as well as of learning in the multilingual and cross-lingual settings. Through careful empirical evaluation, it is established that the proposed methods can be used to create substantially more accurate tools for linguistic processing, compared to both unsupervised methods and to recently proposed cross-lingual methods. The empirical support for this claim is particularly strong in the latter case; our models for syntactic dependency parsing and part-of-speech tagging achieve the hitherto best published results for a wide number of target languages, in the setting where no annotated training data is available in the target language.
95

GoPubMed: Ontology-based literature search for the life sciences / GoPubMed: ontologie-basierte Literatursuche für die Lebenswissenschaften

Doms, Andreas 20 January 2009 (has links) (PDF)
Background: Most of our biomedical knowledge is only accessible through texts. The biomedical literature grows exponentially and PubMed comprises over 18.000.000 literature abstracts. Recently much effort has been put into the creation of biomedical ontologies which capture biomedical facts. The exploitation of ontologies to explore the scientific literature is a new area of research. Motivation: When people search, they have questions in mind. Answering questions in a domain requires the knowledge of the terminology of that domain. Classical search engines do not provide background knowledge for the presentation of search results. Ontology annotated structured databases allow for data-mining. The hypothesis is that ontology annotated literature databases allow for text-mining. The central problem is to associate scientific publications with ontological concepts. This is a prerequisite for ontology-based literature search. The question then is how to answer biomedical questions using ontologies and a literature corpus. Finally the task is to automate bibliometric analyses on an corpus of scientific publications. Approach: Recent joint efforts on automatically extracting information from free text showed that the applied methods are complementary. The idea is to employ the rich terminological and relational information stored in biomedical ontologies to markup biomedical text documents. Based on established semantic links between documents and ontology concepts the goal is to answer biomedical question on a corpus of documents. The entirely annotated literature corpus allows for the first time to automatically generate bibliometric analyses for ontological concepts, authors and institutions. Results: This work includes a novel annotation framework for free texts with ontological concepts. The framework allows to generate recognition patterns rules from the terminological and relational information in an ontology. Maximum entropy models can be trained to distinguish the meaning of ambiguous concept labels. The framework was used to develop a annotation pipeline for PubMed abstracts with 27,863 Gene Ontology concepts. The evaluation of the recognition performance yielded a precision of 79.9% and a recall of 72.7% improving the previously used algorithm by 25,7% f-measure. The evaluation was done on a manually created (by the original authors) curation corpus of 689 PubMed abstracts with 18,356 curations of concepts. Methods to reason over large amounts of documents with ontologies were developed. The ability to answer questions with the online system was shown on a set of biomedical question of the TREC Genomics Track 2006 benchmark. This work includes the first ontology-based, large scale, online available, up-to-date bibliometric analysis for topics in molecular biology represented by GO concepts. The automatic bibliometric analysis is in line with existing, but often out-dated, manual analyses. Outlook: A number of promising continuations starting from this work have been spun off. A freely available online search engine has a growing user community. A spin-off company was funded by the High-Tech Gründerfonds which commercializes the new ontology-based search paradigm. Several off-springs of GoPubMed including GoWeb (general web search), Go3R (search in replacement, reduction, refinement methods for animal experiments), GoGene (search in gene/protein databases) are developed.
96

Hypergraphs and information fusion for term representation enrichment : applications to named entity recognition and word sense disambiguation / Hypergraphes et fusion d’information pour l’enrichissement de la représentation de termes : applications à la reconnaissance d’entités nommées et à la désambiguïsation du sens des mots

Soriano-Morales, Edmundo-Pavel 07 February 2018 (has links)
Donner du sens aux données textuelles est une besoin essentielle pour faire les ordinateurs comprendre notre langage. Pour extraire des informations exploitables du texte, nous devons les représenter avec des descripteurs avant d’utiliser des techniques d’apprentissage. Dans ce sens, le but de cette thèse est de faire la lumière sur les représentations hétérogènes des mots et sur la façon de les exploiter tout en abordant leur nature implicitement éparse.Dans un premier temps, nous proposons un modèle de réseau basé sur des hypergraphes qui contient des données linguistiques hétérogènes dans un seul modèle unifié. En d’autres termes, nous introduisons un modèle qui représente les mots au moyen de différentes propriétés linguistiques et les relie ensemble en fonction desdites propriétés. Notre proposition diffère des autres types de réseaux linguistiques parce que nous visons à fournir une structure générale pouvant contenir plusieurstypes de caractéristiques descriptives du texte, au lieu d’une seule comme dans la plupart des représentations existantes.Cette représentation peut être utilisée pour analyser les propriétés inhérentes du langage à partir de différents points de vue, oupour être le point de départ d’un pipeline de tâches du traitement automatique de langage. Deuxièmement, nous utilisons des techniques de fusion de caractéristiques pour fournir une représentation enrichie unique qui exploite la nature hétérogènedu modèle et atténue l’eparsité de chaque représentation. Ces types de techniques sont régulièrement utilisés exclusivement pour combiner des données multimédia.Dans notre approche, nous considérons différentes représentations de texte comme des sources d’information distinctes qui peuvent être enrichies par elles-mêmes. Cette approche n’a pas été explorée auparavant, à notre connaissance. Troisièmement, nous proposons un algorithme qui exploite les caractéristiques du réseau pour identifier et grouper des mots liés sémantiquement en exploitant les propriétés des réseaux. Contrairement aux méthodes similaires qui sont également basées sur la structure du réseau, notre algorithme réduit le nombre de paramètres requis et surtout, permet l’utilisation de réseaux lexicaux ou syntaxiques pour découvrir les groupes de mots, au lieu d’un type unique des caractéristiques comme elles sont habituellement employées.Nous nous concentrons sur deux tâches différentes de traitement du langage naturel: l’induction et la désambiguïsation des sens des mots (en anglais, Word Sense, Induction and Disambiguation, ou WSI/WSD) et la reconnaissance d’entité nommées(en anglais, Named Entity Recognition, ou NER). Au total, nous testons nos propositions sur quatre ensembles de données différents. Nous effectuons nos expériences et développements en utilisant des corpus à accès libre. Les résultats obtenus nous permettent de montrer la pertinence de nos contributions et nous donnent également un aperçu des propriétés des caractéristiques hétérogènes et de leurs combinaisons avec les méthodes de fusion. Plus précisément, nos expériences sont doubles: premièrement, nous montrons qu’en utilisant des caractéristiques hétérogènes enrichies par la fusion, provenant de notre réseau linguistique proposé, nous surpassons la performance des systèmes à caractéristiques uniques et basés sur la simple concaténation de caractéristiques. Aussi, nous analysons les opérateurs de fusion utilisés afin de mieux comprendre la raison de ces améliorations. En général, l’utilisation indépendante d’opérateurs de fusion n’est pas aussi efficace que l’utilisation d’une combinaison de ceux-ci pour obtenir une représentation spatiale finale. Et deuxièmement, nous abordons encore une fois la tâche WSI/WSD, cette fois-ci avec la méthode à base de graphes proposée afin de démontrer sa pertinence par rapport à la tâche. Nous discutons les différents résultats obtenus avec des caractéristiques lexicales ou syntaxiques. / Making sense of textual data is an essential requirement in order to make computers understand our language. To extract actionable information from text, we need to represent it by means of descriptors before using knowledge discovery techniques.The goal of this thesis is to shed light into heterogeneous representations of words and how to leverage them while addressing their implicit sparse nature.First, we propose a hypergraph network model that holds heterogeneous linguistic data in a single unified model. In other words, we introduce a model that represents words by means of different linguistic properties and links them together accordingto said properties. Our proposition differs to other types of linguistic networks in that we aim to provide a general structure that can hold several types of descriptive text features, instead of a single one as in most representations. This representationmay be used to analyze the inherent properties of language from different points of view, or to be the departing point of an applied NLP task pipeline. Secondly, we employ feature fusion techniques to provide a final single enriched representation that exploits the heterogeneous nature of the model and alleviates the sparseness of each representation.These types of techniques are regularly used exclusively to combine multimedia data. In our approach, we consider different text representations as distinct sources of information which can be enriched by themselves. This approach has not been explored before, to the best of our knowledge. Thirdly, we propose an algorithm that exploits the characteristics of the network to identify and group semantically related words by exploiting the real-world properties of the networks. In contrast with similar methods that are also based on the structure of the network, our algorithm reduces the number of required parameters and more importantly, allows for the use of either lexical or syntactic networks to discover said groups of words, instead of the singletype of features usually employed.We focus on two different natural language processing tasks: Word Sense Induction and Disambiguation (WSI/WSD), and Named Entity Recognition (NER). In total, we test our propositions on four different open-access datasets. The results obtained allow us to show the pertinence of our contributions and also give us some insights into the properties of heterogeneous features and their combinations with fusion methods. Specifically, our experiments are twofold: first, we show that using fusion-enriched heterogeneous features, coming from our proposed linguistic network, we outperform the performance of single features’ systems and other basic baselines. We note that using single fusion operators is not efficient compared to using a combination of them in order to obtain a final space representation. We show that the features added by each combined fusion operation are important towards the models predicting the appropriate classes. We test the enriched representations on both WSI/WSD and NER tasks. Secondly, we address the WSI/WSD task with our network-based proposed method. While based on previous work, we improve it by obtaining better overall performance and reducing the number of parameters needed. We also discuss the use of either lexical or syntactic networks to solve the task.Finally, we parse a corpus based on the English Wikipedia and then store it following the proposed network model. The parsed Wikipedia version serves as a linguistic resource to be used by other researchers. Contrary to other similar resources, insteadof just storing its part of speech tag and its dependency relations, we also take into account the constituency-tree information of each word analyzed. The hope is for this resource to be used on future developments without the need to compile suchresource from zero.
97

[en] NAMED ENTITY RECOGNITION FOR PORTUGUESE / [pt] RECONHECIMENTO DE ENTIDADES MENCIONADAS PARA O PORTUGUÊS

DANIEL SPECHT SILVA MENEZES 13 December 2018 (has links)
[pt] A produção e acesso a quantidades imensas dados é um elemento pervasivo da era da informação. O volume de informação disponível é sem precedentes na história da humanidade e está sobre constante processo de expansão. Uma oportunidade que emerge neste ambiente é o desenvolvimento de aplicações que sejam capazes de estruturar conhecimento contido nesses dados. Neste contexto se encaixa a área de Processamento de Linguagem Natural (PLN) - Natural Language Processing (NLP) - , ser capaz de extrair informações estruturadas de maneira eficiente de fontes textuais. Um passo fundamental para esse fim é a tarefa de Reconhecimento de Entidades Mencionadas (ou nomeadas) - Named Entity Recognition (NER) - que consistem em delimitar e categorizar menções a entidades num texto. A construção de sistemas para NLP deve ser acompanhada de datasets que expressem o entendimento humano sobre as estruturas gramaticais de interesse, para que seja possível realizar a comparação dos resultados com o real discernimento humano. Esses datasets são recursos escassos, que requerem esforço humano para sua produção. Atualmente, a tarefa de NER vem sendo abordada com sucesso por meio de redes neurais artificiais, que requerem conjuntos de dados anotados tanto para avaliação quanto para treino. A proposta deste trabalho é desenvolver um dataset de grandes dimensões para a tarefa de NER em português de maneira automatizada, minimizando a necessidade de intervenção humana. Utilizamos recursos públicos como fonte de dados, nominalmente o DBpedia e Wikipédia. Desenvolvemos uma metodologia para a construção do corpus e realizamos experimentos sobre o mesmo utilizando arquiteturas de redes neurais de melhores performances reportadas atualmente. Exploramos diversas modelos de redes neurais, explorando diversos valores de hiperparâmetros e propondo arquiteturas com o foco específico de incorporar fontes de dados diferentes para treino. / [en] The production and access of huge amounts of data is a pervasive element of the Information Age. The volume of availiable data is without precedents in human history and it s in constant expansion. An oportunity that emerges in this context is the development and usage of applicationos that are capable structuring the knowledge of data. In this context fits the Natural Language Processing, being able to extract information efficiently from textual data. A fundamental step for this goal is the task of Named Entity Recognition (NER) which delimits and categorizes the mentions to entities. The development o systems for NLP tasks must be accompanied by datasets produced by humans in order to compare the system with the human discerniment for the NLP task at hand. These datasets are a scarse resource which the construction is costly in terms of human supervision. Recentlly, the NER task has been approached using artificial network models which needs datsets for both training and evaluation. In this work we propose the construction of a datasets for portuguese NER with an automatic approach using public data sources structured according to the principles of SemanticWeb, namely, DBpedia and Wikipédia. A metodology for the construction of this dataset was developed and experiments were performed using both the built dataset and the neural network architectures with the best reported results. Many setups for the experiments were evaluated, we obtained preliminary results for diverse hiperparameters values, also proposing architectures with the specific focus of incorporating diverse data sources for training.
98

Anotação e classificação automática de entidades nomeadas em notícias esportivas em Português Brasileiro / Automatic named entity recognition and classification for brazilian portuguese sport news

Rodrigo Constantin Ctenas Zaccara 11 July 2012 (has links)
O objetivo deste trabalho é desenvolver uma plataforma para anotação e classificação automática de entidades nomeadas para notícias escritas em português do Brasil. Para restringir um pouco o escopo do treinamento e análise foram utilizadas notícias esportivas do Campeonato Paulista de 2011 do portal UOL (Universo Online). O primeiro artefato desenvolvido desta plataforma foi a ferramenta WebCorpus. Esta tem como principal intuito facilitar o processo de adição de metainformações a palavras através do uso de uma interface rica web, elaborada para deixar o trabalho ágil e simples. Desta forma as entidades nomeadas das notícias são anotadas e classificadas manualmente. A base de dados foi alimentada pela ferramenta de aquisição e extração de conteúdo desenvolvida também para esta plataforma. O segundo artefato desenvolvido foi o córpus UOLCP2011 (UOL Campeonato Paulista 2011). Este córpus foi anotado e classificado manualmente através do uso da ferramenta WebCorpus utilizando sete tipos de entidades: pessoa, lugar, organização, time, campeonato, estádio e torcida. Para o desenvolvimento do motor de anotação e classificação automática de entidades nomeadas foram utilizadas três diferentes técnicas: maximização de entropia, índices invertidos e métodos de mesclagem das duas técnicas anteriores. Para cada uma destas foram executados três passos: desenvolvimento do algoritmo, treinamento utilizando técnicas de aprendizado de máquina e análise dos melhores resultados. / The main target of this research is to develop an automatic named entity classification tool to sport news written in Brazilian Portuguese. To reduce this scope, during training and analysis only sport news about São Paulo Championship of 2011 written by UOL2 (Universo Online) was used. The first artefact developed was the WebCorpus tool, which aims to make easier the process of add meta informations to words, through a rich web interface. Using this, all the corpora news are tagged manually. The database used by this tool was fed by the crawler tool, also developed during this research. The second artefact developed was the corpora UOLCP2011 (UOL Campeonato Paulista 2011). This corpora was manually tagged using the WebCorpus tool. During this process, seven classification concepts were used: person, place, organization, team, championship, stadium and fans. To develop the automatic named entity classification tool, three different approaches were analysed: maximum entropy, inverted index and merge tecniques using both. Each approach had three steps: algorithm development, training using machine learning tecniques and best score analysis.
99

Extrakce strukturovaných dat z českého webu s využitím extrakčních ontologií / Extracting Structured Data from Czech Web Using Extraction Ontologies

Pouzar, Aleš January 2012 (has links)
The presented thesis deals with the task of automatic information extraction from HTML documents for two selected domains. Laptop offers are extracted from e-shops and free-published job offerings are extracted from company sites. The extraction process outputs structured data of high granularity grouped into data records, in which corresponding semantic label is assigned to each data item. The task was performed using the extraction system Ex, which combines two approaches: manually written rules and supervised machine learning algorithms. Due to the expert knowledge in the form of extraction rules the lack of training data could be overcome. The rules are independent of the specific formatting structure so that one extraction model could be used for heterogeneous set of documents. The achieved success of the extraction process in the case of laptop offers showed that extraction ontology describing one or a few product types could be combined with wrapper induction methods to automatically extract all product type offers on a web scale with minimum human effort.
100

GoPubMed: Ontology-based literature search for the life sciences

Doms, Andreas 06 January 2009 (has links)
Background: Most of our biomedical knowledge is only accessible through texts. The biomedical literature grows exponentially and PubMed comprises over 18.000.000 literature abstracts. Recently much effort has been put into the creation of biomedical ontologies which capture biomedical facts. The exploitation of ontologies to explore the scientific literature is a new area of research. Motivation: When people search, they have questions in mind. Answering questions in a domain requires the knowledge of the terminology of that domain. Classical search engines do not provide background knowledge for the presentation of search results. Ontology annotated structured databases allow for data-mining. The hypothesis is that ontology annotated literature databases allow for text-mining. The central problem is to associate scientific publications with ontological concepts. This is a prerequisite for ontology-based literature search. The question then is how to answer biomedical questions using ontologies and a literature corpus. Finally the task is to automate bibliometric analyses on an corpus of scientific publications. Approach: Recent joint efforts on automatically extracting information from free text showed that the applied methods are complementary. The idea is to employ the rich terminological and relational information stored in biomedical ontologies to markup biomedical text documents. Based on established semantic links between documents and ontology concepts the goal is to answer biomedical question on a corpus of documents. The entirely annotated literature corpus allows for the first time to automatically generate bibliometric analyses for ontological concepts, authors and institutions. Results: This work includes a novel annotation framework for free texts with ontological concepts. The framework allows to generate recognition patterns rules from the terminological and relational information in an ontology. Maximum entropy models can be trained to distinguish the meaning of ambiguous concept labels. The framework was used to develop a annotation pipeline for PubMed abstracts with 27,863 Gene Ontology concepts. The evaluation of the recognition performance yielded a precision of 79.9% and a recall of 72.7% improving the previously used algorithm by 25,7% f-measure. The evaluation was done on a manually created (by the original authors) curation corpus of 689 PubMed abstracts with 18,356 curations of concepts. Methods to reason over large amounts of documents with ontologies were developed. The ability to answer questions with the online system was shown on a set of biomedical question of the TREC Genomics Track 2006 benchmark. This work includes the first ontology-based, large scale, online available, up-to-date bibliometric analysis for topics in molecular biology represented by GO concepts. The automatic bibliometric analysis is in line with existing, but often out-dated, manual analyses. Outlook: A number of promising continuations starting from this work have been spun off. A freely available online search engine has a growing user community. A spin-off company was funded by the High-Tech Gründerfonds which commercializes the new ontology-based search paradigm. Several off-springs of GoPubMed including GoWeb (general web search), Go3R (search in replacement, reduction, refinement methods for animal experiments), GoGene (search in gene/protein databases) are developed.

Page generated in 0.0371 seconds