1 |
CUILESS2016: a clinical corpus applying compositional normalization of text mentionsOsborne, John D., Neu, Matthew B., Danila, Maria I., Solorio, Thamar, Bethard, Steven J. 10 January 2018 (has links)
Background: Traditionally text mention normalization corpora have normalized concepts to single ontology identifiers ("pre-coordinated concepts"). Less frequently, normalization corpora have used concepts with multiple identifiers ("post-coordinated concepts") but the additional identifiers have been restricted to a defined set of relationships to the core concept. This approach limits the ability of the normalization process to express semantic meaning. We generated a freely available corpus using post-coordinated concepts without a defined set of relationships that we term "compositional concepts" to evaluate their use in clinical text. Methods: We annotated 5397 disorder mentions from the ShARe corpus to SNOMED CT that were previously normalized as "CUI-less" in the "SemEval-2015 Task 14" shared task because they lacked a pre-coordinated mapping. Unlike the previous normalization method, we do not restrict concept mappings to a particular set of the Unified Medical Language System (UMLS) semantic types and allow normalization to occur to multiple UMLS Concept Unique Identifiers (CUIs). We computed annotator agreement and assessed semantic coverage with this method. Results: We generated the largest clinical text normalization corpus to date with mappings to multiple identifiers and made it freely available. All but 8 of the 5397 disorder mentions were normalized using this methodology. Annotator agreement ranged from 52.4% using the strictest metric (exact matching) to 78.2% using a hierarchical agreement that measures the overlap of shared ancestral nodes. Conclusion: Our results provide evidence that compositional concepts can increase semantic coverage in clinical text. To our knowledge we provide the first freely available corpus of compositional concept annotation in clinical text.
|
2 |
Contributions to generic and affective visual concept recognition / Contribution à la reconnaissance de concepts visuels génériques et émotionnelsLiu, Ningning 22 November 2013 (has links)
Cette thèse de doctorat est consacrée à la reconnaissance de concepts visuels (VCR pour "Visual Concept Recognition"). En raison des nombreuses difficultés qui la caractérisent, cette tâche est toujours considérée comme l’une des plus difficiles en vision par ordinateur et reconnaissance de formes. Dans ce contexte, nous avons proposé plusieurs contributions, particulièrement dans le cadre d’une approche de reconnaissance multimodale combinant efficacement les informations visuelles et textuelles. Tout d’abord, nous avons étudié différents types de descripteurs visuels de bas-niveau sémantique pour la tâche de VCR incluant des descripteurs de couleur, de texture et de forme. Plus précisément, nous pensons que chaque concept nécessite différents descripteurs pour le caractériser efficacement pour permettre sa reconnaissance automatique. Ainsi, nous avons évalué l’efficacité de diverses représentations visuelles, non seulement globales comme la couleur, la texture et la forme, mais également locales telles que SIFT, Color SIFT, HOG, DAISY, LBP et Color LBP. Afin de faciliter le franchissement du fossé sémantique entre les descripteurs bas-niveau et les concepts de haut niveau sémantique, et particulièrement ceux relatifs aux émotions, nous avons proposé des descripteurs visuels de niveau intermédiaire basés sur l’harmonie visuelle et le dynamisme exprimés dans les images. De plus, nous avons utilisé une décomposition spatiale pyramidale des images pour capturer l’information locale et spatiale lors de la construction des descripteurs d’harmonie et de dynamisme. Par ailleurs, nous avons également proposé une nouvelle représentation reposant sur les histogrammes de couleur HSV en utilisant un modèle d’attention visuelle pour identifier les régions d’intérêt dans les images. Ensuite, nous avons proposé un nouveau descripteur textuel dédié au problème de VCR. En effet, la plupart des photos publiées sur des sites de partage en ligne (Flickr, Facebook, ...) sont accompagnées d’une description textuelle sous la forme de mots-clés ou de légende. Ces descriptions constituent une riche source d’information sur la sémantique contenue dans les images et il semble donc particulièrement intéressant de les considérer dans un système de VCR. Ainsi, nous avons élaboré des descripteurs HTC ("Histograms of Textual Concepts") pour capturer les liens sémantiques entre les concepts. L’idée générale derrière HTC est de représenter un document textuel comme un histogramme de concepts textuels selon un dictionnaire (ou vocabulaire), pour lequel chaque valeur associée à un concept est l’accumulation de la contribution de chaque mot du texte pour ce concept, en fonction d’une mesure de distance sémantique. Plusieurs variantes de HTC ont été proposées qui se sont révélées être très efficaces pour la tâche de VCR. Inspirés par la démarche de l’analyse cepstrale de la parole, nous avons également développé Cepstral HTC pour capturer à la fois l’information de fréquence d’occurrence des mots (comme TF-IDF) et les liens sémantiques entre concepts fournis par HTC à partir des mots-clés associés aux images. Enfin, nous avons élaboré une méthode de fusion (SWLF pour "Selective Weighted Later Fusion") afin de combiner efficacement différentes sources d’information pour le problème de VCR. Cette approche de fusion est conçue pour sélectionner les meilleurs descripteurs et pondérer leur contribution pour chaque concept à reconnaître. SWLF s’est révélé être particulièrement efficace pour fusion des modalités visuelles et textuelles, par rapport à des schémas de fusion standards. [...] / This Ph.D thesis is dedicated to visual concept recognition (VCR). Due to many realistic difficulties, it is still considered to be one of the most challenging problems in computer vision and pattern recognition. In this context, we have proposed some innovative contributions for the task of VCR, particularly in building multimodal approaches that efficiently combine visual and textual information. Firstly, we have proposed semantic features for VCR and have investigated the efficiency of different types of low-level visual features for VCR including color, texture and shape. Specifically, we believe that different concepts require different features to efficiently characterize them for the recognition. Therefore, we have investigated in the context of VCR various visual representations, not only global features including color, shape and texture, but also the state-of-the-art local visual descriptors such as SIFT, Color SIFT, HOG, DAISY, LBP, Color LBP. To help bridging the semantic gap between low-level visual features and high level semantic concepts, and particularly those related to emotions and feelings, we have proposed mid-level visual features based on the visual harmony and dynamism semantics using Itten’s color theory and psychological interpretations. Moreover, we have employed a spatial pyramid strategy to capture the spatial information when building our mid-level features harmony and dynamism. We have also proposed a new representation of color HSV histograms by employing a visual attention model to identify the regions of interest in images. Secondly, we have proposed a novel textual feature designed for VCR. Indeed, most of online-shared photos provide textual descriptions in the form of tags or legends. In fact, these textual descriptions are a rich source of semantic information on visual data that is interesting to consider for the purpose of VCR or multimedia information retrieval. We propose the Histograms of Textual Concepts (HTC) to capture the semantic relatedness of concepts. The general idea behind HTC is to represent a text document as a histogram of textual concepts towards a vocabulary or dictionary, whereas its value is the accumulation of the contribution of each word within the text document toward the underlying concept according to a predefined semantic similarity measure. Several variants of HTC have been proposed that revealed to be very efficient for VCR. Inspired by the Cepstral speech analysis process, we have also developed Cepstral HTC to capture both term frequency-based information (like TF-IDF) and the relatedness of semantic concepts in the sparse image tags, which overcomes the HTC’s shortcoming of ignoring term frequency-based information. Thirdly, we have proposed a fusion scheme to combine different sources of Later Fusion, (SWLF) is designed to select the best features and to weight their scores for each concept to be recognized. SWLF proves particularly efficient for fusing visual and textual modalities in comparison with some other standard fusion schemes. While a late fusion at score level is reputed as a simple and effective way to fuse features of different nature for machine-learning problems, the proposed SWLF builds on two simple insights. First, the score delivered by a feature type should be weighted by its intrinsic quality for the classification problem at hand. Second, in a multi-label scenario where several visual concepts may be assigned to an image, different visual concepts may require different features which best recognize them. In addition to SWLF, we also propose a novel combination approach based on Dempster-Shafer’s evidence theory, whose interesting properties allow fusing different ambiguous sources of information for visual affective recognition. [...]
|
3 |
GoPubMed: Ontology-based literature search for the life sciences / GoPubMed: ontologie-basierte Literatursuche für die LebenswissenschaftenDoms, Andreas 20 January 2009 (has links) (PDF)
Background: Most of our biomedical knowledge is only accessible through texts. The biomedical literature grows exponentially and PubMed comprises over 18.000.000 literature abstracts. Recently much effort has been put into the creation of biomedical ontologies which capture biomedical facts. The exploitation of ontologies to explore the scientific literature is a new area of research. Motivation: When people search, they have questions in mind. Answering questions in a domain requires the knowledge of the terminology of that domain. Classical search engines do not provide background knowledge for the presentation of search results. Ontology annotated structured databases allow for data-mining. The hypothesis is that ontology annotated literature databases allow for text-mining. The central problem is to associate scientific publications with ontological concepts. This is a prerequisite for ontology-based literature search. The question then is how to answer biomedical questions using ontologies and a literature corpus. Finally the task is to automate bibliometric analyses on an corpus of scientific publications. Approach: Recent joint efforts on automatically extracting information from free text showed that the applied methods are complementary. The idea is to employ the rich terminological and relational information stored in biomedical ontologies to markup biomedical text documents. Based on established semantic links between documents and ontology concepts the goal is to answer biomedical question on a corpus of documents. The entirely annotated literature corpus allows for the first time to automatically generate bibliometric analyses for ontological concepts, authors and institutions. Results: This work includes a novel annotation framework for free texts with ontological concepts. The framework allows to generate recognition patterns rules from the terminological and relational information in an ontology. Maximum entropy models can be trained to distinguish the meaning of ambiguous concept labels. The framework was used to develop a annotation pipeline for PubMed abstracts with 27,863 Gene Ontology concepts. The evaluation of the recognition performance yielded a precision of 79.9% and a recall of 72.7% improving the previously used algorithm by 25,7% f-measure. The evaluation was done on a manually created (by the original authors) curation corpus of 689 PubMed abstracts with 18,356 curations of concepts. Methods to reason over large amounts of documents with ontologies were developed. The ability to answer questions with the online system was shown on a set of biomedical question of the TREC Genomics Track 2006 benchmark. This work includes the first ontology-based, large scale, online available, up-to-date bibliometric analysis for topics in molecular biology represented by GO concepts. The automatic bibliometric analysis is in line with existing, but often out-dated, manual analyses. Outlook: A number of promising continuations starting from this work have been spun off. A freely available online search engine has a growing user community. A spin-off company was funded by the High-Tech Gründerfonds which commercializes the new ontology-based search paradigm. Several off-springs of GoPubMed including GoWeb (general web search), Go3R (search in replacement, reduction, refinement methods for animal experiments), GoGene (search in gene/protein databases) are developed.
|
4 |
GoPubMed: Ontology-based literature search for the life sciencesDoms, Andreas 06 January 2009 (has links)
Background: Most of our biomedical knowledge is only accessible through texts. The biomedical literature grows exponentially and PubMed comprises over 18.000.000 literature abstracts. Recently much effort has been put into the creation of biomedical ontologies which capture biomedical facts. The exploitation of ontologies to explore the scientific literature is a new area of research. Motivation: When people search, they have questions in mind. Answering questions in a domain requires the knowledge of the terminology of that domain. Classical search engines do not provide background knowledge for the presentation of search results. Ontology annotated structured databases allow for data-mining. The hypothesis is that ontology annotated literature databases allow for text-mining. The central problem is to associate scientific publications with ontological concepts. This is a prerequisite for ontology-based literature search. The question then is how to answer biomedical questions using ontologies and a literature corpus. Finally the task is to automate bibliometric analyses on an corpus of scientific publications. Approach: Recent joint efforts on automatically extracting information from free text showed that the applied methods are complementary. The idea is to employ the rich terminological and relational information stored in biomedical ontologies to markup biomedical text documents. Based on established semantic links between documents and ontology concepts the goal is to answer biomedical question on a corpus of documents. The entirely annotated literature corpus allows for the first time to automatically generate bibliometric analyses for ontological concepts, authors and institutions. Results: This work includes a novel annotation framework for free texts with ontological concepts. The framework allows to generate recognition patterns rules from the terminological and relational information in an ontology. Maximum entropy models can be trained to distinguish the meaning of ambiguous concept labels. The framework was used to develop a annotation pipeline for PubMed abstracts with 27,863 Gene Ontology concepts. The evaluation of the recognition performance yielded a precision of 79.9% and a recall of 72.7% improving the previously used algorithm by 25,7% f-measure. The evaluation was done on a manually created (by the original authors) curation corpus of 689 PubMed abstracts with 18,356 curations of concepts. Methods to reason over large amounts of documents with ontologies were developed. The ability to answer questions with the online system was shown on a set of biomedical question of the TREC Genomics Track 2006 benchmark. This work includes the first ontology-based, large scale, online available, up-to-date bibliometric analysis for topics in molecular biology represented by GO concepts. The automatic bibliometric analysis is in line with existing, but often out-dated, manual analyses. Outlook: A number of promising continuations starting from this work have been spun off. A freely available online search engine has a growing user community. A spin-off company was funded by the High-Tech Gründerfonds which commercializes the new ontology-based search paradigm. Several off-springs of GoPubMed including GoWeb (general web search), Go3R (search in replacement, reduction, refinement methods for animal experiments), GoGene (search in gene/protein databases) are developed.
|
Page generated in 0.114 seconds