• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 34
  • 24
  • 1
  • 1
  • 1
  • Tagged with
  • 61
  • 23
  • 19
  • 19
  • 13
  • 12
  • 12
  • 12
  • 11
  • 11
  • 8
  • 7
  • 7
  • 7
  • 6
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
51

Time Dynamic Topic Models

Jähnichen, Patrick 22 March 2016 (has links)
Information extraction from large corpora can be a useful tool for many applications in industry and academia. For instance, political communication science has just recently begun to use the opportunities that come with the availability of massive amounts of information available through the Internet and the computational tools that natural language processing can provide. We give a linguistically motivated interpretation of topic modeling, a state-of-the-art algorithm for extracting latent semantic sets of words from large text corpora, and extend this interpretation to cover issues and issue-cycles as theoretical constructs coming from political communication science. We build on a dynamic topic model, a model whose semantic sets of words are allowed to evolve over time governed by a Brownian motion stochastic process and apply a new form of analysis to its result. Generally this analysis is based on the notion of volatility as in the rate of change of stocks or derivatives known from econometrics. We claim that the rate of change of sets of semantically related words can be interpreted as issue-cycles, the word sets as describing the underlying issue. Generalizing over the existing work, we introduce dynamic topic models that are driven by general (Brownian motion is a special case of our model) Gaussian processes, a family of stochastic processes defined by the function that determines their covariance structure. We use the above assumption and apply a certain class of covariance functions to allow for an appropriate rate of change in word sets while preserving the semantic relatedness among words. Applying our findings to a large newspaper data set, the New York Times Annotated corpus (all articles between 1987 and 2007), we are able to identify sub-topics in time, \\\\textit{time-localized topics} and find patterns in their behavior over time. However, we have to drop the assumption of semantic relatedness over all available time for any one topic. Time-localized topics are consistent in themselves but do not necessarily share semantic meaning between each other. They can, however, be interpreted to capture the notion of issues and their behavior that of issue-cycles.
52

Resting-state functional connectivity in the brain and its relation to language development in preschool children

Xiao, Yaqiong 01 December 2017 (has links)
Human infants have been shown to have an innate capacity to acquire their mother tongue. In recent decades, the advent of the functional magnetic resonance imaging (fMRI) technique has made it feasible to explore the neural basis underlying language acquisition and processing in children, even in newborn infants (for reviews, see Kuhl & Rivera-Gaxiola, 2008; Kuhl, 2010) . Spontaneous low-frequency (< 0.1 Hz) fluctuations (LFFs) in the resting brain have been shown to be physiologically meaningful in the seminal study (Biswal et al., 1995) . Compared to task-based fMRI, resting-state fMRI (rs-fMRI) has some unique advantages in neuroimaging research, especially in obtaining data from pediatric and clinical populations. Moreover, it enables us to characterize the functional organization of the brain in a systematic manner in the absence of explicit tasks. Among brain systems, the language network has been well investigated by analyzing LFFs in the resting brain. This thesis attempts to investigate the functional connectivity within the language network in typically developing preschool children and the covariation of this connectivity with children’s language development by using the rs-fMRI technique. The first study (see Chapter 2.1; Xiao et al., 2016a) revealed connectivity differences in language-related regions between 5-year-olds and adults, and demonstrated distinct correlation patterns between functional connections within the language network and sentence comprehension performance in children. The results showed a left fronto-temporal connection for processing syntactically more complex sentences, suggesting that this connection is already in place at age 5 when it is needed for complex sentence comprehension, even though the whole functional network is still immature. In the second study (see Chapter 2.2; Xiao et al., 2016b), sentence comprehension performance and rs-fMRI data were obtained from a cohort of children at age 5 and a one-year follow-up. This study examined the changes in functional connectivity in the developing brain and their relation to the development of language abilities. The findings showed that the development of intrinsic functional connectivity in preschool children over the course of one year is clearly observable and individual differences in this development are related to the advancement in sentence comprehension ability with age. In summary, the present thesis provides new insights into the relationship between intrinsic functional connectivity in the brain and language processing, as well as between the changes in intrinsic functional connectivity and concurrent language development in preschool children. Moreover, it allows for a better understanding of the neural mechanisms underlying language processing and the advancement of language abilities in the developing brain.
53

Unsupervised Natural Language Processing for Knowledge Extraction from Domain-specific Textual Resources

Hänig, Christian 17 April 2013 (has links)
This thesis aims to develop a Relation Extraction algorithm to extract knowledge out of automotive data. While most approaches to Relation Extraction are only evaluated on newspaper data dealing with general relations from the business world their applicability to other data sets is not well studied. Part I of this thesis deals with theoretical foundations of Information Extraction algorithms. Text mining cannot be seen as the simple application of data mining methods to textual data. Instead, sophisticated methods have to be employed to accurately extract knowledge from text which then can be mined using statistical methods from the field of data mining. Information Extraction itself can be divided into two subtasks: Entity Detection and Relation Extraction. The detection of entities is very domain-dependent due to terminology, abbreviations and general language use within the given domain. Thus, this task has to be solved for each domain employing thesauri or another type of lexicon. Supervised approaches to Named Entity Recognition will not achieve reasonable results unless they have been trained for the given type of data. The task of Relation Extraction can be basically approached by pattern-based and kernel-based algorithms. The latter achieve state-of-the-art results on newspaper data and point out the importance of linguistic features. In order to analyze relations contained in textual data, syntactic features like part-of-speech tags and syntactic parses are essential. Chapter 4 presents machine learning approaches and linguistic foundations being essential for syntactic annotation of textual data and Relation Extraction. Chapter 6 analyzes the performance of state-of-the-art algorithms of POS tagging, syntactic parsing and Relation Extraction on automotive data. The findings are: supervised methods trained on newspaper corpora do not achieve accurate results when being applied on automotive data. This is grounded in various reasons. Besides low-quality text, the nature of automotive relations states the main challenge. Automotive relation types of interest (e. g. component – symptom) are rather arbitrary compared to well-studied relation types like is-a or is-head-of. In order to achieve acceptable results, algorithms have to be trained directly on this kind of data. As the manual annotation of data for each language and data type is too costly and inflexible, unsupervised methods are the ones to rely on. Part II deals with the development of dedicated algorithms for all three essential tasks. Unsupervised POS tagging (Chapter 7) is a well-studied task and algorithms achieving accurate tagging exist. All of them do not disambiguate high frequency words, only out-of-lexicon words are disambiguated. Most high frequency words bear syntactic information and thus, it is very important to differentiate between their different functions. Especially domain languages contain ambiguous and high frequent words bearing semantic information (e. g. pump). In order to improve POS tagging, an algorithm for disambiguation is developed and used to enhance an existing state-of-the-art tagger. This approach is based on context clustering which is used to detect a word type’s different syntactic functions. Evaluation shows that tagging accuracy is raised significantly. An approach to unsupervised syntactic parsing (Chapter 8) is developed in order to suffice the requirements of Relation Extraction. These requirements include high precision results on nominal and prepositional phrases as they contain the entities being relevant for Relation Extraction. Furthermore, accurate shallow parsing is more desirable than deep binary parsing as it facilitates Relation Extraction more than deep parsing. Endocentric and exocentric constructions can be distinguished and improve proper phrase labeling. unsuParse is based on preferred positions of word types within phrases to detect phrase candidates. Iterating the detection of simple phrases successively induces deeper structures. The proposed algorithm fulfills all demanded criteria and achieves competitive results on standard evaluation setups. Syntactic Relation Extraction (Chapter 9) is an approach exploiting syntactic statistics and text characteristics to extract relations between previously annotated entities. The approach is based on entity distributions given in a corpus and thus, provides a possibility to extend text mining processes to new data in an unsupervised manner. Evaluation on two different languages and two different text types of the automotive domain shows that it achieves accurate results on repair order data. Results are less accurate on internet data, but the task of sentiment analysis and extraction of the opinion target can be mastered. Thus, the incorporation of internet data is possible and important as it provides useful insight into the customer\''s thoughts. To conclude, this thesis presents a complete unsupervised workflow for Relation Extraction – except for the highly domain-dependent Entity Detection task – improving performance of each of the involved subtasks compared to state-of-the-art approaches. Furthermore, this work applies Natural Language Processing methods and Relation Extraction approaches to real world data unveiling challenges that do not occur in high quality newspaper corpora.
54

Referenzielle Kohärenz im Erstspracherwerb: Untersuchungen zur Verarbeitung und Produktion anaphorischer Referenz

Lehmkuhle, Ina 13 May 2022 (has links)
Die Bezugnahme auf bereits in den Diskurs eingeführte Referenten durch Anaphern stellt ein zentrales Mittel zur Etablierung referenzieller Kohärenz dar. Der Gebrauch anaphorischer Referenzausdrücke hängt dabei mit dem Grad der Zugänglichkeit der mentalen Repräsentation eines Diskursreferenten zusammen: Referenzausdrücke wie Pronomen spiegeln beispielsweise einen hohen Zugänglichkeitsgrad wider, wohingegen Referenzausdrücke wie Nominalphrasen und Eigennamen einen niedrigeren Zugänglichkeitsgrad signalisieren (u.a. Ariel, 1990). Der relative Zugänglichkeitsgrad von Diskursreferenten wird dabei von verschiedenen Zugänglichkeitsfaktoren auf unterschiedlichen sprachlichen Ebenen beeinflusst (für einen Überblick: Arnold, 2010). Diese Dissertation beschäftigt sich mit dem Erwerb referenzieller Kohärenz durch deutschsprachige Kinder. Dabei geht es sowohl um die Verarbeitung als auch um die Produktion anaphorischer Referenzausdrücke. Zum einen stellt sich hierbei die Frage, inwiefern Kinder in der Lage sind, anaphorische Referenzausdrücke als Hinweis auf den Grad der Zugänglichkeit von Diskursreferenten online zu verarbeiten und offline zu interpretieren. Zum anderen wird untersucht, inwiefern Kinder beim Gebrauch anaphorischer Referenzausdrücke den relativen Zugänglichkeitsgrad von Diskursreferenten in narrativen Textproduktionen berücksichtigen. Um zu überprüfen, auf welche Weise sich Kinder hierbei von Erwachsenen unterscheiden, werden die Ergebnisse der kindlichen Gruppe jeweils mit denen einer erwachsenen Kontrollgruppe verglichen. Die Ergebnisse der Untersuchungen in dieser Arbeit zeigen, dass deutschsprachige Kinder sowohl bei der Verarbeitung als auch bei der Produktion von anaphorischen Referenzausdrücken sensibel auf den relativen Zugänglichkeitsgrad von Diskursreferenten reagieren: Drei- bis vierjährige Kinder bevorzugen bei der Online-Verarbeitung Personalpronomen gegenüber wiederholten Eigennamen, wenn sich diese auf höchst zugängliche Diskursreferenten beziehen (Experiment 1, Eyetracking). Dies wird als Hinweis darauf gewertet, dass sie verstehen, dass sich gewisse Ausdrücke besser als andere dazu eignen, auf höchst zugängliche Diskursreferenten zu verweisen. Zudem berücksichtigen neun- bis zehnjährige Kinder bei der Produktion anaphorischer Referenzausdrücke in narrativen Textproduktionen lokale und globale Zugänglichkeitsfaktoren (Studie, Bildergeschichte). Als lokaler Zugänglichkeitsfaktor wird hier die referenzielle Funktion (Erhalt vs. Wiederaufnahme) betrachtet; der Charaktertyp (Hauptcharakter vs. Nebencharakter) repräsentiert hingegen einen globalen Zugänglichkeitsfaktor (Vogels, 2014). In Übereinstimmung mit den Präferenzen Erwachsener benutzen die Kinder überwiegend Pronomen zum Erhalt und Nominalphrasen zur Wiederaufnahme von Diskursreferenten. Ein Unterschied zu den Erwachsenen besteht jedoch im Hinblick auf den globalen Zugänglichkeitsfaktor des Charaktertyps: Anders als die Erwachsenen verweisen die Kinder vorzugsweise mit Pronomen auf Hauptcharaktere und mit Nominalphrasen auf Nebencharaktere. Erwachsene scheinen den globalen Zugänglichkeitsfaktor des Charaktertyps hingegen erst dann zu berücksichtigen, wenn lokale Diskursanforderungen erfüllt sind. Dies legt nahe, dass Kinder Zugänglichkeitsfaktoren zum Teil anders gewichten als Erwachsene. Für eine ähnliche Interpretation spricht auch das Verhalten acht- bis neunjähriger Kinder bei der Online-Verarbeitung von Personalpronomen und definiten Nominalphrasen (Experiment 2, Eyetracking). Während die Erwachsenen Personalpronomen gegenüber Nominalphrasen bevorzugen, wenn sich diese in der Funktion des Topikerhalts auf höchst zugängliche Diskursreferenten beziehen, zeigen die Kinder keinen Unterschied in ihrem Blickverhalten in Bezug auf diese beiden referenziellen Ausdrucksformen. Dies spricht dafür, dass die Kinder die informationsstrukturelle Funktion dieser beiden Referenzausdrücke im Gegensatz zu den Erwachsenen unberücksichtigt lassen. Obwohl Kinder bei der Verarbeitung und Produktion anaphorischer Referenzausdrücke bereits vielfach den relativen Zugänglichkeitsgrad von Diskursreferenten berücksichtigen, scheint der Erwerb anaphorischer Referenz auch noch am Ende der Grundschule nicht abgeschlossen zu sein.
55

Automatic Translation of Clinical Trial Eligibility Criteria into Formal Queries: Extended Version

Xu, Chao, Forkel, Walter, Borgwardt, Stefan, Baader, Franz, Zhou, Beihai 29 December 2023 (has links)
Selecting patients for clinical trials is very labor-intensive. Our goal is to develop an automated system that can support doctors in this task. This paper describes a major step towards such a system: the automatic translation of clinical trial eligibility criteria from natural language into formal, logic-based queries. First, we develop a semantic annotation process that can capture many types of clinical trial criteria. Then, we map the annotated criteria to the formal query language. We have built a prototype system based on state-of-the-art NLP tools such as Word2Vec, Stanford NLP tools, and the MetaMap Tagger, and have evaluated the quality of the produced queries on a number of criteria from clinicaltrials.gov. Finally, we discuss some criteria that were hard to translate, and give suggestions for how to formulate eligibility criteria to make them easier to translate automatically.
56

Zooming in on speech production: Cumulative semantic interference and the processing of compounds

Döring, Anna-Lisa 25 April 2023 (has links)
Diese Dissertation untersucht einige ungeklärten Aspekte der Sprachproduktion. Das erste Ziel war es zu klären, wie Komposita (z.B. Goldfisch) auf der lexikalisch-syntaktischen Ebene unseres Sprachproduktionssystems repräsentiert sind. Gibt es dort einen einzelnen lexikalischen Eintrag für das gesamte Kompositum (GOLDFISCH) oder mehrere Einträge für jedes seiner Konstituenten (GOLD und FISCH), welche beim Sprechen zusammengesetzt werden? Zur Beantwortung dieser Frage wurde die sogenannte kumulative semantische Interferenz (KSI) verwendet. Dieser semantische Kontexteffekt beschreibt die Beobachtung, dass die Benennlatenzen von Sprechern systematisch länger werden, wenn diese eine Reihe von semantisch verwandten Bildern benennen. Obwohl KSI bereits viel als Instrument in der Sprachproduktionsforschung genutzt wird, sind einige Fragen rund um den Effekt selbst noch offen. Das zweite Ziel dieser Dissertation war es daher einige dieser Fragen mit Hilfe von behavioralen und elektrophysiologischen Maßen zu beantworten, um so unser Verständnis von KSI zu erweitern. Die Ergebnisse deuten darauf hin, dass KSI ihren Ursprung auf der konzeptuellen Ebene des Sprachproduktionssystems hat und dass sie nicht von der morphologischen Komplexität der verwendeten Begriffe moduliert wird, aber davon, wie häufig diese benannt werden. Diese Erkenntnisse ermöglichen es in der Zukunft zielgenauere Vorhersagen zu machen, wenn KSI als Forschungsinstrument verwendet wird. Die Ergebnisse zeigen zudem, dass die Konstituenten von Komposita während deren Produktion aktiviert werden. Dies belegt, dass Komposita in einer komplexen Struktur repräsentiert sind, die aus einem Eintrag für das ganze Kompositum und zusätzlichen Einträgen für die Konstituenten besteht. Somit zeigen diese Ergebnisse, dass die Morphologie bereits die Repräsentationen auf der lexikalisch-syntaktischen Ebene beeinflusst und erweitern somit unser Wissen über den Aufbau unseres Sprachproduktionssystems. / This dissertation addresses unresolved issues concerning speech production processes and the cognitive architecture of our speech production system. The first aim was to answer the question how compounds (e.g., goldfish) are represented on the lexical-syntactic level of our speech production system. Is there a single entry for the whole compound (GOLDFISH) or multiple ones for each of its constituents (GOLD and FISH), which are assembled for each use? To investigate this question, we used the cumulative semantic interference (CSI) effect. This semantic context effect describes the observation that speakers’ naming latencies systematically increase when naming a sequence of semantically related pictures. Although CSI has been extensively used as a tool in language production research, several aspects of it are not fully understood. Thus, the second aim of this dissertation was to close some of these knowledge gaps and gain a more comprehensive understanding of CSI. In three studies, we first investigated the CSI effect, before using it as a tool to study the lexical representation of compounds. Behavioural and electrophysiological data from the first two studies point to a purely conceptual origin of CSI. Furthermore, they revealed that CSI is not influenced by the items’ morphological complexity but affected by item repetition. These findings advance our understanding of CSI and thus allow us to make more informed predictions when using CSI as a research tool. The last study showed that the compounds’ constituents are activated during compound production, which provides evidence for a complex lexical-syntactic representation of compounds, consisting of one entry for the holistic compound and additional entries for each of its constituents. This dissertation thus reveals that the morphological complexity of compounds affects the lexical-syntactic level during speech production and thus advances our understanding of the architecture of our speech production system.
57

Text Mining for Pathway Curation

Weber-Genzel, Leon 17 November 2023 (has links)
Biolog:innen untersuchen häufig Pathways, Netzwerke von Interaktionen zwischen Proteinen und Genen mit einer spezifischen Funktion. Neue Erkenntnisse über Pathways werden in der Regel zunächst in Publikationen veröffentlicht und dann in strukturierter Form in Lehrbüchern, Datenbanken oder mathematischen Modellen weitergegeben. Deren Kuratierung kann jedoch aufgrund der hohen Anzahl von Publikationen sehr aufwendig sein. In dieser Arbeit untersuchen wir wie Text Mining Methoden die Kuratierung unterstützen können. Wir stellen PEDL vor, ein Machine-Learning-Modell zur Extraktion von Protein-Protein-Assoziationen (PPAs) aus biomedizinischen Texten. PEDL verwendet Distant Supervision und vortrainierte Sprachmodelle, um eine höhere Genauigkeit als vergleichbare Methoden zu erreichen. Eine Evaluation durch Expert:innen bestätigt die Nützlichkeit von PEDLs für Pathway-Kurator:innen. Außerdem stellen wir PEDL+ vor, ein Kommandozeilen-Tool, mit dem auch Nicht-Expert:innen PPAs effizient extrahieren können. Drei Kurator:innen bewerten 55,6 % bis 79,6 % der von PEDL+ gefundenen PPAs als nützlich für ihre Arbeit. Die große Anzahl von PPAs, die durch Text Mining identifiziert werden, kann für Forscher:innen überwältigend sein. Um hier Abhilfe zu schaffen, stellen wir PathComplete vor, ein Modell, das nützliche Erweiterungen eines Pathways vorschlägt. Es ist die erste Pathway-Extension-Methode, die auf überwachtem maschinellen Lernen basiert. Unsere Experimente zeigen, dass PathComplete wesentlich genauer ist als existierende Methoden. Schließlich schlagen wir eine Methode vor, um Pathways mit komplexen Ereignisstrukturen zu erweitern. Hier übertrifft unsere neue Methode zur konditionalen Graphenmodifikation die derzeit beste Methode um 13-24% Genauigkeit in drei Benchmarks. Insgesamt zeigen unsere Ergebnisse, dass Deep Learning basierte Informationsextraktion eine vielversprechende Grundlage für die Unterstützung von Pathway-Kurator:innen ist. / Biological knowledge often involves understanding the interactions between molecules, such as proteins and genes, that form functional networks called pathways. New knowledge about pathways is typically communicated through publications and later condensed into structured formats such as textbooks, pathway databases or mathematical models. However, curating updated pathway models can be labour-intensive due to the growing volume of publications. This thesis investigates text mining methods to support pathway curation. We present PEDL (Protein-Protein-Association Extraction with Deep Language Models), a machine learning model designed to extract protein-protein associations (PPAs) from biomedical text. PEDL uses distant supervision and pre-trained language models to achieve higher accuracy than the state of the art. An expert evaluation confirms its usefulness for pathway curators. We also present PEDL+, a command-line tool that allows non-expert users to efficiently extract PPAs. When applied to pathway curation tasks, 55.6% to 79.6% of PEDL+ extractions were found useful by curators. The large number of PPAs identified by text mining can be overwhelming for researchers. To help, we present PathComplete, a model that suggests potential extensions to a pathway. It is the first method based on supervised machine learning for this task, using transfer learning from pathway databases. Our evaluations show that PathComplete significantly outperforms existing methods. Finally, we generalise pathway extension from PPAs to more realistic complex events. Here, our novel method for conditional graph modification outperforms the current best by 13-24% accuracy on three benchmarks. We also present a new dataset for event-based pathway extension. Overall, our results show that deep learning-based information extraction is a promising basis for supporting pathway curators.
58

Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data

Hellmann, Sebastian 12 January 2015 (has links) (PDF)
This thesis is a compendium of scientific works and engineering specifications that have been contributed to a large community of stakeholders to be copied, adapted, mixed, built upon and exploited in any way possible to achieve a common goal: Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data The explosion of information technology in the last two decades has led to a substantial growth in quantity, diversity and complexity of web-accessible linguistic data. These resources become even more useful when linked with each other and the last few years have seen the emergence of numerous approaches in various disciplines concerned with linguistic resources and NLP tools. It is the challenge of our time to store, interlink and exploit this wealth of data accumulated in more than half a century of computational linguistics, of empirical, corpus-based study of language, and of computational lexicography in all its heterogeneity. The vision of the Giant Global Graph (GGG) was conceived by Tim Berners-Lee aiming at connecting all data on the Web and allowing to discover new relations between this openly-accessible data. This vision has been pursued by the Linked Open Data (LOD) community, where the cloud of published datasets comprises 295 data repositories and more than 30 billion RDF triples (as of September 2011). RDF is based on globally unique and accessible URIs and it was specifically designed to establish links between such URIs (or resources). This is captured in the Linked Data paradigm that postulates four rules: (1) Referred entities should be designated by URIs, (2) these URIs should be resolvable over HTTP, (3) data should be represented by means of standards such as RDF, (4) and a resource should include links to other resources. Although it is difficult to precisely identify the reasons for the success of the LOD effort, advocates generally argue that open licenses as well as open access are key enablers for the growth of such a network as they provide a strong incentive for collaboration and contribution by third parties. In his keynote at BNCOD 2011, Chris Bizer argued that with RDF the overall data integration effort can be “split between data publishers, third parties, and the data consumer”, a claim that can be substantiated by observing the evolution of many large data sets constituting the LOD cloud. As written in the acknowledgement section, parts of this thesis has received numerous feedback from other scientists, practitioners and industry in many different ways. The main contributions of this thesis are summarized here: Part I – Introduction and Background. During his keynote at the Language Resource and Evaluation Conference in 2012, Sören Auer stressed the decentralized, collaborative, interlinked and interoperable nature of the Web of Data. The keynote provides strong evidence that Semantic Web technologies such as Linked Data are on its way to become main stream for the representation of language resources. The jointly written companion publication for the keynote was later extended as a book chapter in The People’s Web Meets NLP and serves as the basis for “Introduction” and “Background”, outlining some stages of the Linked Data publication and refinement chain. Both chapters stress the importance of open licenses and open access as an enabler for collaboration, the ability to interlink data on the Web as a key feature of RDF as well as provide a discussion about scalability issues and decentralization. Furthermore, we elaborate on how conceptual interoperability can be achieved by (1) re-using vocabularies, (2) agile ontology development, (3) meetings to refine and adapt ontologies and (4) tool support to enrich ontologies and match schemata. Part II - Language Resources as Linked Data. “Linked Data in Linguistics” and “NLP & DBpedia, an Upward Knowledge Acquisition Spiral” summarize the results of the Linked Data in Linguistics (LDL) Workshop in 2012 and the NLP & DBpedia Workshop in 2013 and give a preview of the MLOD special issue. In total, five proceedings – three published at CEUR (OKCon 2011, WoLE 2012, NLP & DBpedia 2013), one Springer book (Linked Data in Linguistics, LDL 2012) and one journal special issue (Multilingual Linked Open Data, MLOD to appear) – have been (co-)edited to create incentives for scientists to convert and publish Linked Data and thus to contribute open and/or linguistic data to the LOD cloud. Based on the disseminated call for papers, 152 authors contributed one or more accepted submissions to our venues and 120 reviewers were involved in peer-reviewing. “DBpedia as a Multilingual Language Resource” and “Leveraging the Crowdsourcing of Lexical Resources for Bootstrapping a Linguistic Linked Data Cloud” contain this thesis’ contribution to the DBpedia Project in order to further increase the size and inter-linkage of the LOD Cloud with lexical-semantic resources. Our contribution comprises extracted data from Wiktionary (an online, collaborative dictionary similar to Wikipedia) in more than four languages (now six) as well as language-specific versions of DBpedia, including a quality assessment of inter-language links between Wikipedia editions and internationalized content negotiation rules for Linked Data. In particular the work described in created the foundation for a DBpedia Internationalisation Committee with members from over 15 different languages with the common goal to push DBpedia as a free and open multilingual language resource. Part III - The NLP Interchange Format (NIF). “NIF 2.0 Core Specification”, “NIF 2.0 Resources and Architecture” and “Evaluation and Related Work” constitute one of the main contribution of this thesis. The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. The core specification is included in and describes which URI schemes and RDF vocabularies must be used for (parts of) natural language texts and annotations in order to create an RDF/OWL-based interoperability layer with NIF built upon Unicode Code Points in Normal Form C. In , classes and properties of the NIF Core Ontology are described to formally define the relations between text, substrings and their URI schemes. contains the evaluation of NIF. In a questionnaire, we asked questions to 13 developers using NIF. UIMA, GATE and Stanbol are extensible NLP frameworks and NIF was not yet able to provide off-the-shelf NLP domain ontologies for all possible domains, but only for the plugins used in this study. After inspecting the software, the developers agreed however that NIF is adequate enough to provide a generic RDF output based on NIF using literal objects for annotations. All developers were able to map the internal data structure to NIF URIs to serialize RDF output (Adequacy). The development effort in hours (ranging between 3 and 40 hours) as well as the number of code lines (ranging between 110 and 445) suggest, that the implementation of NIF wrappers is easy and fast for an average developer. Furthermore the evaluation contains a comparison to other formats and an evaluation of the available URI schemes for web annotation. In order to collect input from the wide group of stakeholders, a total of 16 presentations were given with extensive discussions and feedback, which has lead to a constant improvement of NIF from 2010 until 2013. After the release of NIF (Version 1.0) in November 2011, a total of 32 vocabulary employments and implementations for different NLP tools and converters were reported (8 by the (co-)authors, including Wiki-link corpus, 13 by people participating in our survey and 11 more, of which we have heard). Several roll-out meetings and tutorials were held (e.g. in Leipzig and Prague in 2013) and are planned (e.g. at LREC 2014). Part IV - The NLP Interchange Format in Use. “Use Cases and Applications for NIF” and “Publication of Corpora using NIF” describe 8 concrete instances where NIF has been successfully used. One major contribution in is the usage of NIF as the recommended RDF mapping in the Internationalization Tag Set (ITS) 2.0 W3C standard and the conversion algorithms from ITS to NIF and back. One outcome of the discussions in the standardization meetings and telephone conferences for ITS 2.0 resulted in the conclusion there was no alternative RDF format or vocabulary other than NIF with the required features to fulfill the working group charter. Five further uses of NIF are described for the Ontology of Linguistic Annotations (OLiA), the RDFaCE tool, the Tiger Corpus Navigator, the OntosFeeder and visualisations of NIF using the RelFinder tool. These 8 instances provide an implemented proof-of-concept of the features of NIF. starts with describing the conversion and hosting of the huge Google Wikilinks corpus with 40 million annotations for 3 million web sites. The resulting RDF dump contains 477 million triples in a 5.6 GB compressed dump file in turtle syntax. describes how NIF can be used to publish extracted facts from news feeds in the RDFLiveNews tool as Linked Data. Part V - Conclusions. provides lessons learned for NIF, conclusions and an outlook on future work. Most of the contributions are already summarized above. One particular aspect worth mentioning is the increasing number of NIF-formated corpora for Named Entity Recognition (NER) that have come into existence after the publication of the main NIF paper Integrating NLP using Linked Data at ISWC 2013. These include the corpora converted by Steinmetz, Knuth and Sack for the NLP & DBpedia workshop and an OpenNLP-based CoNLL converter by Brümmer. Furthermore, we are aware of three LREC 2014 submissions that leverage NIF: NIF4OGGD - NLP Interchange Format for Open German Governmental Data, N^3 – A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format and Global Intelligent Content: Active Curation of Language Resources using Linked Data as well as an early implementation of a GATE-based NER/NEL evaluation framework by Dojchinovski and Kliegr. Further funding for the maintenance, interlinking and publication of Linguistic Linked Data as well as support and improvements of NIF is available via the expiring LOD2 EU project, as well as the CSA EU project called LIDER, which started in November 2013. Based on the evidence of successful adoption presented in this thesis, we can expect a decent to high chance of reaching critical mass of Linked Data technology as well as the NIF standard in the field of Natural Language Processing and Language Resources.
59

A Fine-Grain Scalable and Channel-Adaptive Hybrid Speech Coding Scheme for Voice over Wireless IP / Improvements Through Multiple Description Coding / Ein feingradig skalierbares und kanaladaptives hybrides Sprachkodierungsverfahren für Voice over Wireless IP / Verbesserungen durch Multiple Description Coding

Zibull, Marco 30 October 2006 (has links)
No description available.
60

Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data

Hellmann, Sebastian 09 January 2014 (has links)
This thesis is a compendium of scientific works and engineering specifications that have been contributed to a large community of stakeholders to be copied, adapted, mixed, built upon and exploited in any way possible to achieve a common goal: Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data The explosion of information technology in the last two decades has led to a substantial growth in quantity, diversity and complexity of web-accessible linguistic data. These resources become even more useful when linked with each other and the last few years have seen the emergence of numerous approaches in various disciplines concerned with linguistic resources and NLP tools. It is the challenge of our time to store, interlink and exploit this wealth of data accumulated in more than half a century of computational linguistics, of empirical, corpus-based study of language, and of computational lexicography in all its heterogeneity. The vision of the Giant Global Graph (GGG) was conceived by Tim Berners-Lee aiming at connecting all data on the Web and allowing to discover new relations between this openly-accessible data. This vision has been pursued by the Linked Open Data (LOD) community, where the cloud of published datasets comprises 295 data repositories and more than 30 billion RDF triples (as of September 2011). RDF is based on globally unique and accessible URIs and it was specifically designed to establish links between such URIs (or resources). This is captured in the Linked Data paradigm that postulates four rules: (1) Referred entities should be designated by URIs, (2) these URIs should be resolvable over HTTP, (3) data should be represented by means of standards such as RDF, (4) and a resource should include links to other resources. Although it is difficult to precisely identify the reasons for the success of the LOD effort, advocates generally argue that open licenses as well as open access are key enablers for the growth of such a network as they provide a strong incentive for collaboration and contribution by third parties. In his keynote at BNCOD 2011, Chris Bizer argued that with RDF the overall data integration effort can be “split between data publishers, third parties, and the data consumer”, a claim that can be substantiated by observing the evolution of many large data sets constituting the LOD cloud. As written in the acknowledgement section, parts of this thesis has received numerous feedback from other scientists, practitioners and industry in many different ways. The main contributions of this thesis are summarized here: Part I – Introduction and Background. During his keynote at the Language Resource and Evaluation Conference in 2012, Sören Auer stressed the decentralized, collaborative, interlinked and interoperable nature of the Web of Data. The keynote provides strong evidence that Semantic Web technologies such as Linked Data are on its way to become main stream for the representation of language resources. The jointly written companion publication for the keynote was later extended as a book chapter in The People’s Web Meets NLP and serves as the basis for “Introduction” and “Background”, outlining some stages of the Linked Data publication and refinement chain. Both chapters stress the importance of open licenses and open access as an enabler for collaboration, the ability to interlink data on the Web as a key feature of RDF as well as provide a discussion about scalability issues and decentralization. Furthermore, we elaborate on how conceptual interoperability can be achieved by (1) re-using vocabularies, (2) agile ontology development, (3) meetings to refine and adapt ontologies and (4) tool support to enrich ontologies and match schemata. Part II - Language Resources as Linked Data. “Linked Data in Linguistics” and “NLP & DBpedia, an Upward Knowledge Acquisition Spiral” summarize the results of the Linked Data in Linguistics (LDL) Workshop in 2012 and the NLP & DBpedia Workshop in 2013 and give a preview of the MLOD special issue. In total, five proceedings – three published at CEUR (OKCon 2011, WoLE 2012, NLP & DBpedia 2013), one Springer book (Linked Data in Linguistics, LDL 2012) and one journal special issue (Multilingual Linked Open Data, MLOD to appear) – have been (co-)edited to create incentives for scientists to convert and publish Linked Data and thus to contribute open and/or linguistic data to the LOD cloud. Based on the disseminated call for papers, 152 authors contributed one or more accepted submissions to our venues and 120 reviewers were involved in peer-reviewing. “DBpedia as a Multilingual Language Resource” and “Leveraging the Crowdsourcing of Lexical Resources for Bootstrapping a Linguistic Linked Data Cloud” contain this thesis’ contribution to the DBpedia Project in order to further increase the size and inter-linkage of the LOD Cloud with lexical-semantic resources. Our contribution comprises extracted data from Wiktionary (an online, collaborative dictionary similar to Wikipedia) in more than four languages (now six) as well as language-specific versions of DBpedia, including a quality assessment of inter-language links between Wikipedia editions and internationalized content negotiation rules for Linked Data. In particular the work described in created the foundation for a DBpedia Internationalisation Committee with members from over 15 different languages with the common goal to push DBpedia as a free and open multilingual language resource. Part III - The NLP Interchange Format (NIF). “NIF 2.0 Core Specification”, “NIF 2.0 Resources and Architecture” and “Evaluation and Related Work” constitute one of the main contribution of this thesis. The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. The core specification is included in and describes which URI schemes and RDF vocabularies must be used for (parts of) natural language texts and annotations in order to create an RDF/OWL-based interoperability layer with NIF built upon Unicode Code Points in Normal Form C. In , classes and properties of the NIF Core Ontology are described to formally define the relations between text, substrings and their URI schemes. contains the evaluation of NIF. In a questionnaire, we asked questions to 13 developers using NIF. UIMA, GATE and Stanbol are extensible NLP frameworks and NIF was not yet able to provide off-the-shelf NLP domain ontologies for all possible domains, but only for the plugins used in this study. After inspecting the software, the developers agreed however that NIF is adequate enough to provide a generic RDF output based on NIF using literal objects for annotations. All developers were able to map the internal data structure to NIF URIs to serialize RDF output (Adequacy). The development effort in hours (ranging between 3 and 40 hours) as well as the number of code lines (ranging between 110 and 445) suggest, that the implementation of NIF wrappers is easy and fast for an average developer. Furthermore the evaluation contains a comparison to other formats and an evaluation of the available URI schemes for web annotation. In order to collect input from the wide group of stakeholders, a total of 16 presentations were given with extensive discussions and feedback, which has lead to a constant improvement of NIF from 2010 until 2013. After the release of NIF (Version 1.0) in November 2011, a total of 32 vocabulary employments and implementations for different NLP tools and converters were reported (8 by the (co-)authors, including Wiki-link corpus, 13 by people participating in our survey and 11 more, of which we have heard). Several roll-out meetings and tutorials were held (e.g. in Leipzig and Prague in 2013) and are planned (e.g. at LREC 2014). Part IV - The NLP Interchange Format in Use. “Use Cases and Applications for NIF” and “Publication of Corpora using NIF” describe 8 concrete instances where NIF has been successfully used. One major contribution in is the usage of NIF as the recommended RDF mapping in the Internationalization Tag Set (ITS) 2.0 W3C standard and the conversion algorithms from ITS to NIF and back. One outcome of the discussions in the standardization meetings and telephone conferences for ITS 2.0 resulted in the conclusion there was no alternative RDF format or vocabulary other than NIF with the required features to fulfill the working group charter. Five further uses of NIF are described for the Ontology of Linguistic Annotations (OLiA), the RDFaCE tool, the Tiger Corpus Navigator, the OntosFeeder and visualisations of NIF using the RelFinder tool. These 8 instances provide an implemented proof-of-concept of the features of NIF. starts with describing the conversion and hosting of the huge Google Wikilinks corpus with 40 million annotations for 3 million web sites. The resulting RDF dump contains 477 million triples in a 5.6 GB compressed dump file in turtle syntax. describes how NIF can be used to publish extracted facts from news feeds in the RDFLiveNews tool as Linked Data. Part V - Conclusions. provides lessons learned for NIF, conclusions and an outlook on future work. Most of the contributions are already summarized above. One particular aspect worth mentioning is the increasing number of NIF-formated corpora for Named Entity Recognition (NER) that have come into existence after the publication of the main NIF paper Integrating NLP using Linked Data at ISWC 2013. These include the corpora converted by Steinmetz, Knuth and Sack for the NLP & DBpedia workshop and an OpenNLP-based CoNLL converter by Brümmer. Furthermore, we are aware of three LREC 2014 submissions that leverage NIF: NIF4OGGD - NLP Interchange Format for Open German Governmental Data, N^3 – A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format and Global Intelligent Content: Active Curation of Language Resources using Linked Data as well as an early implementation of a GATE-based NER/NEL evaluation framework by Dojchinovski and Kliegr. Further funding for the maintenance, interlinking and publication of Linguistic Linked Data as well as support and improvements of NIF is available via the expiring LOD2 EU project, as well as the CSA EU project called LIDER, which started in November 2013. Based on the evidence of successful adoption presented in this thesis, we can expect a decent to high chance of reaching critical mass of Linked Data technology as well as the NIF standard in the field of Natural Language Processing and Language Resources.:CONTENTS i introduction and background 1 1 introduction 3 1.1 Natural Language Processing . . . . . . . . . . . . . . . 3 1.2 Open licenses, open access and collaboration . . . . . . 5 1.3 Linked Data in Linguistics . . . . . . . . . . . . . . . . . 6 1.4 NLP for and by the Semantic Web – the NLP Inter- change Format (NIF) . . . . . . . . . . . . . . . . . . . . 8 1.5 Requirements for NLP Integration . . . . . . . . . . . . 10 1.6 Overview and Contributions . . . . . . . . . . . . . . . 11 2 background 15 2.1 The Working Group on Open Data in Linguistics (OWLG) 15 2.1.1 The Open Knowledge Foundation . . . . . . . . 15 2.1.2 Goals of the Open Linguistics Working Group . 16 2.1.3 Open linguistics resources, problems and chal- lenges . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.4 Recent activities and on-going developments . . 18 2.2 Technological Background . . . . . . . . . . . . . . . . . 18 2.3 RDF as a data model . . . . . . . . . . . . . . . . . . . . 21 2.4 Performance and scalability . . . . . . . . . . . . . . . . 22 2.5 Conceptual interoperability . . . . . . . . . . . . . . . . 22 ii language resources as linked data 25 3 linked data in linguistics 27 3.1 Lexical Resources . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Linguistic Corpora . . . . . . . . . . . . . . . . . . . . . 30 3.3 Linguistic Knowledgebases . . . . . . . . . . . . . . . . 31 3.4 Towards a Linguistic Linked Open Data Cloud . . . . . 32 3.5 State of the Linguistic Linked Open Data Cloud in 2012 33 3.6 Querying linked resources in the LLOD . . . . . . . . . 36 3.6.1 Enriching metadata repositories with linguistic features (Glottolog → OLiA) . . . . . . . . . . . 36 3.6.2 Enriching lexical-semantic resources with lin- guistic information (DBpedia (→ POWLA) → OLiA) . . . . . . . . . . . . . . . . . . . . . . . . 38 4 DBpedia as a multilingual language resource: the case of the greek dbpedia edition. 39 4.1 Current state of the internationalization effort . . . . . 40 4.2 Language-specific design of DBpedia resource identifiers 41 4.3 Inter-DBpedia linking . . . . . . . . . . . . . . . . . . . 42 4.4 Outlook on DBpedia Internationalization . . . . . . . . 44 5 leveraging the crowdsourcing of lexical resources for bootstrapping a linguistic linked data cloud 47 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Problem Description . . . . . . . . . . . . . . . . . . . . 50 5.2.1 Processing Wiki Syntax . . . . . . . . . . . . . . 50 5.2.2 Wiktionary . . . . . . . . . . . . . . . . . . . . . . 52 5.2.3 Wiki-scale Data Extraction . . . . . . . . . . . . . 53 5.3 Design and Implementation . . . . . . . . . . . . . . . . 54 5.3.1 Extraction Templates . . . . . . . . . . . . . . . . 56 5.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . 56 5.3.3 Language Mapping . . . . . . . . . . . . . . . . . 58 5.3.4 Schema Mediation by Annotation with lemon . 58 5.4 Resulting Data . . . . . . . . . . . . . . . . . . . . . . . . 58 5.5 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . 60 5.6 Discussion and Future Work . . . . . . . . . . . . . . . 60 5.6.1 Next Steps . . . . . . . . . . . . . . . . . . . . . . 61 5.6.2 Open Research Questions . . . . . . . . . . . . . 61 6 nlp & dbpedia, an upward knowledge acquisition spiral 63 6.1 Knowledge acquisition and structuring . . . . . . . . . 64 6.2 Representation of knowledge . . . . . . . . . . . . . . . 65 6.3 NLP tasks and applications . . . . . . . . . . . . . . . . 65 6.3.1 Named Entity Recognition . . . . . . . . . . . . 66 6.3.2 Relation extraction . . . . . . . . . . . . . . . . . 67 6.3.3 Question Answering over Linked Data . . . . . 67 6.4 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.4.1 Gold and silver standards . . . . . . . . . . . . . 69 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 iii the nlp interchange format (nif) 73 7 nif 2.0 core specification 75 7.1 Conformance checklist . . . . . . . . . . . . . . . . . . . 75 7.2 Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.2.1 Definition of Strings . . . . . . . . . . . . . . . . 78 7.2.2 Representation of Document Content with the nif:Context Class . . . . . . . . . . . . . . . . . . 80 7.3 Extension of NIF . . . . . . . . . . . . . . . . . . . . . . 82 7.3.1 Part of Speech Tagging with OLiA . . . . . . . . 83 7.3.2 Named Entity Recognition with ITS 2.0, DBpe- dia and NERD . . . . . . . . . . . . . . . . . . . 84 7.3.3 lemon and Wiktionary2RDF . . . . . . . . . . . 86 8 nif 2.0 resources and architecture 89 8.1 NIF Core Ontology . . . . . . . . . . . . . . . . . . . . . 89 8.1.1 Logical Modules . . . . . . . . . . . . . . . . . . 90 8.2 Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.2.1 Access via REST Services . . . . . . . . . . . . . 92 8.2.2 NIF Combinator Demo . . . . . . . . . . . . . . 92 8.3 Granularity Profiles . . . . . . . . . . . . . . . . . . . . . 93 8.4 Further URI Schemes for NIF . . . . . . . . . . . . . . . 95 8.4.1 Context-Hash-based URIs . . . . . . . . . . . . . 99 9 evaluation and related work 101 9.1 Questionnaire and Developers Study for NIF 1.0 . . . . 101 9.2 Qualitative Comparison with other Frameworks and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 9.3 URI Stability Evaluation . . . . . . . . . . . . . . . . . . 103 9.4 Related URI Schemes . . . . . . . . . . . . . . . . . . . . 104 iv the nlp interchange format in use 109 10 use cases and applications for nif 111 10.1 Internationalization Tag Set 2.0 . . . . . . . . . . . . . . 111 10.1.1 ITS2NIF and NIF2ITS conversion . . . . . . . . . 112 10.2 OLiA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 10.3 RDFaCE . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 10.4 Tiger Corpus Navigator . . . . . . . . . . . . . . . . . . 121 10.4.1 Tools and Resources . . . . . . . . . . . . . . . . 122 10.4.2 NLP2RDF in 2010 . . . . . . . . . . . . . . . . . . 123 10.4.3 Linguistic Ontologies . . . . . . . . . . . . . . . . 124 10.4.4 Implementation . . . . . . . . . . . . . . . . . . . 125 10.4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . 126 10.4.6 Related Work and Outlook . . . . . . . . . . . . 129 10.5 OntosFeeder – a Versatile Semantic Context Provider for Web Content Authoring . . . . . . . . . . . . . . . . 131 10.5.1 Feature Description and User Interface Walk- through . . . . . . . . . . . . . . . . . . . . . . . 132 10.5.2 Architecture . . . . . . . . . . . . . . . . . . . . . 134 10.5.3 Embedding Metadata . . . . . . . . . . . . . . . 135 10.5.4 Related Work and Summary . . . . . . . . . . . 135 10.6 RelFinder: Revealing Relationships in RDF Knowledge Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 10.6.1 Implementation . . . . . . . . . . . . . . . . . . . 137 10.6.2 Disambiguation . . . . . . . . . . . . . . . . . . . 138 10.6.3 Searching for Relationships . . . . . . . . . . . . 139 10.6.4 Graph Visualization . . . . . . . . . . . . . . . . 140 10.6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . 141 11 publication of corpora using nif 143 11.1 Wikilinks Corpus . . . . . . . . . . . . . . . . . . . . . . 143 11.1.1 Description of the corpus . . . . . . . . . . . . . 143 11.1.2 Quantitative Analysis with Google Wikilinks Cor- pus . . . . . . . . . . . . . . . . . . . . . . . . . . 144 11.2 RDFLiveNews . . . . . . . . . . . . . . . . . . . . . . . . 144 11.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . 145 11.2.2 Mapping to RDF and Publication on the Web of Data . . . . . . . . . . . . . . . . . . . . . . . . . 146 v conclusions 149 12 lessons learned, conclusions and future work 151 12.1 Lessons Learned for NIF . . . . . . . . . . . . . . . . . . 151 12.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 151 12.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 153

Page generated in 0.0812 seconds