51 |
Time Dynamic Topic ModelsJähnichen, Patrick 22 March 2016 (has links)
Information extraction from large corpora can be a useful tool for many applications in industry and academia. For instance, political communication science has just recently begun to use the opportunities that come with the availability of massive amounts of information available through the Internet and the computational tools that natural language processing can provide. We give a linguistically motivated interpretation of topic modeling, a state-of-the-art algorithm for extracting latent semantic sets of words from large text corpora, and extend this interpretation to cover issues and issue-cycles as theoretical constructs coming from political communication science. We build on a dynamic topic model, a model whose semantic sets of words are allowed to evolve over time governed by a Brownian motion stochastic process and apply a new form of analysis to its result. Generally this analysis is based on the notion of volatility as in the rate of change of stocks or derivatives known from econometrics. We claim that the rate of change of sets of semantically related words can be interpreted as issue-cycles, the word sets as describing the underlying issue. Generalizing over the existing work, we introduce dynamic topic models that are driven by general (Brownian motion is a special case of our model) Gaussian processes, a family of stochastic processes defined by the function that determines their covariance structure. We use the above assumption and apply a certain class of covariance functions to allow for an appropriate rate of change in word sets while preserving the semantic relatedness among words. Applying our findings to a large newspaper data set, the New York Times Annotated corpus (all articles between 1987 and 2007), we are able to identify sub-topics in time, \\\\textit{time-localized topics} and find patterns in their behavior over time. However, we have to drop the assumption of semantic relatedness over all available time for any one topic. Time-localized topics are consistent in themselves but do not necessarily share semantic meaning between each other. They can, however, be interpreted to capture the notion of issues and their behavior that of issue-cycles.
|
52 |
Resting-state functional connectivity in the brain and its relation to language development in preschool childrenXiao, Yaqiong 01 December 2017 (has links)
Human infants have been shown to have an innate capacity to acquire their mother tongue. In recent decades, the advent of the functional magnetic resonance imaging (fMRI) technique has made it feasible to explore the neural basis underlying language acquisition and processing in children, even in newborn infants (for reviews, see Kuhl & Rivera-Gaxiola, 2008; Kuhl, 2010) .
Spontaneous low-frequency (< 0.1 Hz) fluctuations (LFFs) in the resting brain have been shown to be physiologically meaningful in the seminal study (Biswal et al., 1995) . Compared to task-based fMRI, resting-state fMRI (rs-fMRI) has some unique advantages in neuroimaging research, especially in obtaining data from pediatric and clinical populations. Moreover, it enables us to characterize the functional organization of the brain in a systematic manner in the absence of explicit tasks. Among brain systems, the language network has been well investigated by analyzing LFFs in the resting brain.
This thesis attempts to investigate the functional connectivity within the language network in typically developing preschool children and the covariation of this connectivity with children’s language development by using the rs-fMRI technique. The first study (see Chapter 2.1; Xiao et al., 2016a) revealed connectivity differences in language-related regions between 5-year-olds and adults, and demonstrated distinct correlation patterns between functional connections within the language network and sentence comprehension performance in children. The results showed a left fronto-temporal connection for processing syntactically more complex sentences, suggesting that this connection is already in place at age 5 when it is needed for complex sentence comprehension, even though the whole functional network is still immature. In the second study (see Chapter 2.2; Xiao et al., 2016b), sentence comprehension performance and rs-fMRI data were obtained from a cohort of children at age 5 and a one-year follow-up. This study examined the changes in functional connectivity in the developing brain and their relation to the development of language abilities. The findings showed that the development of intrinsic functional connectivity in preschool children over the course of one year is clearly observable and individual differences in this development are related to the advancement in sentence comprehension ability with age.
In summary, the present thesis provides new insights into the relationship between intrinsic functional connectivity in the brain and language processing, as well as between the changes in intrinsic functional connectivity and concurrent language development in preschool children. Moreover, it allows for a better understanding of the neural mechanisms underlying language processing and the advancement of language abilities in the developing brain.
|
53 |
Unsupervised Natural Language Processing for Knowledge Extraction from Domain-specific Textual ResourcesHänig, Christian 17 April 2013 (has links)
This thesis aims to develop a Relation Extraction algorithm to extract knowledge out of automotive data. While most approaches to Relation Extraction are only evaluated on newspaper data dealing with general relations from the business world their applicability to other data sets is not well studied.
Part I of this thesis deals with theoretical foundations of Information Extraction algorithms. Text mining cannot be seen as the simple application of data mining methods to textual data. Instead, sophisticated methods have to be employed to accurately extract knowledge from text which then can be mined using statistical methods from the field of data mining. Information Extraction itself can be divided into two subtasks: Entity Detection and Relation Extraction. The detection of entities is very domain-dependent due to terminology, abbreviations and general language use within the given domain. Thus, this task has to be solved for each domain employing thesauri or another type of lexicon. Supervised approaches to Named Entity Recognition will not achieve reasonable results unless they have been trained for the given type of data.
The task of Relation Extraction can be basically approached by pattern-based and kernel-based algorithms. The latter achieve state-of-the-art results on newspaper data and point out the importance of linguistic features. In order to analyze relations contained in textual data, syntactic features like part-of-speech tags and syntactic parses are essential. Chapter 4 presents machine learning approaches and linguistic foundations being essential for syntactic annotation of textual data and Relation Extraction. Chapter 6 analyzes the performance of state-of-the-art algorithms of POS tagging, syntactic parsing and Relation Extraction on automotive data. The findings are: supervised methods trained on newspaper corpora do not achieve accurate results when being applied on automotive data. This is grounded in various reasons. Besides low-quality text, the nature of automotive relations states the main challenge. Automotive relation types of interest (e. g. component – symptom) are rather arbitrary compared to well-studied relation types like is-a or is-head-of. In order to achieve acceptable results, algorithms have to be trained directly on this kind of data. As the manual annotation of data for each language and data type is too costly and inflexible, unsupervised methods are the ones to rely on.
Part II deals with the development of dedicated algorithms for all three essential tasks. Unsupervised POS tagging (Chapter 7) is a well-studied task and algorithms achieving accurate tagging exist. All of them do not disambiguate high frequency words, only out-of-lexicon words are disambiguated. Most high frequency words bear syntactic information and thus, it is very important to differentiate between their different functions. Especially domain languages contain ambiguous and high frequent words bearing semantic information (e. g. pump). In order to improve POS tagging, an algorithm for disambiguation is developed and used to enhance an existing state-of-the-art tagger. This approach is based on context clustering which is used to detect a word type’s different syntactic functions. Evaluation shows that tagging accuracy is raised significantly.
An approach to unsupervised syntactic parsing (Chapter 8) is developed in order to suffice the requirements of Relation Extraction. These requirements include high precision results on nominal and prepositional phrases as they contain the entities being relevant for Relation Extraction. Furthermore, accurate shallow parsing is more desirable than deep binary parsing as it facilitates Relation Extraction more than deep parsing. Endocentric and exocentric constructions can be distinguished and improve proper phrase labeling. unsuParse is based on preferred positions of word types within phrases to detect phrase candidates. Iterating the detection of simple phrases successively induces deeper structures. The proposed algorithm fulfills all demanded criteria and achieves competitive results on standard evaluation setups.
Syntactic Relation Extraction (Chapter 9) is an approach exploiting syntactic statistics and text characteristics to extract relations between previously annotated entities. The approach is based on entity distributions given in a corpus and thus, provides a possibility to extend text mining processes to new data in an unsupervised manner. Evaluation on two different languages and two different text types of the automotive domain shows that it achieves accurate results on repair order data. Results are less accurate on internet data, but the task of sentiment analysis and extraction of the opinion target can be mastered. Thus, the incorporation of internet data is possible and important as it provides useful insight into the customer\''s thoughts.
To conclude, this thesis presents a complete unsupervised workflow for Relation Extraction – except for the highly domain-dependent Entity Detection task – improving performance of each of the involved subtasks compared to state-of-the-art approaches. Furthermore, this work applies Natural Language Processing methods and Relation Extraction approaches to real world data unveiling challenges that do not occur in high quality newspaper corpora.
|
54 |
Referenzielle Kohärenz im Erstspracherwerb: Untersuchungen zur Verarbeitung und Produktion anaphorischer ReferenzLehmkuhle, Ina 13 May 2022 (has links)
Die Bezugnahme auf bereits in den Diskurs eingeführte Referenten durch Anaphern stellt ein zentrales Mittel zur Etablierung referenzieller Kohärenz dar. Der Gebrauch anaphorischer Referenzausdrücke hängt dabei mit dem Grad der Zugänglichkeit der mentalen Repräsentation eines Diskursreferenten zusammen: Referenzausdrücke wie Pronomen spiegeln beispielsweise einen hohen Zugänglichkeitsgrad wider, wohingegen Referenzausdrücke wie Nominalphrasen und Eigennamen einen niedrigeren Zugänglichkeitsgrad signalisieren (u.a. Ariel, 1990). Der relative Zugänglichkeitsgrad von Diskursreferenten wird dabei von verschiedenen Zugänglichkeitsfaktoren auf unterschiedlichen sprachlichen Ebenen beeinflusst (für einen Überblick: Arnold, 2010). Diese Dissertation beschäftigt sich mit dem Erwerb referenzieller Kohärenz durch deutschsprachige Kinder. Dabei geht es sowohl um die Verarbeitung als auch um die Produktion anaphorischer Referenzausdrücke. Zum einen stellt sich hierbei die Frage, inwiefern Kinder in der Lage sind, anaphorische Referenzausdrücke als Hinweis auf den Grad der Zugänglichkeit von Diskursreferenten online zu verarbeiten und offline zu interpretieren. Zum anderen wird untersucht, inwiefern Kinder beim Gebrauch anaphorischer Referenzausdrücke den relativen Zugänglichkeitsgrad von Diskursreferenten in narrativen Textproduktionen berücksichtigen. Um zu überprüfen, auf welche Weise sich Kinder hierbei von Erwachsenen unterscheiden, werden die Ergebnisse der kindlichen Gruppe jeweils mit denen einer erwachsenen Kontrollgruppe verglichen. Die Ergebnisse der Untersuchungen in dieser Arbeit zeigen, dass deutschsprachige Kinder sowohl bei der Verarbeitung als auch bei der Produktion von anaphorischen Referenzausdrücken sensibel auf den relativen Zugänglichkeitsgrad von Diskursreferenten reagieren: Drei- bis vierjährige Kinder bevorzugen bei der Online-Verarbeitung Personalpronomen gegenüber wiederholten Eigennamen, wenn sich diese auf höchst zugängliche Diskursreferenten beziehen (Experiment 1, Eyetracking). Dies wird als Hinweis darauf gewertet, dass sie verstehen, dass sich gewisse Ausdrücke besser als andere dazu eignen, auf höchst zugängliche Diskursreferenten zu verweisen. Zudem berücksichtigen neun- bis zehnjährige Kinder bei der Produktion anaphorischer Referenzausdrücke in narrativen Textproduktionen lokale und globale Zugänglichkeitsfaktoren (Studie, Bildergeschichte). Als lokaler Zugänglichkeitsfaktor wird hier die referenzielle Funktion (Erhalt vs. Wiederaufnahme) betrachtet; der Charaktertyp (Hauptcharakter vs. Nebencharakter) repräsentiert hingegen einen globalen Zugänglichkeitsfaktor (Vogels, 2014). In Übereinstimmung mit den Präferenzen Erwachsener benutzen die Kinder überwiegend Pronomen zum Erhalt und Nominalphrasen zur Wiederaufnahme von Diskursreferenten. Ein Unterschied zu den Erwachsenen besteht jedoch im Hinblick auf den globalen Zugänglichkeitsfaktor des Charaktertyps: Anders als die Erwachsenen verweisen die Kinder vorzugsweise mit Pronomen auf Hauptcharaktere und mit Nominalphrasen auf Nebencharaktere. Erwachsene scheinen den globalen Zugänglichkeitsfaktor des Charaktertyps hingegen erst dann zu berücksichtigen, wenn lokale Diskursanforderungen erfüllt sind. Dies legt nahe, dass Kinder Zugänglichkeitsfaktoren zum Teil anders gewichten als Erwachsene. Für eine ähnliche Interpretation spricht auch das Verhalten acht- bis neunjähriger Kinder bei der Online-Verarbeitung von Personalpronomen und definiten Nominalphrasen (Experiment 2, Eyetracking). Während die Erwachsenen Personalpronomen gegenüber Nominalphrasen bevorzugen, wenn sich diese in der Funktion des Topikerhalts auf höchst zugängliche Diskursreferenten beziehen, zeigen die Kinder keinen Unterschied in ihrem Blickverhalten in Bezug auf diese beiden referenziellen Ausdrucksformen. Dies spricht dafür, dass die Kinder die informationsstrukturelle Funktion dieser beiden Referenzausdrücke im Gegensatz zu den Erwachsenen unberücksichtigt lassen. Obwohl Kinder bei der Verarbeitung und Produktion anaphorischer Referenzausdrücke bereits vielfach den relativen Zugänglichkeitsgrad von Diskursreferenten berücksichtigen, scheint der Erwerb anaphorischer Referenz auch noch am Ende der Grundschule nicht abgeschlossen zu sein.
|
55 |
Automatic Translation of Clinical Trial Eligibility Criteria into Formal Queries: Extended VersionXu, Chao, Forkel, Walter, Borgwardt, Stefan, Baader, Franz, Zhou, Beihai 29 December 2023 (has links)
Selecting patients for clinical trials is very labor-intensive. Our goal is to develop an automated system that can support doctors in this task. This paper describes a major step towards such a system: the automatic translation of clinical trial eligibility criteria from natural language into formal, logic-based queries. First, we develop a semantic annotation process that can capture many types of clinical trial criteria. Then, we map the annotated criteria to the formal query language. We have built a prototype system based on state-of-the-art NLP tools such as Word2Vec, Stanford NLP tools, and the MetaMap Tagger, and have evaluated the quality of the produced queries on a number of criteria from clinicaltrials.gov. Finally, we discuss some criteria that were hard to translate, and give suggestions for how to formulate eligibility criteria to make them easier to translate automatically.
|
56 |
Zooming in on speech production: Cumulative semantic interference and the processing of compoundsDöring, Anna-Lisa 25 April 2023 (has links)
Diese Dissertation untersucht einige ungeklärten Aspekte der Sprachproduktion. Das erste Ziel war es zu klären, wie Komposita (z.B. Goldfisch) auf der lexikalisch-syntaktischen Ebene unseres Sprachproduktionssystems repräsentiert sind. Gibt es dort einen einzelnen lexikalischen Eintrag für das gesamte Kompositum (GOLDFISCH) oder mehrere Einträge für jedes seiner Konstituenten (GOLD und FISCH), welche beim Sprechen zusammengesetzt werden? Zur Beantwortung dieser Frage wurde die sogenannte kumulative semantische Interferenz (KSI) verwendet. Dieser semantische Kontexteffekt beschreibt die Beobachtung, dass die Benennlatenzen von Sprechern systematisch länger werden, wenn diese eine Reihe von semantisch verwandten Bildern benennen. Obwohl KSI bereits viel als Instrument in der Sprachproduktionsforschung genutzt wird, sind einige Fragen rund um den Effekt selbst noch offen. Das zweite Ziel dieser Dissertation war es daher einige dieser Fragen mit Hilfe von behavioralen und elektrophysiologischen Maßen zu beantworten, um so unser Verständnis von KSI zu erweitern.
Die Ergebnisse deuten darauf hin, dass KSI ihren Ursprung auf der konzeptuellen Ebene des Sprachproduktionssystems hat und dass sie nicht von der morphologischen Komplexität der verwendeten Begriffe moduliert wird, aber davon, wie häufig diese benannt werden. Diese Erkenntnisse ermöglichen es in der Zukunft zielgenauere Vorhersagen zu machen, wenn KSI als Forschungsinstrument verwendet wird. Die Ergebnisse zeigen zudem, dass die Konstituenten von Komposita während deren Produktion aktiviert werden. Dies belegt, dass Komposita in einer komplexen Struktur repräsentiert sind, die aus einem Eintrag für das ganze Kompositum und zusätzlichen Einträgen für die Konstituenten besteht. Somit zeigen diese Ergebnisse, dass die Morphologie bereits die Repräsentationen auf der lexikalisch-syntaktischen Ebene beeinflusst und erweitern somit unser Wissen über den Aufbau unseres Sprachproduktionssystems. / This dissertation addresses unresolved issues concerning speech production processes and the cognitive architecture of our speech production system. The first aim was to answer the question how compounds (e.g., goldfish) are represented on the lexical-syntactic level of our speech production system. Is there a single entry for the whole compound (GOLDFISH) or multiple ones for each of its constituents (GOLD and FISH), which are assembled for each use? To investigate this question, we used the cumulative semantic interference (CSI) effect. This semantic context effect describes the observation that speakers’ naming latencies systematically increase when naming a sequence of semantically related pictures. Although CSI has been extensively used as a tool in language production research, several aspects of it are not fully understood. Thus, the second aim of this dissertation was to close some of these knowledge gaps and gain a more comprehensive understanding of CSI. In three studies, we first investigated the CSI effect, before using it as a tool to study the lexical representation of compounds.
Behavioural and electrophysiological data from the first two studies point to a purely conceptual origin of CSI. Furthermore, they revealed that CSI is not influenced by the items’ morphological complexity but affected by item repetition. These findings advance our understanding of CSI and thus allow us to make more informed predictions when using CSI as a research tool. The last study showed that the compounds’ constituents are activated during compound production, which provides evidence for a complex lexical-syntactic representation of compounds, consisting of one entry for the holistic compound and additional entries for each of its constituents. This dissertation thus reveals that the morphological complexity of compounds affects the lexical-syntactic level during speech production and thus advances our understanding of the architecture of our speech production system.
|
57 |
Text Mining for Pathway CurationWeber-Genzel, Leon 17 November 2023 (has links)
Biolog:innen untersuchen häufig Pathways, Netzwerke von Interaktionen zwischen Proteinen und Genen mit einer spezifischen Funktion. Neue Erkenntnisse über Pathways werden in der Regel zunächst in Publikationen veröffentlicht und dann in strukturierter Form in Lehrbüchern, Datenbanken oder mathematischen Modellen weitergegeben. Deren Kuratierung kann jedoch aufgrund der hohen Anzahl von Publikationen sehr aufwendig sein. In dieser Arbeit untersuchen wir wie Text Mining Methoden die Kuratierung unterstützen können. Wir stellen PEDL vor, ein Machine-Learning-Modell zur Extraktion von Protein-Protein-Assoziationen (PPAs) aus biomedizinischen Texten. PEDL verwendet Distant Supervision und vortrainierte Sprachmodelle, um eine höhere Genauigkeit als vergleichbare Methoden zu erreichen. Eine Evaluation durch Expert:innen bestätigt die Nützlichkeit von PEDLs für Pathway-Kurator:innen. Außerdem stellen wir PEDL+ vor, ein Kommandozeilen-Tool, mit dem auch Nicht-Expert:innen PPAs effizient extrahieren können. Drei Kurator:innen bewerten 55,6 % bis 79,6 % der von PEDL+ gefundenen PPAs als nützlich für ihre Arbeit. Die große Anzahl von PPAs, die durch Text Mining identifiziert werden, kann für Forscher:innen überwältigend sein. Um hier Abhilfe zu schaffen, stellen wir PathComplete vor, ein Modell, das nützliche Erweiterungen eines Pathways vorschlägt. Es ist die erste Pathway-Extension-Methode, die auf überwachtem maschinellen Lernen basiert. Unsere Experimente zeigen, dass PathComplete wesentlich genauer ist als existierende Methoden. Schließlich schlagen wir eine Methode vor, um Pathways mit komplexen Ereignisstrukturen zu erweitern. Hier übertrifft unsere neue Methode zur konditionalen Graphenmodifikation die derzeit beste Methode um 13-24% Genauigkeit in drei Benchmarks. Insgesamt zeigen unsere Ergebnisse, dass Deep Learning basierte Informationsextraktion eine vielversprechende Grundlage für die Unterstützung von Pathway-Kurator:innen ist. / Biological knowledge often involves understanding the interactions between molecules, such as proteins and genes, that form functional networks called pathways. New knowledge about pathways is typically communicated through publications and later condensed into structured formats such as textbooks, pathway databases or mathematical models. However, curating updated pathway models can be labour-intensive due to the growing volume of publications. This thesis investigates text mining methods to support pathway curation. We present PEDL (Protein-Protein-Association Extraction with Deep Language Models), a machine learning model designed to extract protein-protein associations (PPAs) from biomedical text. PEDL uses distant supervision and pre-trained language models to achieve higher accuracy than the state of the art. An expert evaluation confirms its usefulness for pathway curators. We also present PEDL+, a command-line tool that allows non-expert users to efficiently extract PPAs. When applied to pathway curation tasks, 55.6% to 79.6% of PEDL+ extractions were found useful by curators. The large number of PPAs identified by text mining can be overwhelming for researchers. To help, we present PathComplete, a model that suggests potential extensions to a pathway. It is the first method based on supervised machine learning for this task, using transfer learning from pathway databases. Our evaluations show that PathComplete significantly outperforms existing methods. Finally, we generalise pathway extension from PPAs to more realistic complex events. Here, our novel method for conditional graph modification outperforms the current best by 13-24% accuracy on three benchmarks. We also present a new dataset for event-based pathway extension.
Overall, our results show that deep learning-based information extraction is a promising basis for supporting pathway curators.
|
58 |
Weighted Parsing Formalisms Based on Regular Tree GrammarsMörbitz, Richard 06 November 2024 (has links)
This thesis is situated at the boundary between formal language theory, algebra, and natural language processing (NLP).
NLP knows a wide range of language models:
from the simple n-gram models to the recently successful large language models (LLM).
Formal approaches to NLP view natural languages as formal languages, i.e., infinite sets of strings, where each phrase is seen as a string, and they seek finite descriptions of these sets.
Beyond language modeling, NLP deals with tasks such as syntactic analysis (or parsing), translation, information retrieval, and many others.
Solving such tasks using language models involves two steps:
Given a phrase of natural language, the model first builds a representation of the phrase and then computes the solution from that representation.
Formal language models usually employ trees or similar structures as representations, whose evaluation to output values can be elegantly described using algebra.
Chomsky introduced phrase structure grammars, which describe a process of generating strings using rewriting rules.
For modeling natural language, these rules follow an important aspect of its syntax: constituency, i.e., the hierarchical structure of phrases.
The best known grammar formalism is given by context-free grammars (CFG).
However, CFG fail to model discontinuities in constituency, where several non-adjacent parts of a phrase form a subphrase.
For instance, the German sentence “ich war auch einkaufen” can be understood so that “ich auch” is a noun phrase; it is discontinuous because it is interleaved by the verb “war”.
This problem can be solved by employing more expressive grammar formalisms such as linear context-free rewriting systems (LCFRS).
There are also grammar formalisms that generate sets of trees, e.g., regular tree grammars (RTG).
A similar formalisms exists with finite-state tree automata (FTA) whose semantics is defined in terms of accepting an input rather than generating it, but FTA and RTG have the same expressiveness.
Universal algebra lets us view trees as elements of a term algebra, which can evaluated to values in another algebra by applying a unique homomorphism.
For instance, the strings generated by a CFG can be obtained by evaluating trees over the rules of the CFG in this way.
Parsing is the problem of computing the constituency structure of a given phrase. Due to the ambiguity of natural language, several such structures may exist.
This problem can be extended by weights such as probabilities in order to compute, for instance, the best constituency structure.
The framework of semiring parsing abstracts from particular weights and is instead parameterized by a semiring, whereby many NLP problems can be obtained by plugging in an appropriate semiring.
However, the semiring parsing algorithm is only applicable to some problem instances. Weighted deductive parsing is a similar framework that employs a different algorithm, and thus its applicability differs.
We introduce a very general language model in the form of the RTG-based language model (RTG-LM) which consists of an RTG and a language algebra.
The RTG generates the constituency structures of a language and, inspired by the initial algebra semantics, the language algebra evaluates these structures to elements of the modeled language; we call these elements syntactic objects.
Through the free choice of the language algebra, many common grammar formalisms, such as CFG and LCFRS, are covered.
We add multioperator monoids, a generalization of semirings, as a weight algebra to RTG-LM and obtain weighted RTG-based language models (wRTG-LM).
This lets us define an abstract weighted parsing problem, called the M-monoid parsing problem.
Its inputs are a wRTG-LM 𝐺 and a syntactic object 𝑎, and it states to compute all representations that 𝐺 has for 𝑎 using the language algebra.
Then, these representations are evaluated to values in the weight algebra, and the values of all these representations are summed to a single output value.
We propose the M-monoid parsing algorithm to solve this problem. It generalizes both the semiring parsing algorithm and the weighted deductive parsing algorithm in a way that is inspired by Mohri's single-source shortest distance algorithm.
We prove two sufficient conditions for the termination and correctness of our algorithm.
We show that our framework covers semiring parsing, weighted deductive parsing, and other problems from NLP and beyond.
In the second part of this thesis, we explore constituent tree automata (CTA), a generalization of FTA, as a language model that is tailored towards modeling discontinuitiy.
We show several properties of CTA, including that their constituency parsing problem is an instance of our M-monoid parsing problem and can, for a large class of CTA, be solved by the M-monoid parsing algorithm.
This thesis aims to contribute a unifying formal framework for the specification of language models and NLP tasks.
Through our general M-monoid parsing algorithm, we also provide a means of investigating the algorithmic solvability of problems within this field.
|
59 |
Integrating Natural Language Processing (NLP) and Language Resources Using Linked DataHellmann, Sebastian 12 January 2015 (has links) (PDF)
This thesis is a compendium of scientific works and engineering
specifications that have been contributed to a large community of
stakeholders to be copied, adapted, mixed, built upon and exploited in
any way possible to achieve a common goal: Integrating Natural Language
Processing (NLP) and Language Resources Using Linked Data
The explosion of information technology in the last two decades has led
to a substantial growth in quantity, diversity and complexity of
web-accessible linguistic data. These resources become even more useful
when linked with each other and the last few years have seen the
emergence of numerous approaches in various disciplines concerned with
linguistic resources and NLP tools. It is the challenge of our time to
store, interlink and exploit this wealth of data accumulated in more
than half a century of computational linguistics, of empirical,
corpus-based study of language, and of computational lexicography in all
its heterogeneity.
The vision of the Giant Global Graph (GGG) was conceived by Tim
Berners-Lee aiming at connecting all data on the Web and allowing to
discover new relations between this openly-accessible data. This vision
has been pursued by the Linked Open Data (LOD) community, where the
cloud of published datasets comprises 295 data repositories and more
than 30 billion RDF triples (as of September 2011).
RDF is based on globally unique and accessible URIs and it was
specifically designed to establish links between such URIs (or
resources). This is captured in the Linked Data paradigm that postulates
four rules: (1) Referred entities should be designated by URIs, (2)
these URIs should be resolvable over HTTP, (3) data should be
represented by means of standards such as RDF, (4) and a resource should
include links to other resources.
Although it is difficult to precisely identify the reasons for the
success of the LOD effort, advocates generally argue that open licenses
as well as open access are key enablers for the growth of such a network
as they provide a strong incentive for collaboration and contribution by
third parties. In his keynote at BNCOD 2011, Chris Bizer argued that
with RDF the overall data integration effort can be “split between data
publishers, third parties, and the data consumer”, a claim that can be
substantiated by observing the evolution of many large data sets
constituting the LOD cloud.
As written in the acknowledgement section, parts of this thesis has
received numerous feedback from other scientists, practitioners and
industry in many different ways. The main contributions of this thesis
are summarized here:
Part I – Introduction and Background.
During his keynote at the Language Resource and Evaluation Conference in
2012, Sören Auer stressed the decentralized, collaborative, interlinked
and interoperable nature of the Web of Data. The keynote provides strong
evidence that Semantic Web technologies such as Linked Data are on its
way to become main stream for the representation of language resources.
The jointly written companion publication for the keynote was later
extended as a book chapter in The People’s Web Meets NLP and serves as
the basis for “Introduction” and “Background”, outlining some stages of
the Linked Data publication and refinement chain. Both chapters stress
the importance of open licenses and open access as an enabler for
collaboration, the ability to interlink data on the Web as a key feature
of RDF as well as provide a discussion about scalability issues and
decentralization. Furthermore, we elaborate on how conceptual
interoperability can be achieved by (1) re-using vocabularies, (2) agile
ontology development, (3) meetings to refine and adapt ontologies and
(4) tool support to enrich ontologies and match schemata.
Part II - Language Resources as Linked Data.
“Linked Data in Linguistics” and “NLP & DBpedia, an Upward Knowledge
Acquisition Spiral” summarize the results of the Linked Data in
Linguistics (LDL) Workshop in 2012 and the NLP & DBpedia Workshop in
2013 and give a preview of the MLOD special issue. In total, five
proceedings – three published at CEUR (OKCon 2011, WoLE 2012, NLP &
DBpedia 2013), one Springer book (Linked Data in Linguistics, LDL 2012)
and one journal special issue (Multilingual Linked Open Data, MLOD to
appear) – have been (co-)edited to create incentives for scientists to
convert and publish Linked Data and thus to contribute open and/or
linguistic data to the LOD cloud. Based on the disseminated call for
papers, 152 authors contributed one or more accepted submissions to our
venues and 120 reviewers were involved in peer-reviewing.
“DBpedia as a Multilingual Language Resource” and “Leveraging the
Crowdsourcing of Lexical Resources for Bootstrapping a Linguistic Linked
Data Cloud” contain this thesis’ contribution to the DBpedia Project in
order to further increase the size and inter-linkage of the LOD Cloud
with lexical-semantic resources. Our contribution comprises extracted
data from Wiktionary (an online, collaborative dictionary similar to
Wikipedia) in more than four languages (now six) as well as
language-specific versions of DBpedia, including a quality assessment of
inter-language links between Wikipedia editions and internationalized
content negotiation rules for Linked Data. In particular the work
described in created the foundation for a DBpedia Internationalisation
Committee with members from over 15 different languages with the common
goal to push DBpedia as a free and open multilingual language resource.
Part III - The NLP Interchange Format (NIF).
“NIF 2.0 Core Specification”, “NIF 2.0 Resources and Architecture” and
“Evaluation and Related Work” constitute one of the main contribution of
this thesis. The NLP Interchange Format (NIF) is an RDF/OWL-based format
that aims to achieve interoperability between Natural Language
Processing (NLP) tools, language resources and annotations. The core
specification is included in and describes which URI schemes and RDF
vocabularies must be used for (parts of) natural language texts and
annotations in order to create an RDF/OWL-based interoperability layer
with NIF built upon Unicode Code Points in Normal Form C. In , classes
and properties of the NIF Core Ontology are described to formally define
the relations between text, substrings and their URI schemes. contains
the evaluation of NIF.
In a questionnaire, we asked questions to 13 developers using NIF. UIMA,
GATE and Stanbol are extensible NLP frameworks and NIF was not yet able
to provide off-the-shelf NLP domain ontologies for all possible domains,
but only for the plugins used in this study. After inspecting the
software, the developers agreed however that NIF is adequate enough to
provide a generic RDF output based on NIF using literal objects for
annotations. All developers were able to map the internal data structure
to NIF URIs to serialize RDF output (Adequacy). The development effort
in hours (ranging between 3 and 40 hours) as well as the number of code
lines (ranging between 110 and 445) suggest, that the implementation of
NIF wrappers is easy and fast for an average developer. Furthermore the
evaluation contains a comparison to other formats and an evaluation of
the available URI schemes for web annotation.
In order to collect input from the wide group of stakeholders, a total
of 16 presentations were given with extensive discussions and feedback,
which has lead to a constant improvement of NIF from 2010 until 2013.
After the release of NIF (Version 1.0) in November 2011, a total of 32
vocabulary employments and implementations for different NLP tools and
converters were reported (8 by the (co-)authors, including Wiki-link
corpus, 13 by people participating in our survey and 11 more, of
which we have heard). Several roll-out meetings and tutorials were held
(e.g. in Leipzig and Prague in 2013) and are planned (e.g. at LREC
2014).
Part IV - The NLP Interchange Format in Use.
“Use Cases and Applications for NIF” and “Publication of Corpora using
NIF” describe 8 concrete instances where NIF has been successfully used.
One major contribution in is the usage of NIF as the recommended RDF
mapping in the Internationalization Tag Set (ITS) 2.0 W3C standard
and the conversion algorithms from ITS to NIF and back. One outcome
of the discussions in the standardization meetings and telephone
conferences for ITS 2.0 resulted in the conclusion there was no
alternative RDF format or vocabulary other than NIF with the required
features to fulfill the working group charter. Five further uses of NIF
are described for the Ontology of Linguistic Annotations (OLiA), the
RDFaCE tool, the Tiger Corpus Navigator, the OntosFeeder and
visualisations of NIF using the RelFinder tool. These 8 instances
provide an implemented proof-of-concept of the features of NIF.
starts with describing the conversion and hosting of the huge Google
Wikilinks corpus with 40 million annotations for 3 million web sites.
The resulting RDF dump contains 477 million triples in a 5.6 GB
compressed dump file in turtle syntax. describes how NIF can be used to
publish extracted facts from news feeds in the RDFLiveNews tool as
Linked Data.
Part V - Conclusions.
provides lessons learned for NIF, conclusions and an outlook on future
work. Most of the contributions are already summarized above. One
particular aspect worth mentioning is the increasing number of
NIF-formated corpora for Named Entity Recognition (NER) that have come
into existence after the publication of the main NIF paper Integrating
NLP using Linked Data at ISWC 2013. These include the corpora converted
by Steinmetz, Knuth and Sack for the NLP & DBpedia workshop and an
OpenNLP-based CoNLL converter by Brümmer. Furthermore, we are aware of
three LREC 2014 submissions that leverage NIF: NIF4OGGD - NLP
Interchange Format for Open German Governmental Data, N^3 – A Collection
of Datasets for Named Entity Recognition and Disambiguation in the NLP
Interchange Format and Global Intelligent Content: Active Curation of
Language Resources using Linked Data as well as an early implementation
of a GATE-based NER/NEL evaluation framework by Dojchinovski and Kliegr.
Further funding for the maintenance, interlinking and publication of
Linguistic Linked Data as well as support and improvements of NIF is
available via the expiring LOD2 EU project, as well as the CSA EU
project called LIDER, which started in November 2013. Based on the
evidence of successful adoption presented in this thesis, we can expect
a decent to high chance of reaching critical mass of Linked Data
technology as well as the NIF standard in the field of Natural Language
Processing and Language Resources.
|
60 |
A Fine-Grain Scalable and Channel-Adaptive Hybrid Speech Coding Scheme for Voice over Wireless IP / Improvements Through Multiple Description Coding / Ein feingradig skalierbares und kanaladaptives hybrides Sprachkodierungsverfahren für Voice over Wireless IP / Verbesserungen durch Multiple Description CodingZibull, Marco 30 October 2006 (has links)
No description available.
|
Page generated in 0.0975 seconds