• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 222
  • 167
  • 28
  • 28
  • 16
  • 14
  • 14
  • 11
  • 5
  • 3
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 485
  • 223
  • 164
  • 164
  • 108
  • 102
  • 102
  • 102
  • 91
  • 76
  • 66
  • 64
  • 64
  • 63
  • 59
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
371

Sprache und Geschlecht?

Švitek, Mihael 02 May 2023 (has links)
‚Grenzüberschreitend‘ exponiert sich der Beitrag aus der germanistischen Linguistik, Mihael Šviteks (M. A.) metakritische Untersuchung, Sprache und Geschlecht? Dekonstruktive Lesarten (in) der linguistischen Genderforschung. Anliegen Šviteks ist es, am Korpus linguistischer Einführungstexte die sachliche, rhetorische und methodische Abstinenz der (selbst polymethodischen) Genderlinguistik gegenüber dekonstruktiven Gendertheorien kritisch aufzuzeigen. Dem Verfasser stellt sich diese Defizienz umso problematischer dar, als von der Sprachwissenschaft so just solche gendertheoretischen Angebote ausgegrenzt werden, die auf der unhintergehbaren Sprachlichkeit von Welt, Körpern, Geschlecht, Wahrnehmung und Wissen insistieren. Sich indessen selbst ‚voll beim Wort nehmend‘, diskursiviert und performiert Šviteks seine Argumente in einer „Gratwanderung zwischen fachwissenschaftlichem Anspruch und dekonstruktiver Geste“, in einem Zugleich von kritischem ‚Wiederlesen‘, dekonstruktivem ‚Widerlesen‘ und konstruktivem Weiterdenken. Einstweilen als Desiderat formuliert, zielt dieses Weiterdenken auf die Entwicklung eines disziplinüberschreitenden und intersektionalen Analyseapparates, der nichts weniger als die „ganzheitlichere Handhabung von Menschen“ ermöglicht. Selbstkritisch freilich schließt der Beitrag mit dem Zweifel, „ob eine Einzwängung realen menschlichen Lebens in analytische Kategorien jemals ein adäquates Bild der Wirklichkeit zeichnen kann oder ob nicht vielmehr immer ein unauflösbares ‚usw.‘ stehenbleiben muss.“
372

Swedish Modal Particles / Analyses of ju, väl, nog and visst

Abendroth Scherf, Nathalie Katharina 07 November 2019 (has links)
Diese Arbeit geht der Frage nach, ob MPn im Schwedischen syntaktisch Satzadverbien sind. Es wird gezeigt, dass sie sich syntaktisch von Satzadverbien unterscheiden und sich ferner in zwei getrennte Typen von MPn unterteilen lassen. Hierzu wird eine syntaktische Analyse vorgestellt, die diese Unterscheidung in dem phrasalen Status der MPn widerspiegelt. Die syntaktische Analyse wird durch sechs Experimente empirisch bestätigen. Ferner wird gezeigt, dass, um die Linearisierung von Elementen im Mittelfeld, am Beispiel von MPn, DPn und Objektpronomen im Mittelfeld, erklären zu können, nicht nur syntaktische Argumente herangeführt werden können, sondern auch phonologische Aspekte berücksichtigt werden müssen. / This thesis answers the question whether the MPs in Swedish are different from sentence adverbs on the level of syntax. It shows that MPs do differ from sentence adverbs, and further, that the MPs must be divided into two types. I present a syntactic analysis of the MPs that accounts for the two types of MPs as elements of distinct phrasal statuses. The syntactic analysis is tested empirically in six experiments and the results verified the analysis. Further I show that in order to account for the linearisation of MPs and object pronouns in the middle field, not only syntactic but also phonological properties of all elements must be taken into consideration.
373

Gender Agreement Patterns in Heritage Russian

Krüger, Irina 27 July 2021 (has links)
In dieser Arbeit werden die Unterschiede in der Genuskongruenz der belebten Substantive zwischen den ein- und zweisprachigen russischen Muttersprachlern und Muttersprachlerinnen mit Hilfe einer empirischen Studie untersucht. Speziell werden die vier Sonderfälle betrachtet: Hybridnomen (z.B., doktor „Arzt/Ärztin“), Substantive der dualen Genera (z.B., sirota „Waise“), weibliche Vornamen in der Verkleinerungsform mit den Suffixen –ik/ -ok (z.B., Irch-ik), Substantive, die männliche Personen bezeichnen, aber deren Form mit einem Vokal endet. Die Analyse der Ergebnisse dieser Studie ergibt folgende Feststellungen. Die fortgeschrittenen russischen Herkunftssprecher/-innen können mit den grammatischen Strukturen, die keine Variabilität darstellen – d.h. mit den Substantiven der dualen Genera und mit männlichen Personenbezeichnungen mit femininen Endungen - das Niveau eines/einer Muttersprachlers/Muttersprachlerin erreichen. Mit den Substantiven, die variable Genuskongruenz erlauben (weibliche Vornamen mit Suffixes –ik/-ok, Hybridnomen) wurde eine verdeckte Restrukturierung der Sprache beobachtet. Die Sprecher/-innen nutzen die grammatischen Strukturen ohne sichtbare Fehler, aber trotzdem anders als es die Muttersprachler/-innen tun würden. Im Fall der Nutzung der gemischten Kongruenz liegt die Restrukturierung daran, dass die Herkunftssprecher/-innen das generische Maskulinum nicht erwerben. Es sollte jedoch beachtet werden, dass die Nutzung der Genuskongruenz stark vom Sprachniveau eines/einer Sprechers/Sprecherin abhängt. Außerdem wurde festgestellt, dass Referentialität eine besondere Schwierigkeit für die Herkunftssprecher/-innen darstellt. Was die einsprachigen Muttersprachler/-innen angeht, wird es gezeigt, dass die Abhängigkeit dieser Sprecher/-innen bei der Wahl der Genuskongruenz von dem lexikalischen Kriterium die Tendenz der russischen Sprache zum analytischen Sprachbau beweist. / In this dissertation, I raise the issue of the grammatical gender in Russian as a heritage language. In particular, this thesis aims to determine the major principles of use of gender agreement patterns with the four classes of exceptional nouns (hybrids referring to females, common gender nouns, female names ending in -ik/ -ok, male terms and male names ending in -a/ -ja) in heritage Russian. For this purpose, I have conducted an experimental study on gender agreement, which consists of two big tasks, a translation task and a multiple-choice task. A detailed analysis of the results of the study has led to the following conclusions. Advanced heritage speakers are able to achieve the target-like language proficiency in gender agreement in transparent contexts and in some situations of form-meaning mismatch. The use of agreement patterns strongly depends on the speakers’ language proficiency. Less proficient speakers tend to have more problems with referential nouns. Importantly, this dissertation provides evidence for the importance of variability for successful heritage language acquisition. Variability of grammatical structures leads to inconsistency of input which makes it harder for heritage speakers to acquire these structures and leads to incomplete acquisition. As a result, heritage speakers fail to acquire the generic component of the semantic structure of hybrid nouns. This in turn results in the divergence in the use of agreement patterns by monolingual and bilingual speakers with the exceptional nouns, which allow variability (hybrids, female names ending in -ik/ -ok). This divergence is realised without overt errors and represents an example of covert language restructuring. Apart from that, the thesis touches upon the question of the development of standard Russian and provides evidence for the increase of analytic features in the Russian language.
374

Speech Act Deixis / A situated dynamic account for observational and experimental insights into spoken German

Buch, Friederike Linde 24 May 2024 (has links)
Diese Dissertation führt den Beweis, dass Sprechaktbezug nicht anaphorischer, sondern deiktischer Natur ist, und stellt ein formales Modell für denselben vor. Korpusdaten in gesprochenem Deutsch und Daten aus Fernseh-Talkshows zeigen, dass man sich nur mit demonstrativen Ausdrücken auf Sprechakte beziehen kann. Zusätzlich unterstützen zwei Experimente die Beobachtung, dass Sprechaktbezüge nicht mit Personalpronomen getätigt werden. Nur selten lassen Muttersprachler des Deutschen ein gegebenes Personalpronomen auf einen Sprechakt referieren, und nur selten wählen sie das Personalpronomen, um sich auf einen gegebenen Sprechaktreferenten zu beziehen. Die klare Präferenz liegt beim Demonstrativum. Um auf Entitäten außerhalb des Diskurses zu referieren, nutzt man im Deutschen Demonstrativ-, nicht aber Personalpronomen. Dementsprechend sollten Sprechakte als Ereignisse im Äußerungskontext und nicht als Teil von sprachlicher Form und Bedeutung aufgefasst werden. Bestehenden Diskurstheorien mangelt es an einer Unterscheidung zwischen Anaphern und Deixis, während umgekehrt Theorien über sprachliche Bezüge sich nicht mit Sprechakten beschäftigen. Segmented Discourse Representation Theory (SDRT) integriert nicht-sprachliche Objekte als Diskursreferenten in die Diskursstruktur, was auch für Sprechakte gilt. Dieser Umstand erlaubt allerdings Anaphern auf Sprechakte. Da sich schwach referentielle Ausdrücke wie Personalpronomen nicht auf Sprechakte beziehen können, muss die Ontologie von Sprechakten in SDRT überdacht werden. Hier wird eine SDRT-Variante vorgestellt, die als Diskursmodell zwei Informationsquellen umfasst, nämlich a) semantische Äußerungsinhalte und b) die physische Umgebung der Gesprächsteilnehmer (d.h. ihre "joint attention"), dargestellt als zwei DRSen. Das Modell unterscheidet systematisch zwischen anaphorischem und deiktischem Bezug und dadurch auch zwischen Bezug auf sprachlichen Inhalt und auf sprachliche "Behältnisse": Sprechakte. / This dissertation provides evidence that reference to speech acts is deictic, not anaphoric, and furthermore introduces a formal model of speech act reference. Corpus data from spoken German as well as observed data from German TV talk shows demonstrates that speech acts are exclusively referred to by demonstrative expressions. Additionally, new experimental evidence supports this observation and shows that speech acts are not referred to by personal pronouns. German native speakers rarely make given personal pronouns refer to a speech act, nor do they decide for a personal pronoun to refer to a given speech act referent when forced to choose between personal and demonstrative pronouns. Demonstratives are strongly preferred. In German, demonstrative pronouns rather than personal pronouns are used to refer to objects external to the discourse. Consequently, speech acts should be modeled as events in the utterance context rather than as parts of linguistic form and meaning. Existing theories of discourse structure lack a distinction between anaphora and deixis, while theories of reference do not integrate the concept of a speech act. Segmented Discourse Representation Theory (SDRT) introduces non-linguistic entities in discourse structure. This includes speech acts, which are introduced as discourse referents, which in return predicts anaphoric reference to speech acts. Since reference to speech acts with weak expressions like personal pronouns does not occur, the status of speech acts in SDRT must be redefined. As variant of SDRT, I propose a discourse model that comprises the two information sources of a) semantic content of utterances and b) immediate physical environment of the interlocutors (i.e. their joint attention), which are represented as a pair of DRSs. This model systematically distinguishes between anaphora and deixis, and therefore between reference to linguistic content and reference to linguistic containers: speech acts.
375

Topoi in der Esoterik: Linguistische Perspektiven auf esoterische Sprache

Baumgertel, Leander 22 July 2024 (has links)
No description available.
376

Ausgewählte Begriffe in der Teilchenphysik: Eine qualitative Inhaltsanalyse unter Verwendung von Ansätzen der Kognitiven Linguistik

Stieler, Tom 20 August 2024 (has links)
In der Teilchenphysik existieren für zentrale Begriffe oftmals Synonyme. Beispielsweise wird für Botenteilchen auch die Bezeichnung Austauschteilchen oder Eichboson genutzt. Im Englischen finden sich analog force carrier, messenger particles oder gauge bosons. Es fehlt bisher an Übersichten über die Nutzung einzelner Termini, wie diese Begriffe im Deutschen und Englischem verwendet werden sowie welche Begründungen und Verstehenskontexte eine Rolle spielen. Aus linguistischer Perspektive zählt nicht nur der einzelne Begriff, sondern insbesondere der Frame (= konzeptuelle Wissenseinheit) der damit evoziert wird. Ziel der vorliegenden Arbeit war es, eine qualitative Inhaltsanalyse bezüglich der Fragestellung, wie vier zentrale Begriffe (Austausch, Wechselwirkung, Umwandlung, Stoß) in der Teilchenphysik genutzt werden, zu erarbeiten. Das Forschungsdesgin orientiert sich dabei an Methoden der qualitativen Forschung. Die Datenerhebung erfolgt auf zwei Wegen: Zum einen über eine computergestützte Dokumentenanalyse ausgewählter Fachbücher der Teilchenphysik in deutscher und englischer Sprache; zum anderen mit leitfadengestützten Expert:inneninterviews, welche am Rande der 26. IPPOG-Tagung 2023 am CERN durchgeführt wurden.
377

Sprache und Ideologie: Entwurf und Kritik einer linguistischen Untersuchung von Bedeutungssystemen

Švitek, Mihael 22 October 2024 (has links)
Ein Gespenst geht um in der Welt: Das Gespenst der Ideologie. Nach den diversen diskursiven Verschiebungen der vergangenen Jahre liest man diese altehrwürdige Vokabel wieder öfter: Sei es als Stigmawort für die Position des politischen Gegners, als Fachausdruck in politischen Leitartikeln oder allenthalben als Füllwort oder Floskel. Gleichzeitig erfuhr das epistemologisch orientierte Konzept Ideologie in der englischsprachigen Forschung eine ungeahnte Renaissance und reifte somit zu einem mächtigen Analyseinstrument für (politische) Weltanschauungen und alltägliche Überzeugungen heran. Das Promotionsprojekt unternahm den Versuch, eine Neubegründung des Terminus für die Linguistik zu wagen. Der Ideologiebegriff wird durch tiefenhistorische theoretische Reflexion und angewandte sprachanalytische Modellierung für eine kulturwissenschaftliche interessierte Forschung operationalisiert, um eine neuartige Methode für die qualitative und quantitative Untersuchung von Weltanschauungen bereitzustellen. Dabei wird ein selbstentwickeltes Verfahren angewandt, das die Fallstricke der bisherigen linguistischen Arbeiten zum Ideologiebegriff zu vermeiden sucht. Zu den untersuchten Sinnkontinenten gehört ein großes Korpus anarchistischer Texte, stilprägende Texte des Maoismus, das Gesamtwerk von Rosa Luxemburg, die rechtsextreme Online-Enzyklopädie Metapedia sowie der Katechismus der Katholischen Kirche. Erstmals erprobt für die Sprachwissenschaft wurde das Verfahren der diffraktiven Lektüre nach Karen Barad, wenn Textausschnitte verschiedener Herkunft durch_einander gelesen wurden, so konnten verblüffende und unvorhersehbare Ergebnisse erzielt werden. / A specter is haunting the world: the specter of ideology. After various discursive shifts in recent years, this venerable term is once again being used more frequently—whether as a stigmatizing label for the position of a political opponent, as a technical term in political editorials, or often as filler or cliché. At the same time, the epistemologically oriented concept of ideology has experienced an unexpected renaissance in English-language research, maturing into a powerful analytical tool for understanding (political) worldviews and everyday beliefs. This doctoral project attempted to propose a redefinition of the term for linguistics. The concept of ideology is operationalized through a deeply historical theoretical reflection and applied linguistic modeling for research with a cultural studies focus, aiming to provide a novel method for the qualitative and quantitative investigation of worldviews. In doing so, a self-developed approach is employed, seeking to avoid the pitfalls of previous linguistic work on the concept of ideology. Among the examined “continents of meaning” is a large corpus of anarchist texts, formative texts of Maoism, the complete works of Rosa Luxemburg, the far-right online encyclopedia Metapedia, as well as the Catechism of the Catholic Church. For the first time in linguistics, the method of diffractive reading, as proposed by Karen Barad, was tested. When excerpts from texts of various origins were read through each other, surprising and unpredictable results were achieved.
378

Measuring coselectional constraint in learner corpora: A graph-based approach

Shadrova, Anna Valer'evna 24 July 2020 (has links)
Die korpuslinguistische Arbeit untersucht den Erwerb von Koselektionsbeschränkungen bei Lerner*innen des Deutschen als Fremdsprache in einem quasi-longitudinalen Forschungsdesign anhand des Kobalt-Korpus. Neben einigen statistischen Analysen wird vordergründig eine graphbasierte Analyse entwickelt, die auf der Graphmetrik Louvain-Modularität aufbaut. Diese wird für diverse Subkorpora nach verschiedenen Kriterien berechnet und mit Hilfe verschiedener Samplingtechniken umfassend intern validiert. Im Ergebnis zeigen sich eine Abhängigkeit der gemessenen Modularitätswerte vom Sprachstand der Teilnehmer*innen, eine höhere Modularität bei Muttersprachler*innen, niedrigere Modularitätswerte bei weißrussischen vs. chinesischen Lerner*innen sowie ein U-Kurven-förmiger Erwerbsverlauf bei weißrussischen, nicht aber chinesischen Lerner*innen. Unterschiede zwischen den Gruppen werden aus typologischer, kognitiver, diskursiv-kultureller und Registerperspektive diskutiert. Abschließend werden Vorschläge für den Einsatz von graphbasierten Modellierungen in kernlinguistischen Fragestellungen entwickelt. Zusätzlich werden theoretische Lücken in der gebrauchsbasierten Beschreibung von Koselektionsphänomenen (Phraseologie, Idiomatizität, Kollokation) aufgezeigt und ein multidimensionales funktionales Modell als Alternative vorgeschlagen. / The thesis located in corpus linguistics analyzes the acquisition of coselectional constraint in learners of German as a second language in a quasi-longitudinal design based on the Kobalt corpus. Supplemented by a number of statistical analyses, the thesis primarily develops a graph-based analysis making use of Louvain modularity. The graph metric is computed for a range of subcorpora chosen by various criteria. Extensive internal validation is performed through a number of sampling techniques. Results robustly indicate a dependency of modularity on language acquisition progress, higher modularity in L1 vs. L2, lower modularity in Belarusian vs. Chinese learners, and a u-shaped learning development in Belarusian, but not in Chinese learners. Group differences are discussed from a typological, cognitive, cultural and cultural discourse, and register perspective. Finally, future applications of graph-based modeling in core-linguistic research are outlined. In addition, some gaps in the theoretical discussion of coselection phenomena (phraseology, idiomaticity, collocation) in usage-based linguistics are discussed and a multidimensional and functional model is proposed as an alternative.
379

Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data

Hellmann, Sebastian 09 January 2014 (has links)
This thesis is a compendium of scientific works and engineering specifications that have been contributed to a large community of stakeholders to be copied, adapted, mixed, built upon and exploited in any way possible to achieve a common goal: Integrating Natural Language Processing (NLP) and Language Resources Using Linked Data The explosion of information technology in the last two decades has led to a substantial growth in quantity, diversity and complexity of web-accessible linguistic data. These resources become even more useful when linked with each other and the last few years have seen the emergence of numerous approaches in various disciplines concerned with linguistic resources and NLP tools. It is the challenge of our time to store, interlink and exploit this wealth of data accumulated in more than half a century of computational linguistics, of empirical, corpus-based study of language, and of computational lexicography in all its heterogeneity. The vision of the Giant Global Graph (GGG) was conceived by Tim Berners-Lee aiming at connecting all data on the Web and allowing to discover new relations between this openly-accessible data. This vision has been pursued by the Linked Open Data (LOD) community, where the cloud of published datasets comprises 295 data repositories and more than 30 billion RDF triples (as of September 2011). RDF is based on globally unique and accessible URIs and it was specifically designed to establish links between such URIs (or resources). This is captured in the Linked Data paradigm that postulates four rules: (1) Referred entities should be designated by URIs, (2) these URIs should be resolvable over HTTP, (3) data should be represented by means of standards such as RDF, (4) and a resource should include links to other resources. Although it is difficult to precisely identify the reasons for the success of the LOD effort, advocates generally argue that open licenses as well as open access are key enablers for the growth of such a network as they provide a strong incentive for collaboration and contribution by third parties. In his keynote at BNCOD 2011, Chris Bizer argued that with RDF the overall data integration effort can be “split between data publishers, third parties, and the data consumer”, a claim that can be substantiated by observing the evolution of many large data sets constituting the LOD cloud. As written in the acknowledgement section, parts of this thesis has received numerous feedback from other scientists, practitioners and industry in many different ways. The main contributions of this thesis are summarized here: Part I – Introduction and Background. During his keynote at the Language Resource and Evaluation Conference in 2012, Sören Auer stressed the decentralized, collaborative, interlinked and interoperable nature of the Web of Data. The keynote provides strong evidence that Semantic Web technologies such as Linked Data are on its way to become main stream for the representation of language resources. The jointly written companion publication for the keynote was later extended as a book chapter in The People’s Web Meets NLP and serves as the basis for “Introduction” and “Background”, outlining some stages of the Linked Data publication and refinement chain. Both chapters stress the importance of open licenses and open access as an enabler for collaboration, the ability to interlink data on the Web as a key feature of RDF as well as provide a discussion about scalability issues and decentralization. Furthermore, we elaborate on how conceptual interoperability can be achieved by (1) re-using vocabularies, (2) agile ontology development, (3) meetings to refine and adapt ontologies and (4) tool support to enrich ontologies and match schemata. Part II - Language Resources as Linked Data. “Linked Data in Linguistics” and “NLP & DBpedia, an Upward Knowledge Acquisition Spiral” summarize the results of the Linked Data in Linguistics (LDL) Workshop in 2012 and the NLP & DBpedia Workshop in 2013 and give a preview of the MLOD special issue. In total, five proceedings – three published at CEUR (OKCon 2011, WoLE 2012, NLP & DBpedia 2013), one Springer book (Linked Data in Linguistics, LDL 2012) and one journal special issue (Multilingual Linked Open Data, MLOD to appear) – have been (co-)edited to create incentives for scientists to convert and publish Linked Data and thus to contribute open and/or linguistic data to the LOD cloud. Based on the disseminated call for papers, 152 authors contributed one or more accepted submissions to our venues and 120 reviewers were involved in peer-reviewing. “DBpedia as a Multilingual Language Resource” and “Leveraging the Crowdsourcing of Lexical Resources for Bootstrapping a Linguistic Linked Data Cloud” contain this thesis’ contribution to the DBpedia Project in order to further increase the size and inter-linkage of the LOD Cloud with lexical-semantic resources. Our contribution comprises extracted data from Wiktionary (an online, collaborative dictionary similar to Wikipedia) in more than four languages (now six) as well as language-specific versions of DBpedia, including a quality assessment of inter-language links between Wikipedia editions and internationalized content negotiation rules for Linked Data. In particular the work described in created the foundation for a DBpedia Internationalisation Committee with members from over 15 different languages with the common goal to push DBpedia as a free and open multilingual language resource. Part III - The NLP Interchange Format (NIF). “NIF 2.0 Core Specification”, “NIF 2.0 Resources and Architecture” and “Evaluation and Related Work” constitute one of the main contribution of this thesis. The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. The core specification is included in and describes which URI schemes and RDF vocabularies must be used for (parts of) natural language texts and annotations in order to create an RDF/OWL-based interoperability layer with NIF built upon Unicode Code Points in Normal Form C. In , classes and properties of the NIF Core Ontology are described to formally define the relations between text, substrings and their URI schemes. contains the evaluation of NIF. In a questionnaire, we asked questions to 13 developers using NIF. UIMA, GATE and Stanbol are extensible NLP frameworks and NIF was not yet able to provide off-the-shelf NLP domain ontologies for all possible domains, but only for the plugins used in this study. After inspecting the software, the developers agreed however that NIF is adequate enough to provide a generic RDF output based on NIF using literal objects for annotations. All developers were able to map the internal data structure to NIF URIs to serialize RDF output (Adequacy). The development effort in hours (ranging between 3 and 40 hours) as well as the number of code lines (ranging between 110 and 445) suggest, that the implementation of NIF wrappers is easy and fast for an average developer. Furthermore the evaluation contains a comparison to other formats and an evaluation of the available URI schemes for web annotation. In order to collect input from the wide group of stakeholders, a total of 16 presentations were given with extensive discussions and feedback, which has lead to a constant improvement of NIF from 2010 until 2013. After the release of NIF (Version 1.0) in November 2011, a total of 32 vocabulary employments and implementations for different NLP tools and converters were reported (8 by the (co-)authors, including Wiki-link corpus, 13 by people participating in our survey and 11 more, of which we have heard). Several roll-out meetings and tutorials were held (e.g. in Leipzig and Prague in 2013) and are planned (e.g. at LREC 2014). Part IV - The NLP Interchange Format in Use. “Use Cases and Applications for NIF” and “Publication of Corpora using NIF” describe 8 concrete instances where NIF has been successfully used. One major contribution in is the usage of NIF as the recommended RDF mapping in the Internationalization Tag Set (ITS) 2.0 W3C standard and the conversion algorithms from ITS to NIF and back. One outcome of the discussions in the standardization meetings and telephone conferences for ITS 2.0 resulted in the conclusion there was no alternative RDF format or vocabulary other than NIF with the required features to fulfill the working group charter. Five further uses of NIF are described for the Ontology of Linguistic Annotations (OLiA), the RDFaCE tool, the Tiger Corpus Navigator, the OntosFeeder and visualisations of NIF using the RelFinder tool. These 8 instances provide an implemented proof-of-concept of the features of NIF. starts with describing the conversion and hosting of the huge Google Wikilinks corpus with 40 million annotations for 3 million web sites. The resulting RDF dump contains 477 million triples in a 5.6 GB compressed dump file in turtle syntax. describes how NIF can be used to publish extracted facts from news feeds in the RDFLiveNews tool as Linked Data. Part V - Conclusions. provides lessons learned for NIF, conclusions and an outlook on future work. Most of the contributions are already summarized above. One particular aspect worth mentioning is the increasing number of NIF-formated corpora for Named Entity Recognition (NER) that have come into existence after the publication of the main NIF paper Integrating NLP using Linked Data at ISWC 2013. These include the corpora converted by Steinmetz, Knuth and Sack for the NLP & DBpedia workshop and an OpenNLP-based CoNLL converter by Brümmer. Furthermore, we are aware of three LREC 2014 submissions that leverage NIF: NIF4OGGD - NLP Interchange Format for Open German Governmental Data, N^3 – A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format and Global Intelligent Content: Active Curation of Language Resources using Linked Data as well as an early implementation of a GATE-based NER/NEL evaluation framework by Dojchinovski and Kliegr. Further funding for the maintenance, interlinking and publication of Linguistic Linked Data as well as support and improvements of NIF is available via the expiring LOD2 EU project, as well as the CSA EU project called LIDER, which started in November 2013. Based on the evidence of successful adoption presented in this thesis, we can expect a decent to high chance of reaching critical mass of Linked Data technology as well as the NIF standard in the field of Natural Language Processing and Language Resources.:CONTENTS i introduction and background 1 1 introduction 3 1.1 Natural Language Processing . . . . . . . . . . . . . . . 3 1.2 Open licenses, open access and collaboration . . . . . . 5 1.3 Linked Data in Linguistics . . . . . . . . . . . . . . . . . 6 1.4 NLP for and by the Semantic Web – the NLP Inter- change Format (NIF) . . . . . . . . . . . . . . . . . . . . 8 1.5 Requirements for NLP Integration . . . . . . . . . . . . 10 1.6 Overview and Contributions . . . . . . . . . . . . . . . 11 2 background 15 2.1 The Working Group on Open Data in Linguistics (OWLG) 15 2.1.1 The Open Knowledge Foundation . . . . . . . . 15 2.1.2 Goals of the Open Linguistics Working Group . 16 2.1.3 Open linguistics resources, problems and chal- lenges . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.4 Recent activities and on-going developments . . 18 2.2 Technological Background . . . . . . . . . . . . . . . . . 18 2.3 RDF as a data model . . . . . . . . . . . . . . . . . . . . 21 2.4 Performance and scalability . . . . . . . . . . . . . . . . 22 2.5 Conceptual interoperability . . . . . . . . . . . . . . . . 22 ii language resources as linked data 25 3 linked data in linguistics 27 3.1 Lexical Resources . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Linguistic Corpora . . . . . . . . . . . . . . . . . . . . . 30 3.3 Linguistic Knowledgebases . . . . . . . . . . . . . . . . 31 3.4 Towards a Linguistic Linked Open Data Cloud . . . . . 32 3.5 State of the Linguistic Linked Open Data Cloud in 2012 33 3.6 Querying linked resources in the LLOD . . . . . . . . . 36 3.6.1 Enriching metadata repositories with linguistic features (Glottolog → OLiA) . . . . . . . . . . . 36 3.6.2 Enriching lexical-semantic resources with lin- guistic information (DBpedia (→ POWLA) → OLiA) . . . . . . . . . . . . . . . . . . . . . . . . 38 4 DBpedia as a multilingual language resource: the case of the greek dbpedia edition. 39 4.1 Current state of the internationalization effort . . . . . 40 4.2 Language-specific design of DBpedia resource identifiers 41 4.3 Inter-DBpedia linking . . . . . . . . . . . . . . . . . . . 42 4.4 Outlook on DBpedia Internationalization . . . . . . . . 44 5 leveraging the crowdsourcing of lexical resources for bootstrapping a linguistic linked data cloud 47 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Problem Description . . . . . . . . . . . . . . . . . . . . 50 5.2.1 Processing Wiki Syntax . . . . . . . . . . . . . . 50 5.2.2 Wiktionary . . . . . . . . . . . . . . . . . . . . . . 52 5.2.3 Wiki-scale Data Extraction . . . . . . . . . . . . . 53 5.3 Design and Implementation . . . . . . . . . . . . . . . . 54 5.3.1 Extraction Templates . . . . . . . . . . . . . . . . 56 5.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . 56 5.3.3 Language Mapping . . . . . . . . . . . . . . . . . 58 5.3.4 Schema Mediation by Annotation with lemon . 58 5.4 Resulting Data . . . . . . . . . . . . . . . . . . . . . . . . 58 5.5 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . 60 5.6 Discussion and Future Work . . . . . . . . . . . . . . . 60 5.6.1 Next Steps . . . . . . . . . . . . . . . . . . . . . . 61 5.6.2 Open Research Questions . . . . . . . . . . . . . 61 6 nlp & dbpedia, an upward knowledge acquisition spiral 63 6.1 Knowledge acquisition and structuring . . . . . . . . . 64 6.2 Representation of knowledge . . . . . . . . . . . . . . . 65 6.3 NLP tasks and applications . . . . . . . . . . . . . . . . 65 6.3.1 Named Entity Recognition . . . . . . . . . . . . 66 6.3.2 Relation extraction . . . . . . . . . . . . . . . . . 67 6.3.3 Question Answering over Linked Data . . . . . 67 6.4 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.4.1 Gold and silver standards . . . . . . . . . . . . . 69 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 iii the nlp interchange format (nif) 73 7 nif 2.0 core specification 75 7.1 Conformance checklist . . . . . . . . . . . . . . . . . . . 75 7.2 Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.2.1 Definition of Strings . . . . . . . . . . . . . . . . 78 7.2.2 Representation of Document Content with the nif:Context Class . . . . . . . . . . . . . . . . . . 80 7.3 Extension of NIF . . . . . . . . . . . . . . . . . . . . . . 82 7.3.1 Part of Speech Tagging with OLiA . . . . . . . . 83 7.3.2 Named Entity Recognition with ITS 2.0, DBpe- dia and NERD . . . . . . . . . . . . . . . . . . . 84 7.3.3 lemon and Wiktionary2RDF . . . . . . . . . . . 86 8 nif 2.0 resources and architecture 89 8.1 NIF Core Ontology . . . . . . . . . . . . . . . . . . . . . 89 8.1.1 Logical Modules . . . . . . . . . . . . . . . . . . 90 8.2 Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.2.1 Access via REST Services . . . . . . . . . . . . . 92 8.2.2 NIF Combinator Demo . . . . . . . . . . . . . . 92 8.3 Granularity Profiles . . . . . . . . . . . . . . . . . . . . . 93 8.4 Further URI Schemes for NIF . . . . . . . . . . . . . . . 95 8.4.1 Context-Hash-based URIs . . . . . . . . . . . . . 99 9 evaluation and related work 101 9.1 Questionnaire and Developers Study for NIF 1.0 . . . . 101 9.2 Qualitative Comparison with other Frameworks and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 9.3 URI Stability Evaluation . . . . . . . . . . . . . . . . . . 103 9.4 Related URI Schemes . . . . . . . . . . . . . . . . . . . . 104 iv the nlp interchange format in use 109 10 use cases and applications for nif 111 10.1 Internationalization Tag Set 2.0 . . . . . . . . . . . . . . 111 10.1.1 ITS2NIF and NIF2ITS conversion . . . . . . . . . 112 10.2 OLiA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 10.3 RDFaCE . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 10.4 Tiger Corpus Navigator . . . . . . . . . . . . . . . . . . 121 10.4.1 Tools and Resources . . . . . . . . . . . . . . . . 122 10.4.2 NLP2RDF in 2010 . . . . . . . . . . . . . . . . . . 123 10.4.3 Linguistic Ontologies . . . . . . . . . . . . . . . . 124 10.4.4 Implementation . . . . . . . . . . . . . . . . . . . 125 10.4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . 126 10.4.6 Related Work and Outlook . . . . . . . . . . . . 129 10.5 OntosFeeder – a Versatile Semantic Context Provider for Web Content Authoring . . . . . . . . . . . . . . . . 131 10.5.1 Feature Description and User Interface Walk- through . . . . . . . . . . . . . . . . . . . . . . . 132 10.5.2 Architecture . . . . . . . . . . . . . . . . . . . . . 134 10.5.3 Embedding Metadata . . . . . . . . . . . . . . . 135 10.5.4 Related Work and Summary . . . . . . . . . . . 135 10.6 RelFinder: Revealing Relationships in RDF Knowledge Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 10.6.1 Implementation . . . . . . . . . . . . . . . . . . . 137 10.6.2 Disambiguation . . . . . . . . . . . . . . . . . . . 138 10.6.3 Searching for Relationships . . . . . . . . . . . . 139 10.6.4 Graph Visualization . . . . . . . . . . . . . . . . 140 10.6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . 141 11 publication of corpora using nif 143 11.1 Wikilinks Corpus . . . . . . . . . . . . . . . . . . . . . . 143 11.1.1 Description of the corpus . . . . . . . . . . . . . 143 11.1.2 Quantitative Analysis with Google Wikilinks Cor- pus . . . . . . . . . . . . . . . . . . . . . . . . . . 144 11.2 RDFLiveNews . . . . . . . . . . . . . . . . . . . . . . . . 144 11.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . 145 11.2.2 Mapping to RDF and Publication on the Web of Data . . . . . . . . . . . . . . . . . . . . . . . . . 146 v conclusions 149 12 lessons learned, conclusions and future work 151 12.1 Lessons Learned for NIF . . . . . . . . . . . . . . . . . . 151 12.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 151 12.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 153
380

Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages

Gerlach, Martin 10 March 2016 (has links) (PDF)
Natural language is a remarkable example of a complex dynamical system which combines variation and universal structure emerging from the interaction of millions of individuals. Understanding statistical properties of texts is not only crucial in applications of information retrieval and natural language processing, e.g. search engines, but also allow deeper insights into the organization of knowledge in the form of written text. In this thesis, we investigate the statistical and dynamical processes underlying the co-existence of universality and variability in word statistics. We combine a careful statistical analysis of large empirical databases on language usage with analytical and numerical studies of stochastic models. We find that the fat-tailed distribution of word frequencies is best described by a generalized Zipf’s law characterized by two scaling regimes, in which the values of the parameters are extremely robust with respect to time as well as the type and the size of the database under consideration depending only on the particular language. We provide an interpretation of the two regimes in terms of a distinction of words into a finite core vocabulary and a (virtually) infinite noncore vocabulary. Proposing a simple generative process of language usage, we can establish the connection to the problem of the vocabulary growth, i.e. how the number of different words scale with the database size, from which we obtain a unified perspective on different universal scaling laws simultaneously appearing in the statistics of natural language. On the one hand, our stochastic model accurately predicts the expected number of different items as measured in empirical data spanning hundreds of years and 9 orders of magnitude in size showing that the supposed vocabulary growth over time is mainly driven by database size and not by a change in vocabulary richness. On the other hand, analysis of the variation around the expected size of the vocabulary shows anomalous fluctuation scaling, i.e. the vocabulary is a nonself-averaging quantity, and therefore, fluctuations are much larger than expected. We derive how this results from topical variations in a collection of texts coming from different authors, disciplines, or times manifest in the form of correlations of frequencies of different words due to their semantic relation. We explore the consequences of topical variation in applications to language change and topic models emphasizing the difficulties (and presenting possible solutions) due to the fact that the statistics of word frequencies are characterized by a fat-tailed distribution. First, we propose an information-theoretic measure based on the Shannon-Gibbs entropy and suitable generalizations quantifying the similarity between different texts which allows us to determine how fast the vocabulary of a language changes over time. Second, we combine topic models from machine learning with concepts from community detection in complex networks in order to infer large-scale (mesoscopic) structures in a collection of texts. Finally, we study language change of individual words on historical time scales, i.e. how a linguistic innovation spreads through a community of speakers, providing a framework to quantitatively combine microscopic models of language change with empirical data that is only available on a macroscopic level (i.e. averaged over the population of speakers).

Page generated in 0.0573 seconds