Spelling suggestions: "subject:"[een] DEPENDENCY PARSING"" "subject:"[enn] DEPENDENCY PARSING""
11 |
Morphosyntactic Corpora and Tools for PersianSeraji, Mojgan January 2015 (has links)
This thesis presents open source resources in the form of annotated corpora and modules for automatic morphosyntactic processing and analysis of Persian texts. More specifically, the resources consist of an improved part-of-speech tagged corpus and a dependency treebank, as well as tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and dependency parsing for Persian. In developing these resources and tools, two key requirements are observed: compatibility and reuse. The compatibility requirement encompasses two parts. First, the tools in the pipeline should be compatible with each other in such a way that the output of one tool is compatible with the input requirements of the next. Second, the tools should be compatible with the annotated corpora and deliver the same analysis that is found in these. The reuse requirement means that all the components in the pipeline are developed by reusing resources, standard methods, and open source state-of-the-art tools. This is necessary to make the project feasible. Given these requirements, the thesis investigates two main research questions. The first is how can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse? The approach taken is to accept the tokenization variations in the corpora to achieve robustness. The tokenization variations in Persian texts are related to the orthographic variations of writing fixed expressions, as well as various types of affixes and clitics. Since these variations are inherent properties of Persian texts, it is important that the tools in the pipeline can handle them. Therefore, they should not be trained on idealized data. The second question concerns how accurately we can perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora. The experimental evaluation of the tools shows that the sentence segmenter and tokenizer achieve an F-score close to 100%, the tagger has an accuracy of nearly 97.5%, and the parser achieves a best labeled accuracy of over 82% (with unlabeled accuracy close to 87%).
|
12 |
Syntaktická analýza textů se střídáním kódů / Syntaktická analýza textů se střídáním kódůRavishankar, Vinit January 2018 (has links)
(English) Vinit Ravishankar July 2018 The aim of this thesis is twofold; first, we attempt to dependency parse existing code-switched corpora, solely by training on monolingual dependency treebanks. In an attempt to do so, we design a dependency parser and ex- periment with a variety of methods to improve upon the baseline established by raw training on monolingual treebanks: these methods range from treebank modification to network modification. On this task, we obtain state-of-the- art results for most evaluation criteria on the task for our evaluation language pairs: Hindi/English and Komi/Russian. We beat our own baselines by a sig- nificant margin, whilst simultaneously beating most scores on similar tasks in the literature. The second part of the thesis involves introducing the relatively understudied task of predicting code-switching points in a monolingual utter- ance; we provide several architectures that attempt to do so, and provide one of them as our baseline, in the hopes that it should continue as a state-of-the-art in future tasks. 1
|
13 |
A text-mining based approach to capturing the NHS patient experienceBahja, Mohammed January 2017 (has links)
An important issue for healthcare service providers is to achieve high levels of patient satisfaction. Collecting patient feedback about their experience in hospital enables providers to analyse their performance in terms of the levels of satisfaction and to identify the strengths and limitations of their service delivery. A common method of collecting patient feedback is via online portals and the forums of the service provider, where the patients can rate and comment about the service received. A challenge in analysing patient experience collected via online portals is that the amount of data can be huge and hence, prohibitive to analyse manually. In this thesis, an automated approach to patient experience analysis via Sentiment Analysis, Topic Modelling, and Dependency Parsing methods is presented. The patient experience data collected from the National Health Service (NHS) online portal in the United Kingdom is analysed in the study to understand this experience. The study was carried out in three iterations: (1) In the first, the Sentiment Analysis method was applied, which identified whether a given patient feedback item was positive or negative. (2) The second iteration involved applying Topic Modelling methods to identify automatically themes and topics from the patient feedback. Further, the outcomes of the Sentiment Analysis study from the first iteration were utilised to identify the patient sentiment regarding the topic being discussed in a given comment. (3) In the third iteration of the study, Dependency Parsing methods were employed for each patient feedback item and the topics identified. A method was devised to summarise the reason for a particular sentiment about each of the identified topics. The outcomes of the study demonstrate that text-mining methods can be effectively utilised to identify patients’ sentiment in their feedback as well as to identify the themes and topics discussed in it. The approach presented in the study was proven capable of effectively automatically analysing the NHS patient feedback database. Specifically, it can provide an overview of the positive and negative sentiment rate, identify the frequently discussed topics and summarise individual patient feedback items. Moreover, an API visualisation tool is introduced to make the outcomes more accessible to the health care providers.
|
14 |
From Intent to Code : Using Natural Language ProcessingByström, Adam January 2017 (has links)
Programming and the possibility to express one’s intent to a machine is becoming a very important skill in our digitalizing society. Today, instructing a machine, such as a computer to perform actions is done through programming. What if this could be done with human language? This thesis examines how new technologies and methods in the form of Natural Language Processing can be used to make programming more accessible by translating intent expressed in natural language into code that a computer can execute. Related research has studied using natural language as a programming language and using natural language to instruct robots. These studies have shown promising results but are hindered by strict syntaxes, limited domains and inability to handle ambiguity. Studies have also been made using Natural Language Processing to analyse source code, turning code into natural language. This thesis has the reversed approach. By utilizing Natural Language Processing techniques, an intent can be translated into code containing concepts such as sequential execution, loops and conditional statements. In this study, a system for converting intent, expressed in English sentences, into code is developed. To analyse this approach to programming, an evaluation framework is developed, evaluating the system during the development process as well as usage of the final system. The results show that this way of programming might have potential but conclude that the Natural Language Processing models still have too low accuracy. Further research is required to increase this accuracy to further assess the potential of this way of programming.
|
15 |
Lexical selection for machine translationSabtan, Yasser Muhammad Naguib mahmoud January 2011 (has links)
Current research in Natural Language Processing (NLP) tends to exploit corpus resources as a way of overcoming the problem of knowledge acquisition. Statistical analysis of corpora can reveal trends and probabilities of occurrence, which have proved to be helpful in various ways. Machine Translation (MT) is no exception to this trend. Many MT researchers have attempted to extract knowledge from parallel bilingual corpora. The MT problem is generally decomposed into two sub-problems: lexical selection and reordering of the selected words. This research addresses the problem of lexical selection of open-class lexical items in the framework of MT. The work reported in this thesis investigates different methodologies to handle this problem, using a corpus-based approach. The current framework can be applied to any language pair, but we focus on Arabic and English. This is because Arabic words are hugely ambiguous and thus pose a challenge for the current task of lexical selection. We use a challenging Arabic-English parallel corpus, containing many long passages with no punctuation marks to denote sentence boundaries. This points to the robustness of the adopted approach. In our attempt to extract lexical equivalents from the parallel corpus we focus on the co-occurrence relations between words. The current framework adopts a lexicon-free approach towards the selection of lexical equivalents. This has the double advantage of investigating the effectiveness of different techniques without being distracted by the properties of the lexicon and at the same time saving much time and effort, since constructing a lexicon is time-consuming and labour-intensive. Thus, we use as little, if any, hand-coded information as possible. The accuracy score could be improved by adding hand-coded information. The point of the work reported here is to see how well one can do without any such manual intervention. With this goal in mind, we carry out a number of preprocessing steps in our framework. First, we build a lexicon-free Part-of-Speech (POS) tagger for Arabic. This POS tagger uses a combination of rule-based, transformation-based learning (TBL) and probabilistic techniques. Similarly, we use a lexicon-free POS tagger for English. We use the two POS taggers to tag the bi-texts. Second, we develop lexicon-free shallow parsers for Arabic and English. The two parsers are then used to label the parallel corpus with dependency relations (DRs) for some critical constructions. Third, we develop stemmers for Arabic and English, adopting the same knowledge -free approach. These preprocessing steps pave the way for the main system (or proposer) whose task is to extract translational equivalents from the parallel corpus. The framework starts with automatically extracting a bilingual lexicon using unsupervised statistical techniques which exploit the notion of co-occurrence patterns in the parallel corpus. We then choose the target word that has the highest frequency of occurrence from among a number of translational candidates in the extracted lexicon in order to aid the selection of the contextually correct translational equivalent. These experiments are carried out on either raw or POS-tagged texts. Having labelled the bi-texts with DRs, we use them to extract a number of translation seeds to start a number of bootstrapping techniques to improve the proposer. These seeds are used as anchor points to resegment the parallel corpus and start the selection process once again. The final F-score for the selection process is 0.701. We have also written an algorithm for detecting ambiguous words in a translation lexicon and obtained a precision score of 0.89.
|
16 |
Textual entailment for modern standard ArabicAlabbas, Maytham Abualhail Shahed January 2013 (has links)
This thesis explores a range of approaches to the task of recognising textual entailment (RTE), i.e. determining whether one text snippet entails another, for Arabic, where we are faced with an exceptional level of lexical and structural ambiguity. To the best of our knowledge, this is the first attempt to carry out this task for Arabic. Tree edit distance (TED) has been widely used as a component of natural language processing (NLP) systems that attempt to achieve the goal above, with the distance between pairs of dependency trees being taken as a measure of the likelihood that one entails the other. Such a technique relies on having accurate linguistic analyses. Obtaining such analyses for Arabic is notoriously difficult. To overcome these problems we have investigated strategies for improving tagging and parsing depending on system combination techniques. These strategies lead to substantially better performance than any of the contributing tools. We describe also a semi-automatic technique for creating a first dataset for RTE for Arabic using an extension of the ‘headline-lead paragraph’ technique because there are, again to the best of our knowledge, no such datasets available. We sketch the difficulties inherent in volunteer annotators-based judgment, and describe a regime to ameliorate some of these. The major contribution of this thesis is the introduction of two ways of improving the standard TED: (i) we present a novel approach, extended TED (ETED), for extending the standard TED algorithm for calculating the distance between two trees by allowing operations to apply to subtrees, rather than just to single nodes. This leads to useful improvements over the performance of the standard TED for determining entailment. The key here is that subtrees tend to correspond to single information units. By treating operations on subtrees as less costly than the corresponding set of individual node operations, ETED concentrates on entire information units, which are a more appropriate granularity than individual words for considering entailment relations; and (ii) we use the artificial bee colony (ABC) algorithm to automatically estimate the cost of edit operations for single nodes and subtrees and to determine thresholds, since assigning an appropriate cost to each edit operation manually can become a tricky task.The current findings are encouraging. These extensions can substantially affect the F-score and accuracy and achieve a better RTE model when compared with a number of string-based algorithms and the standard TED approaches. The relative performance of the standard techniques on our Arabic test set replicates the results reported for these techniques for English test sets. We have also applied ETED with ABC to the English RTE2 test set, where it again outperforms the standard TED.
|
17 |
Exploiting Vocabulary, Morphological, and Subtree Knowledge to Improve Chinese Syntactic Analysis / 語彙的、形態的、および部分木知識を用いた中国語構文解析の精度向上Shen, Mo 23 March 2016 (has links)
In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of Kyoto University's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink. / 京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第19848号 / 情博第599号 / 新制||情||104(附属図書館) / 32884 / 京都大学大学院情報学研究科知能情報学専攻 / (主査)准教授 河原 大輔, 教授 黒橋 禎夫, 教授 鹿島 久嗣 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
18 |
Neural Approaches for Syntactic and Semantic Analysis / 構文・意味解析に対するニューラルネットワークを利用した手法Kurita, Shuhei 25 March 2019 (has links)
京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第21911号 / 情博第694号 / 新制||情||119(附属図書館) / 京都大学大学院情報学研究科知能情報学専攻 / (主査)教授 黒橋 禎夫, 教授 鹿島 久嗣, 准教授 河原 大輔 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
19 |
Predicative Analysis for Information Extraction : application to the biology domain / Analyse prédicative pour l'extraction d'information : application au domaine de la biologieRatkovic, Zorana 11 December 2014 (has links)
L’abondance de textes dans le domaine biomédical nécessite le recours à des méthodes de traitement automatique pour améliorer la recherche d’informations précises. L’extraction d’information (EI) vise précisément à extraire de l’information pertinente à partir de données non-structurées. Une grande partie des méthodes dans ce domaine se concentre sur les approches d’apprentissage automatique, en ayant recours à des traitements linguistiques profonds. L’analyse syntaxique joue notamment un rôle important, en fournissant une analyse précise des relations entre les éléments de la phrase.Cette thèse étudie le rôle de l’analyse syntaxique en dépendances dans le cadre d’applications d’EI dans le domaine biomédical. Elle comprend l’évaluation de différents analyseurs ainsi qu’une analyse détaillée des erreurs. Une fois l’analyseur le plus adapté sélectionné, les différentes étapes de traitement linguistique pour atteindre une EI de haute qualité, fondée sur la syntaxe, sont abordés : ces traitements incluent des étapes de pré-traitement (segmentation en mots) et des traitements linguistiques de plus haut niveau (lié à la sémantique et à l’analyse de la coréférence). Cette thèse explore également la manière dont les différents niveaux de traitement linguistique peuvent être représentés puis exploités par l’algorithme d’apprentissage. Enfin, partant du constat que le domaine biomédical est en fait extrêmement diversifié, cette thèse explore l’adaptation des techniques à différents sous-domaines, en utilisant des connaissances et des ressources déjà existantes. Les méthodes et les approches décrites sont explorées en utilisant deux corpus biomédicaux différents, montrant comment les résultats d’IE sont utilisés dans des tâches concrètes. / The abundance of biomedical information expressed in natural language has resulted in the need for methods to process this information automatically. In the field of Natural Language Processing (NLP), Information Extraction (IE) focuses on the extraction of relevant information from unstructured data in natural language. A great deal of IE methods today focus on Machine Learning (ML) approaches that rely on deep linguistic processing in order to capture the complex information contained in biomedical texts. In particular, syntactic analysis and parsing have played an important role in IE, by helping capture how words in a sentence are related. This thesis examines how dependency parsing can be used to facilitate IE. It focuses on a task-based approach to dependency parsing evaluation and parser selection, including a detailed error analysis. In order to achieve a high quality of syntax-based IE, different stages of linguistic processing are addressed, including both pre-processing steps (such as tokenization) and the use of complementary linguistic processing (such as the use of semantics and coreference analysis). This thesis also explores how the different levels of linguistics processing can be represented for use within an ML-based IE algorithm, and how the interface between these two is of great importance. Finally, biomedical data is very heterogeneous, encompassing different subdomains and genres. This thesis explores how subdomain-adaptationcan be achieved by using already existing subdomain knowledge and resources. The methods and approaches described are explored using two different biomedical corpora, demonstrating how the IE results are used in real-life tasks.
|
20 |
Tvorba závislostního korpusu pro jorubštinu s využitím paralelních dat / Tvorba závislostního korpusu pro jorubštinu s využitím paralelních datOluokun, Adedayo January 2018 (has links)
The goal of this thesis is to create a dependency treebank for Yorùbá, a language with very little pre-existing machine-readable resources. The treebank follows the Universal Dependencies (UD) annotation standard, certain language-specific guidelines for Yorùbá were specified. Known techniques for porting resources from resource-rich languages were tested, in particular projection of annotation across parallel bilingual data. Manual annotation is not the main focus of this thesis; nevertheless, a small portion of the data was verified manually in order to evaluate the annotation quality. Also, a model was trained on the manual annotation using UDPipe.
|
Page generated in 0.0399 seconds