Global ETD Search

21	Neural Networks for Part-of-Speech Tagging Strandqvist, Wiktor January 2016 (has links) The aim of this thesis is to explore the viability of artificial neural networks using a purely contextual word representation as a solution for part-of-speech tagging. Furthermore, the effects of deep learning and increased contextual information of the network are explored. This was achieved by creating an artificial neural network written in Python. The input vectors employed were created by Word2Vec. This system was compared to a baseline using a tagger with handcrafted features in respect to accuracy and precision. The results show that the use of artificial neural networks using a purely contextual word representation shows promise, but ultimately falls roughly two percent short of the baseline. The suspected reason for this is the suboptimal representation for rare words. The use of deeper network architectures shows an insignificant improvement, indicating that the data sets used might be too small. The use of additional context information provided a higher accuracy, but started to decline after a context size of one. artificial neural network part-of-speech tagging language technology
22	Lexical selection for machine translation Sabtan, Yasser Muhammad Naguib mahmoud January 2011 (has links) Current research in Natural Language Processing (NLP) tends to exploit corpus resources as a way of overcoming the problem of knowledge acquisition. Statistical analysis of corpora can reveal trends and probabilities of occurrence, which have proved to be helpful in various ways. Machine Translation (MT) is no exception to this trend. Many MT researchers have attempted to extract knowledge from parallel bilingual corpora. The MT problem is generally decomposed into two sub-problems: lexical selection and reordering of the selected words. This research addresses the problem of lexical selection of open-class lexical items in the framework of MT. The work reported in this thesis investigates different methodologies to handle this problem, using a corpus-based approach. The current framework can be applied to any language pair, but we focus on Arabic and English. This is because Arabic words are hugely ambiguous and thus pose a challenge for the current task of lexical selection. We use a challenging Arabic-English parallel corpus, containing many long passages with no punctuation marks to denote sentence boundaries. This points to the robustness of the adopted approach. In our attempt to extract lexical equivalents from the parallel corpus we focus on the co-occurrence relations between words. The current framework adopts a lexicon-free approach towards the selection of lexical equivalents. This has the double advantage of investigating the effectiveness of different techniques without being distracted by the properties of the lexicon and at the same time saving much time and effort, since constructing a lexicon is time-consuming and labour-intensive. Thus, we use as little, if any, hand-coded information as possible. The accuracy score could be improved by adding hand-coded information. The point of the work reported here is to see how well one can do without any such manual intervention. With this goal in mind, we carry out a number of preprocessing steps in our framework. First, we build a lexicon-free Part-of-Speech (POS) tagger for Arabic. This POS tagger uses a combination of rule-based, transformation-based learning (TBL) and probabilistic techniques. Similarly, we use a lexicon-free POS tagger for English. We use the two POS taggers to tag the bi-texts. Second, we develop lexicon-free shallow parsers for Arabic and English. The two parsers are then used to label the parallel corpus with dependency relations (DRs) for some critical constructions. Third, we develop stemmers for Arabic and English, adopting the same knowledge -free approach. These preprocessing steps pave the way for the main system (or proposer) whose task is to extract translational equivalents from the parallel corpus. The framework starts with automatically extracting a bilingual lexicon using unsupervised statistical techniques which exploit the notion of co-occurrence patterns in the parallel corpus. We then choose the target word that has the highest frequency of occurrence from among a number of translational candidates in the extracted lexicon in order to aid the selection of the contextually correct translational equivalent. These experiments are carried out on either raw or POS-tagged texts. Having labelled the bi-texts with DRs, we use them to extract a number of translation seeds to start a number of bootstrapping techniques to improve the proposer. These seeds are used as anchor points to resegment the parallel corpus and start the selection process once again. The final F-score for the selection process is 0.701. We have also written an algorithm for detecting ambiguous words in a translation lexicon and obtained a precision score of 0.89. 006.3
23	Modelagem de contextos para aprendizado automático aplicado à análise morfossintática / Modeling contexts for automatic learning applied to morphosyntactic analysis Fábio Natanael Kepler 28 May 2010 (has links) A etiquetagem morfossintática envolve atribuir às palavras de uma sentença suas classes morfossintáticas de acordo com os contextos em que elas aparecem. Cadeias de Markov de Tamanho Variável (VLMCs, do inglês \"Variable-Length Markov Chains\") oferecem uma forma de modelar contextos maiores que trigramas sem sofrer demais com a esparsidade de dados e a complexidade do espaço de estados. Mesmo assim, duas palavras do português apresentam um alto grau de ambiguidade: \'que\' e \'a\'. O número de erros na etiquetagem dessas palavras corresponde a um quarto do total de erros cometidos por um etiquetador baseado em VLMCs. Além disso, essas palavras parecem apresentar dois diferentes tipos de ambiguidade: um dependendo de contexto não local e outro de contexto direito. Exploramos maneiras de expandir o modelo baseado em VLMCs através do uso de diferentes modelos e métodos, a fim de atacar esses problemas. As abordagens mostraram variado grau de sucesso, com um método em particular (aprendizado guiado) se mostrando capaz de resolver boa parte da ambiguidade de \'a\'. Discutimos razões para isso acontecer. Com relação a \'que\', ao longo desta tese propusemos e testamos diversos métodos de aprendizado de informação contextual para tentar desambiguá-lo. Mostramos como, em todos eles, o nível de ambiguidade de \'que\' permanece praticamente constante. / Part-of-speech tagging involves assigning to words in a sentence their part-of-speech class based on the contexts they appear in. Variable-Length Markov Chains (VLMCs) offer a way of modeling contexts longer than trigrams without suffering too much from data sparsity and state space complexity. Even so, two words in Portuguese show a high degree of ambiguity: \'que\' and \'a\'. The number of errors tagging these words corresponds to a quarter of the total errors made by a VLMC-based tagger. Moreover, these words seem to show two different types of ambiguity: one depending on non-local context and one on right context. We searched ways of expanding the VLMC-based model with a number of different models and methods in order to tackle these issues. The approaches showed variable degrees of success, with one particular method (Guided Learning) solving much of the ambiguity of \'a\'. We explore reasons why this happened. Rega rding \'que\', throughout this thesis we propose and test various methods for learning contextual information in order to try to disambiguate it. We show how, in all of them, the level of ambiguity shown by \'que\' remains practically c onstant. Cadeias de markov Etiquetagem morfossintática Linguística computacional Computational linguistics Markov chains Part-of-speech tagging
24	Utilizing Human-Computer Interactions to Improve Text Annotation Carmen, Marc A. 08 July 2010 (has links) (PDF) The need for annotated corpora in a variety of different types of research grows constantly. Unfortunately creating annotated corpora is frequently cost-prohibitive due the number of person-hours required to create the corpus. This project investigates one solution that helps to reduce the cost of creating annotated corpora through the use of a new user interface which includes a specially built framework and component for annotating part-of-speech information and the implementation of a dictionary. This project reports on a user study performed to determine the effect of dictionaries with different levels of coverage on a part-of-speech annotation task. Based on a pilot study with thirty-three participants the analysis shows that a part-of-speech tag dictionary with greater than or equal to 60% coverage helps to improve the time required to complete the part-of-speech annotation task while maintaining high levels of accuracy. part-of-speech annotation user study CCASH active learning cost-reduction Linguistics
25	Utvärdering av Part-of-Speech tagging som metod för identifiering av nyckelord i dialog / Evaluation of Part-of-Speech tagging as a method for identification of keywords in dialogs He, Jeannie, Norström, Matthew January 2019 (has links) Denna studie presenterar Part-of-Speech tagging som metod för identifiering av nyckelord samt en marknadsanalys för en konverserande robot att leda språkkaféer. Resultatet evaluerades med hjälp av svar från enkäter utskickade till 30 anonyma personer med svenska som modersmål. Resultatet visar att metoden är rimlig och kan implementeras i en konverserande robot för att öka dess förståelse av det talade språket som förekommer inom språkkaféer. Marknadsanalysen indikerar att det existerar en marknad för den konverserande roboten. Roboten behöver dock förbättras för att kunna bli en ersättning för mänskliga språkledare inom språkkaféer. / This study presents Part-of-Speech tagging as a method for keyword spotting as well as a market research for a conversational robot to lead a language café. The results are evaluated using the answers from 30 anonymous Swedish native speakers. The results show that the method is plausible and could be implemented in a conversational robot to increase its understanding of the spoken language in a language café. The market research indicates that there is a market for the conversational robot. The conversional robot needs, however, improvements to successfully become a substitute for human language teachers in language cafés. Part-of-Speech Tagging Keyword spotting Language café Robotics. Computer and Information Sciences Data- och informationsvetenskap
26	Hybrid models for Chinese unknown word resolution Lu, Xiaofei 12 September 2006 (has links) No description available. Language, Linguistics Chinese unknown words word segmentation unknown word identification part-of-speech tagging sense tagging
27	Applying particle filtering to unsupervised part-of-speech induction Dubbin, Gregory January 2014 (has links) Statistical Natural Language Processing (NLP) lies at the intersection of Computational Linguistics and Machine Learning. As linguistic models incorporate more subtle nuances of language and its structure, standard inference techniques can fall behind. One such application is research on the unsupervised induction of part-of-speech tags. It has the potential to improve both our understanding of the plausibility of theories of first language acquisition, and Natural Language Processing applications such as Speech Recognition and Machine Translation. Sequential Monte Carlo (SMC) approaches, i.e. particle filters, are well suited to approximating such models. This thesis seeks to determine whether one application of SMC methods, particle Gibbs sampling, is capable of performing inference in otherwise intractable NLP applications. Specifically, this research analyses the benefits and drawbacks to relying on particle Gibbs to perform unsupervised part-of-speech induction without the flawed one-tag-per-type assumption of similar approaches. Additionally, this thesis explores the affects of type-based supervision with tag-dictionaries extracted from annotated corpora or from the wiktionary. The semi-supervised tag dictionary improves the performance of the local Gibbs PYP-HMM sampler enough to nearly match the performance of the particle Gibbs type-sampler. Finally, this thesis also extends the Pitman-Yor HMM tagger of Blunsom and Cohn (2011) to include an explicit model of the lexicon which encodes those tags from which a word-type may be generated. This has the effect of both biasing the model to produce fewer tags per type and modelling the tendency for open class words to be ambiguous between only a subset of the available tags. Furthermore, I extend the type based particle Gibbs inference algorithm to simultaneously resample the ambiguity class as well as tags for all of the tokens of a given word type. The result is a principled probabilistic model of part-of-speech induction that achieves state-of-the-art performance. Overall, the experiments and contributions of this thesis demonstrate the applicability of the particle Gibbs sampler and particle methods in general to otherwise intractable problems in NLP. 006.3
28	Intégration de ressources lexicales riches dans un analyseur syntaxique probabiliste / Integration of lexical resources in a probabilistic parser Sigogne, Anthony 03 December 2012 (has links) Cette thèse porte sur l'intégration de ressources lexicales et syntaxiques du français dans deux tâches fondamentales du Traitement Automatique des Langues [TAL] que sont l'étiquetage morpho-syntaxique probabiliste et l'analyse syntaxique probabiliste. Dans ce mémoire, nous utilisons des données lexicales et syntaxiques créées par des processus automatiques ou par des linguistes afin de donner une réponse à deux problématiques que nous décrivons succinctement ci-dessous : la dispersion des données et la segmentation automatique des textes. Grâce à des algorithmes d'analyse syntaxique de plus en plus évolués, les performances actuelles des analyseurs sont de plus en plus élevées, et ce pour de nombreuses langues dont le français. Cependant, il existe plusieurs problèmes inhérents aux formalismes mathématiques permettant de modéliser statistiquement cette tâche (grammaire, modèles discriminants,...). La dispersion des données est l'un de ces problèmes, et est causée principalement par la faible taille des corpus annotés disponibles pour la langue. La dispersion représente la difficulté d'estimer la probabilité de phénomènes syntaxiques apparaissant dans les textes à analyser mais qui sont rares ou absents du corpus ayant servi à l'apprentissage des analyseurs. De plus, il est prouvé que la dispersion est en partie un problème lexical, car plus la flexion d'une langue est importante, moins les phénomènes lexicaux sont représentés dans les corpus annotés. Notre première problématique repose donc sur l'atténuation de l'effet négatif de la dispersion lexicale des données sur les performances des analyseurs. Dans cette optique, nous nous sommes intéressé à une méthode appelée regroupement lexical, et qui consiste à regrouper les mots du corpus et des textes en classes. Ces classes réduisent le nombre de mots inconnus et donc le nombre de phénomènes syntaxiques rares ou inconnus, liés au lexique, des textes à analyser. Notre objectif est donc de proposer des regroupements lexicaux à partir d'informations tirées des lexiques syntaxiques du français, et d'observer leur impact sur les performances d'analyseurs syntaxiques. Par ailleurs, la plupart des évaluations concernant l'étiquetage morpho-syntaxique probabiliste et l'analyse syntaxique probabiliste ont été réalisées avec une segmentation parfaite du texte, car identique à celle du corpus évalué. Or, dans les cas réels d'application, la segmentation d'un texte est très rarement disponible et les segmenteurs automatiques actuels sont loin de proposer une segmentation de bonne qualité, et ce, à cause de la présence de nombreuses unités multi-mots (mots composés, entités nommées,...). Dans ce mémoire, nous nous focalisons sur les unités multi-mots dites continues qui forment des unités lexicales auxquelles on peut associer une étiquette morpho-syntaxique, et que nous appelons mots composés. Par exemple, cordon bleu est un nom composé, et tout à fait un adverbe composé. Nous pouvons assimiler la tâche de repérage des mots composés à celle de la segmentation du texte. Notre deuxième problématique portera donc sur la segmentation automatique des textes français et son impact sur les performances des processus automatiques. Pour ce faire, nous nous sommes penché sur une approche consistant à coupler, dans un même modèle probabiliste, la reconnaissance des mots composés et une autre tâche automatique. Dans notre cas, il peut s'agir de l'analyse syntaxique ou de l'étiquetage morpho-syntaxique. La reconnaissance des mots composés est donc réalisée au sein du processus probabiliste et non plus dans une phase préalable. Notre objectif est donc de proposer des stratégies innovantes permettant d'intégrer des ressources de mots composés dans deux processus probabilistes combinant l'étiquetage ou l'analyse à la segmentation du texte / This thesis focuses on the integration of lexical and syntactic resources of French in two fundamental tasks of Natural Language Processing [NLP], that are probabilistic part-of-speech tagging and probabilistic parsing. In the case of French, there are a lot of lexical and syntactic data created by automatic processes or by linguists. In addition, a number of experiments have shown interest to use such resources in processes such as tagging or parsing, since they can significantly improve system performances. In this paper, we use these resources to give an answer to two problems that we describe briefly below : data sparseness and automatic segmentation of texts. Through more and more sophisticated parsing algorithms, parsing accuracy is becoming higher for many languages including French. However, there are several problems inherent in mathematical formalisms that statistically model the task (grammar, discriminant models,...). Data sparseness is one of those problems, and is mainly caused by the small size of annotated corpora available for the language. Data sparseness is the difficulty of estimating the probability of syntactic phenomena, appearing in the texts to be analyzed, that are rare or absent from the corpus used for learning parsers. Moreover, it is proved that spars ness is partly a lexical problem, because the richer the morphology of a language is, the sparser the lexicons built from a Treebank will be for that language. Our first problem is therefore based on mitigating the negative impact of lexical data sparseness on parsing performance. To this end, we were interested in a method called word clustering that consists in grouping words of corpus and texts into clusters. These clusters reduce the number of unknown words, and therefore the number of rare or unknown syntactic phenomena, related to the lexicon, in texts to be analyzed. Our goal is to propose word clustering methods based on syntactic information from French lexicons, and observe their impact on parsers accuracy. Furthermore, most evaluations about probabilistic tagging and parsing were performed with a perfect segmentation of the text, as identical to the evaluated corpus. But in real cases of application, the segmentation of a text is rarely available and automatic segmentation tools fall short of proposing a high quality segmentation, because of the presence of many multi-word units (compound words, named entities,...). In this paper, we focus on continuous multi-word units, called compound words, that form lexical units which we can associate a part-of-speech tag. We may see the task of searching compound words as text segmentation. Our second issue will therefore focus on automatic segmentation of French texts and its impact on the performance of automatic processes. In order to do this, we focused on an approach of coupling, in a unique probabilistic model, the recognition of compound words and another task. In our case, it may be parsing or tagging. Recognition of compound words is performed within the probabilistic process rather than in a preliminary phase. Our goal is to propose innovative strategies for integrating resources of compound words in both processes combining probabilistic tagging, or parsing, and text segmentation Analyse syntaxique Étiquetage morpho-syntaxique Lexiques Hybridation Dispersion des données Segmentation automatique Parsing Part-Of-Speech Tagging Lexicons Hybridisation Segmentation Data sparseness
29	Text Augmentation: Inserting markup into natural language text with PPM Models Yeates, Stuart Andrew January 2006 (has links) This thesis describes a new optimisation and new heuristics for automatically marking up XML documents, and CEM, a Java implementation, using PPM models. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods. Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BibTeX system and marked up in XML with every field from the original BibTeX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists' Communique corpus and the Reuters' corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory. A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked up documents. The performance of the new heuristics and optimisation are examined using the four corpora. Markup Text Augmentation Textual Analysis Hidden Markov Models HMM PPM Viterbi Search Part-Of-Speech Tagging XML Metadata
30	Part-of-Speech Bootstrapping Using Lexically-Specific Frames Leibbrandt, Richard Eduard, richard.leibbrandt@flinders.edu.au January 2009 (has links) The work in this thesis presents and evaluates a number of strategies by which English-learning children might discover the major open-class parts-of-speech in English (nouns, verbs and adjectives) on the basis of purely distributional information. Previous work has shown that parts-of-speech can be readily induced from the distributional patterns in which words occur. The research reported in this thesis extends and improves on this previous work in two major ways, related to the constructional status of the utterance contexts used for distributional analysis, and to the way in which previous studies have dealt with categorial ambiguity. Previous studies that have induced parts-of-speech from word distributions have done so on the basis of fixed windows of words that occur before and after the word in focus. These contexts are often not constructions of the language in question, and hence have dubious status as elements of linguistic knowledge. A great deal of recent evidence (e.g. Lieven, Pine & Baldwin, 1997; Tomasello, 1992) has suggested that childrens early language may be organized around a number of lexically-specific constructional frames with slots, such as a X, you X it, draw X on X. The work presented here investigates the possibility that constructions such as these may be a more appropriate domain for the distributional induction of parts-of-speech. This would open up the possibility of a treatment of part-of-speech induction that is more closely integrated with the acquisition of syntax. Three strategies to discover lexically-specific frames in the speech input to children are presented. Two of these strategies are based on the interplay between more and less frequent words in English utterances: the more frequent words, which are typically function words or light verbs, are taken to provide the schematic backbone of an utterance. The third strategy is based around pairs of words in which the occurrence of one word is highly predictable from that of the other, but not vice versa; from these basic slot-filler relationships, larger frames are assembled. These techniques were implemented computationally and applied to a corpus of child-directed speech. Each technique yielded a large set of lexically-specific frames, many of which could plausibly be regarded as constructions. In a comparison with a manual analysis of the same corpus by Cameron-Faulkner, Lieven and Tomasello (2003), it is shown that most of the constructional frames identified in the manual analysis were also produced by the automatic techniques. After the identification of potential constructional frames, parts-of-speech were formed from the patterns of co-occurrence of words in particular constructions, by means of hierarchical clustering. The resulting clusters produced are shown to be quite similar to the major English parts-of-speech of nouns, verbs and adjectives. Each individual word token was assigned a part-of-speech on the basis of its constructional context. This categorization was evaluated empirically against the part-of-speech assigned to the word in question in the original corpus. The resulting categorization is shown to be, to a great extent, in agreement with the manual categorization. These strategies deal with the categorial ambiguity of words, by allowing the frame context to determine part-of-speech. However, many of the frames produced were themselves ambiguous cues to part-of-speech. For this reason, strategies are presented to deal with both word and context ambiguity. Three such strategies are proposed. One considers membership of a part-of-speech to be a matter of degree for both word and contextual frame. A second strategy attempts to discretely assign multiple parts-of-speech to words and constructions in a way that imposes internal consistency in the corpus. The third strategy attempts to assign only the minimally-required multiple categories to words and constructions so as to provide a parsimonious description of the data. Each of these techniques was implemented and applied to each of the three frame discovery techniques, thereby providing category information about both the frame and the word. The subsequent assignment of parts-of-speech was done by combining word and frame information, and is shown to be far more accurate than the categorization based on frames alone. This approach can be regarded as addressing certain objections against the distributional method that have been raised by Pinker (1979, 1984, 1987). Lastly, a framework for extending this research is outlined that allows semantic information to be incorporated into the process of category induction. part-of-speech bootstrapping word classes language learning lexically-specific frames construction grammar usage-based linguistics POS tagging developmental psychology psycholinguistics

Search results