Global ETD Search

141	POSITION CLASS PRECLUSION: A COMPUTATIONAL RESOLUTION OF MUTUALLY EXCLUSIVE AFFIX POSITIONS Hale, Rebecca O 01 January 2014 (has links) In Paradigm Function Morphology, it is usual to model affix position classes with an ordered sequence of inflectional rule blocks. Each rule block determines how (or whether) a particular affix position is filled. In this model, competition among inflectional rules is assumed to be limited to members of the same rule block; thus, the appearance of an affix in one position cannot be precluded by the appearance of an affix in another position. I present evidence that apparently disconfirms this restriction and suggests that a more general conception of rule competition is necessary. The data appear to imply that an affixation rule may in some cases override a rule introducing an affix occupying another, distinct position. I propose that each inflectional rule R carry two indices — the first, as usual, specifying the position of the affix introduced by R. The second, however, specifies the position(s) that R satisfies. By default, these two indices identify the same position. However, where one affix precludes another, the second index of the appearing affix specifies two affix positions: the one in which it appears and the one which it precludes. With both blocks satisfied, no other rules which fill either may be applied. Paradigm Function Morphology inflectional morphology morphology computational linguistics affixation Computational Linguistics Morphology
142	Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing Tiedemann, Jörg January 2003 (has links) <p>The focus of this thesis is on re-using translations in natural language processing. It involves the collection of documents and their translations in an appropriate format, the automatic extraction of translation data, and the application of the extracted data to different tasks in natural language processing.</p><p>Five parallel corpora containing more than 35 million words in 60 languages have been collected within co-operative projects. All corpora are sentence aligned and parts of them have been analyzed automatically and annotated with linguistic markup.</p><p>Lexical data are extracted from the corpora by means of word alignment. Two automatic word alignment systems have been developed, the Uppsala Word Aligner (UWA) and the Clue Aligner. UWA implements an iterative "knowledge-poor" word alignment approach using association measures and alignment heuristics. The Clue Aligner provides an innovative framework for the combination of statistical and linguistic resources in aligning single words and multi-word units. Both aligners have been applied to several corpora. Detailed evaluations of the alignment results have been carried out for three of them using fine-grained evaluation techniques.</p><p>A corpus processing toolbox, Uplug, has been developed. It includes the implementation of UWA and is freely available for research purposes. A new version, Uplug II, includes the Clue Aligner. It can be used via an experimental web interface (UplugWeb).</p><p>Lexical data extracted by the word aligners have been applied to different tasks in computational lexicography and machine translation. The use of word alignment in monolingual lexicography has been investigated in two studies. In a third study, the feasibility of using the extracted data in interactive machine translation has been demonstrated. Finally, extracted lexical data have been used for enhancing the lexical components of two machine translation systems.</p> Computational linguistics word alignment parallel corpora translation corpora computational lexicography machine translation Datorlingvistik Computational linguistics Datorlingvistik
143	Creation of a customised character recognition application Sandgren, Frida January 2005 (has links) <p>This master’s thesis describes the work in creating a customised optical character recognition (OCR) application; intended for use in digitisation of theses submitted to the Uppsala University in the 18th and 19th centuries. For this purpose, an open source software called Gamera has been used for recognition and classification of the characters in the documents. The software provides specific algorithms for analysis of heritage documents and is designed to be used as a tool for creating domain-specific (i.e. customised) recognition applications.</p><p>By using the Gamera classifier training interface, classifier data was created which reflects the characters in the particular theses. The data can then be used in automatic recognition of ‘new’ characters, by loading it into one of Gamera’s classifiers. The output of Gamera are sets of classified glyphs (i.e. small images of characters), stored in an XML-based format.</p><p>However, as OCR typically involves translation of images of text into a machine-readable format, a complementary OCR-module was needed. For this purpose, an external Gamera module for page segmentation was modified and used.</p><p>In addition, a script for control of the OCR-process was created, which initiates the page segmentation on Gamera classified glyphs. The result is written to text files.</p><p>Finally, in a test for recognition accuracy, one of the theses was used for creation of training data and for test of data. The result from the test show an average accuracy rate of 82% and that there is a need for a better pre-processing module which removes more noise from the images, as well as recognises different character sizes in the images before they are run by the OCR-process.</p> Computational linguistics OCR Digitisation Character recognition Classification Heritage documents Datorlingvistik Computational linguistics Datorlingvistik
144	Classification into Readability Levels : Implementation and Evaluation Larsson, Patrik January 2006 (has links) <p>The use for a readability classification model is mainly as an integrated part of an information retrieval system. By matching the user's demands of readability to the documents with the corresponding readability, the classification model can further improve the results of, for example, a search engine. This thesis presents a new solution for classification into readability levels for Swedish. The results from the thesis are a number of classification models. The models were induced by training a Support Vector Machines classifier on features that are established by previous research as good measurements of readability. The features were extracted from a corpus annotated with three readability levels. Natural Language Processing tools for tagging and parsing were used to analyze the corpus and enable the extraction of the features from the corpus. Empirical testings of different feature combinations were performed to optimize the classification model. The classification models render a good and stable classification. The best model obtained a precision score of 90.21\% and a recall score of 89.56\% on the test-set, which is equal to a F-score of 89.88.</p> / <p>Uppsatsen beskriver utvecklandet av en klassificeringsmodell för Svenska texter beroende på dess läsbarhet. Användningsområdet för en läsbaretsklassificeringsmodell är främst inom informationssökningssystem. Modellen kan öka träffsäkerheten på de dokument som anses relevanta av en sökmotor genom att matcha användarens krav på läsbarhet med de indexerade dokumentens läsbarhet. Resultatet av uppsatsen är ett antal modeller för klassificering av text beroende på läsbarhet. Modellerna har tagits fram genom att träna upp en Support Vector Machines klassificerare, på ett antal särdrag som av tidigare forskning har fastslagits vara goda mått på läsbarhet. Särdragen extraherades från en korpus som är annoterad med tre läsbarhetsnivåer. Språkteknologiska verktyg för taggning och parsning användes för att möjliggöra extraktionen av särdragen. Särdragen utvärderades empiriskt i olika särdragskombinationer för att optimera modellerna. Modellerna testades och utvärderades med goda resultat. Den bästa modellen hade en precision på 90,21 och en recall på 89,56, detta ger en F-score som är 89,88. Uppsatsen presenterar förslag på vidareutveckling samt potentiella användningsområden.</p> readability information retrieval search engines Computational linguistics läsbarhet sökmotorer informationssökning maskininlärning språkteknologi datorlingvistik Computational linguistics Datorlingvistik
145	Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing Tiedemann, Jörg January 2003 (has links) The focus of this thesis is on re-using translations in natural language processing. It involves the collection of documents and their translations in an appropriate format, the automatic extraction of translation data, and the application of the extracted data to different tasks in natural language processing. Five parallel corpora containing more than 35 million words in 60 languages have been collected within co-operative projects. All corpora are sentence aligned and parts of them have been analyzed automatically and annotated with linguistic markup. Lexical data are extracted from the corpora by means of word alignment. Two automatic word alignment systems have been developed, the Uppsala Word Aligner (UWA) and the Clue Aligner. UWA implements an iterative "knowledge-poor" word alignment approach using association measures and alignment heuristics. The Clue Aligner provides an innovative framework for the combination of statistical and linguistic resources in aligning single words and multi-word units. Both aligners have been applied to several corpora. Detailed evaluations of the alignment results have been carried out for three of them using fine-grained evaluation techniques. A corpus processing toolbox, Uplug, has been developed. It includes the implementation of UWA and is freely available for research purposes. A new version, Uplug II, includes the Clue Aligner. It can be used via an experimental web interface (UplugWeb). Lexical data extracted by the word aligners have been applied to different tasks in computational lexicography and machine translation. The use of word alignment in monolingual lexicography has been investigated in two studies. In a third study, the feasibility of using the extracted data in interactive machine translation has been demonstrated. Finally, extracted lexical data have been used for enhancing the lexical components of two machine translation systems. Computational linguistics word alignment parallel corpora translation corpora computational lexicography machine translation Datorlingvistik Computational linguistics Datorlingvistik
146	A comparative study of the grammatical gender systems of languages by means of analysing word embeddings Veeman, Hartger January 2020 (has links) The creation of word embeddings is one of the key breakthroughs in natural language processing. Word embeddings allow for words to be represented semantically, opening the way to many new deep learning methods. Understanding what information is in word embeddings will help understanding the behaviour of embeddings in natural language processing tasks, but also allows for the quantitative study of the linguistic features such as grammatical gender. This thesis attempts to explore how grammatical gender is encoded in word embeddings, through analysing the performance of a neural network classifier on the classification of nouns by gender. This analysis is done in three experiments: an analysis of contextualized embeddings, an analysis of embeddings learned from modified corpora and an analysis of aligned embeddings in many languages. The contextualized word embedding model ELMo has multiple output layers with a gradual increasing presence of semantic information in the embedding. This differing presence of semantic information was used to test the classifier's reliance on semantic information. Swedish, German, Spanish and Russian embeddings were classified at all layers of a three layered ELMo model. The word representation layer without any contextualization was found to produce the best accuracy, indicating the noise introduced by the contextualization was more impactful than any potential extra semantic information. Swedish embeddings were learned from a corpus stripped of articles and a stemmed corpus. Both sets of embeddings showed an drop of about 6% in accuracy in comparison with the embeddings from a non-augmented corpus, indicating agreement plays a large role in the classification. Aligned multilingual embeddings were used to measure the accuracy of a grammatical gender classifier in 24 languages. The classifier models were applied to data of other languages to determine the similarity of the encoding of grammatical gender in these embeddings. Correcting the results with a random guessing baseline shows that transferred models can be highly accurate in certain language combinations and in some cases almost approach the accuracy of the model on its source data. A comparison between transfer accuracy and phylogenetic distance showed that the model transferability follows a pattern that resembles the phylogenetic distance. word embeddings grammatical gender computational linguistics language representations
147	Compound Processing for Phrase-Based Statistical Machine Translation Stymne, Sara January 2009 (has links) In this thesis I explore how compound processing can be used to improve phrase-based statistical machine translation (PBSMT) between English and German/Swedish. Both German and Swedish generally use closed compounds, which are written as one word without spaces or other indicators of word boundaries. Compounding is both common and productive, which makes it problematic for PBSMT, mainly due to sparse data problems. The adopted strategy for compound processing is to split compounds into their component parts before training and translation. For translation into Swedish and German the parts are merged after translation. I investigate the effect of different splitting algorithms for translation between English and German, and of different merging algorithms for German. I also apply these methods to a different language pair, English--Swedish. Overall the studies show that compound processing is useful, especially for translation from English into German or Swedish. But there are improvements for translation into English as well, such as a reduction of unknown words. I show that for translation between English and German different splitting algorithms work best for different translation directions. I also design and evaluate a novel merging algorithm based on part-of-speech matching, which outperforms previous methods for compound merging, showing the need for information that is carried through the translation process, rather than only external knowledge sources such as word lists. Most of the methods for compound processing were originally developed for German. I show that these methods can be applied to Swedish as well, with similar results. Machine translation compounds factored translation statistical machine translation computational linguistics Computer Sciences Datavetenskap (datalogi)
148	Exploring nature of the structured data in GP electronic patient records Ranandeh Kalankesh, Leila January 2011 (has links) No description available. 610.21
149	Temporal processing of news : annotation of temporal expressions, verbal events and temporal relations Marsic, Georgiana January 2011 (has links) The ability to capture the temporal dimension of a natural language text is essential to many natural language processing applications, such as Question Answering, Automatic Summarisation, and Information Retrieval. Temporal processing is a ¯eld of Computational Linguistics which aims to access this dimension and derive a precise temporal representation of a natural language text by extracting time expressions, events and temporal relations, and then representing them according to a chosen knowledge framework. This thesis focuses on the investigation and understanding of the di®erent ways time is expressed in natural language, on the implementation of a temporal processing system in accordance with the results of this investigation, on the evaluation of the system, and on the extensive analysis of the errors and challenges that appear during system development. The ultimate goal of this research is to develop the ability to automatically annotate temporal expressions, verbal events and temporal relations in a natural language text. Temporal expression annotation involves two stages: temporal expression identi¯cation concerned with determining the textual extent of a temporal expression, and temporal expression normalisation which ¯nds the value that the temporal expression designates and represents it using an annotation standard. The research presented in this thesis approaches these tasks with a knowledge-based methodology that tackles temporal expressions according to their semantic classi¯cation. Several knowledge sources and normalisation models are experimented with to allow an analysis of their impact on system performance. The annotation of events expressed using either ¯nite or non-¯nite verbs is addressed with a method that overcomes the drawback of existing methods v which associate an event with the class that is most frequently assigned to it in a corpus and are limited in coverage by the small number of events present in the corpus. This limitation is overcome in this research by annotating each WordNet verb with an event class that best characterises that verb. This thesis also describes an original methodology for the identi¯cation of temporal relations that hold among events and temporal expressions. The method relies on sentence-level syntactic trees and a propagation of temporal relations between syntactic constituents, by analysing syntactic and lexical properties of the constituents and of the relations between them. The detailed evaluation and error analysis of the methods proposed for solving di®erent temporal processing tasks form an important part of this research. Various corpora widely used by researchers studying di®erent temporal phenomena are employed in the evaluation, thus enabling comparison with state of the art in the ¯eld. The detailed error analysis targeting each temporal processing task helps identify not only problems of the implemented methods, but also reliability problems of the annotated resources, and encourages potential reexaminations of some temporal processing tasks. 006.35
150	Discovering patterns in databases: the cases for language, music, and unstructured data 葉立志, Yip, Chi-lap. January 2000 (has links) published_or_final_version / Computer Science and Information Systems / Doctoral / Doctor of Philosophy Data mining. Computer algorithms. Computational linguistics. Music - Data processing.

Search results