Global ETD Search

281	Cross-Lingual and Low-Resource Sentiment Analysis Farra, Noura January 2019 (has links) Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages. This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language. Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis. To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments. The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language. In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment. Computer science Computational linguistics Text data mining Emotive (Linguistics) Disaster relief
282	Using web texts for word sense disambiguation Wang, Yuanyong, Computer Science & Engineering, Faculty of Engineering, UNSW January 2007 (has links) In all natural languages, ambiguity is a universal phenomenon. When a word has multiple meaning depending on its contexts it is called an ambiguous word. The process of determining the correct meaning of a word (formally named word sense) in a given context is word sense disambiguation(WSD). WSD is one of the most fundamental problems in natural language processing. If properly addressed, it could lead to revolutionary advancement in many other technologies such as text search engine technology, automatic text summarization and classification, automatic lexicon construction, machine translation and automatic learning agent technology. One difficulty that has always confronted WSD researchers is the lack of high quality sense specific information. For example, if the word "power" Immediately preceds the word "plant", it would strongly constrain the meaning of "plant" to be "an industrial facility". If "power" is replaced by the phrase "root of a", then the sense of "plant" is dictated to be "an organism" of the kingdom Planate. It is obvious that manually building a comprehensive sense specific information base for each sense of each word is impractical. Researchers also tried to extract such information from large dictionaries as well as manually sense tagged corpora. Most of the dictionaries used for WSD are not built for this purpose and have a lot of inherited peculiarities. While manual tagging is slow and costly, automatic tagging is not successful in providing a reliable performance. Furthermore, it is often the case that for a randomly chosen word (to be disambiguated), the sense specific context corpora that can be collected from dictionaries are not large enough. Therefore, manually building sense specific information bases or extraction of such information from dictionaries are not effective approaches to obtain sense specific information. A web text, due to its vast quantity and wide diversity, becomes an ideal source for extraction of large quantity of sense specific information. In this thesis, the impacts of Web texts on various aspects of WSD has been investigated. New measures and models are proposed to tame enormous amount of Web texts for the purpose of WSD. They are formally evaluated by experimenting their disambiguation performance on about 70 ambiguous nouns. The results are very encouraging and have helped revealing the great potential of using Web texts for WSD. The results are published in three papers at Australia national and international level (Wang&Hoffmann,2004,2005,2006)[42][43][44]. Search engine. Word sense disambiguation. Ambiguity. Semantics -- Data processing. Computational linguistics.
283	Incremental knowledge acquisition for natural language processing Pham, Son Bao, Computer Science & Engineering, Faculty of Engineering, UNSW January 2006 (has links) Linguistic patterns have been used widely in shallow methods to develop numerous NLP applications. Approaches for acquiring linguistic patterns can be broadly categorised into three groups: supervised learning, unsupervised learning and manual methods. In supervised learning approaches, a large annotated training corpus is required for the learning algorithms to achieve decent results. However, annotated corpora are expensive to obtain and usually available only for established tasks. Unsupervised learning approaches usually start with a few seed examples and gather some statistics based on a large unannotated corpus to detect new examples that are similar to the seed ones. Most of these approaches either populate lexicons for predefined patterns or learn new patterns for extracting general factual information; hence they are applicable to only a limited number of tasks. Manually creating linguistic patterns has the advantage of utilising an expert's knowledge to overcome the scarcity of annotated data. In tasks with no annotated data available, the manual way seems to be the only choice. One typical problem that occurs with manual approaches is that the combination of multiple patterns, possibly being used at different stages of processing, often causes unintended side effects. Existing approaches, however, do not focus on the practical problem of acquiring those patterns but rather on how to use linguistic patterns for processing text. A systematic way to support the process of manually acquiring linguistic patterns in an efficient manner is long overdue. This thesis presents KAFTIE, an incremental knowledge acquisition framework that strongly supports experts in creating linguistic patterns manually for various NLP tasks. KAFTIE addresses difficulties in manually constructing knowledge bases of linguistic patterns, or rules in general, often faced in existing approaches by: (1) offering a systematic way to create new patterns while ensuring they are consistent; (2) alleviating the difficulty in choosing the right level of generality when creating a new pattern; (3) suggesting how existing patterns can be modified to improve the knowledge base's performance; (4) making the effort in creating a new pattern, or modifying an existing pattern, independent of the knowledge base's size. KAFTIE, therefore, makes it possible for experts to efficiently build large knowledge bases for complex tasks. This thesis also presents the KAFDIS framework for discourse processing using new representation formalisms: the level-of-detail tree and the discourse structure graph. Knowledge acquisition (Expert systems) Computational linguistics Semantics - Data processing
284	Lexical approaches to backoff in statistical parsing Lakeland, Corrin, n/a January 2006 (has links) This thesis develops a new method for predicting probabilities in a statistical parser so that more sophisticated probabilistic grammars can be used. A statistical parser uses a probabilistic grammar derived from a training corpus of hand-parsed sentences. The grammar is represented as a set of constructions - in a simple case these might be context-free rules. The probability of each construction in the grammar is then estimated by counting its relative frequency in the corpus. A crucial problem when building a probabilistic grammar is to select an appropriate level of granularity for describing the constructions being learned. The more constructions we include in our grammar, the more sophisticated a model of the language we produce. However, if too many different constructions are included, then our corpus is unlikely to contain reliable information about the relative frequency of many constructions. In existing statistical parsers two main approaches have been taken to choosing an appropriate granularity. In a non-lexicalised parser constructions are specified as structures involving particular parts-of-speech, thereby abstracting over individual words. Thus, in the training corpus two syntactic structures involving the same parts-of-speech but different words would be treated as two instances of the same event. In a lexicalised grammar the assumption is that the individual words in a sentence carry information about its syntactic analysis over and above what is carried by its part-of-speech tags. Lexicalised grammars have the potential to provide extremely detailed syntactic analyses; however, Zipf�s law makes it hard for such grammars to be learned. In this thesis, we propose a method for optimising the trade-off between informative and learnable constructions in statistical parsing. We implement a grammar which works at a level of granularity in between single words and parts-of-speech, by grouping words together using unsupervised clustering based on bigram statistics. We begin by implementing a statistical parser to serve as the basis for our experiments. The parser, based on that of Michael Collins (1999), contains a number of new features of general interest. We then implement a model of word clustering, which we believe is the first to deliver vector-based word representations for an arbitrarily large lexicon. Finally, we describe a series of experiments in which the statistical parser is trained using categories based on these word representations. parsing (computer grammar) computational linguistics linguistics statistical methods
285	Maori language integration in the age of information technology: a computational approach Laws, Mark R., n/a January 2001 (has links) A multidisciplinary approach that involves language universals, linguistic discourse analysis and computer information technology are combined to support the descriptive nature of this research dissertation. Utilising comparative methods to determine rudimentary language structures which reflect both the scientific and historic parameters that are embedded in all languages. From a hypothesis to the proof of concept, a multitude of computer applications have been used to test these language models, templates and frameworks. To encapsulate this entire approach, it is best described as "designing then building the theoretical, experimental, and practical projects that form the structural network of the Maori language system". The focus on methods for integrating the language is to investigate shared characteristics between Maori and New Zealand English. This has provided a complete methodology for a bilingual based system. A system with text and speech for language generation and classification. This approach has looked at existing computational linguistic and information processing techniques for the analysis of each language�s phenomena; where data from basic units to higher-order linguistic knowledge has been analysed in terms of their characteristics for similar and/or dissimilar features. The notion that some language units can have similar acoustic sounds, structures or even meanings in other languages is plausible. How these are identified was the key concept to building an integrated language system. This research has permitted further examination into developing a new series of phonological and lexical self organising maps of Maori. Using phoneme and word maps spatially organised around lower to higher order concepts such as �sounds like�. To facilitate the high demands placed on very large data stores, the further development of the speech database management system containing phonological, phonetic, lexical, semantic, and other language frameworks was also developed. This database has helped to examine how effectively Maori has been fully integrated into an existing English framework. The bilingual system will allow full interaction with a computer-based speech architecture. This will contribute to the existing knowledge being constructed by the many different disciplines associated with languages; naturally or artificially derived. Evolving connectionist systems are new tools that are trained in an unsupervised manner to be both adaptable and flexible. This hybrid approach is an improvement on past methods in the development of more effective and efficient ways for solving applied problems for speech data analysis, classification, rule extraction, information retrieval and knowledge acquisition. A preliminary study will apply bilingual data to an �evolving clustering method� algorithm that returns a structure containing acoustic clusters plotted using visualisation techniques. In the true practical sense, the complete bilingual system has had a bi-directional approach. Both languages have undergone similar data analysis, language modelling, data access, text and speech processing, and human-computer network interface interaction. Maori language te reo Maori computational linguistics computer program language information technology
286	Question Classification in Question Answering Systems Sundblad, Håkan January 2007 (has links) <p>Question answering systems can be seen as the next step in information retrieval, allowing users to pose questions in natural language and receive succinct answers. In order for a question answering system as a whole to be successful, research has shown that the correct classification of questions with regards to the expected answer type is imperative. Question classification has two components: a taxonomy of answer types, and a machinery for making the classifications.</p><p>This thesis focuses on five different machine learning algorithms for the question classification task. The algorithms are k nearest neighbours, naïve bayes, decision tree learning, sparse network of winnows, and support vector machines. These algorithms have been applied to two different corpora, one of which has been used extensively in previous work and has been constructed for a specific agenda. The other corpus is drawn from a set of users' questions posed to a running online system. The results showed that the performance of the algorithms on the different corpora differs both in absolute terms, as well as with regards to the relative ranking of them. On the novel corpus, naïve bayes, decision tree learning, and support vector machines perform on par with each other, while on the biased corpus there is a clear difference between them, with support vector machines being the best and naïve bayes being the worst.</p><p>The thesis also presents an analysis of questions that are problematic for all learning algorithms. The errors can roughly be divided as due to categories with few members, variations in question formulation, the actual usage of the taxonomy, keyword errors, and spelling errors. A large portion of the errors were also hard to explain.</p> / Report code: LiU-Tek-Lic-2007:29. Question classification question answering machine learning taxonomy evaluation Computational linguistics Datorlingvistik
287	Evaluating Readability on Mobile Devices Öquist, Gustav January 2006 (has links) <p>The thesis presents findings from five readability studies performed on mobile devices. The dynamic Rapid Serial Visual Presentation (RSVP) format has been enhanced with regard to linguistic adaptation and segmentation as well as eye movement modeling. The novel formats have been evaluated against other common presentation formats including Paging, Scrolling, and Leading in latin-square balanced repeated-measurement studies with 12-16 subjects. Apart from monitoring Reading speed, Comprehension, and Task load (NASA-TLX), Eye movement tracking has been used to learn more about how the text presentation affects reading.</p><p>The Page format generally offered best readability. Reading on a mobile phone decreased reading speed by 10% compared to reading on a Personal Digital Assistant (PDA), an interesting finding given that the display area of the mobile phone was 50% smaller. Scrolling, the most commonly used presentation format on mobile devices today, proved inferior to both Paging and RSVP. Leading, the most widely known dynamic format, caused very unnatural eye movements for reading. This seems to have increased task load, but not affected reading speed to a similar extent. The RSVP format displaying one word at time was found to reduce eye movements significantly, but contrary to common claims, this resulted in decreased reading speed and increased task load. In the last study, Predictive Text Presentation (PTP) was introduced. The format is based on RSVP and combines linguistic chunking and adaptation with eye movement modeling to achieve a reading experience that can rival traditional text presentation.</p><p>It is explained why readability on mobile devices is important, how it may be evaluated in an efficient and yet reliable manner, and PTP is pinpointed as the format with greatest potential for improvement. The methodology used in the evaluations and the shortcomings of the studies are discussed. Finally, a hyper-graeco-latin-square experimental design is proposed for future evaluations.</p> Computational linguistics Readability Mobile devices Text presentation Usability Evaluation Adaptation Eye movement tracking Datorlingvistik
288	Utveckling av ett svensk-engelskt lexikon inom tåg- och transportdomänen Axelsson, Hans, Blom, Oskar January 2006 (has links) <p>This paper describes the process of building a machine translation lexicon for use in the train and transport domain with the machine translation system MATS. The lexicon will consist of a Swedish part, an English part and links between them and is derived from a Trados</p><p>translation memory which is split into a training(90%) part and a testing(10%) part. The task is carried out mainly by using existing word linking software and recycling previous machine translation lexicons from other domains. In order to do this, a method is developed where focus lies on automation by means of both existing and self developed software, in combination with manual interaction. The domain specific lexicon is then extended with a domain neutral core lexicon and a less domain neutral general lexicon. The different lexicons are automatically and manually evaluated through machine translation on the test corpus. The automatic evaluation of the largest lexicon yielded a NEVA score of 0.255 and a BLEU score of 0.190. The manual evaluation saw 34% of the segments correctly translated, 37%, although not correct, perfectly understandable and 29% difficult to understand.</p> datorlingvistik BLEU NEVA lexikon maskinöversättning länkning Multra lemma språkteknologi Computational linguistics Datorlingvistik
289	Hybrid models for Chinese unknown word resolution Lu, Xiaofei. January 2006 (has links) Thesis (Ph. D.)--Ohio State University, 2006. / Title from first page of PDF file. Includes bibliographical references (p. 143-155).
290	Evaluating Readability on Mobile Devices Öquist, Gustav January 2006 (has links) The thesis presents findings from five readability studies performed on mobile devices. The dynamic Rapid Serial Visual Presentation (RSVP) format has been enhanced with regard to linguistic adaptation and segmentation as well as eye movement modeling. The novel formats have been evaluated against other common presentation formats including Paging, Scrolling, and Leading in latin-square balanced repeated-measurement studies with 12-16 subjects. Apart from monitoring Reading speed, Comprehension, and Task load (NASA-TLX), Eye movement tracking has been used to learn more about how the text presentation affects reading. The Page format generally offered best readability. Reading on a mobile phone decreased reading speed by 10% compared to reading on a Personal Digital Assistant (PDA), an interesting finding given that the display area of the mobile phone was 50% smaller. Scrolling, the most commonly used presentation format on mobile devices today, proved inferior to both Paging and RSVP. Leading, the most widely known dynamic format, caused very unnatural eye movements for reading. This seems to have increased task load, but not affected reading speed to a similar extent. The RSVP format displaying one word at time was found to reduce eye movements significantly, but contrary to common claims, this resulted in decreased reading speed and increased task load. In the last study, Predictive Text Presentation (PTP) was introduced. The format is based on RSVP and combines linguistic chunking and adaptation with eye movement modeling to achieve a reading experience that can rival traditional text presentation. It is explained why readability on mobile devices is important, how it may be evaluated in an efficient and yet reliable manner, and PTP is pinpointed as the format with greatest potential for improvement. The methodology used in the evaluations and the shortcomings of the studies are discussed. Finally, a hyper-graeco-latin-square experimental design is proposed for future evaluations. Computational linguistics Readability Mobile devices Text presentation Usability Evaluation Adaptation Eye movement tracking Datorlingvistik

Search results