Hybrid Methods for Coreference Resolution in Swedish

Nilsson, Kristina January 2010 (has links)
The aim of this thesis is to improve coreference resolution in Swedish by providing a hybrid approach based on combining data-driven methods and linguistic knowledge. Coreference resolution here consists in identifying all expressions in a text that have the same referent, for example, a person or an object. The linguistic knowledge is based on Accessibility Theory (Ariel 1990). This is used for guiding the  selection of likely anaphor-antecedent pairs from the set of all possible such pairs in a text. The data-driven method adopted is Memory-Based Learning (MBL), a supervised method based on the idea that learning means storing experiences in memory, and that new problems are solved by reusing solutions from similar experiences (Daelemans and Van den Bosch 2005). The referring expressions covered by the system are names, definite descriptions, and pronouns. In order to maximize performance, we use different classifiers with a specific set of linguistically motivated features for each type of expression. The great majority of features used for classification are domain- and language-independent. We demonstrate two ways of using this method of linguistically motivated selection of anaphor-antecedent pairs. First, the amount of training examples stored in memory  is reduced. We find that for coreference resolution of definite descriptions and names, the amount of training data can thereby be reduced with only a minor loss in performance, but for pronoun resolution there is a negative effect. Second, selection can be used for improving on coreference resolution results. This is the first step in our hybrid approach to coreference resolution, where the second step is the application of an MBL classifier for determining coreference between the selected pairs. Results indicate that this hybrid approach is advantageous for coreference resolution of definite descriptions and names. For pronoun resolution, there is a negative effect on recall along with a positive effect on precision. / För att köpa boken skicka en beställning till exp@ling.su.se/ To order the book send an e-mail to exp@ling.su.se

Question Classification in Question Answering Systems

Sundblad, Håkan January 2007 (has links)
<p>Question answering systems can be seen as the next step in information retrieval, allowing users to pose questions in natural language and receive succinct answers. In order for a question answering system as a whole to be successful, research has shown that the correct classification of questions with regards to the expected answer type is imperative. Question classification has two components: a taxonomy of answer types, and a machinery for making the classifications.</p><p>This thesis focuses on five different machine learning algorithms for the question classification task. The algorithms are k nearest neighbours, naïve bayes, decision tree learning, sparse network of winnows, and support vector machines. These algorithms have been applied to two different corpora, one of which has been used extensively in previous work and has been constructed for a specific agenda. The other corpus is drawn from a set of users' questions posed to a running online system. The results showed that the performance of the algorithms on the different corpora differs both in absolute terms, as well as with regards to the relative ranking of them. On the novel corpus, naïve bayes, decision tree learning, and support vector machines perform on par with each other, while on the biased corpus there is a clear difference between them, with support vector machines being the best and naïve bayes being the worst.</p><p>The thesis also presents an analysis of questions that are problematic for all learning algorithms. The errors can roughly be divided as due to categories with few members, variations in question formulation, the actual usage of the taxonomy, keyword errors, and spelling errors. A large portion of the errors were also hard to explain.</p> / Report code: LiU-Tek-Lic-2007:29.

Evaluating Readability on Mobile Devices

Öquist, Gustav January 2006 (has links)
<p>The thesis presents findings from five readability studies performed on mobile devices. The dynamic Rapid Serial Visual Presentation (RSVP) format has been enhanced with regard to linguistic adaptation and segmentation as well as eye movement modeling. The novel formats have been evaluated against other common presentation formats including Paging, Scrolling, and Leading in latin-square balanced repeated-measurement studies with 12-16 subjects. Apart from monitoring Reading speed, Comprehension, and Task load (NASA-TLX), Eye movement tracking has been used to learn more about how the text presentation affects reading.</p><p>The Page format generally offered best readability. Reading on a mobile phone decreased reading speed by 10% compared to reading on a Personal Digital Assistant (PDA), an interesting finding given that the display area of the mobile phone was 50% smaller. Scrolling, the most commonly used presentation format on mobile devices today, proved inferior to both Paging and RSVP. Leading, the most widely known dynamic format, caused very unnatural eye movements for reading. This seems to have increased task load, but not affected reading speed to a similar extent. The RSVP format displaying one word at time was found to reduce eye movements significantly, but contrary to common claims, this resulted in decreased reading speed and increased task load. In the last study, Predictive Text Presentation (PTP) was introduced. The format is based on RSVP and combines linguistic chunking and adaptation with eye movement modeling to achieve a reading experience that can rival traditional text presentation.</p><p>It is explained why readability on mobile devices is important, how it may be evaluated in an efficient and yet reliable manner, and PTP is pinpointed as the format with greatest potential for improvement. The methodology used in the evaluations and the shortcomings of the studies are discussed. Finally, a hyper-graeco-latin-square experimental design is proposed for future evaluations.</p>

Evaluating Readability on Mobile Devices

Öquist, Gustav January 2006 (has links)
The thesis presents findings from five readability studies performed on mobile devices. The dynamic Rapid Serial Visual Presentation (RSVP) format has been enhanced with regard to linguistic adaptation and segmentation as well as eye movement modeling. The novel formats have been evaluated against other common presentation formats including Paging, Scrolling, and Leading in latin-square balanced repeated-measurement studies with 12-16 subjects. Apart from monitoring Reading speed, Comprehension, and Task load (NASA-TLX), Eye movement tracking has been used to learn more about how the text presentation affects reading. The Page format generally offered best readability. Reading on a mobile phone decreased reading speed by 10% compared to reading on a Personal Digital Assistant (PDA), an interesting finding given that the display area of the mobile phone was 50% smaller. Scrolling, the most commonly used presentation format on mobile devices today, proved inferior to both Paging and RSVP. Leading, the most widely known dynamic format, caused very unnatural eye movements for reading. This seems to have increased task load, but not affected reading speed to a similar extent. The RSVP format displaying one word at time was found to reduce eye movements significantly, but contrary to common claims, this resulted in decreased reading speed and increased task load. In the last study, Predictive Text Presentation (PTP) was introduced. The format is based on RSVP and combines linguistic chunking and adaptation with eye movement modeling to achieve a reading experience that can rival traditional text presentation. It is explained why readability on mobile devices is important, how it may be evaluated in an efficient and yet reliable manner, and PTP is pinpointed as the format with greatest potential for improvement. The methodology used in the evaluations and the shortcomings of the studies are discussed. Finally, a hyper-graeco-latin-square experimental design is proposed for future evaluations.

Cross-language Ontology Learning : Incorporating and Exploiting Cross-language Data in the Ontology Learning Process

Hjelm, Hans January 2009 (has links)
An ontology is a knowledge-representation structure, where words, terms or concepts are defined by their mutual hierarchical relations. Ontologies are becoming ever more prevalent in the world of natural language processing, where we currently see a tendency towards using semantics for solving a variety of tasks, particularly tasks related to information access. Ontologies, taxonomies and thesauri (all related notions) are also used in various variants by humans, to standardize business transactions or for finding conceptual relations between terms in, e.g., the medical domain. The acquisition of machine-readable, domain-specific semantic knowledge is time consuming and prone to inconsistencies. The field of ontology learning therefore provides tools for automating the construction of domain ontologies (ontologies describing the entities and relations within a particular field of interest), by analyzing large quantities of domain-specific texts. This thesis studies three main topics within the field of ontology learning. First, we examine which sources of information are useful within an ontology learning system and how the information sources can be combined effectively. Secondly, we do this with a special focus on cross-language text collections, to see if we can learn more from studying several languages at once, than we can from a single-language text collection. Finally, we investigate new approaches to formal and automatic evaluation of the quality of a learned ontology. We demonstrate how to combine information sources from different languages and use them to train automatic classifiers to recognize lexico-semantic relations. The cross-language data is shown to have a positive effect on the quality of the learned ontologies. We also give theoretical and experimental results, showing that our ontology evaluation method is a good complement to and in some aspects improves on the evaluation measures in use today. / För att köpa boken skicka en beställning till exp@ling.su.se/ To order the book send an e-mail to exp@ling.su.se

Tolkning av spansk känsloprosodi

Olavison, Jari January 2003 (has links)
<p>Text-till-talsystem blir allt vanligare i vardagen, och det forskas även en hel del på utvecklingen av tal-till-talöversättningssystem. Många företag använder sig i allt större utsträckning av telefontjänster där automatiska system med syntetiskt tal och taligenkänning ersätter människor. För att vi som konsumenter ska känna att det är bekvämt att nyttja dessa tjänster och förstå budskapen är det viktigt att dessa syntetiska röster låter så naturliga som möjligt. Det som gör en röst naturlig är dess prosodi, dvs.</p><p>dess ickesegmentella aspekter såsom röstens intonation, intensitet och tempo, för att nämna några. Prosodin har inte endast lingvistiska funktioner utan den signalerar även känslor och attityder hos talaren. Vem vill lyssna på en syntetisk röst som låter väldigt ledsen eller arg t.ex. när bilens GPS-navigator sorgset talar om att vi ska ta nästa avfart åt höger.</p><p>Känslosignalering sker normalt både auditivt och visuellt, en glad person har ofta ett leende på läpparna och talar på ett sätt att vi som lyssnare får intryck av att personen är glad. Denna studie handlar just om den auditiva signaleringen av känslor som jag kallar känsloprosodi.</p><p>Det är inte självklart att talare av olika språk signalerar känslor på samma sätt trots att många lingvister, liksom jag, är övertygade om att det finns en viss universalitet, vilket man bör beakta vit tal-till-talöversättningssystem. Av denna anledning har jag i min studie valt att jämföra svenska auditiva känsloyttranden med spanska känsloyttranden.</p><p>Detta har jag gjort genom att göra perceptionstester av spanska röster och jämfört resultaten med en tidigare studie av Åsa Abelin och Jens Allwood på Göteborgs universitet (1999) som gjort en liknande studie mha. svenska röster. Jämförelser av misstolkningar av avsedda känslor indikerar bl.a. att vissa känslor verkar uttryckas på olika sätt för spanska och svenska. Tydligast är detta för ”förvåning” som i båda studier i stor utsträckning misstolkats av informanter med annat modersmål än talaren, även ”avsky” verkar uttryckas något annorlunda. Andra resultat som framkom är att svensktalande ofta misstolkar ”ilska” (spansk) som ”glädje” vilket kan jämföras med att spansktalande misstolkade ”glädje” (svensk) som ”sorg”. Studien visar också att känslor som förväxlas ofta är akustiskt lika till uttrycket och även har en del semantiska likheter.</p>

'Consider' and its Swedish equivalents in relation to machine translation

Andersson, Karin January 2007 (has links)
<p>This study describes the English verb ’consider’ and the characteristics of some of its senses. An investigation of this kind may be useful, since a machine translation program, SYSTRAN, has invariably translated ’consider’ with the Swedish verbs ’betrakta’ (Eng: ’view’, regard’) and ’anse’ (Eng: ’regard’). This handling of ’consider’ is not satisfactory in all contexts.</p><p>Since ’consider’ is a cogitative verb, it is fascinating to observe that both the theory of semantic primes and universals and conceptual semantics are concerned with cogitation in various ways. Anna Wierzbicka, who is one of the advocates of semantic primes and universals, argues that THINK should be considered as a semantic prime. Moreover, one of the prime issues of conceptual semantics is to describe how thoughts are constructed by virtue of e.g. linguistic components, perception and experience.</p><p>In order to define and clarify the distinctions between the different senses, we have taken advantage of the theory of mental spaces.</p><p>This thesis has been structured in accordance with the meanings that have been indicated in WordNet as to ’consider’. As a consequence, the senses that ’consider’ represents have been organized to form the subsequent groups: ’Observation’, ’Opinion’ together with its sub-group ’Likelihood’ and ’Cogitation’ followed by its sub-group ’Attention/Consideration’.</p><p>A concordance tool, http://www.nla.se/culler, provided us with 90 literary quotations that were collected in a corpus. Afterwards, these citations were distributed between the groups mentioned above and translated into Swedish by SYSTRAN.</p><p>Furthermore, the meanings as to ’consider’ have also been related to the senses, recorded by the FrameNet scholars. Here, ’consider’ is regarded as a verb of ’Cogitation’ and ’Categorization’.</p><p>When this study was accomplished, it could be inferred that certain senses are connected to specific syntactic constructions. In other cases, however, the distinctions between various meanings can only be explained by virtue of semantics.</p><p>To conclude, it appears to be likely that an implementation is facilitated if a specific syntactic construction can be tied to a particular sense. This may be the case concerning some meanings of ’consider’. Machine translation is presumably a much more laborious task, if one is solely governed by semantic conditions.</p>

Disfluency in Swedish human–human and human–machine travel booking dialogues

Eklund, Robert January 2004 (has links)
This thesis studies disfluency in spontaneous Swedish speech, i.e., the occurrence of hesitation phenomena like eh, öh, truncated words, repetitions and repairs, mispronunciations, truncated words and so on. The thesis is divided into three parts: PART I provides the background, both concerning scientific, personal and industrial–academic aspects in the Tuning in quotes, and the Preamble and Introduction (chapter 1). PART II consists of one chapter only, chapter 2, which dives into the etiology of disfluency. Consequently it describes previous research on disfluencies, also including areas that are not the main focus of the present tome, like stuttering, psychotherapy, philosophy, neurology, discourse perspectives, speech production, application-driven perspectives, cognitive aspects, and so on. A discussion on terminology and definitions is also provided. The goal of this chapter is to provide as broad a picture as possible of the phenomenon of disfluency, and how all those different and varying perspectives are related to each other. PART III describes the linguistic data studied and analyzed in this thesis, with the following structure: Chapter 3 describes how the speech data were collected, and for what reason. Sum totals of the data and the post-processing method are also described. Chapter 4 describes how the data were transcribed, annotated and analyzed. The labeling method is described in detail, as is the method employed to do frequency counts. Chapter 5 presents the analysis and results for all different categories of disfluencies. Besides general frequency and distribution of the different types of disfluencies, both inter- and intra-corpus results are presented, as are co-occurrences of different types of disfluencies. Also, inter- and intra-speaker differences are discussed. Chapter 6 discusses the results, mainly in light of previous research. Reasons for the observed frequencies and distribution are proposed, as are their relation to language typology, as well as syntactic, morphological and phonetic reasons for the observed phenomena. Future work is also envisaged, both work that is possible on the present data set, work that is possible on the present data set given extended labeling and work that I think should be carried out, but where the present data set fails, in one way or another, to meet the requirements of such studies. Appendices 1–4 list the sum total of all data analyzed in this thesis (apart from Tok Pisin data). Appendix 5 provides an example of a full human–computer dialogue. / The electronic version of the printed dissertation is a corrected version where typos as well as phrases have been corrected. A list with the corrections is presented in the errata list above.

Utveckling av ett svensk-engelskt lexikon inom tåg- och transportdomänen

Axelsson, Hans, Blom, Oskar January 2006 (has links)
This paper describes the process of building a machine translation lexicon for use in the train and transport domain with the machine translation system MATS. The lexicon will consist of a Swedish part, an English part and links between them and is derived from a Trados translation memory which is split into a training(90%) part and a testing(10%) part. The task is carried out mainly by using existing word linking software and recycling previous machine translation lexicons from other domains. In order to do this, a method is developed where focus lies on automation by means of both existing and self developed software, in combination with manual interaction. The domain specific lexicon is then extended with a domain neutral core lexicon and a less domain neutral general lexicon. The different lexicons are automatically and manually evaluated through machine translation on the test corpus. The automatic evaluation of the largest lexicon yielded a NEVA score of 0.255 and a BLEU score of 0.190. The manual evaluation saw 34% of the segments correctly translated, 37%, although not correct, perfectly understandable and 29% difficult to understand.

Grundtonsstrategier vid tonlösa segment

von Kartaschew, Filip January 2007 (has links)
Prosodimodeller som bl.a. kan användas i talsynteser grundar sig ofta på analyser av tal som består av enbart tonande segment. Framför tonlös konsonant saknar vokalsegments grundtonskurvor möjlig fortsättning och blir dessutom kortare. Detta brukar då justeras med hjälp av trunkering av grundtonskurvan. Tidigare studier har i korthet visat att skillnader, förutom trunkering, i vokalers grundtonskurva kan uppstå beroende på om efterföljande segment är tonande eller tonlöst. Med utgångspunkt från dessa studier undersöks i detta examensarbete grundtonskurvan i svenska satser. Även resultaten i denna studie visar att olika strategier i grundtonskurvan används, och att trunkering inte räcker för att förklara vad som sker med grundtonskurvan i dessa kontexter. Generellt visar resultaten på att det verkar viktigt för försökspersonerna att behålla den information som grundtonskurvan ger i form av max- och minimumvärde, och att fall och stigningar så långt det går bibehålls.

