161 |
Collocation Segmentation for Text Chunking / Teksto skaidymas pastoviųjų junginių segmentaisDaudaravičius, Vidas 04 February 2013 (has links)
Segmentation is a widely used paradigm in text processing. Rule-based, statistical and hybrid methods are employed to perform the segmentation. This dissertation introduces a new type of segmentation - collocation segmentation - and a new method to perform it, and applies them to three different text processing tasks. In lexicography, collocation segmentation makes possible the use of large corpora to evaluate the usage and importance of terminology over time. Text categorization results can be improved using collocation segmentation. The study shows that collocation segmentation, without any other language resources, achieves better results than the widely used n-gram techniques together with POS (Part-of-Speech) processing tools. Also, the preprocessing of data with collocation segmentation and subsequent integration of these segments into a Statistical Machine Translation system improves the translation results. Diverse word combinability measures variously influence the final collocation segmentation and, thus, the translation results. The new collocation segmentation method is simple, efficient and applicable to language processing for diverse applications. / Teksto skaidymo įvairaus tipo segmentais metodai yra plačiai naudojami teksto apdorojimui. Segmentuojant naudojami tiek statistiniai, tiek formalieji metodai. Disertacijoje pristatomas naujas segmentavimo tipas ir metodas - segmentavimas pastoviaisiais junginiais - ir pateikiami taikymai įvairiose teksto apdorojimo srityse. Taikant pastoviųjų junginių segmentavimą leksikografijoje atskleidžiama, kaip objektyviai ir greitai galima analizuoti labai didelius tekstų archyvus aptinkant vartojamą terminiją ir šių automatiškai identifikuotų terminų svarbumą ir kaitą laiko tėkmėje. Ši analizė leidžia greitai nustatyti svarbius metodologinius pokyčius mokslinių tyrimų istorijoje ir nustatyti pastarojo meto aktualias tyrimų sritis. Tekstų klasifikavimo taikyme atskleidžiama, kaip taikant segmentavimą pastoviaisiais junginiais galima pagerinti tekstų klasifikavimo rezultatus. Taip pat, pasitelkiant segmentavimą pastoviaisiais junginiais, atskleidžiama, kad nežymiai galima pagerinti statistinio mašininio vertimo kokybę, ir atskleidžiama įvairių žodžių junglumo įverčių įtaka segmentavimui pastoviaisiais junginiais. Naujas teksto skaidymo pastoviaisiais junginiais metodas atskleidžia naujas galimybes gerinti teksto apdorojimo rezultatus įvairiuose taikymuose ir įvairiose kalbose.
|
162 |
Teksto skaidymas pastoviųjų junginių segmentais / Collocation segmentation for text chunkingDaudaravičius, Vidas 04 February 2013 (has links)
Teksto skaidymo įvairaus tipo segmentais metodai yra plačiai naudojami teksto apdorojimui. Segmentuojant naudojami tiek statistiniai, tiek formalieji metodai. Disertacijoje pristatomas naujas segmentavimo tipas ir metodas - segmentavimas pastoviaisiais junginiais - ir pateikiami taikymai įvairiose teksto apdorojimo srityse. Taikant pastoviųjų junginių segmentavimą leksikografijoje atskleidžiama, kaip objektyviai ir greitai galima analizuoti labai didelius tekstų archyvus aptinkant vartojamą terminiją ir šių automatiškai identifikuotų terminų svarbumą ir kaitą laiko tėkmėje. Ši analizė leidžia greitai nustatyti svarbius metodologinius pokyčius mokslinių tyrimų istorijoje ir nustatyti pastarojo meto aktualias tyrimų sritis. Tekstų klasifikavimo taikyme atskleidžiama, kaip taikant segmentavimą pastoviaisiais junginiais galima pagerinti tekstų klasifikavimo rezultatus. Taip pat, pasitelkiant segmentavimą pastoviaisiais junginiais, atskleidžiama, kad nežymiai galima pagerinti statistinio mašininio vertimo kokybę, ir atskleidžiama įvairių žodžių junglumo įverčių įtaka segmentavimui pastoviaisiais junginiais. Naujas teksto skaidymo pastoviaisiais junginiais metodas atskleidžia naujas galimybes gerinti teksto apdorojimo rezultatus įvairiuose taikymuose ir įvairiose kalbose. / Segmentation is a widely used paradigm in text processing. Rule-based, statistical and hybrid methods are employed to perform the segmentation. This dissertation introduces a new type of segmentation - collocation segmentation - and a new method to perform it, and applies them to three different text processing tasks. In lexicography, collocation segmentation makes possible the use of large corpora to evaluate the usage and importance of terminology over time. Text categorization results can be improved using collocation segmentation. The study shows that collocation segmentation, without any other language resources, achieves better results than the widely used n-gram techniques together with POS (Part-of-Speech) processing tools. Also, the preprocessing of data with collocation segmentation and subsequent integration of these segments into a Statistical Machine Translation system improves the translation results. Diverse word combinability measures variously influence the final collocation segmentation and, thus, the translation results. The new collocation segmentation method is simple, efficient and applicable to language processing for diverse applications.
|
163 |
英文介系詞片語定位與英文介系詞推薦 / Attachment of English prepositional phrases and suggestions of English prepositions蔡家琦, Tsai, Chia Chi Unknown Date (has links)
英文介系詞在句子裡所扮演的角色通常是用來使介系詞片語更精確地補述上下文,英文的母語使用者可以很直覺地使用。然而電腦不瞭解語義,因此不容易判斷介系詞修飾對象;非英文母語使用者則不容易直覺地使用正確的介系詞。所以本研究將專注於介系詞片語定位與介系詞推薦的議題。
在本研究將這二個介系詞議題抽象化為一個決策問題,並提出一個一般化的解決方法。這二個問題共通的部分在於動詞片語,一個簡單的動詞片語含有最重要的四個中心詞(headword):動詞、名詞一、介系詞和名詞二。由這四個中心詞做為出發點,透過WordNet做階層式的選擇,在大量的案例中尋找語義上共通的部分,再利用機器學習的方法建構一般化的模型。此外,針對介系詞片語定的問題,我們挑選較具挑戰性介系詞做實驗。
藉由使用真實生活語料,我們的方法處理介系詞片語定位的問題,比同樣考慮四個中心詞的最大熵值法(Max Entropy)好;但與考慮上下文的Stanford剖析器差不多。而在介系詞推薦的問題裡,較難有全面比較的對象,但我們的方法精準度可達到53.14%。
本研究發現,高層次的語義可以使分類器有不錯的分類效果,而透過階層式的選擇語義能使分類效果更佳。這顯示我們確實可以透過語義歸納一套準則,用於這二個介系詞的議題。相信成果在未來會對機器翻譯與文本校對的相關研究有所價值。 / This thesis focuses on problems of attachment of prepositional phrases (PPs) and problems of prepositional suggestions. Determining the correct PP attachment is not easy for computers. Using correct prepositions is not easy for learners of English as a second language.
I transform the problems of PPs attachment and prepositional suggestion into an abstract model, and apply the same computational procedures to solve these two problems. The common model features four headwords, i.e., the verb, the first noun, the preposition, and the second noun in the prepositional phrases. My methods consider the semantic features of the headwords in WordNet to train classification models, and apply the learned models for tackling the attachment and suggestion problems. This exploration of PP attachment problems is special in that only those PPs that are almost equally possible to attach to the verb and the first noun were used in the study.
The proposed models consider only four headwords to achieve satisfactory performances. In experiments for PP attachment, my methods outperformed a Maximum Entropy classifier which also considered four headwords. The performances of my methods and of the Stanford parsers were similar, while the Stanford parsers had access to the complete sentences to judge the attachments. In experiments for prepositional suggestions, my methods found the correct prepositions 53.14% of the time, which is not as good as the best performing system today.
This study reconfirms that semantic information is instrument for both PP attachment and prepositional suggestions. High level semantic information helped to offer good performances, and hierarchical semantic synsets helped to improve the observed results. I believe that the reported results are valuable for future studies of PP attachment and prepositional suggestions, which are key components for machine translation and text proofreading.
|
164 |
Fluency enhancement : applications to machine translation : thesis for Master of Engineering in Information & Telecommunications Engineering, Massey University, Palmerston North, New ZealandManion, Steve Lawrence January 2009 (has links)
The quality of Machine Translation (MT) can often be poor due to it appearing incoherent and lacking in fluency. These problems consist of word ordering, awkward use of words and grammar, and translating text too literally. However we should not consider translations such as these failures until we have done our best to enhance their quality, or more simply, their fluency. In the same way various processes can be applied to touch up a photograph, various processes can also be applied to touch up a translation. This research outlines the improvement of MT quality through the application of Fluency Enhancement (FE), which is a process we have created that reforms and evaluates text to enhance its fluency. We have tested our FE process on our own MT system which operates on what we call the SAM fundamentals, which are as follows: Simplicity - to be simple in design in order to be portable across different languages pairs, Adaptability - to compensate for the evolution of language, and Multiplicity - to determine a final set of translations from as many candidate translations as possible. Based on our research, the SAM fundamentals are the key to developing a successful MT system, and are what have piloted the success of our FE process.
|
165 |
Fluency enhancement : applications to machine translation : thesis for Master of Engineering in Information & Telecommunications Engineering, Massey University, Palmerston North, New ZealandManion, Steve Lawrence January 2009 (has links)
The quality of Machine Translation (MT) can often be poor due to it appearing incoherent and lacking in fluency. These problems consist of word ordering, awkward use of words and grammar, and translating text too literally. However we should not consider translations such as these failures until we have done our best to enhance their quality, or more simply, their fluency. In the same way various processes can be applied to touch up a photograph, various processes can also be applied to touch up a translation. This research outlines the improvement of MT quality through the application of Fluency Enhancement (FE), which is a process we have created that reforms and evaluates text to enhance its fluency. We have tested our FE process on our own MT system which operates on what we call the SAM fundamentals, which are as follows: Simplicity - to be simple in design in order to be portable across different languages pairs, Adaptability - to compensate for the evolution of language, and Multiplicity - to determine a final set of translations from as many candidate translations as possible. Based on our research, the SAM fundamentals are the key to developing a successful MT system, and are what have piloted the success of our FE process.
|
166 |
A generic and open framework for multiword expressions treatment : from acquisition to applicationsRamisch, Carlos Eduardo January 2012 (has links)
The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work.
|
167 |
Enfrentamento do problema das divergências de tradução por um sistema de tradução automática : um exercício exploratório /Oliveira, Mirna Fernanda de. January 2006 (has links)
Orientador: Bento Carlos Dias da Silva / Banca: Beatriz Nunes de Oliveira Longo / Banca: Dirce Charara Monteiro / Banca: Gladis Maria de Barcellos Almeida / Banca: Heronides Maurílio de Melo Moura / Resumo: O objetivo desta tese é desenvolver um estudo lingüístico-computacional exploratório de um problema específico que deve ser enfrentado por sistemas de tradução automática: o problema da divergências de tradução quer de natureza sintática quer de natureza léxico-semântica que se verificam entre pares de sentenças de línguas naturais diferentes. Para isso, fundamenta-se na metodologia de pesquisa interdisciplinar em PLN (Processamento Automático de Línguas Naturais) de Dias-da-Silva (1996, 1998 e 2003) e na teoria lingüístico-computacional subjacente ao sistema de tradução automática UNITRAN de Dorr (1993), que, por sua vez é subsidiado pela teoria sintática dos princípios e Parâmetros de Chomsky (1981) e pela teoria semântica das Estruturas conceituais de Jackendoff (1990). Como contribuição, a tese descreve a composição e o funcionamento do UNITRAN, desenhado para dar conta de parte do problema posto pelas divergências de tradução e ilustra a possibilidade de inclusão do português nesse sistema através do exame de alguns tipos de divergências que se verificam entre frases do inglês e do português. / Abstract: This dissertation aims to develop an exploratory linguistic and computational study of an especific type of problem that must be faced by machine translation systems: the problem of translation divergences, whether syntactic or lexical-semantic ones that can be verified between distinct natural language sentence. In order to achieve this aim, this work is based on the interdisciplinary research metodology of the NLP (Natural Language Processing) field developed by Dias-da-Silva (1996, 1998 & 2003) and on the linguistic computacional theory behind UNITRAN, a machine translation systemdeveloped by Dorr (1993), a system that is on its turned based on Chomsky's syntactic theory of Government and Binding (1981) and Jackendoff's semantic theory of Conceptual Structures (1990). As a contribution to the field of NLP, this dissertation describes the machinery of UNITRAN, designed to deal with part of the problem of translation divergencies, and it illustrates the possibility of including Brazilian Portuguese language in the system through the investigation of certain kinds of divergences that can be found between English and Brazilian Portuguese senteces. / Doutor
|
168 |
A generic and open framework for multiword expressions treatment : from acquisition to applicationsRamisch, Carlos Eduardo January 2012 (has links)
The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work.
|
169 |
Increasing Willingness and Opportunities to Communicate in a Foreign Language with Machine Translation and Instant MessagingTekwa, Kizito 05 April 2018 (has links)
Advances in technology over the last few decades have led to significant changes in the way we communicate. Technological innovation has been one of the reasons for the development of computer-mediated communication (CMC), which has had far-reaching implications in the private and professional lives of many people. Instant messaging (IM), which is one form of computer-mediated communication, has significantly gained popularity over the years and many scholars have examined its influence in areas including business and academics. Initially developed to enhance communication between users who understood the same language, some IM clients including Wechat (www.wechat.com), QQ International (www.imqq.com), and Skype Translator (www.skype.com) have integrated a built-in translation application that facilitates communication among users that speak different languages.
The current research project explores the relationship between machine translation, IM, and foreign language (FL) learning. In particular, it investigates whether machine-translated IM could improve the willingness to communicate (WTC) of beginner FL learners and whether the IM translation tool offers learners opportunities to communicate (OTC) in the FL. To answer these questions, China-based beginner FL learners were recruited and paired with native and near-native English speakers based in Canada. China-based participants completed two questionnaires and also exchanged (machine-translated) IM on selected topics with Canada-based participants for a period of ten weeks. Some China-based participants communicated with the help of the IM translation tool, while the others communicated without the tool.
After analyzing the data gathered during the study, we found that WTC increased more for participants with the IM translation tool than for participants without the IM translation tool. Our analysis also indicated that the IM translation tool offered participants OTC in English. This was illustrated in various conversation aspects including number of words and turns exchanged, synchronous exchanges, ownership, conversation enhancement, topics discussed, tasks undertaken, and requests for paraphrase, repetition and explanation.
In the discussion of the implications of our findings, we outline how the research project reinforced our understanding of the concept of WTC in a technology driven FL learning environment. We also discuss the implications of our findings for machine translation (MT), FL, and translation studies. Our discussion focuses on the debate on the tools to use and content to teach in the translator and FL training environments as well as various concepts in translation studies including MT quality, writing for MT, fit-for-purpose MT, collaboration and MT post-editing. This project enables us to test the applicability of MT in a different context using a novel group of users. The project therefore contributes to ongoing research on the relationship between CMC (specifically IM), MT, and FL learning, as well as to our knowledge of applications and perceptions of MT.
|
170 |
A criaÃÃo de um sistema hÃbrido de traduÃÃo automÃtica para a conversÃo de expressÃes nominais da lÃngua inglesa / The Creation of a Hybrid Machine Translation for the Conversion of Nominal Expressions from EnglishTiago Martins da Cunha 18 December 2013 (has links)
CoordenaÃÃo de AperfeiÃoamento de NÃvel Superior / Deutscher Akademischer Austausch Dienst / A traduÃÃo automÃtica (TA) teve grande parte de sua credibilidade questionada por tradutores profissionais por muitos anos. No entanto, o uso de sistemas de TA tornou-se uma necessidade, a fim de organizar e acelerar o processo de traduÃÃo. A maioria dos usuÃrios, profissionais ou nÃo, nÃo tem conhecimento sobre o design das ferramentas que integram o sistema que eles usam. A concepÃÃo de um sistema de TA consiste de uma cadeia de ferramentas que formam o motor de um sistema de TA. Assim, propÃe-se a descriÃÃo e a criaÃÃo de uma ferramenta de traduÃÃo que seja capaz de lidar com expressÃes nominais da lÃngua Inglesa para portuguesa. As expressÃes nominais em InglÃs podem ser compostas de elementos como genitivo e gerÃndios, que nÃo apresentam correspondentes para o portuguÃs. Assim, estes elementos causam dificuldades para os sistemas de TA . O nosso objetivo à o de criar um sistema de TA que seja capaz de lidar com este problema de maneira satisfatÃria. O sistema desenvolvido e descrito nesta tese foi treinado com expressÃes nominais do corpus Europarl e testado com expressÃes nominais tratadas na literatura sobre a sintaxe dos sintagmas nominais. Nosso sistema apresentou resultados que consideramos satisfatÃrios de acordo com escores obtidos nas avaliaÃÃes manual e automÃtica ao compararmos com os resultados obtidos por outros sistemas de TA disponÃveis gratuitamente para utilizaÃÃo. / Machine translation (MT) had much of its credibility questioned by professional translators for many years. However, the use of MT systems has become a necessity in order to organize and accelerate the translation process. Most users, professionals or not, have no knowledge about the design of the tools that integrate the system they use. The design of a MT system consists of a pipeline of tools that form the systemâs engine. Thus, we propose the description and the creation of a translation tool that would able to handle nominal expressions from English to Portuguese. The nominal expressions in English may be composed of elements as genitive and gerunds, which lack Portuguese correspondents. Thus, these elements cause difficulties for MT systems. Our goal is to create a MT system that is able to deal satisfactorily with this problem. The system developed and described in this thesis was trained with nominal expressions from the Europarl corpus and tested with nominal expressions handled in the literature of noun phrases syntax. Our system showed what we consider satisfactory results according to the scores in the manual and automatic evaluation when we compare the results from other MT systems freely available for use.
|
Page generated in 0.0281 seconds