Global ETD Search

Return to search

Domain and genre dependency in Statistical Machine Translation

Statistical Machine Translation (SMT) is currently the most promising and widely studied paradigm in the broader field of Machine Translation, continuously explored in order to improve its performance and to find solutions to its current shortcomings, in particular the sparsity of big bilingual corpora in a variety of domains or genres to be used as training data. However, while one the main trends is still to rely as much as possible on already available large collections of data, even when they do not fit quite well specific translation tasks in terms of relatedness of content, the possibility of using less but appropriately selected training sets - depending on the textual variety of the documents that need to be translated case by case - has not been extensively explored as much so far. The goal of this research is to investigate whether this latter possibility, i.e. the lack of availability of large quantities of assorted data, can have a possible solution in the application of strategies commonly used in genre and domain classification (including unsupervised topic modeling and document dissimilarity techniques), in particular performing subsampling experiments on bilingual corpora in order to obtain a good fit between training data and the texts that need to be translated with SMT. For the purposes of this study, already existing freely available large corpora were found to be unsuitable for the selection of domain/document specifc subsamples, so two new parallel corpora - English-Italian and English-German - were compiled employing the \web as corpus" approach on websites containing translated content. Then some tests were made on documents belonging to different varieties, translated with SMT systems built using subsamples of training data selected using document dissimilarity measures in order to pick up the most suitable documents as training data. Such method has shown how the choice of subsampling strategy heavily depends on the text variety of each considered document, but it has also proven that better translation results can be obtained from small samples of training sets rather than using all the available data, which brings benefits also in terms of quicker training times and use of fewer computational resources.

http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.643605

418

Identifer	oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:643605
Date	January 2014
Creators	Brunello, Marco
Contributors	Sharoff, Serge ; Babych, Bogdan ; Thomas, Martin
Publisher	University of Leeds
Source Sets	Ethos UK
Detected Language	English
Type	Electronic Thesis or Dissertation
Source	http://etheses.whiterose.ac.uk/8420/

Page generated in 0.0055 seconds

Domain and genre dependency in Statistical Machine Translation

Description

Links & Downloads

Tags

Additional Fields