Return to search

Domain and genre dependency in Statistical Machine Translation

Statistical Machine Translation (SMT) is currently the most promising and widely studied paradigm in the broader field of Machine Translation, continuously explored in order to improve its performance and to find solutions to its current shortcomings, in particular the sparsity of big bilingual corpora in a variety of domains or genres to be used as training data. However, while one the main trends is still to rely as much as possible on already available large collections of data, even when they do not fit quite well specific translation tasks in terms of relatedness of content, the possibility of using less but appropriately selected training sets - depending on the textual variety of the documents that need to be translated case by case - has not been extensively explored as much so far. The goal of this research is to investigate whether this latter possibility, i.e. the lack of availability of large quantities of assorted data, can have a possible solution in the application of strategies commonly used in genre and domain classification (including unsupervised topic modeling and document dissimilarity techniques), in particular performing subsampling experiments on bilingual corpora in order to obtain a good fit between training data and the texts that need to be translated with SMT. For the purposes of this study, already existing freely available large corpora were found to be unsuitable for the selection of domain/document specifc subsamples, so two new parallel corpora - English-Italian and English-German - were compiled employing the \web as corpus" approach on websites containing translated content. Then some tests were made on documents belonging to different varieties, translated with SMT systems built using subsamples of training data selected using document dissimilarity measures in order to pick up the most suitable documents as training data. Such method has shown how the choice of subsampling strategy heavily depends on the text variety of each considered document, but it has also proven that better translation results can be obtained from small samples of training sets rather than using all the available data, which brings benefits also in terms of quicker training times and use of fewer computational resources.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:643605
Date January 2014
CreatorsBrunello, Marco
ContributorsSharoff, Serge ; Babych, Bogdan ; Thomas, Martin
PublisherUniversity of Leeds
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttp://etheses.whiterose.ac.uk/8420/

Page generated in 0.0055 seconds