As a result of drastically improved machine translation quality in recent years, machine translation followed by manual post-editing is currently a trend in the language industry that is slowly but surely replacing manual translation from scratch. In this thesis, the applicability of machine translation to product descriptions of clothing items is studied. The focus lies on determining whether automatic post-editing is a viable approach for improving baseline translations when new training data becomes available and finding out if there is an existing quality estimation system that could reliably assign quality scores to machine translated texts. It is shown that machine translation is a promising approach for the target domain with the majority of systems experimented with being able to generate translations that on average are of almost publishable quality according to the human evaluation carried out, meaning that only light post-editing is needed before the translations can be published. Automatic post-editing is shown to be able to improve the worst baseline translations but struggles with improving the overall translation quality due to its tendency to overcorrect good translations. Nevertheless, one of the trained post-editing systems is still rated higher than the baseline by human evaluators. A new finding is that training a post-editing model on more data using worse translations leads to better performance compared to training on less but higher-quality data. None of the quality estimation systems experimented with shows a strong correlation with human evaluation results which is why it is suggested not to provide the confidence scores of the baseline model to the human evaluators responsible for correcting and approving translations. The main contributions of this work are showing that the target domain of product descriptions is suitable for integrating machine translation into the translation workflow, proposing an approach for that translation workflow that is more automated than the current one as well as the finding that it is better to use more data and poorer translations compared to less data and higher-quality translations when training an automatic post-editing system.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-477750 |
Date | January 2022 |
Creators | Kukk, Kätriin |
Publisher | Uppsala universitet, Institutionen för lingvistik och filologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0022 seconds