The majority of the world's languages are poorly represented in informational media like radio, television, newspapers, and the Internet. Translation into and out of these languages may offer a way for speakers of these languages to interact with the wider world, but current statistical machine translation models are only effective with a large corpus of parallel texts - texts in two languages that are translations of one another - which most languages lack. This thesis describes the Babylon project which attempts to alleviate this shortage by supplementing existing parallel texts with texts gathered automatically from the Web -- specifically targeting pages that contain text in a pair of languages. Results indicate that parallel texts gathered from the Web can be effectively used as a source of training data for machine translation and can significantly improve the translation quality for text in a similar domain. However, the small quantity of high-quality low-density language parallel texts on the Web remains a significant obstacle.
Identifer | oai:union.ndltd.org:unt.edu/info:ark/67531/metadc5140 |
Date | 12 1900 |
Creators | Mohler, Michael Augustine Gaylord |
Contributors | Mihalcea, Rada, 1974-, Tarau, Paul, Chen, Jiangping |
Publisher | University of North Texas |
Source Sets | University of North Texas |
Language | English |
Detected Language | English |
Type | Thesis or Dissertation |
Format | Text |
Rights | Public, Copyright, Mohler, Michael Augustine Gaylord, Copyright is held by the author, unless otherwise noted. All rights reserved. |
Page generated in 0.002 seconds