Return to search

The enhancement of machine translation for low-density languages using Web-gathered parallel texts.

The majority of the world's languages are poorly represented in informational media like radio, television, newspapers, and the Internet. Translation into and out of these languages may offer a way for speakers of these languages to interact with the wider world, but current statistical machine translation models are only effective with a large corpus of parallel texts - texts in two languages that are translations of one another - which most languages lack. This thesis describes the Babylon project which attempts to alleviate this shortage by supplementing existing parallel texts with texts gathered automatically from the Web -- specifically targeting pages that contain text in a pair of languages. Results indicate that parallel texts gathered from the Web can be effectively used as a source of training data for machine translation and can significantly improve the translation quality for text in a similar domain. However, the small quantity of high-quality low-density language parallel texts on the Web remains a significant obstacle.

Identiferoai:union.ndltd.org:unt.edu/info:ark/67531/metadc5140
Date12 1900
CreatorsMohler, Michael Augustine Gaylord
ContributorsMihalcea, Rada, 1974-, Tarau, Paul, Chen, Jiangping
PublisherUniversity of North Texas
Source SetsUniversity of North Texas
LanguageEnglish
Detected LanguageEnglish
TypeThesis or Dissertation
FormatText
RightsPublic, Copyright, Mohler, Michael Augustine Gaylord, Copyright is held by the author, unless otherwise noted. All rights reserved.

Page generated in 0.1921 seconds