In this thesis we present the idea of using parallel phrases for word alignment. Each parallel phrase is extracted from a set of manual word alignments and contains a number of source and target words and their corresponding alignments. If a parallel phrase matches a new sentence pair, its word alignments can be applied to the new sentence. There are several advantages of using phrases for word alignment. First, longer text segments include more context and will be more likely to produce correct word alignments than shorter segments or single words. More importantly, the use of longer phrases makesit possible to generalize words in the phrase by replacing words by parts-of-speech or other grammatical information. In this way, the number of words covered by the extracted phrases can go beyond the words and phrases that were present in the original set of manually aligned sentences. We present experiments with phrase-based word alignment on three types of English–Swedish parallel corpora: a software manual, a novel and proceedings of the European Parliament. In order to find a balance between improved coverage and high alignment accuracy we investigated different properties of generalised phrases to identify which types of phrases are likely to produce accurate alignments on new data. Finally, we have compared phrase-based word alignments to state-of-the-art statistical alignment with encouraging results. We show that phrase-based word alignments can be used to enhance statistical word alignment. To evaluate word alignments an English–Swedish reference set for the Europarl corpus was constructed. The guidelines for producing this reference alignment are presented in the thesis.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-15462 |
Date | January 2008 |
Creators | Holmqvist, Maria |
Publisher | Linköpings universitet, NLPLAB - Laboratoriet för databehandling av naturligt språk, Linköpings universitet, Tekniska högskolan, Linköping : LIU-tryck |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Licentiate thesis, monograph, info:eu-repo/semantics/masterThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Relation | Linköping Studies in Science and Technology. Thesis, 0280-7971 ; 1392 |
Page generated in 0.002 seconds