Global ETD Search

Return to search

Automatické vytváření slovníků z paralelních korpusů / Automatic dictionary acquisition from parallel corpora

In this work, an extensible word-alignment framework is implemented from scratch. It is based on a discriminative method that combines a wide range of lexical association measures and other features and requires a small amount of manually word-aligned data to optimize parameters of the model. The optimal alignment is found as minimum-weight edge cover, selected suboptimal alignments are used to estimate confidence of each alignment link. Feature combination is tuned in the course of many experiments with respect to the results of evaluation. The evaluation results are compared to GIZA++. The best trained model is used to word-align a large Czech-English parallel corpus and from the links of highest confidence a bilingual lexicon is extracted. Single-word translation equivalents are sorted by their significance. Lexicons of different sizes are extracted by taking top N translations. Precision of the lexicons is evaluated automatically and also manually by judging random samples.

http://www.nusl.cz/ntk/nusl-300414

Identifer	oai:union.ndltd.org:nusl.cz/oai:invenio.nusl.cz:300414
Date	January 2011
Creators	Popelka, Jan
Contributors	Pecina, Pavel, Mareček, David
Source Sets	Czech ETDs
Language	Czech
Detected Language	English
Type	info:eu-repo/semantics/masterThesis
Rights	info:eu-repo/semantics/restrictedAccess

Page generated in 0.002 seconds

Automatické vytváření slovníků z paralelních korpusů / Automatic dictionary acquisition from parallel corpora

Description

Links & Downloads

Tags

Additional Fields