Part-of-speech (POS) tagging is one of the most fundamental and crucial tasks in Natural Language Processing. Chinese POS tagging is challenging because it also involves word segmentation. In this report, research will be focused on how to improve unsupervised Part-of-Speech (POS) tagging using Hidden Markov Models and the Expectation Maximization parameter estimation approach (EM-HMM). The traditional EM-HMM system uses a dictionary, which is used to constrain possible tag sequences and initialize the model parameters. This is a very crude initialization: the emission
parameters are set uniformly in accordance with the tag dictionary. To improve this, word alignments can be used. Word alignments are the word-level translation correspondent pairs generated from parallel text between two languages. In this report, Chinese-English word alignment is used. The performance is expected to be better, as these two tasks are complementary to each other. The dictionary provides information on word types, while word alignment provides information on word tokens. However, it is found to be of limited benefit.
In this report, another method is proposed. To improve the dictionary coverage and get better POS distribution, Modified Adsorption, a label propagation algorithm is used. We construct a graph connecting word tokens to feature types (such as word unigrams and bigrams) and connecting those tokens to information from knowledge sources, such as a small tag dictionary, Wiktionary, and word alignments. The core idea is to use a small amount of supervision, in the form of a tag dictionary and acquire POS distributions for each word (both known and unknown) and provide this as an improved initialization for EM learning for HMM. We find this strategy to work very well, especially when we have a small tag dictionary. Label propagation provides a better initialization for the EM-HMM method, because it greatly increases the coverage of the dictionary. In addition, label propagation is quite flexible to incorporate many kinds of knowledge. However, results also show that some resources, such as the word alignments, are not easily exploited with label propagation. / text
Identifer | oai:union.ndltd.org:UTEXAS/oai:repositories.lib.utexas.edu:2152/ETD-UT-2011-05-3193 |
Date | 02 February 2012 |
Creators | Ding, Weiwei, 1985- |
Source Sets | University of Texas |
Language | English |
Detected Language | English |
Type | thesis |
Format | application/pdf |
Page generated in 0.0019 seconds