Return to search

Applying particle filtering to unsupervised part-of-speech induction

Statistical Natural Language Processing (NLP) lies at the intersection of Computational Linguistics and Machine Learning. As linguistic models incorporate more subtle nuances of language and its structure, standard inference techniques can fall behind. One such application is research on the unsupervised induction of part-of-speech tags. It has the potential to improve both our understanding of the plausibility of theories of first language acquisition, and Natural Language Processing applications such as Speech Recognition and Machine Translation. Sequential Monte Carlo (SMC) approaches, i.e. particle filters, are well suited to approximating such models. This thesis seeks to determine whether one application of SMC methods, particle Gibbs sampling, is capable of performing inference in otherwise intractable NLP applications. Specifically, this research analyses the benefits and drawbacks to relying on particle Gibbs to perform unsupervised part-of-speech induction without the flawed one-tag-per-type assumption of similar approaches. Additionally, this thesis explores the affects of type-based supervision with tag-dictionaries extracted from annotated corpora or from the wiktionary. The semi-supervised tag dictionary improves the performance of the local Gibbs PYP-HMM sampler enough to nearly match the performance of the particle Gibbs type-sampler. Finally, this thesis also extends the Pitman-Yor HMM tagger of Blunsom and Cohn (2011) to include an explicit model of the lexicon which encodes those tags from which a word-type may be generated. This has the effect of both biasing the model to produce fewer tags per type and modelling the tendency for open class words to be ambiguous between only a subset of the available tags. Furthermore, I extend the type based particle Gibbs inference algorithm to simultaneously resample the ambiguity class as well as tags for all of the tokens of a given word type. The result is a principled probabilistic model of part-of-speech induction that achieves state-of-the-art performance. Overall, the experiments and contributions of this thesis demonstrate the applicability of the particle Gibbs sampler and particle methods in general to otherwise intractable problems in NLP.

Identiferoai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:655048
Date January 2014
CreatorsDubbin, Gregory
ContributorsBlunsom, Phil; Pulman, Stephen
PublisherUniversity of Oxford
Source SetsEthos UK
Detected LanguageEnglish
TypeElectronic Thesis or Dissertation
Sourcehttp://ora.ox.ac.uk/objects/uuid:48caedb6-478f-4bb0-8ca7-975ee7fe5e38

Page generated in 0.0024 seconds