• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Generalized Probabilistic Topic and Syntax Models for Natural Language Processing

Darling, William Michael 14 September 2012 (has links)
This thesis proposes a generalized probabilistic approach to modelling document collections along the combined axes of both semantics and syntax. Probabilistic topic (or semantic) models view documents as random mixtures of unobserved latent topics which are themselves represented as probabilistic distributions over words. They have grown immensely in popularity since the introduction of the original topic model, Latent Dirichlet Allocation (LDA), in 2004, and have seen successes in computational linguistics, bioinformatics, political science, and many other fields. Furthermore, the modular nature of topic models allows them to be extended and adapted to specific tasks with relative ease. Despite the recorded successes, however, there remains a gap in combining axes of information from different sources and in developing models that are as useful as possible for specific applications, particularly in Natural Language Processing (NLP). The main contributions of this thesis are two-fold. First, we present generalized probabilistic models (both parametric and nonparametric) that are semantically and syntactically coherent and contain many simpler probabilistic models as special cases. Our models are consistent along both axes of word information in that an LDA-like component sorts words that are semantically related into distinct topics and a Hidden Markov Model (HMM)-like component determines the syntactic parts-of-speech of words so that we can group words that are both semantically and syntactically affiliated in an unsupervised manner, leading to such groups as verbs about health care and nouns about sports. Second, we apply our generalized probabilistic models to two NLP tasks. Specifically, we present new approaches to automatic text summarization and unsupervised part-of-speech (POS) tagging using our models and report results commensurate with the state-of-the-art in these two sub-fields. Our successes demonstrate the general applicability of our modelling techniques to important areas in computational linguistics and NLP.

Page generated in 0.0757 seconds