• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Comparison of Description Length for Text Corpus

Huang, Chung-Hsiang 24 May 2012 (has links)
In this thesis, we compare the description length of different grammars, and extend the research of automatic grammar learning to the grammar production of Stanford parser. In our research before, we have introduced that how to minimize the description length of the grammar which is generated from the Academia Sinica Balanced Corpus. Based on the concept of data compression, the encoding method in our research is effective in reducing the description length of a text corpus. Moreover, we further discussed about the description length of two special cases of context-free grammars: exhaustive and recursive. The exhaustive grammar is that for every distinct sentence in the corpus is derived, and the recursive one covers all strings. In our research of this thesis, we use a parsing tool called "Stanford parser" to parse sentences and generate grammar rules. We also compare the description length of the grammar parsed by machine with the grammar fixed by artificial. In one of the experiments, we use Stanford parser to parse ASBC corpus, and the description length is 53.0Mb. The description length of rule is only 52,683. In the other experiment, we use Stanford parser to parse Sinica Treebank and compare the description length of the generated grammar with the origin. The result shows that the description length of grammar of the Sinica Treebank is 2.76Mb, and the grammar generated by Stanford parser is 4.02Mb.

Page generated in 0.0417 seconds