Return to search

Comparison of Description Length for Text Corpus

In this thesis, we compare the description length of different grammars, and extend the research of automatic grammar learning to the grammar production of Stanford parser. In our research before, we have introduced that how to minimize the description length of the grammar which is generated from the Academia Sinica Balanced Corpus. Based on the concept of data compression, the encoding method in our research is effective in reducing the description length of a text corpus. Moreover, we further discussed about the description length of two special cases of context-free grammars: exhaustive and recursive. The exhaustive grammar is that for every distinct sentence in the corpus is derived, and the recursive one covers all strings. In our research of this thesis, we use a parsing tool called "Stanford parser" to parse sentences and generate grammar rules. We also compare the description length of the grammar parsed by machine with the grammar fixed by artificial. In one of the experiments, we use Stanford parser to parse ASBC corpus, and the description length is 53.0Mb. The description length of rule is only 52,683. In the other experiment, we use Stanford parser to parse Sinica Treebank and compare the description length of the generated grammar with the origin. The result shows that the description length of grammar of the Sinica Treebank is 2.76Mb, and the grammar generated by Stanford parser is 4.02Mb.

Identiferoai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0524112-180320
Date24 May 2012
CreatorsHuang, Chung-Hsiang
ContributorsChung-Hsien Wu, Chia-Ping Chen, Hsin-Min Wang, Liang-Chih Yu
PublisherNSYSU
Source SetsNSYSU Electronic Thesis and Dissertation Archive
LanguageCholon
Detected LanguageEnglish
Typetext
Formatapplication/pdf
Sourcehttp://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0524112-180320
Rightsuser_define, Copyright information available at source archive

Page generated in 0.0015 seconds