Global ETD Search

Return to search

Comparison of Description Length for Text Corpus

In this thesis, we compare the description length of different grammars, and extend the research of automatic grammar learning to the grammar production of Stanford parser. In our research before, we have introduced that how to minimize the description length of the grammar which is generated from the Academia Sinica Balanced Corpus. Based on the concept of data compression, the encoding method in our research is effective in reducing the description length of a text corpus. Moreover, we further discussed about the description length of two special cases of context-free grammars: exhaustive and recursive. The exhaustive grammar is that for every distinct sentence in the corpus is derived, and the recursive one covers all strings. In our research of this thesis, we use a parsing tool called "Stanford parser" to parse sentences and generate grammar rules. We also compare the description length of the grammar parsed by machine with the grammar fixed by artificial. In one of the experiments, we use Stanford parser to parse ASBC corpus, and the description length is 53.0Mb. The description length of rule is only 52,683. In the other experiment, we use Stanford parser to parse Sinica Treebank and compare the description length of the generated grammar with the origin. The result shows that the description length of grammar of the Sinica Treebank is 2.76Mb, and the grammar generated by Stanford parser is 4.02Mb.

http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0524112-180320

Identifer	oai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0524112-180320
Date	24 May 2012
Creators	Huang, Chung-Hsiang
Contributors	Chung-Hsien Wu, Chia-Ping Chen, Hsin-Min Wang, Liang-Chih Yu
Publisher	NSYSU
Source Sets	NSYSU Electronic Thesis and Dissertation Archive
Language	Cholon
Detected Language	English
Type	text
Format	application/pdf
Source	http://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0524112-180320
Rights	user_define, Copyright information available at source archive

Page generated in 0.0025 seconds

Comparison of Description Length for Text Corpus

Description

Links & Downloads

Tags

Additional Fields