Global ETD Search

Return to search

A partial syntactic analysis-based pre-processor for automatic indexing and retrieval of Chinese texts

Automatic indexing is the automatic creation of a text surrogate, normally keywords or phrases, to represent the original text. In the current English text retrieval systems, this process of content representation is accomplished by extracting words using spaces and punctuations as word delimiters. The same technique cannot easily be applied to Chinese texts which contain no obvious word boundaries; they appear to be a linear sequence of non-spaced or equally spaced ideographic characters and thenumber of characters in words varies. The solution to the problem lies in morphological and syntactic analyses of Chinese morphemes, words and phrases. The idea is inspired by the experiments on English computational morphology and its application to English text retrieval, mainly automatic compound and phrase indexing. These areas are particularly germane to Chinese because typographically there are no morph and phrase boundaries in either Chinese or English texts. The experiment is based on the hypothesis that words and phrases exceeding two Chinese characters can be characterised by a grammar that describes the concatenation behaviour of morphological and syntactic categories. This is examined using the following three procedures: (1) text segmentation - texts are divided into one and two character segments by searching a dictionary containing over 17000 morphemes and words, which are tagged with 'morphological and syntactic categories. (2) category disambiguation - for the resulting morphemes and words tagged with more than one category, the correct one is selected based on context (3) parsing - the segments are analysed using the grammar, which combines them into compound and complex words and phrases for indexing and retrieval. The utilities employed in the experiment include CCOOS, an extended version of MSOOS providing for Chinese I/O system,Chinese Wordstar for text input and Chinese dBASEIII for dictionary construction. Source codes are written in Turbo BASIC including its database toolbox. Thiny texts are drawn randomly from newspapers to form thcsample for the experiment. The results prove that the partial syntactic analysis-based approach can extract keywords with a good degree of accuracy.

http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.587865

005.7

Identifer	oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:587865
Date	January 1992
Creators	Wu, Zimin
Publisher	Loughborough University
Source Sets	Ethos UK
Detected Language	English
Type	Electronic Thesis or Dissertation
Source	https://dspace.lboro.ac.uk/2134/13685

Page generated in 0.0023 seconds

A partial syntactic analysis-based pre-processor for automatic indexing and retrieval of Chinese texts

Description

Links & Downloads

Tags

Additional Fields