Return to search

An Enhanced Conditional Random Field Model for Chinese Word Segmentation

In Chinese language, the smallest meaningful unit is a word which is composed of a sequence
of characters. A Chinese sentence is composed of a sequence of words without any separation
between them. In the area of information retrieval or data mining, the segmentation of a
sequence of Chinese characters should be done before anyone starts to use these segments of
characters. The process is called the Chinese word segmentation. The researches of Chinese
word segmentation have been developed for many years. Although some recent researches
have achieved very high performance, the recall of those words that are not in the dictionary
only achieves sixty or seventy percent. An approach described in this paper makes use of the
linear-chain conditional random fields (CRFs) to have a more accurate Chinese word segmentation.
The discriminatively trained model that uses two of our proposed feature templates for
deciding the boundaries between characters is used in our study. We also propose three other
methods, which are the duplicate word repartition, the date representation repartition, and the segment refinement, to enhance the accuracy of the processed segments. In the experiments, we use several different approaches for testing and compare the results with those proposed by Li et al. and Lau and King based on three different Chinese word corpora. The results prove that the improved feature template which makes use of the information of prefix and postfix
could increase both the recall and the precision. For example, the F-measure reaches 0.964 in the MSR dataset. By detecting repeat characters, the duplicated characters could also be better repartitioned without using extra resources. In the representation of date, the wrongly segmented date could be better repartitioned by using the proposed method which deals with numbers, date, and measure words. If a word is segmented differently from that of the corresponding standard segmentation corpus, a proper segment could be produced by repartitioning the assembled segment which is composed of the current segment and the adjacent segment.
In the area of using the conditional random fields for Chinese word segmentation, we have
proposed a feature template for better result and three methods which focus on other specific
segmentation problems.

Identiferoai:union.ndltd.org:NSYSU/oai:NSYSU:etd-0203110-093833
Date03 February 2010
CreatorsHuang, Jhao-ming
ContributorsChia-ping Chen, Tsung-nan Li, Chia-hua Lin
PublisherNSYSU
Source SetsNSYSU Electronic Thesis and Dissertation Archive
LanguageEnglish
Detected LanguageEnglish
Typetext
Formatapplication/pdf
Sourcehttp://etd.lib.nsysu.edu.tw/ETD-db/ETD-search/view_etd?URN=etd-0203110-093833
Rightsoff_campus_withheld, Copyright information available at source archive

Page generated in 0.0027 seconds