• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

An Enhanced Conditional Random Field Model for Chinese Word Segmentation

Huang, Jhao-ming 03 February 2010 (has links)
In Chinese language, the smallest meaningful unit is a word which is composed of a sequence of characters. A Chinese sentence is composed of a sequence of words without any separation between them. In the area of information retrieval or data mining, the segmentation of a sequence of Chinese characters should be done before anyone starts to use these segments of characters. The process is called the Chinese word segmentation. The researches of Chinese word segmentation have been developed for many years. Although some recent researches have achieved very high performance, the recall of those words that are not in the dictionary only achieves sixty or seventy percent. An approach described in this paper makes use of the linear-chain conditional random fields (CRFs) to have a more accurate Chinese word segmentation. The discriminatively trained model that uses two of our proposed feature templates for deciding the boundaries between characters is used in our study. We also propose three other methods, which are the duplicate word repartition, the date representation repartition, and the segment refinement, to enhance the accuracy of the processed segments. In the experiments, we use several different approaches for testing and compare the results with those proposed by Li et al. and Lau and King based on three different Chinese word corpora. The results prove that the improved feature template which makes use of the information of prefix and postfix could increase both the recall and the precision. For example, the F-measure reaches 0.964 in the MSR dataset. By detecting repeat characters, the duplicated characters could also be better repartitioned without using extra resources. In the representation of date, the wrongly segmented date could be better repartitioned by using the proposed method which deals with numbers, date, and measure words. If a word is segmented differently from that of the corresponding standard segmentation corpus, a proper segment could be produced by repartitioning the assembled segment which is composed of the current segment and the adjacent segment. In the area of using the conditional random fields for Chinese word segmentation, we have proposed a feature template for better result and three methods which focus on other specific segmentation problems.

Page generated in 0.0632 seconds