Chinese is very different from alphabetical languages such as English, as there are no delimiters between Chinese words. So Chinese segmentation is an important step for most Chinese natural language processing (NLP) tasks.
We propose a tightness continuum for Chinese semantic units. The construction of the continuum is based on statistical informations. Based on this continuum, sequences can be dynamically segmented, and then that information can be exploited in a number of information retrieval tasks.
In order to show that our tightness continuum is useful for NLP tasks, we propose two methods to exploit the tightness continuum within IR systems. The first method refines the result of a general Chinese word segmenter. The second method embeds the tightness value into IR score functions. Experimental results show that our tightness measure is reasonable and does improve the performance of IR systems.
Identifer | oai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:AEU.10048/1123 |
Date | 06 1900 |
Creators | Xu, Ying |
Contributors | Goebel, Randy (Computing Science), Ringlstetter, Christoph (Center of Language and Information Processing, University of Munich), Kondrak, Greg (Computing Science), Zhao, Dangzhi (School of Library and Information Science) |
Source Sets | Library and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada |
Language | English |
Detected Language | English |
Type | Thesis |
Format | 1004486 bytes, application/pdf |
Relation | Ying Xu, Christoph Ringlstetter and Randy Goebel. A Continuum-based Approach for Tightness Analysis of Chinese Semantic Units. PACLIC 23. 2009 |
Page generated in 0.0016 seconds