1 |
中文動詞自動分類研究 / Automatic Classification of Chinese Unknown Verbs曾慧馨, Tseng, Hui-Hsin Unknown Date (has links)
本文提出以規則法與相似法將未知動詞自動分類至中研院詞庫小組(1993)的動詞分類標記上。規則法中的規則從訓練語料中訓練出,並加上未知動詞重疊的規律,包含率約二成五,正確率約86.86%∼91.32%。規則法的優點在於正確率高,但缺點在於可以處理的未知動詞數量太少。相似法利用與未知動詞的相似例子猜測未知動詞的可能分類,利用詞彙內部的訊息---詞基的詞類、語意類與詞彙結構來計算相似度。相似法的可以全面性的處理未知動詞,缺點容易受到訓練語料中標記錯誤的例子誤導與訓練語料的大小所影響。我們結合規則法與相似法預測未知動詞分類的正確率為72%。 / We present two methods to classify the Chinese unknown verbs. First, we summarize some linguistic rules and morphological patterns from corpus. The accuracy of the rule-based method is 86.86%~91.32%. Second, we use the instance-based categorization to classify the Chinese unknown words. The accuracy of the instance-based method is 67.86%~70.92% and the accuracy of the integrated classifier is about 72%.
|
2 |
應用平行語料建構中文斷詞組件 / Applications of Parallel Corpora for Chinese Segmentation王瑞平, Wang, Jui Ping Unknown Date (has links)
在本論文,我們建構一個基於中英平行語料的中文斷詞系統,並透過該系統對不同領域的語料斷詞。提供我們的系統不同領域的中英平行語料後,系統可以自動化地產生品質不錯的訓練語料,以節省透過人工斷詞方式取得訓練語料所耗費的時間、人力。
在產生訓練語料時,首先對中英平行語料中的所有中文句,透過查詢中文辭典的方式產生句子的各種斷詞組合,再利用英漢翻譯的資訊處理交集型歧異,將錯誤的斷詞組合去除。此外本研究從中英平行語料中擷取新的中英詞對與未知詞,並分別將其擴充至英漢辭典模組與中文辭典模組,以提升我們的系統之斷詞效能。
我們透過兩部分的實驗進行斷詞效能評估,而在實驗中會使用三種不同領域的實驗語料。在第一部分,我們以人工斷詞的測試語料進行斷詞效能評估。在第二部分,我們藉由漢英翻譯的翻譯品質間接地評估我們的系統之斷詞效能。由實驗結果顯示,我們的系統可以有一定的斷詞效能。 / In this paper, we construct a Chinese word segmentation system which based on Chinese-English Parallel Corpus to save time and manpower, and the corpora in different domains can be segmented by our system.
By providing Chinese-English Parallel Corpus to our system, training corpus can be automatically produced by our system. Then segmentation model can be trained with the produced training corpus. We use Chinese translation of words in English parallel sentences to solve overlapping ambiguity. We extract translation pairs and unknown words from Chinese-English Parallel Corpus.
In evaluation, two different experiments are conducted, and experimental data in three domains are used to evaluate segmentation performance in two experiments. In the first experiment, manually annotated Chinese sentences are used as testing data. In the second experiment, segmentation performance is indirectly indicated by translation quality. Experimental results show that our system achieves acceptable segmentation performance.
|
Page generated in 0.0189 seconds