中文繁簡在字體或電腦編碼上明顯不同之外,在部份詞彙的用法也有不同,這些用法不同的詞彙卻有相同意義的詞彙稱為繁簡體中的等義詞,這些等義詞在雙方文化交流時可能會造成一些障礙,例如人們互相對話、文件書籍翻譯或軟體系統等轉換時容易造成詞義上的誤解,目前均以人工方式來解決不同詞彙的問題,均會費時耗力且易疏漏,若能利用科學的方法讓電腦能自動辨識中文繁簡的等義詞,便能利用辨識出的等義詞給予提示,解決繁簡詞義不同所造成的誤解。
依照實驗設計架構,首先建立電腦類與一般類的繁簡體語料庫,作為辨識的基礎,並建立研究的架構與方法,分為二個階段三種方法,第一階段使用第一種方法,我們先使用N-gram辨識等義詞,評估單一方法是否能有效辨識出等義詞,第二階段使用第二種方法PMI-IR & LC-IR方法與第三種方法Context Vector,評估第二階段的方法是否能將等義詞的辨識能力提高。
根據本研究目的,讓電腦能自動在語料庫中自動辨識中文繁簡等義詞,所以提出了新的辨識架構,用N-gram初步辨識出等義詞,並經由PMI-IR & LC-IR與Context Vector方法提高Precision約0~20%不等。本研究結論是採用不同語言的語料庫,使用N-gram能夠辦識出等義詞,並搭配PMI-IR & LC-IR與Context Vector方法後,可以強化與提昇其等義詞辨識的能力,解決單一方法等義詞辨識能力不足之問題。 / Traditional Chinese and Simplied Chinese are not only different in the typeface and in the computer code, but also in the partial usage of vocabularies. These vocabularies which have different usage but have the same significance are called synonyms. These synonyms will cause some obstacles and misunderstanding in meaning when two parties have cultural exchange, such as during conversation, documents and books translation or softwares system transformation. What we do to solve the problem now is picked them out by manpower, but that will waste a lot of time and strength and easily make errors. If we can use scientific way to let the computer distinguish automatically the synonyms between Traditional Chinese and Simplied Chinese, we will be able to solve such misunderstanding by the hints of the distinguished synonyms.
According to the structure of experiment, to let the computer distinguish automatically the synonyms between Traditional Chinese and Simplied Chinese, we have to establish a Traditional Chinese and Simplied Chinese computer category and a general category first as the basis of identification. We should build up the research structure and the method, which divided into two stages and three methods. The first stage uses the first method to use N-gram to distinguish the synonyms and then review if this single method can identify the synonyms effectively. The second stage uses the second method PMI-IR & LC-IR and the third method Context Vector and review if the second stage can raise the synonyms’ ability of identification.
According to this research purpose, the computer to study on automatic exact recognition synonyms between traditional and simplified Chinese, so has proposed the new structure of distinguishing, N-gram automatic exact recognition synonym tentatively, and PMI-IR & LC-IR and Context Vector method can improve Precision about 0~20%. This conclusion is a corpus base of using different languages, using N-gram can be exact recognition synonyms, PMI-IR & LC-IR and Context Vector method, can improve single method ability.
Identifer | oai:union.ndltd.org:CHENGCHI/G0094971010 |
Creators | 黃群弼 |
Publisher | 國立政治大學 |
Source Sets | National Chengchi University Libraries |
Language | 中文 |
Detected Language | English |
Type | text |
Rights | Copyright © nccu library on behalf of the copyright holders |
Page generated in 0.0018 seconds