隨著網際網路的蓬勃發展,資訊擷取(Information Extraction)已經成為一個非常重要的技術。資訊擷取的目標為從非結構化的文字資料中,為特定的主題整理出相關之結構化資訊,其所牽涉的問題,包括分析文件的內容,篩選、擷取出相關的文字及其對應的意義。到目前為止,大部份的資訊擷取系統都著重在英文文件上,對於中文文件資訊擷取技術的研究才正在如火如荼的展開,加上全世界至少超過1/5的人說中文,積極投入中文資訊擷取的研究就顯得非常重要。
中文的描述方式與英文有著很大的不同。在英文,詞跟詞之間有著明顯的『空白』,電腦可以很輕易的區隔輸入字串中每個詞。但是在中文,詞跟詞之間並沒有明顯的界限,一般的處理情形為利用詞典,將一個輸入字串中的文字,比對詞典內的詞來當做斷詞的依據,不過由於字組成詞的變化程度相當大,斷詞錯誤的情形仍很可能出現。因此,在本篇研究論文我們提出不做斷詞、不做詞性分析,而利用『型態辨識』的方法搭配『有限狀態自動機』的運作方式,來處理中文資訊擷取的問題。在實驗方面,我們以『總政府人事任免公報』當作測試資料,其精確度高達98%,而回收率也達到了97%。此外,我們也應用到其他不同的資料領域,對於建立跨領域之中文資訊擷取系統有了初步的研究進展,充分印證了本資訊擷取方法處理中文資訊擷取問題的可行性。 / With the explosion of World Wide Web, information extraction has become a major technical area. The goal of information extraction is to transform non-structured text into structured data of specific topic. It involves analyzing, filtering and extracting relevant parts of text and the corresponding meaning. Most information extraction research mainly focuses on English text. On the other hand, research on Chinese information extraction has not received as much attention. Considering the fact that one-fifth population in the world are Chinese-speaking people, Chinese information extraction technology will become increasingly important.
Chinese language is different with English in many aspects. In English, words are separated with space such that computers can easily distinguish each word in the input string. In Chinese, there are no spaces between characters to segment them into meaningful words. A general solution is to match characters of the input string to the words in the dictionary to find proper word boundary. Yet, much flexibility and ambiguity exist in the combination of characters into words. Many errors may occur in word segmentation. . In this thesis, we propose an approach to Chinese information extraction based on pattern matching and finite state automata, without relying on word segmentation and part-of-speech tagging. The approach was evaluated with “government personnel directives in official gazettes” as test data, and it achieved performance measure of 98% precision and 97% recall. Moreover, the approach was extended to other data domains. The results have showed initial progress on the research of multiple- domain Chinese information extraction system.
Identifer | oai:union.ndltd.org:CHENGCHI/G0090753018 |
Creators | 翁嘉緯, Chia-Wei Weng |
Publisher | 國立政治大學 |
Source Sets | National Chengchi University Libraries |
Language | 中文 |
Detected Language | English |
Type | text |
Rights | Copyright © nccu library on behalf of the copyright holders |
Page generated in 0.0022 seconds