資訊擷取是從自然語言文本中辨識出特定的主題或事件的描述,進而萃取出相關主題或事件元素中的對應資訊,再將其擷取之結果彙整至資料庫中,便能將自然語言文件轉換成結構化的核心資訊。然而資訊擷取技術的結果會有錯誤情況發生,若單只依靠人工檢查及更正錯誤的方式進行,將會是耗費大量人力及時間的工作。
在本研究論文中,我們提出字串圖形結構與字串特徵值兩種錯誤資料偵測方法。前者是透過圖形結構比對各資料內字元及字元間關聯,接著由公式計算出每筆資料的比對分數,藉由分數高低可判斷是否為錯誤資料;後者則是利用字串特徵值,來描述字串外表特徵,再透過SVM和C4.5機器學習分類方法歸納出決策樹,進而分類正確與錯誤二元資料。而此兩種偵測方法的差異在於前者隱含了圖學理論之節點位置與鄰點概念,直接比對原始字串內容;後者則是將原始字串轉換成特徵數值,進行分類等動作。
在實驗方面,我們以「總統府人事任免公報」之資訊擷取成果資料庫作為測試資料。實驗結果顯示,本研究所提出的錯誤偵測方法可以有效偵測出不合格的值組,不但能節省驗證資料所花費的成本,甚至可確保高資料品質的資訊擷取成果產出,促使資訊擷取技術更廣泛的實際應用。 / Given a targeted subject and a text collection, information extraction techniques provide the capability to populate a database in which each record entry is a subject instance documented in the text collection. However, even with the state-of-the-art IE techniques, IE task results are expected to contain errors. Manual error detection and correction are labor intensive and time consuming. This validation cost remains a major obstacle to actual deployment of practical IE applications with high validity requirement.
In this paper, we propose string graph structure and string feature-based methods. The former takes advantage of graph structure to compare characters and the relation between characters. Next step, we count the corresponding score via formula, and then the scores are takes to estimate the data correctness. The latter uses string features to describe a certain characteristics of each string, after that decision tree is generated by the C4.5 and SVM machine learning algorithms. And then classify the data is valid or not. These two detection methods have the ability to describe the feature of data and verify the correctness further. The difference between these two methods is that, we deal with string of row data directly in the previous method. Besides, it indicates the concept of node position and neighbor node in graphic theory. By contrast, the row string was transformed into feature value, and then be classified in the latter method.
In our experiments, we use IE task results of government personnel directives as test data. We conducted experiments to verify that effective detection of IE invalid values can be achieved by using the string graph structure and string feature-based methods. The contribution of our work is to reduce validation cost and enhance the quality of IE results, even provide both analytical and empirical evidences for supporting the effective enhancement of IE results usability as well.
Identifer | oai:union.ndltd.org:CHENGCHI/G0093753006 |
Creators | 鄭雍瑋, Cheng, Yung-Wei |
Publisher | 國立政治大學 |
Source Sets | National Chengchi University Libraries |
Language | 中文 |
Detected Language | English |
Type | text |
Rights | Copyright © nccu library on behalf of the copyright holders |
Page generated in 0.0131 seconds