數位典藏與數位學習國家型科技計畫14年來透過數位化方式典藏國家文物,例如:生物、考古、地質等15項主題,為了能讓數位典藏資料與時事互動故使用關鍵字作為數位典藏資料與時事的橋樑,由於時事資料會出現新字詞,因此,本研究將提出一個演算法在不使用詞庫或字典的情況下對單一篇中文文章擷取主題關鍵字,此演算法是以Bigram的方式斷詞因此字詞最小單位為二個字,例如:「中文」,隨後挑選出頻率詞並採用分群的方式將頻率詞進行分群最後計算每個字詞的卡方值並產生主題關鍵字,在文章中字詞共現的分佈是很重要的,假設一字詞與所有頻率詞的機率分佈中,此字詞與幾個頻率詞的機率分佈偏差較大,則此字詞極有可能為一關鍵字。在字詞的呈現方面,中文句子裡不像英文句子裡有明顯的分隔符號隔開每一個字詞,造成中文在斷詞處理上產生了極大的問題,與英文比較起來中文斷詞明顯比英文來的複雜許多,在本研究將會比較以Bigram、CKIP和史丹佛中文斷詞器為斷詞的工具,分別進行過濾或不過濾字詞與對頻率詞分群或不分群之步驟,再搭配計算卡方值或詞頻後所得到的主題關鍵字之差異,實驗之資料將採用中央研究院數位典藏資源網的文章,文章的標準答案則來自於中央研究院資訊科學研究所電腦系統與通訊實驗室所開發的撈智網。從實驗結果得知使用Bigram斷詞所得到的主題關鍵字部分和使用CKIP或史丹佛中文斷詞器所得到的主題關鍵字相同,且部分關鍵字與文章主題的關聯性更強,而使用Bigram斷詞的主要優點在於不用詞庫。最後,本研究所提出之演算法是基於能將數位典藏資料推廣出去的前提下所發展,希望未來透過此演算法能從當下熱門話題的文章擷取出主題關鍵字,並透過主題關鍵字連結到相關的數位典藏資料,進而帶動新一波「數典潮」。 / In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyword s in documents in digital archives, and the techniques developed along with the work can be used to build a connection between digital archives and news articles. Because there are always new words or new uses of words in news articles, in this thesis we propose an algorithm that can automatically extract keywords from a single Chinese document without using a corpus or dictionary. Given a document in Chinese, initially the algorithm uses a bigram-based approach to divide it into bigrams of Chinese characters. Next, the algorithm calculates term frequencies of bigrams and filters out those with low term frequencies. Finally, the algorithm calculates chi-square values to produce keywords that are most related to the topic of the given document. The co-occurrence of words can be used as an indicator for the degree of importance of words. If a term and some frequent terms have similar distributions of co-occurrence, it would probably be a keyword. Unlike English word segmentation which can be done by using word delimiters, Chinese word segmentation has been a challenging task because there are no spaces between characters in Chinese. The proposed algorithm performs Chinese word segmentation by using a bigram-based approach, and we compare the segmented words with those given by CKIP and Stanford Chinese Segmenter. In this thesis, we present comparisons for different settings: One considers whether or not infrequent terms are filtered out, and the other considers whether or not frequent terms are clustered by a clustering algorithm. The dataset used in experiments is downloaded from the Academia Sinica Digital Resources and the ground truth is provided by Gainwisdom, which is developed by Computer Systems and Communication Lab in Academia Sinica. According to the experimental results, some of the segmented words given by the bigram-based approach adopted in the proposed algorithm are the same as those given by CKIP or Stanford Chinese Segmenter, while some of the segmented words given by the bigram-based approach have stronger connections to topics of documents. The main advantage of the bigram-based approach is that it does not require a corpus or dictionary.
Identifer | oai:union.ndltd.org:CHENGCHI/G0100971017 |
Creators | 吳泰勳, Wu, Tai Hsun |
Publisher | 國立政治大學 |
Source Sets | National Chengchi University Libraries |
Language | 中文 |
Detected Language | English |
Type | text |
Rights | Copyright © nccu library on behalf of the copyright holders |
Page generated in 0.0022 seconds