Global ETD Search

1	適用於中文史料文本之標記式主題模型分析方法研究 / An Enhanced Topic Model Based on Labeled LDA for Chinese Historical Corpora 陳奕安 Unknown Date (has links) 本論文提出了一個適用於中文史料文本主題分析方法,主要是根據標記式隱含狄利克雷分布(Labeled Latent Dirichlet Allocation,LLDA) 演算法,使其可以透過人工標記的中文文本找出特定主題的相關詞彙。在我們提出的演算法中,我們加上主題種子字詞(Seed Words) 資訊,以增強 LDA 群聚過後的結果,使群聚過後的詞彙與主題的關聯度能夠獲得提昇。近年來,隨著網際網路的普及以及資訊檢索的蓬勃發展,同時由於數位典藏的資料成長,越來越多的實體書藉被編輯成數位版本並且加上後設資料(Metadata),在取得這些富有價值的歷史文本資料後,如何利用文字探勘技術(Text Mining)在這些資料上變成一項重要的研究議題。其中,如何從大量文本史料中辨識出文章主題更是許多學者感興趣的方向,而 LDA 主題模型則是在文字探勘領域中非常經典的方法。在此研究中我們發現傳統 LDA 對於群聚後的主題描述存在些許問題,包括主題類別的高隨機性以及個別主題的低易讀性,使得後續的解讀工作變得十分困難,因此我們採用了由 LDA 衍生出的標記式主題模型 Labeled LDA 演算法,限定能夠產生的主題類別以降低期隨機性,此外我們還加入了考量中文字詞的長度以及自定義的相關種子字詞等改進,使群聚出的主題詞彙能夠與主題更加相關,更加容易描述。實驗部分,我們利用改良後的演算法提取出主題詞彙,並進行人工標記,接著將標記的結果作為正確解答來計算平均準度均值(Mean Average Precision,MAP)等資訊檢索之評估方法作為評估,結果證實以長字詞以及種子字詞為考量所群聚出的結果皆優於傳統主題模型所群聚出的結果;此外,我們也將最終的結果與 TF-IDF 權重計算後的字詞進行比較,並由實驗結果可見其兩者之間的差異性。 / This paper proposes an enhanced topic model based on Labeled Latent Dirichlet Allocation (LLDA) for Chinese historical corpora to discover words related to specific topics. To enhance the traditional LDA performance and to increase the readability of its clustered words, we attempt to use the infor- mation of seed words and the Chinese word length into the traditional LDA algorithm. In this study, we find that the traditional LDA exists some prob- lems about topic descriptions after clustering. We therefore apply the Labeled LDA algorithm, which is derived from traditional LDA, with the proposed improvements of considering the lengths of the words and related seed words. In our experiments, Mean Average Precision (MAP) is used to evaluate our experiment results based on the topics words labeled manually by historical experts. The experimental results shows that the proposed method of consid- ering both Chinese word length information and seed words is better than the traditional LDA method. In addition, we compare the proposed results with the TF-IDF weighting scheme, and the proposed method also outperforms the TF-IDF method significantly. 主題模型標記式主題模型隱含狄利克雷分布

Search results

適用於中文史料文本之標記式主題模型分析方法研究 / An Enhanced Topic Model Based on Labeled LDA for Chinese Historical Corpora