Return to search

中文文本探勘工具:主題分析、詞組關聯強度、相關句擷取 / Tools for Chinese Text Mining: Topic Analysis, Association Strengths of Collocations, Extraction of Relevant Statements

現今資料大量且快速數位化的時代,各領域對資訊探勘分析技術越趨倚重。而在數位人文中領域中從2009年「數位典藏與數位人文國際研討會」開始,此議題逐漸受到重視,主要目的為將數位文物結合資訊分析與圖像化輔助,透過不同層面的詮釋建構出更完整的文物資訊。
本研究建構一個針對各種中文語料分析的工具,藉由latent semantic analysis、pointwise mutual information、Person’s chi-squared test、typed dependencies distance、word2vec、Gibbs sampling for latent Dirichlet allocation等計算語料中關鍵詞彙關聯強度的方法,並結合分群方法找出可能的主題,最後擷取符合分群結果的相關句子予以輔助人文學者分析詮釋。透過提供各種觀察語料的面向,進而提升語料相關研究學者的效率。
我們利用《人民日報》、《新青年》、《聯合報》、《中國時報》作為實驗與測試的中文語料。且將《新青年》藉由此套工具分析後的結果提供給專業人文學者,做為分析詮釋的參考資訊與佐證依據,並在「2015年數位典藏與數位人文國際研討會」中發表論文。目前我們透過各種中文語料評估工具的效能,且在未來將公開此套工具提供給更多學者使用,節省對於語料分析的時間。 / In recent years, a wide variety of text documents have been transformed into digital format. Hence, using data mining techniques to analyze data is becoming more and more popular in many research fields. The digital humanities gradually have taken seriously since "International Conference of Digital Archives and Digital Humanities" began in 2009. The main purpose of the digital heritage combined with information analysis and visualization could improve the effectiveness of cultural information through different levels of interpretation.
In this study, we construct a set of tools for Chinese text mining, calculating associated strengths of collocations work through latent semantic analysis, pointwise mutual information, Person’s chi-squared test, typed dependencies distance, word2vec, and Gibbs sampling for latent Dirichlet allocation etc. The tools employ clustering method to identify the possible topics, meanwhile, the tools will extract the relevant statements according to the clustering results. These clustering and relevant statements contribute and improve the efficiency of humanities scholars’ analysis through providing a variety of observations about the corpora.
At the experimental stage of this study, we considered the "People's Daily", "New Youth", "United Daily News", and "China Times" as as the corpora for testing. Among the research, humanities scholars analyzed the "New Youth" by the tools and published a paper in the "2015 International Conference of Digital Archives and Digital Humanities". Currently, we assess the effectiveness of the tools through a variety of Chinese corpora. In the future, we will make the tools freely available on the Internet for Chinese text mining. We hope these time-saving tools can assist in humanities scholars’ study of Chinese corpora.

Identiferoai:union.ndltd.org:CHENGCHI/G0102753020
Creators林書佑, Lin, Shu Yu
Publisher國立政治大學
Source SetsNational Chengchi University Libraries
Language中文
Detected LanguageEnglish
Typetext
RightsCopyright © nccu library on behalf of the copyright holders

Page generated in 0.0019 seconds