Return to search

探索性資料分析方法在文本資料中的應用─以「新青年」雜誌為例 / A Study of Exploratory Data Analysis on Text Data ── A Case study based on New Youth Magazine

隨著經濟繁榮和網絡發展的日新月異,線上線下每時每刻都產生龐大數據,其中約有80%的文字、影像等非結構化數據,如何量化和採取適合的分析方法,成為有效提取有價值信息及對其加以利用的關鍵。針對文字類型的資料,本文提出探索性資料分析方法,並以《新青年》雜誌的語言變化為例,呈現如何選取文本特徵并对其量化及分析的過程。
首先,本文以卷為分析單位,多角度量化《新青年》雜誌各卷的文本結構,包括文本用字、用句、文言和白虛字使用以及常用字詞共用等方面,通過多種圖表相結合的呈現方式,窺探《新青年》雜誌語言變化歷程以及轉變特點。這其中既包括了對文言文到白話文轉變機制的探索,也包括白話語言演化的探索。其次,根據各卷初探的結果,尋找可區隔文言文和白話文兩種語言形式的文本特徵變數,再以《新青年》第一卷和第七卷為訓練樣本,結合主成分和羅吉斯迴歸,對文、白兩種語言形式的文章進行分類訓練,再利用第四卷進行測試。結果證實,所提取的文本變數能夠有效實現對文、白兩種語言形式的文章的區分。此外,本文亦根據前述初探結果以及人文學者經驗,探索《新青年》雜誌後期語言形式的變化,即從五四運動時期的白話文至以「紅色中文」為特徵的白話文(二戰之後中國使用的白話文)的變化。以第七卷和第十一卷為樣本進行訓練,結果證實這兩卷語言形式存在明顯區別;並加入台灣《聯合報》和中國大陸的《人民日報》進行分類預測,發現兩類報刊的語言偏向有明顯差異,值得後續深入研究。 / Tremendous data are produced every day, due to the rapid development of computer technology and economics. Unstructured data, such as text, pictures, videos, etc., account for nearly 80 percent of all data created. Choosing appropriate methods for quantifying and analyzing this kind of data would determine whether or not we can extract useful information. For that, we propose a standard operating process of exploratory data analysis (EDA) and use a case study of language changes in New Youth Magazine as a demonstration.
First, we quantify the texts of New Youth magazine from different perspectives, including the uses of words, sentences, function words, and share of common vocabulary. We aim to detect the evolution of modern language itself as well as changes from traditional Chinese to modern Chinese. Then, according to the results of exploratory data analysis, we treat the first and seventh volumes of New Youth magazine for training data to develop classification model and apply the model to fourth volume (i.e., testing data). The results show that the traditional Chinese and modern Chinese can be successfully classified. Next, we intend to verify the changes from modern Chinese of the May 4th Movement to those by advocating Socialism. We treat the seventh volume and eleventh volume of New Youth magazine as training data and again develop a classification model. Then we apply this model to the United Daily News from Taiwan and People’s Daily from Mainland China. We found these two newspapers are very different and the style of United Daily News is closer to that of seventh volume, while the style of People’s Daily is more like that of eleventh volume. This indicates that the People’s Daily is likely to be influenced by the Soviet Union.

Identiferoai:union.ndltd.org:CHENGCHI/G0102354031
Creators潘艷艷, Pan, Yan Yan
Publisher國立政治大學
Source SetsNational Chengchi University Libraries
Language中文
Detected LanguageEnglish
Typetext
RightsCopyright © nccu library on behalf of the copyright holders

Page generated in 0.0017 seconds