1 |
應用文字探勘技術於英文文章難易度分類 / The Classification of the Difficulty of English Articles with Text Mining許珀豪, Hsu, Po Hao Unknown Date (has links)
英語學習者如何能在普及的網路環境中,挑選難易度符合自身英文閱讀能力的文章,便是一個值得探討的議題。為了提升文章難易度分類的準確度,近代研究選取許多難易度特徵去分類。本研究希望能夠藉由英文語文難易度特徵、文字特徵,各自歸類和綜合歸類後與原先官方文章類別比較,檢驗是否可以利用語文特徵與文字特徵結合後的歸類結果,來提高準度。
本研究以GEPT的模擬試題文章作為歸類的依據。研究架構主要分成三部分:語文難易度特徵歸類、文字特徵歸類與綜合前兩者歸類。先以語文難易度特徵組成特徵向量的維度,並算出各語文特徵值後,再使用kNN將文章歸類成初級、中級或中高級,並做為比較準確度的依據;再以GEPT文章斷詞,並選取特徵詞作為特徵向量維度、TF-IDF作特徵值進行文字特徵歸類;最後則是將前面兩種特徵結合作為歸類標準。分別的F-measure為0.61、0.47,最後一個、也是表現最好的結果是以兩者結合後歸類,F-measure有0.68。
如何從大量的英文文章當中找到適合自己程度循序漸進的學習,是本論文期望未來可以藉由最後語文難易度特徵加上文字特徵的結果來達到的目的。未來可以結合語文難易度特徵以及文字特徵來幫助英文文章做分類,並可以從中分類出不同類別且不同程度的英文文章,讓使用者自行選擇並閱讀,使學習成效進而提升。 / It is rather an important issue that how to grasp the difficulty of the articles in order to efficiently choose the English articles that match our proficiency in the popularity of Internet. Recently, researchers have selected many characteristics of difficulty degrees in order to enhance the accuracy of the classification. The study aims to simplify the former complicated procedures of article classification by using the classification results of linguistic difficulty characteristics, text characteristics respectively, and the combination of the both; in the hope to raise the accuracy of the classification through the comparison of the results.
The article classification of the study is based on GEPT official practicing exams. There are three parts of this study: the characteristics of the linguistic difficulty and the text, and the combination of the both. First, the dimensions of the linguistic vectors will be the linguistic characteristics. The articles will be classified into primary, intermediate, or intermediate-high levels by kNN method, considered the comparison basis for the classification of the articles’ difficulty. Second, after GEPT articles are broken into words, the dimensions of the text vectors will be the selected words; the TF-IDF will be the values of the text vectors. The third part is to classify articles by using the combination of the former two results. After comparing the three, the best method is the third, the accuracy is 0.68.
The study hopes the result could help people choose proper English articles to learn English step by step. In the future, we could classify the articles by the combination of the both of linguistic difficulty characteristics and text characteristics. Not only classified as the different levels, but also classified as the different categories. The learners could choose what they like and the articles could correspond their degree in order to promote the effect of learning.
|
Page generated in 0.0271 seconds