1 |
雲端運算服務環境下運用文字探勘於語意註解網頁文件分析之研究 / Extraction of semantic annotation document using text mining techniques in cloud computing environment黃孝文 Unknown Date (has links)
隨著網路的快速成長,資料探勘(Data Mining)及文字探勘(Text Mining)所須分析的資料集越來越龐大,透過單一機器執行資料探勘分析受限於記憶體大小及其計算能力,不僅運算時間大幅增加,分析資料集的檔案大小也因而受到限制;語意註解萃取出文件的重要內容,凸顯主題加強資料探勘及文字探勘的效果,而資料探勘、文字探勘和語意註解背後都牽涉到大規模的資料處理,透過雲端運算的技術使負載平衡,將運算工作分散至運算叢集中的每一台電腦,不僅加快運算和儲存的速度,更可降低整體的風險。
本研究使用Hadoop軟體實作雲端文字探勘平台,用於分散式文字探勘及結果分析,採用涵蓋21578篇新聞文件的路透社資料集(Reuters 21578)進行實證分析,依照Mod Apte切分法分為訓練資料集及測試資料集用以進行文件分類,文件分類的步驟分為數個部分,分別為進行資料格式轉換的資料前置處理、針對文件內容加註更詳盡的連結及描述的語意註解、用以產生分類預測模型的分類器(簡單貝氏分類器、餘集簡單貝氏分類器)與評估文件分類結果的評估器;路透社資料集經過去除停用字、附加語意註解資料及文本詞彙長度統計分類,再進行簡單貝氏分類器及餘集簡單貝氏分類器的訓練,比較測試資料集的分類正確率作為文件分類實證結果。
本研究根據實驗結果發現,探討去除停用字、語意註解、文件分類演算法及文本詞彙長度對於文件分類正確率的影響:(1)去除停用字使出現頻率高的停用字對於分類預測產生負面影響;(2)語意註解作為詮釋資料的取得方式,可增加文件分類的效果;(3)餘集簡單貝氏分類器,可用以減少偏斜資料對於分類預測結果的誤判;(4)文本詞彙長度較長的文章則會某種程度主導分類預測結果,造成誤判的產生,降低分類正確率;透過上述各影響因子的調整使文件分類的結果得到改善,使得文件分類正確率獲得較佳的效果。
本研究提出之系統以雲端運算環境運行文件分類演算法,使得大型資料集得以更為迅速取得分析結果,使用語意註解作為詮釋資料的來源,使得文件分類模型產生過程中有更多資訊可分析,使得機器判斷的正確程度獲得改善,亦可將文件轉換為語意網文件,供語意網搜尋引擎查詢檢索,未來應加入Twitter或Facebook等擁有大量非結構化資料的網站之資料,使本平台得以分析更大規模的資料,並且考慮資料集類別分佈的集中程度對分類正確率的影響程度,同時應實作效果更佳的分類演算法,進而改善系統整體的結果。 / Nowadays, businesses perform data mining and text mining need to handle large scale dataset. The computational resources of servers are often limited and lack of efficient to compute analytical jobs. But if they could run their data mining jobs under cloud computing clusters, they are able to get results very quickly on a large dataset without "out of memory" problems.
In this paper, a series of experiments are conducted to measure and analyze the accuracy of the classification algorithms implemented on Hadoop using Reuters-21578 dataset; the process of text mining consisted of four stages: (1)data preprocessing, (2)semantic annotation, (3)classifier, (4)evaluator. Reuters-21578 had divided into training set and testing set based on Mod Apte Split, processed by stopwords removal, appended semantic annotations as metadata and splitted into several subsets according to different document sizes. Experiments outlined several issues that will need to be considered when conducting text mining.
According to the experiment results, the researcher found that stopwords removal, semantic annotation, different classification algorithms and different document sizes could improve the classification accuracy. First, stopwords removal avoids common words from becoming noises that will do harm to classification result. Second, semantic annotation as the extra information could improve the result. Third, complementary naive bayes algorithm could solve the decision boundary problem which naive bayesian cannot handle. Fourth, long documents could dominate the classification results. Sixth, the class imbalance problem could cause a drop of classification accuracy. Text mining result could be improved by adjusting the parameters found above.
|
2 |
基於語意框架之讀者情緒偵測研究 / Semantic Frame-based Approach for Reader-Emotion Detection陳聖傑, Chen, Cen Chieh Unknown Date (has links)
過往對於情緒分析的研究顯少聚焦在讀者情緒,往往著眼於筆者情緒之研究。讀者情緒是指讀者閱讀文章後產生之情緒感受。然而相同一篇文章可能會引起讀者多種情緒反應,甚至產生與筆者迥異之情緒感受,也突顯其讀者情緒分析存在更複雜的問題。本研究之目的在於辨識讀者閱讀文章後之切確情緒,而文件分類的方法能有效地應用於讀者情緒偵測的研究,除了能辨識出正確的讀者情緒之外,並且能保留讀者情緒文件之相關內容。然而,目前的資訊檢索系統仍缺乏對隱含情緒之文件有效的辨識能力,特別是對於讀者情緒的辨識。除此之外,基於機器學習的方法難以讓人類理解,也很難查明辨識失敗的原因,進而無法了解何種文章引發讀者切確的情緒感受。有鑑於此,本研究提出一套基於語意框架(frame-based approach, FBA)之讀者情緒偵測研究的方法,FBA能模擬人類閱讀文章的方式外,並且可以有效地建構讀者情緒之基礎知識,以形成讀者情緒的知識庫。FBA具備高自動化抽取語意概念的基礎知識,除了利用語法結構的特徵,我們進一步考量周邊語境和語義關聯,將相似的知識整合成具有鑑別力之語意框架,並且透過序列比對(sequence alignment)的方式進行讀者情緒文件之匹配。經實驗結果顯示證明,本研究方法能有效地運用於讀者情緒偵測之相關研究。 / Previous studies on emotion classification mainly focus on the writer's emotional state. By contrast, this research emphasizes emotion detection from the readers' perspective. The classification of documents into reader-emotion categories can be applied in several ways, and one of the applications is to retain only the documents that cause desired emotions for enabling users to retrieve documents that contain relevant contents and at the same time instill proper emotions. However, current IR systems lack of ability to discern emotion within texts, reader-emotion has yet to achieve comparable performance. Moreover, the pervious machine learning-based approaches are generally not human understandable, thereby, it is difficult to pinpoint the reason for recognition failures and understand what emotions do articles trigger in their readers.
We propose a flexible semantic frame-based approach (FBA) for reader's emotion detection that simulates such process in human perception. FBA is a highly automated process that incorporates various knowledge sources to learn semantic frames that characterize an emotion and is comprehensible for humans from raw text. Generated frames are adopted to predict readers' emotion through an alignment-based matching algorithm that allows a semantic frame to be partially matched through a statistical scoring scheme. Experiment results demonstrate that our approach can effectively detect readers' emotion by exploiting the syntactic structures and semantic associations in the context as well as outperforms currently well-known statistical text classification methods and the stat-of-the-art reader-emotion detection method.
|
Page generated in 0.0166 seconds