近年來鏈結開放式資料 (Linked Open Data,簡稱LOD) 被認定含有大量潛在價值。如何蒐集與整合多元化的LOD並提供給資料分析人員進行資料的萃取與分析,已成為當前研究的重要挑戰。LOD資料是RDF (Resource Description Framework) 的資料格式。我們可以利用SPARQL來查詢RDF資料,但是目前對於大量RDF的資料除了缺少一個高性能且易擴展的儲存和查詢分析整合性系統之外,對於RDF大數據資料分析流程的研究也不夠完備。本研究以預測電影票房為例,使用DBpedia LOD資料集並連結外部電影資料庫 (例如:IMDb),並在Spark大數據平台上進行巨量圖形的分析。首先利用簡單貝氏分類與貝氏網路兩種演算法進行電影票房預測模型實例的建構,並使用貝氏訊息準則 (Bayesian Information Criterion,簡稱BIC) 找到最佳的貝氏網路結構。接著計算多元分類的ROC曲線與AUC值來評估本案例預測模型的準確率。 / Recent years, Linked Open Data (LOD) has been identified as containing large amount of potential value. How to collect and integrate multiple LOD contents for effective analytics has become a research challenge. LOD is represented as a Resource Description Framework (RDF) format, which can be queried through SPARQL language. But large amount of RDF data is lack of a high performance and scalable storage analysis system. Moreover, big RDF data analytics pipeline is far from perfect. The purpose of this study is to exploit the above research issue. A movie box office sale prediction scenario is demonstrated by using DBpedia with external IMDb movie database. We perform the DBpedia big graph analytics on the Apache Spark platform. The movie box office prediction for optimal model selection is first evaluated by BIC. Then, Naïve Bayes and Bayesian Network optimal model’s ROC and AUC values are obtained to justify our approach.
Identifer | oai:union.ndltd.org:CHENGCHI/G0103753023 |
Creators | 劉文友, Liu, Wen Yu |
Publisher | 國立政治大學 |
Source Sets | National Chengchi University Libraries |
Language | 中文 |
Detected Language | English |
Type | text |
Rights | Copyright © nccu library on behalf of the copyright holders |
Page generated in 0.0017 seconds