1 |
透過Spark平台實現大數據分析與建模的比較:以微博為例 / Accomplish Big Data Analytic and Modeling Comparison on Spark: Weibo as an Example潘宗哲, Pan, Zong Jhe Unknown Date (has links)
資料的快速增長與變化以及分析工具日新月異,增加資料分析的挑戰,本研究希望透過一個完整機器學習流程,提供學術或企業在導入大數據分析時的參考藍圖。我們以Spark作為大數據分析的計算框架,利用MLlib的Spark.ml與Spark.mllib兩個套件建構機器學習模型,解決傳統資料分析時可能會遇到的問題。在資料分析過程中會比較Spark不同分析模組的適用性情境,首先使用本地端叢集進行開發,最後提交至Amazon雲端叢集加快建模與分析的效能。大數據資料分析流程將以微博為實驗範例,並使用香港大學新聞與傳媒研究中心提供的2012年大陸微博資料集,我們採用RDD、Spark SQL與GraphX萃取微博使用者貼文資料的特增值,並以隨機森林建構預測模型,來預測使用者是否具有官方認證的二元分類。 / The rapid growth of data volume and advanced data analytics tools dramatically increase the challenge of big data analytics services adoption. This paper presents a big data analytics pipeline referenced blueprint for academic and company when they consider importing the associated services. We propose to use Apache Spark as a big data computing framework, which Spark MLlib contains two packages Spark.ml and Spark.mllib, on building a machine learning model. This resolves the traditional data analytics problem. In this big data analytics pipeline, we address a situation for adopting suitable Spark modules. We first use local cluster to develop our data analytics project following the jobs submitted to AWS EC2 clusters to accelerate analytic performance. We demonstrate the proposed big data analytics blueprint by using 2012 Weibo datasets. Finally, we use Spark SQL and GraphX to extract information features from large amount of the Weibo users’ posts. The official certification prediction model is constructed for Weibo users through Random Forest algorithm.
|
Page generated in 0.0294 seconds