1 |
應用記憶體內運算於多維度多顆粒度資料探勘之研究―以醫療服務創新為例 / A Research Into In-memory Computing In Multidimensional, Multi-granularity Data Mining ― With Healthcare Services Innovation朱家棋, Chu, Chia Chi Unknown Date (has links)
全球面臨人口老化與人口不斷成長的壓力下,對於醫療服務的需求不斷提升。醫療服務領域中常以資料探勘「關聯規則」分析,挖掘隱藏在龐大的醫學資料庫中的知識(knowledge),以支援臨床決策或創新醫療服務。隨著醫療服務與應用推陳出新(如,電子健康紀錄或行動醫療等),與醫療機構因應政府政策需長期保存大量病患資料,讓醫療領域面臨如何有效的處理巨量資料。
然而傳統的關聯規則演算法,其效能上受到相當大的限制。因此,許多研究提出將關聯規則演算法,在分散式環境中,以Hadoop MapReduce框架實現平行化處理巨量資料運算。其相較於單節點 (single-node) 的運算速度確實有大幅提升。但實際上,MapReduce並不適用於需要密集迭帶運算的關聯規則演算法。
本研究藉由Spark記憶體內運算框架,在分散式叢集上實現平行化挖掘多維度多顆粒度挖掘關聯規則,實驗結果可以歸納出下列三點。第一點,當資料規模小時,由於平行化將資料流程分為Map與Reduce處理,因此在小規模資料處理上沒有太大的效益。第二點,當資料規模大時,平行化策略模式與單機版有明顯大幅度差異,整體運行時間相差100倍之多;然而當項目個數大於1萬個時,單機版因記憶體不足而無法運行,但平行化策略依舊可以運行。第三點,整體而言Spark雖然在小規模處理上略慢於單機版的速度,但其運行時間仍小於Hadoop的4倍。大規模處理速度上Spark依舊優於Hadoop版本。因此,在處理大規模資料時,就運算效能與擴充彈性而言,Spark都為最佳化解決方案。 / Under the population aging and population growth and rising demand for Healthcare. Healthcare is facing a big issue how to effectively deal with huge amounts of data. Cased by new healthcare services or applications (such as electronic health records or health care, etc), and also medical institutions in accordance with government policy for long-term preservation of a large number of patient data.
But the traditional algorithms for mining association rules, subject to considerable restrictions on their effectiveness. Therefore, many studies suggest that the association rules algorithm in a distributed computing, such as Hadoop MapReduce framework implements parallel to process huge amounts of data operations. But in fact, MapReduce does not apply to require intensive iterative computation algorithm of association rules.
Studied in this Spark in-memory computing framework, implemented on a distributed cluster parallel mining association rules mining multidimensional granularity, the experimental results can be summed up in the following three points. 1th, when data is small, due to the parallel data flow consists of Map and Reduce, so not much in the small-scale processing of benefits. 2nd, when the data size is large, parallel strategy models and stand-alone obviously significant differences overall running time is 100 times as much when the item number is greater than 10,000, however, stand-alone version cannot run due to insufficient memory, but parallel strategies can still run. 3rd, overall Spark though somewhat slower than the single version in small scale processing speed, but the running time is less than 4 times times the Hadoop. Massive processing speed Spark is still superior to the Hadoop version. Therefore, when working with large data, operational efficiency and expansion elasticity, Spark for optimum solutions.
|
Page generated in 0.0133 seconds