1 |
行動應用軟體在迭代分群行為之研究 / Iterative Clustering on Behaviors of App Executables邱莉晴, Chiu, Li Ching Unknown Date (has links)
行動裝置在現在這個世代相當普遍,而我們需要一個方法來探索App在背後的行為。
本研究提出了一個非監督式的分群方式,目的是在於探討我們是否能使用App中的原始碼當作以行為分群的依據。
在此研究中,我們應用了迭代分群的方式對Apps做分析,並且觀察分群的結果是否恰當。
而在實驗中,我們由App Store下載了數百個App並加以分析,我們發現我們所提出的方式表現相當良好並且能給出正確的分群結果。 / Smart devices are everywhere nowadays. Mobile application (app) development has become one of the main streams in software industry with more than millions of apps that have been developed and published to billions of users.
It is essential to have a systematic way to analyze apps, preferably on their executable that are the only public available sources of apps in most cases.
In this work, we propose to apply unsupervised clustering to mobile applications on their system call distributions. This is done by first adopting a static binary analysis that reverses engineering on executable of apps to find method call/sequence counts that are embedded in apps. Apps are then clustered iteratively based on this information to reveal implicit relationships among apps based on function call similarity. The GHSOM (Growing Hierarchical Self-Organizing Map), an unsupervised learning tool, is integrated to cluster apps based on the information resolved from their executable directly.
We use types of methods and sequences as features. To run the clustering algorithm on apps, however, we immediately confront a problem that we have a large amount of attributes and data that leads to a long/infeasible analysis time with GHSOMs. The new iterative approach is proposed to conquer this problem along with dimension reduction with principle component analysis, cutting attributes with limited information loss.
In the preliminary result on analyzing hundreds of apps that are directly downloaded from Apple app store, we can find that the proposed clustering works well and reveals some interesting information. Apps that are developed by the same company are clustered in the same group. Apps that have similar behaviors, e.g., having the same functions on games, painting, socializing, are clustered together.
|
2 |
基於大數據資料的非監督分散式分群演算法 / An Effective Distributed GHSOM Algorithm for Unsupervised Clustering on Big Data邱垂暉, Chiu, Chui Hui Unknown Date (has links)
基於屬性相似度將樣本進行分群的技術已經被廣泛應用在許多領域,如模式識別,特徵提取和惡意行為偵測。由於此技術的重要性,很多人已經將各種分群技術利用分散式框架進行再製,例如K-means搭配Hadoop在Apache Mahout平台上。由於K-means需要預先定義分群數量,而自組織映射圖(SOM)需要預先定義圖的大小,所以能夠自動將樣本依照樣本間的變化容差進行分群的GHSOM(增長層次自組織映射圖)就提供了一個很棒的非監督學習方法用來針對某些資訊不完整的資料。然而,GHSOM目前並不是一個分散式的演算法,這就限制了其在大數據資料的應用上。在本篇論文中,我們提出了一種新的分散式GHSOM演算法。我們使用Scala的Actor Model來實現GHSOM的分散式系統,我們將GHSOM演算法中的水平擴增以及垂直擴增交由Actor來處理並顯示出顯著的性能提升。為了評估我們所提出的方法,我們收集並分析了數千個惡意程式在現實生活中的執行行為,並通過在數百萬個樣本上進行非監督分群後推導出惡意程式行為的檢測規則來顯示其性能的改進、規則有效性以及實踐中的潛在用法。 / Clustering techniques that group samples based on their attribute similarity have been widely used in many fields such as pattern recognition, feature extraction and malicious behavior characterization. Due to its importance, various clustering techniques have been developed with distributed frameworks such as K-means with Hadoop in Apache Mahout for scalable computation. While K-means requires the number of clusters and self organizing maps (SOM) requires the map size to be given, the technique of GHSOM (growing hierarchical self organizing maps) that clusters samples dynamically to satisfy the requirement on tolerance of variation between samples, poses an attractive unsupervised learning solution for data that have limited information to decide the number of clusters in advance. However it is not scalable with sequential computation, which limits its applications on big data. In this paper, we present a novel distributed algorithm on GHSOM. We take advantage of parallel computation with scala actor model for GHSOM construction, distributing vertical and horizontal expansion tasks to actors and showing significant performance improvement. To evaluate the presented approach, we collect and analyze execution behaviors of thousands of malware in real life and derive detection rules with the presented unsupervised clustering on millions samples, showing its performance improvement, rule effectiveness and potential usage in practice.
|
3 |
Extending the Growing Hierarchical Self Organizing Maps for a Large Mixed-Attribute Dataset Using Spark MapReduceMalondkar, Ameya Mohan January 2015 (has links)
In this thesis work, we propose a Map-Reduce variant of the Growing Hierarchical Self Organizing Map (GHSOM) called MR-GHSOM, which is capable of handling mixed attribute datasets of massive size. The Self Organizing Map (SOM) has proved to be a useful unsupervised data analysis algorithm. It projects a high dimensional data onto a lower dimensional grid of neurons. However, the SOM has some limitations owing to its static structure and the incapability to mirror the hierarchical relations in the data. The GHSOM overcomes these shortcomings of the SOM by providing a dynamic structure that adapts its shape according to the input data. It is capable of growing dynamically in terms of the size of the individual neuron layers to represent data at the desired granularity as well as in depth to model the hierarchical relations in the data.
However, the training of the GHSOM requires multiple passes over an input dataset. This makes it difficult to use the GHSOM for massive datasets. In this thesis work, we propose a Map-Reduce variant of the GHSOM called MR-GHSOM, which is capable of processing massive datasets. The MR-GHSOM is implemented using the Apache Spark cluster computing engine and leverages the popular Map-Reduce programming model. This enables us to exploit the usefulness and dynamic capabilities of the GHSOM even for a large dataset.
Moreover, the conventional GHSOM algorithm can handle datasets with numeric attributes only. This is owing to the fact that it relies heavily on the Euclidean space dissimilarity measures of the attribute vectors. The MR-GHSOM further extends the GHSOM to handle mixed attribute - numeric and categorical - datasets. It accomplishes this by adopting the distance hierarchy approach of managing mixed attribute datasets.
The proposed MR-GHSOM is thus capable of handling massive datasets containing mixed attributes. To demonstrate the effectiveness of the MR-GHSOM in terms of clustering of mixed attribute datasets, we present the results produced by the MR-GHSOM on some popular datasets. We further train our MR-GHSOM on a Census dataset containing mixed attributes and provide an analysis of the results.
|
4 |
財務報表舞弊之探索研究 / Exploring financial reporting fraud徐國英 Unknown Date (has links)
Financial reporting fraud leads to not only significant investment risks for external stockholders, but also financial crises for the capital market. Although the issue of fraudulent financial reporting has drawn much attention, relevant research is much less than issues of predicting financial distress or bankruptcy. Furthermore, one purpose of exploring the financial reporting fraud with various forms is to obtain a better understand of the corporate through investigating its financial and corporate governance indicators. This study addresses the challenge with proposing an approach with the following four phases: (1) to identify a set of financial and corporate governance indicators that are significantly correlated with the financial reporting fraud; (2) to use the Growing Hierarchical Self-Organizing Map (GHSOM) to cluster the normal and fraud listed corporate data; (3) to extract knowledge about the financial reporting fraud through observing the hierarchical relationship displayed in the trained GHSOM; and (4) to make the justification of the extracted knowledge. The proposed approach is feasible because researchers claim that the GHSOM can discover the hidden hierarchical relationship from data with high dimensionality.
|
Page generated in 0.2397 seconds