Global ETD Search

61	Reprezentace textu a její vliv na kategorizaci / Representation of Text and Its Influence on Categorization Šabatka, Ondřej January 2010 (has links) The thesis deals with machine processing of textual data. In the theoretical part, issues related to natural language processing are described and different ways of pre-processing and representation of text are also introduced. The thesis also focuses on the usage of N-grams as features for document representation and describes some algorithms used for their extraction. The next part includes an outline of classification methods used. In the practical part, an application for pre-processing and creation of different textual data representations is suggested and implemented. Within the experiments made, the influence of these representations on accuracy of classification algorithms is analysed.
62	Semantic Topic Modeling and Trend Analysis Mann, Jasleen Kaur January 2021 (has links) This thesis focuses on finding an end-to-end unsupervised solution to solve a two-step problem of extracting semantically meaningful topics and trend analysis of these topics from a large temporal text corpus. To achieve this, the focus is on using the latest develop- ments in Natural Language Processing (NLP) related to pre-trained language models like Google’s Bidirectional Encoder Representations for Transformers (BERT) and other BERT based models. These transformer-based pre-trained language models provide word and sentence embeddings based on the context of the words. The results are then compared with traditional machine learning techniques for topic modeling. This is done to evalu- ate if the quality of topic models has improved and how dependent the techniques are on manually defined model hyperparameters and data preprocessing. These topic models provide a good mechanism for summarizing and organizing a large text corpus and give an overview of how the topics evolve with time. In the context of research publications or scientific journals, such analysis of the corpus can give an overview of research/scientific interest areas and how these interests have evolved over the years. The dataset used for this thesis is research articles and papers from a journal, namely ’Journal of Cleaner Productions’. This journal has more than 24000 research articles at the time of working on this project. We started with implementing Latent Dirichlet Allocation (LDA) topic modeling. In the next step, we implemented LDA along with document clus- tering to get topics within these clusters. This gave us an idea of the dataset and also gave us a benchmark. After having some base results, we explored transformer-based contextual word and sentence embeddings to evaluate if this leads to more meaningful, contextual, and semantic topics. For document clustering, we have used K-means clustering. In this thesis, we also discuss methods to optimally visualize the topics and the trend changes of these topics over the years. Finally, we conclude with a method for leveraging contextual embeddings using BERT and Sentence-BERT to solve this problem and achieve semantically meaningful topics. We also discuss the results from traditional machine learning techniques and their limitations. NLP unsupervised topic modelling trend analysis LDA BERT Sentence-BERT TF-IDF transformer based language models document clustering Computer Sciences Datavetenskap (datalogi)
63	Investigating the Correlation Between Marketing Emails and Receivers Using Unsupervised Machine Learning on Limited Data : A comprehensive study using state of the art methods for text clustering and natural language processing / Undersökning av samband mellan marknadsföringsemail och dess mottagare med hjälp av oövervakad maskininlärning på begränsad data Pettersson, Christoffer January 2016 (has links) The goal of this project is to investigate any correlation between marketing emails and their receivers using machine learning and only a limited amount of initial data. The data consists of roughly 1200 emails and 98.000 receivers of these. Initially, the emails are grouped together based on their content using text clustering. They contain no information regarding prior labeling or categorization which creates a need for an unsupervised learning approach using solely the raw text based content as data. The project investigates state-of-the-art concepts like bag-of-words for calculating term importance and the gap statistic for determining an optimal number of clusters. The data is vectorized using term frequency - inverse document frequency to determine the importance of terms relative to the document and to all documents combined. An inherit problem of this approach is high dimensionality which is reduced using latent semantic analysis in conjunction with singular value decomposition. Once the resulting clusters have been obtained, the most frequently occurring terms for each cluster are analyzed and compared. Due to the absence of initial labeling an alternative approach is required to evaluate the clusters validity. To do this, the receivers of all emails in each cluster who actively opened an email is collected and investigated. Each receiver have different attributes regarding their purpose of using the service and some personal information. Once gathered and analyzed, conclusions could be drawn that it is possible to find distinguishable connections between the resulting email clusters and their receivers but to a limited extent. The receivers from the same cluster did show similar attributes as each other which were distinguishable from the receivers of other clusters. Hence, the resulting email clusters and their receivers are specific enough to distinguish themselves from each other but too general to handle more detailed information. With more data, this could become a useful tool for determining which users of a service should receive a particular email to increase the conversion rate and thereby reach out to more relevant people based on previous trends. / Målet med detta projekt att undersöka eventuella samband mellan marknadsföringsemail och dess mottagare med hjälp av oövervakad maskininlärning på en brgränsad mängd data. Datan består av ca 1200 email meddelanden med 98.000 mottagare. Initialt så gruperas alla meddelanden baserat på innehåll via text klustering. Meddelandena innehåller ingen information angående tidigare gruppering eller kategorisering vilket skapar ett behov för ett oövervakat tillvägagångssätt för inlärning där enbart det råa textbaserade meddelandet används som indata. Projektet undersöker moderna tekniker så som bag-of-words för att avgöra termers relevans och the gap statistic för att finna ett optimalt antal kluster. Datan vektoriseras med hjälp av term frequency - inverse document frequency för att avgöra relevansen av termer relativt dokumentet samt alla dokument kombinerat. Ett fundamentalt problem som uppstår via detta tillvägagångssätt är hög dimensionalitet, vilket reduceras med latent semantic analysis tillsammans med singular value decomposition. Då alla kluster har erhållits så analyseras de mest förekommande termerna i vardera kluster och jämförs. Eftersom en initial kategorisering av meddelandena saknas så krävs ett alternativt tillvägagångssätt för evaluering av klustrens validitet. För att göra detta så hämtas och analyseras alla mottagare för vardera kluster som öppnat något av dess meddelanden. Mottagarna har olika attribut angående deras syfte med att använda produkten samt personlig information. När de har hämtats och undersökts kan slutsatser dras kring hurvida samband kan hittas. Det finns ett klart samband mellan vardera kluster och dess mottagare, men till viss utsträckning. Mottagarna från samma kluster visade likartade attribut som var urskiljbara gentemot mottagare från andra kluster. Därav kan det sägas att de resulterande klustren samt dess mottagare är specifika nog att urskilja sig från varandra men för generella för att kunna handera mer detaljerad information. Med mer data kan detta bli ett användbart verktyg för att bestämma mottagare av specifika emailutskick för att på sikt kunna öka öppningsfrekvensen och därmed nå ut till mer relevanta mottagare baserat på tidigare resultat. Machine learning Unsupervised Natural language processing nlp clustering centroid based k-means text clustering limited data email clustering lsa svd tf-idf dimensionality reduction the gap statistic Lloyd's algorithm vectorization feature extraction Computer Sciences Datavetenskap (datalogi)
64	Performance Benchmarking and Cost Analysis of Machine Learning Techniques : An Investigation into Traditional and State-Of-The-Art Models in Business Operations / Prestandajämförelse och kostnadsanalys av maskininlärningstekniker : en undersökning av traditionella och toppmoderna modeller inom affärsverksamhet Lundgren, Jacob, Taheri, Sam January 2023 (has links) Eftersom samhället blir allt mer datadrivet revolutionerar användningen av AI och maskininlärning sättet företag fungerar och utvecklas på. Denna studie utforskar användningen av AI, Big Data och Natural Language Processing (NLP) för att förbättra affärsverksamhet och intelligens i företag. Huvudsyftet med denna avhandling är att undersöka om den nuvarande klassificeringsprocessen hos värdorganisationen kan upprätthållas med minskade driftskostnader, särskilt lägre moln-GPU-kostnader. Detta har potential att förbättra klassificeringsmetoden, förbättra produkten som företaget erbjuder sina kunder på grund av ökad klassificeringsnoggrannhet och stärka deras värdeerbjudande. Vidare utvärderas tre tillvägagångssätt mot varandra och implementationerna visar utvecklingen inom området. Modellerna som jämförs i denna studie inkluderar traditionella maskininlärningsmetoder som Support Vector Machine (SVM) och Logistisk Regression, tillsammans med state-of-the-art transformermodeller som BERT, både Pre-Trained och Fine-Tuned. Artikeln visar att det finns en avvägning mellan prestanda och kostnad vilket illustrerar problemet som många företag, som Valu8, står inför när de utvärderar vilket tillvägagångssätt de ska implementera. Denna avvägning diskuteras och analyseras sedan mer detaljerat för att utforska möjliga kompromisser från varje perspektiv i ett försök att hitta en balanserad lösning som kombinerar prestandaeffektivitet och kostnadseffektivitet. / As society is becoming more data-driven, Artificial Intelligence (AI) and Machine Learning are revolutionizing how companies operate and evolve. This study explores the use of AI, Big Data, and Natural Language Processing (NLP) in improving business operations and intelligence in enterprises. The primary objective of this thesis is to examine if the current classification process at the host company can be maintained with reduced operating costs, specifically lower cloud GPU costs. This can improve the classification method, enhance the product the company offers its customers due to increased classification accuracy, and strengthen its value proposition. Furthermore, three approaches are evaluated against each other, and the implementations showcase the evolution within the field. The models compared in this study include traditional machine learning methods such as Support Vector Machine (SVM) and Logistic Regression, alongside state-of-the-art transformer models like BERT, both Pre-Trained and Fine-Tuned. The paper shows a trade-off between performance and cost, showcasing the problem many companies like Valu8 stand before when evaluating which approach to implement. This trade-off is discussed and analyzed in further detail to explore possible compromises from each perspective to strike a balanced solution that combines performance efficiency and cost-effectiveness. Artificial Intelligence (AI) Machine Learning Big Data Natural Language Processing (NLP) Pre-Trained BERT Fine-Tuned BERT TF-IDF Logistic Regression Support Vector Machine (SVM) Cloud GPU Operating Costs Performance Efficiency Business Intelligence Computer and Information Sciences Data- och informationsvetenskap
65	Regroupement de textes avec des approches simples et efficaces exploitant la représentation vectorielle contextuelle SBERT Petricevic, Uros 12 1900 (has links) Le regroupement est une tâche non supervisée consistant à rassembler les éléments semblables sous un même groupe et les éléments différents dans des groupes distincts. Le regroupement de textes est effectué en représentant les textes dans un espace vectoriel et en étudiant leur similarité dans cet espace. Les meilleurs résultats sont obtenus à l’aide de modèles neuronaux qui affinent une représentation vectorielle contextuelle de manière non supervisée. Or, cette technique peuvent nécessiter un temps d’entraînement important et sa performance n’est pas comparée à des techniques plus simples ne nécessitant pas l’entraînement de modèles neuronaux. Nous proposons, dans ce mémoire, une étude de l’état actuel du domaine. Tout d’abord, nous étudions les meilleures métriques d’évaluation pour le regroupement de textes. Puis, nous évaluons l’état de l’art et portons un regard critique sur leur protocole d’entraînement. Nous proposons également une analyse de certains choix d’implémentation en regroupement de textes, tels que le choix de l’algorithme de regroupement, de la mesure de similarité, de la représentation vectorielle ou de l’affinage non supervisé de la représentation vectorielle. Finalement, nous testons la combinaison de certaines techniques ne nécessitant pas d’entraînement avec la représentation vectorielle contextuelle telles que le prétraitement des données, la réduction de dimensionnalité ou l’inclusion de Tf-idf. Nos expériences démontrent certaines lacunes dans l’état de l’art quant aux choix des métriques d’évaluation et au protocole d’entraînement. De plus, nous démontrons que l’utilisation de techniques simples permet d’obtenir des résultats meilleurs ou semblables à des méthodes sophistiquées nécessitant l’entraînement de modèles neuronaux. Nos expériences sont évaluées sur huit corpus issus de différents domaines. / Clustering is an unsupervised task of bringing similar elements in the same cluster and different elements in distinct groups. Text clustering is performed by representing texts in a vector space and studying their similarity in this space. The best results are obtained using neural models that fine-tune contextual embeddings in an unsupervised manner. However, these techniques require a significant amount of training time and their performance is not compared to simpler techniques that do not require training of neural models. In this master’s thesis, we propose a study of the current state of the art. First, we study the best evaluation metrics for text clustering. Then, we evaluate the state of the art and take a critical look at their training protocol. We also propose an analysis of some implementation choices in text clustering, such as the choice of clustering algorithm, similarity measure, contextual embeddings or unsupervised fine-tuning of the contextual embeddings. Finally, we test the combination of contextual embeddings with some techniques that don’t require training such as data preprocessing, dimensionality reduction or Tf-idf inclusion. Our experiments demonstrate some shortcomings in the state of the art regarding the choice of evaluation metrics and the training protocol. Furthermore, we demonstrate that the use of simple techniques yields better or similar results to sophisticated methods requiring the training of neural models. Our experiments are evaluated on eight benchmark datasets from different domains. Regroupement de textes représentation vectorielle contextuelle réduction de dimensionnalité apprentissage automatique SBERT Tf-idf UMAP TSDEA Text clustering Contextual word embedding Dimension reduction Machine learning Natural language processing
66	Help Document Recommendation System Vijay Kumar, Keerthi, Mary Stanly, Pinky January 2023 (has links) Help documents are important in an organization to use the technology applications licensed from a vendor. Customers and internal employees frequently use and interact with the help documents section to use the applications and know about the new features and developments in them. Help documents consist of various knowledge base materials, question and answer documents and help content. In day- to-day life, customers go through these documents to set up, install or use the product. Recommending similar documents to the customers can increase customer engagement in the product and can also help them proceed without any hurdles. The main aim of this study is to build a recommendation system by exploring different machine-learning techniques to recommend the most relevant and similar help document to the user. To achieve this, in this study a hybrid-based recommendation system for help documents is proposed where the documents are recommended based on similarity of the content using content-based filtering and similarity between the users using collaborative filtering. Finally, the recommendations from content-based filtering and collaborative filtering are combined and ranked to form a comprehensive list of recommendations. The proposed approach is evaluated by the internal employees of the company and by external users. Our experimental results demonstrate that the proposed approach is feasible and provides an effective way to recommend help documents. Document similarity Recommender systems content-based filtering collaborative filtering Non-Negative Matrix Factorisation (NMF) cosine similarity K-means clustering Computer Sciences Datavetenskap (datalogi)
67	Mining of Textual Data from the Web for Speech Recognition / Mining of Textual Data from the Web for Speech Recognition Kubalík, Jakub January 2010 (has links) Prvotním cílem tohoto projektu bylo prostudovat problematiku jazykového modelování pro rozpoznávání řeči a techniky pro získávání textových dat z Webu. Text představuje základní techniky rozpoznávání řeči a detailněji popisuje jazykové modely založené na statistických metodách. Zvláště se práce zabývá kriterii pro vyhodnocení kvality jazykových modelů a systémů pro rozpoznávání řeči. Text dále popisuje modely a techniky dolování dat, zvláště vyhledávání informací. Dále jsou představeny problémy spojené se získávání dat z webu, a v kontrastu s tím je představen vyhledávač Google. Součástí projektu byl návrh a implementace systému pro získávání textu z webu, jehož detailnímu popisu je věnována náležitá pozornost. Nicméně, hlavním cílem práce bylo ověřit, zda data získaná z Webu mohou mít nějaký přínos pro rozpoznávání řeči. Popsané techniky se tak snaží najít optimální způsob, jak data získaná z Webu použít pro zlepšení ukázkových jazykových modelů, ale i modelů nasazených v reálných rozpoznávacích systémech.
68	Impacts of Climate Change on IDF Relationships for Design of Urban Stormwater Systems Saha, Ujjwal January 2014 (has links) (PDF) Increasing global mean temperature or global warming has the potential to affect the hydrologic cycle. In the 21st century, according to the UN Intergovernmental Panel on Climate Change (IPCC), alterations in the frequency and magnitude of high intensity rainfall events are very likely. Increasing trend of urbanization across the globe is also noticeable, simultaneously. These changes will have a great impact on water infrastructure as well as environment in urban areas. One of the impacts may be the increase in frequency and extent of flooding. India, in the recent years, has witnessed a number of urban floods that have resulted in huge economic losses, an instance being the flooding of Mumbai in July, 2005. To prevent catastrophic damages due to floods, it has become increasingly important to understand the likely changes in extreme rainfall in future, its effect on the urban drainage system, and the measures that can be taken to prevent or reduce the damage due to floods. Reliable estimation of future design rainfall intensity accounting for uncertainties due to climate change is an important research issue. In this context, rainfall intensity-duration-frequency (IDF) relationships are one of the most extensively used hydrologic tools in planning, design and operation of various drainage related infrastructures in urban areas. There is, thus, a need for a study that investigates the potential effects of climate change on IDF relationships. The main aim of the research reported in this thesis is to investigate the effect of climate change on Intensity-Duration-Frequency relationship in an urban area. The rainfall in Bangalore City is used as a case study to demonstrate the applications of the methodologies developed in the research Ahead of studying the future changes, it is essential to investigate the signature of changes in the observed hydrological and climatological data series. Initially, the yearly mean temperature records are studied to find out the signature of global warming. It is observed that the temperature of Bangalore City shows an evidence of warming trend at a statistical confidence level of 99.9 %, and that warming effect is visible in terms of increase of minimum temperature at a rate higher than that of maximum temperature. Interdependence studies between temperature and extreme rainfall reveal that up to a certain range, increase in temperature intensifies short term rainfall intensities at a rate more than the average rainfall. From these two findings, it is clear that short duration rainfall intensities may intensify in the future due to global warming and urban heat island effect. The possible urbanization signatures in the extreme rainfall in terms of intensification in the evening and weekends are also inferred, although inconclusively. The IDF relationships are developed with historical data and changes in the long term daily rainfall extreme characteristics are studied. Multidecedal oscillations in the daily rainfall extreme series are also examined. Further, non-parametric trend analyses of various indices of extreme rainfall are carried out to confirm that there is a trend of increase in extreme rainfall amount and frequency, and therefore it is essential to the study the effects of climate change on the IDF relationships of the Bangalore City. Estimation of future changes in rainfall at hydrological scale generally relies on simulations of future climate provided by Global Climate Models (GCMs). Due to spatial and temporal resolution mismatch, GCM results need to be downscaled to get the information at station scale and at time resolutions necessary in the context of urban flooding. The downscaling of extreme rainfall characteristics in an urban station scale pose the following challenges: (1) downscaling methodology should be efficient enough to simulate rainfall at the tail of rainfall distribution (e.g., annual maximum rainfall), (2) downscaling at hourly or up to a few minutes temporal resolution is required, and (3) various uncertainties such as GCM uncertainties, future scenario uncertainties and uncertainties due to various statistical methodologies need to be addressed. For overcoming the first challenge, a stochastic rainfall generator is developed for spatial downscaling of GCM precipitation flux information to station scale to get the daily annual maximum rainfall series (AMRS). Although Regional Climate Models (RCMs) are meant to simulate precipitation at regional scales, they fail to simulate extreme events accurately. Transfer function based methods and weather typing techniques are also generally inefficient in simulating the extreme events. Due to its stochastic nature, rainfall generator is better suited for extreme event generation. An algorithm for stochastic simulation of rainfall, which simulates both the mean and extreme rainfall satisfactorily, is developed in the thesis and used for future projection of rainfall by perturbing the parameters of the rainfall generator for the future time periods. In this study, instead of using the customary two states (rain/dry) Markov chain, a three state hybrid Markov chain is developed. The three states used in the Markov chain are: dry day, moderate rain day and heavy rain day. The model first decides whether a day is dry or rainy, like the traditional weather generator (WGEN) using two transition probabilities, probabilities of a rain day following a dry day (P01), and a rain day following a rain day (P11). Then, the state of a rain day is further classified as a moderate rain day or a heavy rain day. For this purpose, rainfall above 90th percentile value of the non-zero precipitation distribution is termed as a heavy rain day. The state of a day is assigned based on transition probabilities (probabilities of a rain day following a dry day (P01), and a rain day following a rain day (P11)) and a uniform random number. The rainfall amount is generated by Monte Carlo method for the moderate and heavy rain days separately. Two different gamma distributions are fitted for the moderate and heavy rain days. Segregating the rain days into two different classes improves the process of generation of extreme rainfall. For overcoming the second challenge, i.e. requirement of temporal scales, the daily scale IDF ordinates are disaggregated into hourly and sub-hourly durations. Disaggregating continuous rainfall time series at sub-hourly scale requires continuous rainfall data at a fine scale (15 minute), which is not available for most of the Indian rain gauge stations. Hence, scale invariance properties of extreme rainfall time series over various rainfall durations are investigated through scaling behavior of the non-central moments (NCMs) of generalized extreme value (GEV) distribution. The scale invariance properties of extreme rainfall time series are then used to disaggregate the distributional properties of daily rainfall to hourly and sub-hourly scale. Assuming the scaling relationships as stationary, future sub-hourly and hourly IDF relationships are developed. Uncertainties associated with the climate change impacts arise due to existence of several GCMs developed by different institutes across the globe, climate simulations available for different representative concentration pathway (RCP) scenarios, and the diverse statistical techniques available for downscaling. Downscaled output from a single GCM with a single emission scenario represents only a single trajectory of all possible future climate realizations and cannot be representative of the full extent of climate change. Therefore, a comprehensive assessment of future projections should use the collective information from an ensemble of GCM simulations. In this study, 26 different GCMs and 4 RCP scenarios are taken into account to come up with a range of IDF curves at different future time periods. Reliability ensemble averaging (REA) method is used for obtaining weighted average from the ensemble of projections. Scenario uncertainty is not addressed in this study. Two different downscaling techniques (viz., delta change and stochastic rainfall generator) are used to assess the uncertainty due to downscaling techniques. From the results, it can be concluded that the delta change method under-estimated the extreme rainfall compared to the rainfall generator approach. This study also confirms that the delta change method is not suitable for impact studies related to changes in extreme events, similar to some earlier studies. Thus, mean IDF relationships for three different future extreme events, similar to some earlier studies. Thus, mean IDF relationships for three different future periods and four RCP scenarios are simulated using rainfall generator, scaling GEV method, and REA method. The results suggest that the shorter duration rainfall will invigorate more due to climate change. The change is likely to be in the range of 20% to 80%, in the rainfall intensities across all durations. Finally, future projected rainfall intensities are used to investigate the possible impact of climate change in the existing drainage system of the Challaghatta valley in the Bangalore City by running the Storm Water Management Model (SWMM) for historical period, and the best and the worst case scenario for three future time period of 2021–2050, 2051–2080 and 2071–2100. The results indicate that the existing drainage is inadequate for current condition as well as for future scenarios. The number of nodes flooded will increase as the time period increases, and a huge change in runoff volume is projected. The modifications of the drainage system are suggested by providing storage pond for storing the excess high speed runoff in order to restrict the width of the drain The main research contribution of this thesis thus comes from an analysis of trends of extreme rainfall in an urban area followed by projecting changes in the IDF relationships under climate change scenarios and quantifying uncertainties in the projections. Urban Climate Change Urban Stormwater Systems Storm Water Management Models Extreme Rain and Rainfall in Bangalore Climate Change Impacts Global Climate Models Rainfall Generator Models Extreme Rain and Rainfall in Urban Areas Climate Change Impact Storm Water Management Model (SWMM) Urban Flooding Climate Change Environmental Engineering
69	Využití metod dolování dat pro analýzu sociálních sítí / Using of Data Mining Method for Analysis of Social Networks Novosad, Andrej January 2013 (has links) Thesis discusses data mining the social media. It gives an introduction about the topic of data mining and possible mining methods. Thesis also explores social media and social networks, what are they able to offer and what problems do they bring. Three different APIs of three social networking sites are examined with their opportunities they provide for data mining. Techniques of text mining and document classification are explored. An implementation of a web application that mines data from social site Twitter using the algorithm SVM is being described. Implemented application is classifying tweets based on their text where classes represent tweets' continents of origin. Several experiments executed both in RapidMiner software and in implemented web application are then proposed and their results examined.
70	Studying the effectiveness of dynamic analysis for fingerprinting Android malware behavior / En studie av effektivitet hos dynamisk analys för kartläggning av beteenden hos Android malware Regard, Viktor January 2019 (has links) Android is the second most targeted operating system for malware authors and to counter the development of Android malware, more knowledge about their behavior is needed. There are mainly two approaches to analyze Android malware, namely static and dynamic analysis. Recently in 2017, a study and well labeled dataset, named AMD (Android Malware Dataset), consisting of over 24,000 malware samples was released. It is divided into 135 varieties based on similar malicious behavior, retrieved through static analysis of the file classes.dex in the APK of each malware, whereas the labeled features were determined by manual inspection of three samples in each variety. However, static analysis is known to be weak against obfuscation techniques, such as repackaging or dynamic loading, which can be exploited to avoid the analysis. In this study the second approach is utilized and all malware in the dataset are analyzed at run-time in order to monitor their dynamic behavior. However, analyzing malware at run-time has known weaknesses as well, as it can be avoided through, for instance, anti-emulator techniques. Therefore, the study aimed to explore the available sandbox environments for dynamic analysis, study the effectiveness of fingerprinting Android malware using one of the tools and investigate whether static features from AMD and the dynamic analysis correlate. For instance, by an attempt to classify the samples based on similar dynamic features and calculating the Pearson Correlation Coefficient (r) for all combinations of features from AMD and the dynamic analysis. The comparison of tools for dynamic analysis, showed a need of development, as most popular tools has been released for a long time and the common factor is a lack of continuous maintenance. As a result, the choice of sandbox environment for this study ended up as Droidbox, because of aspects like ease of use/install and easily adaptable for large scale analysis. Based on the dynamic features extracted with Droidbox, it could be shown that Android malware are more similar to the varieties which they belong to. The best metric for classifying samples to varieties, out of four investigated metrics, turned out to be Cosine Similarity, which received an accuracy of 83.6% for the entire dataset. The high accuracy indicated a correlation between the dynamic features and static features which the varieties are based on. Furthermore, the Pearson Correlation Coefficient confirmed that the manually extracted features, used to describe the varieties, and the dynamic features are correlated to some extent, which could be partially confirmed by a manual inspection in the end of the study. Android malware dynamic analysis droidbox cuckoodroid droidscope mobsf malware behavior correlation pearson correlation cosine similarity euclidean distance chebyshev distance mahalanobis distance similarity analysis static features dynamic features tf-idf AMD Android malware dataset malware dataset UpDroid EC2 Computer and Information Sciences Data- och informationsvetenskap

Search results