Spelling suggestions: "subject:"feature importance"" "subject:"eature importance""
1 |
Language Classification of Music Using MetadataRoxbergh, Linus January 2019 (has links)
The purpose of this study was to investigate how metadata from Spotify could be used to identify the language of songs in a dataset containing nine languages. Features based on song name, album name, genre, regional popularity and vectors describing songs, playlists and users were analysed individually and in combination with each other in different classifiers. In addition to this, this report explored how different levels of prediction confidence affects performance and how it compared to a classifier based on audio input. A random forest classifier proved to have the best performance with an accuracy of 95.4% for the whole data set. Performance was also investigated when the confidence of the model was taken into account, and when only keeping more confident predictions from the model, accuracy was higher. When keeping the 70% most confident predictions an accuracy of 99.4% was achieved. The model also proved to be robust to input of other languages than it was trained on, and managed to filter out unwanted records not matching the languages of the model. A comparison was made to a classifier based on audio input, where the model using metadata performed better on the training and test set used. Finally, a number of possible improvements and future work were suggested.
|
2 |
Ranking Aspect-Based Features in Restaurant ReviewsChan, Jacob Ling Hang 07 December 2020 (has links)
Consumers continuously review products and services on the internet. Others have frequently relied on those reviews in making purchasing decisions. Review texts are usually free-form and associated with a star rating on a 5-point scale. The majority of restaurants receive a 3.5 or 4 star rating on average, so a standalone star rating does not provide adequate information for readers to make a decision. Many researchers have approached the problem with sentiment analysis to classify a sentence or a text as expressing a positive or a negative review. Sentiment analysis, even at the fine-grained level, can only provide classification of positive and negative judgments on any particular aspect under consideration. The novel method proposed in this thesis provides insight into what aspects reviewers deem as relevant when assigning star rating to restaurants. This is accomplished by using an interpretable star rating classification method that predicts star rating based on aspect and polarity score from the review. The model first assigns a polarity score for each aspect in the review text, then predicts a star rating, and outputs a ranked list of aspect importance according to a widely used restaurant reviews dataset. The result from this thesis suggests that the classification model is able to output a reliable ranking from the review texts.
|
3 |
Estimating Brain Maturation in Very Preterm Neonates : An Explainable Machine Learning Approach / Estimering av hjärnmognad i mycket prematura spädbarn : En ansats att tillämpa förklarbar maskininlärningSvensson, Patrik January 2023 (has links)
Introduction: Assessing brain maturation in preterm neonates is essential for the health of the neonates. Machine learning methods have been introduced as a prospective assessment tool for neonatal electroencephalogram(EEG) signals. Explainable methods are essential in the medical field, and more research regarding explainability is needed in the field of using machine learning for neonatal EEG analysis. Methodology: This thesis develops an explainable machine learning model that estimates postmenstrual age in very preterm neonates from EEG signals and investigates the importance of the features used in the model. Dual-channel EEG signals had been collected from 14 healthy preterm neonates of postmenstrual age spanning 25 to 32 weeks. The signals were converted to amplitude-integrated EEG (aEEG) and a list of features was extracted from the signals. A regression tree model was developed and the feature importance of the model was assessed using permutation importance and Shapley additive explanations. Results: The model had an RMSE of 1.73 weeks (R2=0.45, PCC=0.676). The best feature was the mean amplitude of the lower envelope of the signal, followed by signal time spent over 100 µV. Conclusion: The model is performing comparably to human experts, and as it can be improved in multiple ways, this result indicates a promising outlook for explainable machine learning model applications in neonatal EEG analysis.
|
4 |
Statistical Tools for Efficient Confirmation of Diagnosis in Patients with Suspected Primary Central Nervous System VasculitisBrooks, John 27 April 2023 (has links)
The management of missing data is a major concern in classification model generation in all fields but poses a particular challenge in situations where there is only a small quantity of sparse data available. In the field of medicine, this is not an uncommon problem. While widely subscribed methodologies like logistic regression can, with minor modifications and potentially much labor, provide reasonable insights from the larger and less sparse datasets that are anticipated when analyzing diagnosis of common conditions, there are a multitude of rare conditions of interest. Primary angiitis of the central nervous system (PACNS) is a rare but devastating entity that given its range of presenting symptoms can be suspected in a variety of circumstances. It unfortunately continues to be a diagnosis that is hard to make. Aside from some general frameworks, there isn’t a rigorously defined diagnostic approach as is the case in other more common neuroinflammatory conditions like multiple sclerosis. Instead, clinicians currently rely on experience and clinical judgement to guide the reasonable exclusion of potential inciting entities and mimickers. In effect this results in a smaller quantity of heterogenous that may not optimally suited for more traditional classification methodology (e.g., logistic regression) without substantial contemplation and justification of appropriate data cleaning / preprocessing. It is therefore challenging to make and analyze systematic approaches that could direct clinicians in a way that standardizes patient care.
In this thesis, a machine learning approach was presented to derive quantitatively justified insights into the factors that are most important to consider during the diagnostic process to identify conditions like PACNS. Modern categorization techniques (i.e., random forest and support vector machines) were used to generate diagnostic models identifying cases of PACNS from which key elements of diagnostic importance could be identified. A novel variant of a random forest (RF) approach was also demonstrated as a means of managing missing data in a small sample, a significant problem encountered when exploring data on rare conditions without clear diagnostic frameworks. A reduced need to hypothesize the reasons for missingness when generating and applying the novel variant was discussed. The application of such tools to diagnostic model generation of PACNS and other rare and / or emerging diseases and provide objective feedback was explored. This primarily centered around a structured assessment on how to prioritize testing to rapidly rule out conditions that require alternative management and could be used to support future guidelines to optimize the care of these patients.
The material presented herein had three components. The first centered around the example of PACNS. It described, in detail, an example of a relevant medical condition and explores why the data is both rare and sparse. Furthermore, the reasons for the sparsity are heterogeneous or non-monotonic (i.e., not conducive to modelling with a singular model). This component concludes with a search for candidate variables to diagnose the condition by means of scoping review for subsequent comparative demonstration of the novel variant of random forest construction that was proposed. The second component discussed machine learning model development and simulates data with varying degrees and patterns of missingness to demonstrate how the models could be applied to data with properties like what would be expected of PACNS related data. Finally, described techniques were applied to separate a subset of patients with suspected PACNS from those with diagnosed PACNS using institutional data and proposes future study to expand upon and ultimately verify these insights. Further development of the novel random forest approach is also discussed.
|
5 |
Ransomware Detection Using Windows API Calls and Machine LearningKaranam, Sanjula 31 May 2023 (has links)
Ransomware is an ever-growing issue that has been affecting individuals and corporations since its inception, leading to losses of the order of billions each year. This research builds upon the existing body of research pertaining to ransomware detection for Windows-based platforms through behavioral analysis using sandboxing techniques and classification using machine learning (ML), considering the various predefined function calls, known as API (Application Programming Interface) calls, made by ransomware and benign samples as classifying features. The primary aim of this research is to study the effect of the frequency of API calls made by ransomware samples spanning across a large number of ransomware families exhibiting varied behavior, and benign samples on the classification accuracy of various ML algorithms. Conducting an experiment based on this, a quantitative analysis of the ML classification algorithms was performed, for the frequency of API calls based input and binary input based on the existence of an API call, resulting in the conclusion that considering the frequency of API calls marginally improves the ransomware recall rate. The secondary research question posed by this research aims to justify the ML classification of ransomware by conducting behavioral analysis of ransomware and goodware in the context of the API calls that had a major effect on the classification of ransomware. This research was able to provide meaningful insights into the runtime behavior of ransomware and goodware, and how such behavior including API calls and their frequencies were in line with the MLbased classification of ransomware. / Master of Science / Ransomware is an ever-growing issue that has been affecting individuals and corporations since its inception, leading to losses of the order of billions each year. It infects a user machine, encrypts user files or locks the user out of their machine, or both, demanding ransom in exchange for decrypting or unlocking user data. Analyzing ransomware either statically or behaviorally is a prerequisite for building detection and countering mechanisms. Behavioral analysis of ransomware is the basis for this research, wherein ransomware is analyzed by executing it on a safe sandboxed environment such as a virtual machine to avoid infecting a real-user machine, and its runtime characteristics are extracted for analysis. Among these characteristics, the various predefined function calls, known as API (Application Programming Interface) calls, made to the system by ransomware will serve as the basis for the classification of ransomware and benign software. After analyzing ransomware samples across various families, and benign samples in a sandboxed environment, and considering API calls as features, the curated dataset was fed to a set of ML algorithms that have the capability to extract useful information from the dataset to take classification decisions without human intervention. The research will consider the importance of the frequency of API calls on the classification accuracy and also state the most important APIs for classification along with their potential use in the context of ransomware and goodware to justify ML classification. Zero-Day detection, which refers to testing the accuracy of trained ML models on unknown ransomware samples and families was also performed.
|
6 |
Explainability in Deep Reinforcement LearningKeller, Jonas 29 October 2024 (has links)
With the combination of Reinforcement Learning (RL) and Artificial Neural Networks (ANNs), Deep Reinforcement Learning (DRL) agents are shifted towards being non-interpretable black-box models. Developers of DRL agents, however, could benefit from enhanced interpretability of the agents’ behavior, especially during the training process. Improved interpretability could enable developers to make informed adaptations, leading to better overall performance. The explainability methods Partial Dependence Plot (PDP), Accumulated Local Effects (ALE) and SHapley Additive exPlanations (SHAP) were considered to provide insights into how an agent’s behavior evolves during training. Additionally, a decision tree as a surrogate model was considered to enhance the interpretability of a trained agent. In a case study, the methods were tested on a Deep Deterministic Policy Gradient (DDPG) agent that was trained in an Obstacle Avoidance (OA) scenario. PDP, ALE and SHAP were evaluated towards their ability to provide explanations as well as the feasibility of their application in terms of computational overhead. The decision tree was evaluated towards its ability to approximate the agent’s policy as a post-hoc method. Results demonstrated that PDP, ALE and SHAP were able to provide valuable explanations during the training. Each method contributed additional information with their individual advantages. However, the decision tree failed to approximate the agent’s actions effectively to be used as a surrogate model.:List of Figures
List of Tables
List of Abbreviations
1 Introduction
2 Foundations
2.1 Machine Learning
2.1.1 Deep Learning
2.2 Reinforcement Learning
2.2.1 Markov Decision Process
2.2.2 Limitations of Optimal Solutions
2.2.3 Deep Reinforcement Learning
2.3 Explainability
2.3.1 Obstacles for Explainability Methods
3 Applied Explainability Methods
3.1 Real-Time Methods
3.1.1 Partial Dependence Plot
3.1.1.1 Incremental Partial Dependence Plots for Dynamic Modeling Scenarios
3.1.1.2 PDP-based Feature Importance
3.1.2 Accumulated Local Effects
3.1.3 SHapley Additive exPlanations
3.2 Post-Hoc Method: Global Surrogate Model
4 Case Study: Obstacle Avoidance
4.1 Environment Representation
4.2 Agent
4.3 Application Settings
5 Results
5.1 Problems of the Incremental Partial Dependence Plot
5.2 Real-Time Methods
5.2.1 Feature Importance
5.2.2 Computational Overhead
5.3 Global Surrogate Model
6 Discussion
7 Conclusion
Bibliography
Appendix
A Incremental Partial Dependence Results
|
7 |
The Impact of the COVID-19 Lockdown on the Urban Air Quality: A Machine Learning Approach.Bobba, Srinivas January 2021 (has links)
‘‘SARS-CoV-2’’ which is responsible for the current pandemic of COVID-19 disease was first reported from Wuhan, China, on 31 December 2019. Since then, to prevent its propagation around the world, a set of rapid and strict countermeasures have been taken. While most of the researchers around the world initiated their studies on the Covid-19 lockdown effect on air quality and concluded pollution reduction, the most reliable methods that can be used to find out the reduction of the pollutants in the air are still in debate. In this study, we performed an analysis on how Covid-19 lockdown procedures impacted the air quality in selected cities i.e. New Delhi, Diepkloof, Wuhan, and London around the world. The results show that the air quality index (AQI) improved by 43% in New Delhi,18% in Wuhan,15% in Diepkloof, and 12% in London during the initial lockdown from the 19th of March 2020 to 31st May 2020 compared to that of four-year pre-lockdown. Furthermore, the concentrations of four main pollutants, i.e., NO2, CO, SO2, and PM2.5 were analyzed before and during the lockdown in India. The quantification of pollution drop is supported by statistical measurements like the AVOVA Test and the Permutation Test. Overall, 58%, 61%,18% and 55% decrease is observed in NO2, CO,SO2, and PM2.5 concentrations, respectively. To check if the change in weather has played any role in pollution level reduction or not we analyzed how weather factors are correlated with pollutants using a correlation matrix. Finally, machine learning regression models are constructed to assess the lockdown impact on air quality in India by incorporating weather data. Gradient Boosting is performed well in the Prediction of drop-in PM2.5 concentration on individual cities in India. By comparing the feature importance ranking by regression models supported by correlation factors with PM2.5.This study concludes that COVID-19 lockdown has a significant effect on the natural environment and air quality improvement.
|
8 |
Predicting Customer Churn in E-commerce Using Statistical Modeling and Feature Importance Analysis : A Comparison of Random Forest and Logistic Regression ApproachesRudälv, Amanda January 2023 (has links)
While operating in online markets offers opportunities for expanded assortment and convenience, it also poses challenges such as increased competition and the need to build personal relationships with customers. Customer retention be- comes crucial in maintaining a successful business, emphasizing the importance of understanding customer behavior. Traditionally, customer behavior analysis has focused on transactional behavior, such as purchase frequency and spending amounts. However, there has been a shift towards non-transactional behavior, driven by the popularity of loyalty programs that reward customers beyond trans- actions and aim to make customers feel appreciated and included, regardless of their spending power. This study is conducted at a global retailer with the aim of enhancing the under- standing of how non-transactional customer behavior influences customer churn. The approach in this study is to understand such behavior by developing a statis- tical model and to analyze statistical approaches of feature importance. Two types of approaches for statistical modeling, each with four variations, are assessed: (1) Random forest; and (2) Logistic regression. Furthermore, three different feature importance methods are considered; (1) Gini importance; (2) Permutation impor- tance and (3) Coefficient importance. The results showed that this approach can be used to analyze customer behavior and gain a better understanding of the driving factors for churn. Furthermore, the results showed that random forest approaches outperform logistic regression. With the definition of churn constructed in this study, the most important factors that affect the probability of churn are the customer’s number of sessions and inter session interval. / Att bedriva e-handel erbjuder inte enbart möjligheter för utökat sortiment och bekvämlighet, utan leder även till ökad konkurrens och ett ökat behov av att bygga relationer med kunder. Kundlojalitet är därmed avgörande för att upprätthålla en framgångsrik verksamhet, och betonar vikten av att förstå kundernas beteende. Traditionellt har analyser av kundbeteende främst bedrivits med fokus på transak- tionellt beteende, såsom frekvens eller totalbelopp för köp. På senare tid har allt mer fokus lagts på icke-transaktionellt beteende, på grund av införandet av lo- jalitetsprogram som belönar kunder bortom transaktioner, med målet att kunder ska känna sig uppskattade och inkluderade, oavsett köpkraft. Denna studie genomförs hos ett globalt detaljhandelsföretag med målet att utöka förståelsen för hur icke-transaktionellt kundbeteende påverkar kundbortfall. För att uppnå detta konstrueras en statistisk modell som utnyttjas för att med hjälp av statistiska metoder analysera signifikans hos variabler. Två kategorier av statis- tiska modeller undersöks; (1) Random forest och (2) Logistisk regression. Utöver detta används tre olika metoder för att analysera signifikans hos variabler; (1) Gini-betydelse; (2) Permutationsbetydelse; och (3) Koefficientbetydelse. Resultatet visar att studiens tillvägagångssätt kan användas för att analysera kund- beteende och nå ökad förståelse för vad som driver kundbortfall. Vidare visar re- sultatet att random forest-modeller överträffar modeller baserade på logistisk re- gression. Baserat på den definition av kundbortfall som definierats i denna studie är de viktigaste faktorerna som påverkar sannolikheten för kundbortfall, kundens antal sessioner och intervallet mellan kundens sessioner.
|
9 |
On the impact of geospatial features in real estate appraisal with interpretable algorithms / Om påverkan av geospatiala variabler i fastighetsvärdering med tolkbara algoritmerJäger, Simon January 2021 (has links)
Real estate appraisal is the means of defining the market value of land and property affixed to it. Many different features determine the market value of a property. For example, the distance to the nearest park or the travel time to the central business district may be significant when determining its market value. The use of machine learning in real estate appraisal requires algorithm accuracy and interpretability. Related research often defines these two properties as a trade-off and suggests that more complex algorithms may outperform intrinsically interpretable algorithms. This study tests these claims by examining the impact of geospatial features on interpretable algorithms in real estate appraisal. The experiments use property transactions from Oslo, Norway, and adds relative and global geospatial features for all properties using geocoding and spherical distance calculations. Such as the distance to the nearest park or the city center. The experiment implements three intrinsically interpretable algorithms; a linear regression algorithm, a decision tree algorithm, and a RuleFit algorithm. For comparison, it also implements two artificial neural network algorithms as a baseline. This study measures the impact of geospatial features using the algorithm performance by the coefficient of determination and the mean absolute error for the algorithm without and with geospatial features. Then, the individual impact of each geospatial feature is measured using four feature importance measures; mean decrease impurity, input variable importance, mean decrease accuracy, and Shapley values. The statistically significant results show that geospatial features improve algorithm performance. The improvement of algorithm performance is not unique to interpretable algorithms but occurs for all algorithms. Furthermore, it shows that interpretable algorithms are not axiomatically inferior to the tested artificial neural network algorithms. The distance to the city center and a nearby hospital are, on average, the most important geospatial features. While important for algorithm performance, precisely what the geospatial features capture remains for future examination. / Fastighetsvärdering är ett sätt att bestämma marknadsvärdet på mark och egendom som anbringas på den. Flera olika variabler påverkar marknadsvärdet för en fastighet. Avståndet till närmaste park eller restiden till det centrala affärsdistriktet kan till exempel vara betydande när man bestämmer ett marknadsvärde. Användningen av maskininlärning vid fastighetsvärdering kräver noggrannhet och tolkbarhet hos algoritmer. Relaterad forskning definierar ofta dessa två egenskaper som en kompromiss och föreslår att mer komplexa algoritmer kan överträffa tolkbara algoritmer. Den här studien testar dessa påståenden genom att undersöka påverkan av geospatiala variabler på tolkbara algoritmer i fastighetsvärdering. Experimentet använder fastighetstransaktioner från Oslo i Norge, och lägger till relativa och globala geospatiala variabler för alla fastigheter med hjälp av geokodning och sfäriska avståndsberäkningar. Såsom avståndet till närmaste park eller stadens centrum. Experimentet implementerar tre tolkbara algoritmer; en linjär regressionsalgoritm, en beslutsträdalgoritm och en RuleFit-algoritm. Som jämförelse implementerar den också två artificiella neuronnätsalgoritmer som en baslinje. Studien mäter påverkan av geospatiala variabler med algoritmprestanda genom determinationskoefficienten och det genomsnittliga absoluta felet för algoritmen med och utan geospatiala variabler. Därefter mäts den individuella påverkan av varje geospatial variabel med hjälp av fyra mått på variabelbetydelse; mean decrease impurity, input variabel importance, mean decrease accuracy och Shapley-värden. De statistiskt signifikanta resultaten visar att geospatiala variabler förbättrar algoritmers prestanda. Förbättringen av algoritmprestanda är inte unik för tolkningsbara algoritmer utan sker för alla algoritmer. Dessutom visar resultatet att tolkningsbara algoritmer inte är sämre än de testade artificiella neuronnätsalgoritmerna. Avståndet till stadens centrum och det närmaste sjukhuset är i genomsnitt de viktigaste geospatiala variablerna. Även om de geospatial variablerna är viktiga för algoritmprestanda, kvarstår frågan om vad exakt de betyder för framtida granskning.
|
10 |
DO FEATURE IMPORTANCE AND FEATURE CENTRALITY DIFFERENTIALLY INFLUENCE SEMANTIC KNOWLEDGE IN INDIVIDUALS WITH APHASIA?Cox, Violet O. 30 November 2009 (has links)
No description available.
|
Page generated in 0.1022 seconds