The last decades have seen an increase in both the amount and complexity of the data used in modern industries in business and technology. A key element for managing these data sets is using machine learning algorithms to process structures and find patterns. Variable selection applies to facilitate and improve these processes by finding and removing redundant variables. One way to achieve this is by eliminating variables based on how much they correlate, a premise for this thesis. This study examines how a reduction of correlated variables affects the predictive accuracy of six different machine learning algorithms. Two demarcations are made. First, the correlation between the explanatory variables is set to a high level and secondly, each variable’s correlation with the dependent variable is set to a modest level. The hypothesis states that removing highly correlated explanatory variables should not negatively affect the accuracy. By conducting a Monte Carlo simulation with three models, each consisting of a different number of correlated variables, the change in accuracy could be compared and evaluated. The result suggests an adverse change in accuracy for all algorithms except one. The differences are relatively low, with the largest accuracy decrease being -5.49 percentage points. The conclusion is that the hypothesis does not hold when the explanatory variables are at a modest level of correlation with the dependent variable.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-466389 |
Date | January 2021 |
Creators | Johansson Lannge, Elsa |
Publisher | Uppsala universitet, Statistiska institutionen |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0023 seconds