Global ETD Search

Return to search

Predicting the Unobserved : A statistical analysis of missing data techniques for binary classification

The aim of the thesis is to investigate how the classification performance of random forest and logistic regression differ, given an imbalanced data set with MCAR missing data. The performance is measured in terms of accuracy and sensitivity. Two analyses are performed: one with a simulated data set and one application using data from the Swedish population registries. The simulation study is created to have the same class imbalance at 1:5. The missing values are handled using three different techniques: complete case analysis, predictive mean matching and mean imputation. The thesis concludes that logistic regression and random forest are on average equally accurate, with some instances of random forest outperforming logistic regression. Logistic regression consistently outperforms random forest with regards to sensitivity. This implies that logistic regression may be the best option for studies where the goal is to accurately predict outcomes in the minority class. None of the missing data techniques stood out in terms of performance.

http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-388581

Probability Theory and Statistics

Sannolikhetsteori och statistik

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-388581
Date	January 2019
Creators	Säfström, Stella
Publisher	Uppsala universitet, Statistiska institutionen
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess

Page generated in 0.0016 seconds

Predicting the Unobserved : A statistical analysis of missing data techniques for binary classification

Description

Links & Downloads

Tags

Additional Fields