Return to search

Categorical data imputation using non-parametric or semi-parametric imputation methods

A research report submitted to the Faculty of Science, University of the Witwatersrand, for the degree of Master of Science by Coursework and Research Report. / Researchers and data analysts often encounter a problem when analysing data with missing values. Methods for imputing continuous data are well developed in the literature. However, methods for imputing categorical data are not well established. This research report focuses on categorical data imputation using non-parametric and semi-parametric methods. The aims of the study are to compare different imputation methods for categorical data and to assess the quality of the imputation. Three imputation methods are compared namely; multiple imputation, hot deck imputation and random forest imputation. Missing data are created on a complete data set using the missing completely at random mechanism. The imputed data sets are compared with the original complete data set, and the imputed values which are the same as the values in the original data set are counted. The analysis revealed that the hot deck imputation method is more precise, compared to random forest and multiple imputation methods. Logistic regression is fitted on the imputed data sets and the original data set and the resulting models are compared. The analysis shows that the multiple imputation method affects the model fit of the logistic regression negatively.

Identiferoai:union.ndltd.org:netd.ac.za/oai:union.ndltd.org:wits/oai:wiredspace.wits.ac.za:10539/20380
Date11 May 2016
CreatorsKhosa, Floyd Vukosi
Source SetsSouth African National ETD Portal
LanguageEnglish
Detected LanguageEnglish
TypeThesis
Formatapplication/pdf, application/pdf, application/pdf, application/pdf

Page generated in 0.0018 seconds