Data sets with missing values are a pervasive problem within medical research. Building lifetime prediction models based solely upon complete-case data can bias the results, so imputation is preferred over listwise deletion. In this thesis, artificial neural networks (ANNs) are used as a prediction model on simulated data with which to compare various imputation approaches. The construction and optimization of ANNs is discussed in detail, and some guidelines are presented for activation functions, number of hidden layers and other tunable parameters. For the simulated data, binary lifetime prediction at five years was examined. The ANNs here performed best with tanh activation, binary cross-entropy loss with softmax output and three hidden layers of between 15 and 25 nodes. The imputation methods examined are random, mean, missing forest, multivariate imputation by chained equations (MICE), pooled MICE with imputed target and pooled MICE with non-imputed target. Random and mean imputation performed poorly compared to the others and were used as a baseline comparison case. The other algorithms all performed well up to 50% missingness. There were no statistical differences between these methods below 30% missingness, however missing forest had the best performance above this amount. It is therefore the recommendation of this thesis that the missing forest algorithm is used to impute missing data when constructing ANNs to predict breast cancer patient survival at the five-year mark.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-150339 |
Date | January 2018 |
Creators | Raoufi-Danner, Torrin |
Publisher | Linköpings universitet, Statistik och maskininlärning |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.002 seconds