Global ETD Search

Return to search

The accuracy of parameter estimates and coverage probability of population values in regression models upon different treatments of systematically missing data

Several methods are available for the treatment of missing data. Most of the methods are
based on the assumption that data are missing completely at random (MCAR). However, data
sets that are MCAR are rare in psycho-educational research. This gives rise to the need for
investigating the performance of missing data treatments (MDTs) with non-randomly or
systematically missing data, an area that has not received much attention by researchers in the
past.
In the current simulation study, the performance of four MDTs, namely, mean
substitution (MS), pairwise deletion (PW), expectation-maximization method (EM), and
regression imputation (RS), was investigated in a linear multiple regression context. Four
investigations were conducted involving four predictors under low and high multiple R² , and nine
predictors under low and high multiple R² . In addition, each investigation was conducted under
three different sample size conditions (94, 153, and 265). The design factors were missing
pattern (2 levels), percent missing (3 levels) and non-normality (4 levels). This design gave rise
to 72 treatment conditions. The sampling was replicated one thousand times in each condition.
MDTs were evaluated based on accuracy of parameter estimates. In addition, the bias in
parameter estimates, and coverage probability of regression coefficients, were computed.
The effect of missing pattern, percent missing, and non-normality on absolute error for
R² estimate was of practical significance. In the estimation of R², EM was the most accurate under
the low R² condition, and PW was the most accurate under the high R² condition. No MDT was
consistently least biased under low R² condition. However, with nine predictors under the high
R² condition, PW was generally the least biased, with a tendency to overestimate population R².
The mean absolute error (MAE) tended to increase with increasing non-normality and increasing
percent missing. Also, the MAE in R²
estimate tended to be smaller under monotonic pattern than
under non-monotonic pattern. MDTs were most differentiated at the highest level of percent
missing (20%), and under non-monotonic missing pattern.
In the estimation of regression coefficients, RS generally outperformed the other MDTs
with respect to accuracy of regression coefficients as measured by MAE . However, EM was
competitive under the four predictors, low R² condition. MDTs were most differentiated only in
the estimation of β₁, the coefficient of the variable with no missing values. MDTs were
undifferentiated in their performance in the estimation for b₂,...,bp, p = 4 or 9, although the MAE
remained fairly the same across all the regression coefficients. The MAE increased with
increasing non-normality and percent missing, but decreased with increasing sample size. The
MAE was generally greater under non-monotonic pattern than under monotonic pattern. With
four predictors, the least bias was under RS regardless of the magnitude of population R². Under
nine predictors, the least bias was under PW regardless of population R².
The results for coverage probabilities were generally similar to those under estimation of
regression coefficients, with coverage probabilities closest to nominal alpha under RS. As
expected, coverage probabilities decreased with increasing non-normality for each MDT, with
values being closest to nominal value for normal data. MDTs were most differentiated with
respect to coverage probabilities under non-monotonic pattern than under monotonic pattern.
Important implications of the results to researchers are numerous. First, the choice of
MDT was found to depend on the magnitude of population R², number of predictors, as well as
on the parameter estimate of interest. With the estimation of R² as the goal of analysis, use of EM
is recommended if the anticipated R² is low (about .2). However, if the anticipated R² is high
(about .6), use of PW is recommended. With the estimation of regression coefficients as the goal
of analysis, the choice of MDT was found to be most crucial for the variable with no missing
data. The RS method is most recommended with respect to estimation accuracy of regression
coefficients, although greater bias was recorded under RS than under PW or MS when the
number of predictors was large (i.e., nine predictors). Second, the choice of MDT seems to be of
little concern if the proportion of missing data is 10 percent, and also if the missing pattern is
monotonic rather than non-monotonic. Third, the proportion of missing data seems to have less
impact on the accuracy of parameter estimates under monotonic missing pattern than under non-monotonic
missing pattern. Fourth, it is recommended for researchers that in the control of Type
I error rates under low R² condition, the EM method should be used as it produced coverage
probability of regression coefficients closest to nominal value at .05 level. However, in the
control of Type I error rates under high R² condition, the RS method is recommended.
Considering that simulated data were used in the present study, it is suggested that future research
should attempt to validate the findings of the present study using real field data. Also, a future
investigator could modify the number of predictors as well as the confidence interval in the
calculation of coverage probabilities to extend generalization of results. / Education, Faculty of / Educational and Counselling Psychology, and Special Education (ECPS), Department of / Graduate

http://hdl.handle.net/2429/9556

Regression analysis

Statistics

Error analysis (Mathematics)

Identifer	oai:union.ndltd.org:UBC/oai:circle.library.ubc.ca:2429/9556
Date	11 1900
Creators	Othuon, Lucas Onyango A.
Source Sets	University of British Columbia
Language	English
Detected Language	English
Type	Text, Thesis/Dissertation
Format	8836458 bytes, application/pdf
Rights	For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use.

Page generated in 0.002 seconds

The accuracy of parameter estimates and coverage probability of population values in regression models upon different treatments of systematically missing data

Description

Links & Downloads

Tags

Additional Fields