Return to search

Comparison of the Item Response Theory with Covariates Model and Explanatory Cognitive Diagnostic Model for Detecting and Explaining Differential Item Functioning

In psychometrics, a concern is that the assessment is fair for all students who take it. The fairness of an assessment can be evaluated in several ways, including the examination of differential item functioning (DIF). An item exhibits DIF if a subgroup has a lower probability of answering an item correctly than another subgroup after matching on academic achievement. Subgroups include race, spoken language, disability status, or sex. Under item response theory (IRT), a single score is given to each student since IRT assumes that an assessment is only measuring one construct. However, under cognitive diagnostic modeling (CDM), an assessment measures multiple specific constructs and classifies students as having mastered the construct or not. There are several methods to detect DIF under both types of models, but most methods cannot conduct explanatory modeling. Explanatory modeling consists of predicting item responses and latent traits using relevant observed or latent covariates. If an item exhibits DIF which disadvantages a subgroup, covariates can be modeled to explain the DIF and indicate either true or spurious differences. If an item exhibited statistically significant DIF which became nonsignificant after modeling explanatory variables, then the DIF would be explained and considered spurious. If the DIF remained significant after modeling explanatory variables, then there was stronger evidence that DIF was present and not spurious. When an item exhibits DIF, the validity of the inferences from the assessment is threatened and group comparisons become inappropriate.
This study evaluated the presence of DIF on the Trends in International Math and Science Study (TIMSS) between students who speak English as a first language (EFL) and students who do not speak English as a first language (multilingual learners [ML]) in the USA. The 8th grade science data was analyzed from the year 2011 since science achievement remains understudied, the 8th grade is a critical turning point for K-12 students, and because 2011 was the most recent year that item content is available from this assessment. The item response theory with covariates (IRT-C) model was used as the explanatory IRT model, while the reparameterized deterministic-input, noisy "and" gate (RDINA) model was used as the explanatory CDM (E-CDM). All released items were analyzed for DIF by both models with language status as the key grouping variable. Items that exhibited significant DIF were further analyzed by including relevant covariates. Then, if items still exhibited DIF, their content was evaluated to determine why a group was disadvantaged.
Several items exhibited significant DIF under both the IRT-C and E-CDM. Most disadvantaged ML students. Under the IRT-C, two items that exhibited DIF were explained by quantitative covariates. Two items that did not exhibit significant nonuniform DIF became significant after explanation. Whether or not a student repeated elementary school was the strongest explanatory covariate, while confidence in science explained the most items. Under the E-CDM, five items initially exhibited significant uniform DIF with one also exhibiting nonuniform DIF. After scale purification, two items exhibited significant uniform DIF, and one exhibited marginally significant DIF. After explanatory modeling, no items exhibited significant uniform DIF, and only one item exhibited marginally significant nonuniform DIF. Examining covariates, home educational resources explained the most with ten items and the strongest positive covariate. Repeated elementary school had the strongest absolute effect. Examining the item content of 14 items, most items had no causal explanation for the presence of DIF. In four items, a causal mechanism was identified and concluded to exhibit item bias. An item's cognitive domain had a relationship with DIF items, with 79% of items under the Knowing domain. Based on these results, DIF that disadvantaged ML students was present among several items on this science assessment. Both the IRT-C and E-CDM identified several items exhibiting DIF, quantitative covariates explained several items exhibiting DIF, and item bias was discovered in several items.
Following up on this empirical study, a simulation study was performed to evaluate DIF detection power and Type I error rates of the Wald test and likelihood ratio (LR) test, and parameter recovery when ignoring subgroups, using the compensatory reparameterized unified model (C-RUM). Factors included sample size, DIF magnitude, DIF type, Q-matrix complexity, their interaction effects, and p-value adjustment.
Evaluating DIF under the C-RUM, the DIF detection method had the largest effect on Type I error rates, with the Wald test recovering the nominal p-value much better than the LR test. In terms of power, DIF magnitude was the most important factor, followed by Q-matrix complexity. As DIF magnitude increased and Q-matrix complexity decreased, power rates increased. In terms of parameter recovery, the DIF type had the strongest effect, followed by Q-matrix complexity. Nonuniform DIF recovered the parameter more than uniform DIF, while fewer attributes measured by an item improved parameter recovery. Several factors affected DIF detection power and Type I error, including DIF detection method, DIF magnitude, and Q-matrix complexity. For parameter recovery, DIF type had an impact, along with Q-matrix complexity, and DIF magnitude. / Doctor of Philosophy / Academic assessments are a necessary tool to evaluate student educational progress in different subjects across school years. These are necessary to establish student proficiency within schools, districts, states, and countries. The results can be broken down to make various comparisons, including by race, ethnicity, gender, language status, schools, or any other demographic. Other comparisons can be made against a proficiency standard or passing rate. It is important and necessary to make comparisons between groups so that any disparities or achievement gaps can be identified and rectified.
This study evaluated achievement gaps between multilingual learner (ML) students and English first language (EFL) students on individual items of an 8th-grade international science assessment. This subject and grade level are crucial for students preparing for college and starting their career development. Every test item was analyzed to determine if there was an achievement gap and if an item was biased against a group based on their first language. Several follow-up analyses were conducted on every item to ensure that the results were as accurate as possible and that there were no other plausible explanations. Several explanatory factors were evaluated, including student home educational resources, confidence in science, likes learning science, repeating elementary school, being bullied at school, and time spent on science homework. For items that had achievement gaps based on language, further analysis was conducted to ensure that the gaps were not due to other student characteristics. Based on that analysis, the item content was examined by myself and a content expert. This was done to evaluate if there were characteristics of the item that led to the language achievement gap. This allowed for the evaluation of whether an item was biased against either ML or EFL students.
Fourteen items exhibited achievement gaps based on language status. Most items disadvantaged ML students, and the achievement gaps ranged from small to large. This initial analysis was followed up with more extensive analyses to rule out other potential causes of the achievement gaps. Repeated elementary school had the strongest relationship with these items, while confidence in science was related to the most items exhibiting achievement gaps. There were two items in which the language achievement gap was explained by a combination of factors, thus concluding that there was not any gap on the items. The remaining items still exhibited achievement gaps which led to analysis of the item content. On four items, the causes of the remaining achievement gaps were discovered. For the remaining items, there was no clear reason for the item bias and achievement gaps.
This study was followed by a study to evaluate a new method of detecting achievement gaps. This was done by creating specific data so that the true values were known. The sample size, test item complexity, achievement gap size and direction, and gap detection method were evaluated. These conditions and their values were chosen to reflect realistic testing scenarios and provide a better understanding of the previous study's results.
The results indicated that one achievement gap detection method had higher detection rates compared to the other detection method. This was true in all conditions. Additionally, achievement gaps were found more often when sample sizes and achievement gaps were larger, test items were less complex, and when one group was disadvantaged across all ability levels. When comparing the estimated and true statistics, there were large deviations when one group was disadvantaged at different proficiency levels. Also, when items were more complex, and sample sizes were smaller, the deviation between true and estimated statistics was larger than when items were simpler and sample sizes were larger.

Identiferoai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/116425
Date06 October 2023
CreatorsKrost, Kevin Andrew
ContributorsEducational Research and Evaluation, Skaggs, Gary E., Miyazaki, Yasuo, Williams, Thomas O., Kniola, David John
PublisherVirginia Tech
Source SetsVirginia Tech Theses and Dissertation
LanguageEnglish
Detected LanguageEnglish
TypeDissertation
FormatETD, application/pdf, application/pdf
RightsCreative Commons Attribution-ShareAlike 4.0 International, http://creativecommons.org/licenses/by-sa/4.0/

Page generated in 0.0026 seconds