Return to search

The effects of item difficulty and examinee ability on the distribution and effectiveness of LZ and ECIZ4 appropriateness indices.

Test scores are intended to provide a measure of examinee's estimate of ability. High ability examinees are expected to get few easy items wrong and low ability examinees are exepcted to get few difficult items right. But there are occasions when the test-taking behavior of some atypical examinees may be so unsual that their test scores cannot be regarded as an appropriate measure of ability. An atypical examinee can have a spuriously low or a spuriously high score. However, appropriateness indices can be used to identify examinees with potentially inaccurate total scores. Appropriateness indices provide quantitative, measures of response pattern atypicality. These indices fall into two major categories: (a) IRT-based and (b) non-IRT based indices. The dependency of non-IRT based indices on the item difficulty order of a particular group has rendered them inadequate for detecting aberrant reponse patterns. IRT-based indices are group invariant. Researchers have investigated the effectiveness and the distributions of these indices under varying conditions of testing. However, some test situations might require efficient and accurate indices of appropriateness measurement for restricted samples. It might be helpful, for example, to accurately identify examinees with potential spuriously low scores falling just the below the criterion of a minimum competency test, on a certification test, it might be helpful to concentrate on identifying examinees with spuriously high scores. Therefore, the effects of item difficulty 7 and examinee ability distributions on the effectiveness and the distributional characeristics of LZ and ECIZ4 (IRT-based) appropriateness indices were investigated in this study. To examine the effects of item difficulty and ability distributions on the distributional characteristics of LZ and ECIZ4, data were generated in nine combinations of item difficulty and ability distributions to simulate the responses of 2000 examinees to 60 test items according to the three-parameter model. Three uniform distributions of item difficulty were used. Items typical of diagnostic tests were generated in the interval -3.0 to +1.2; items typical of power tests were generated in the interval of -3.0 to +3.0; and items typical of certification and licencing tests were generated in the interval of -1.2 to +3.0. Three distributions of ability were used. Thetas typical of low, medium, and high ability examinees were generated to have normal distributions with the means of -1.2, 0.0, and +1.2 respectively and each with a standard deviation of 0.6. The mean, standard deviation, skewness, kurtosis, and the percentile estimates of LZ and ECIZ4 were significantly affected by the variations of item difficulty and ability distributions. The distributions of the two indices approximated a normal distribution when the ability estimates matched the item difficulty. Overall, the distributions of LZ approximated a normal distribution better than the distribution of ECIZ4. To examine the effectiveness of LZ and ECIZ4 in detecting aberrant response patterns, two samples, each consisting of 500 response patterns (for spuriously low and spuriously high) were generated for each of the nine combinations of item difficulty and ability distribution and subjected to spurious treatments. Twenty percent and 10% spuriously high scores were created by randomly selecting 20% or 10% of the original responses and changing incorrect answers to correct. Twenty percent and 10% spuriously low scores were created by randomly selecting 20% or 10% of the original responses and changing correct answers to incorrect. The percentile estimates obtained were used as cutoff points to classify response patterns as aberrant or non-aberrant. Spuriously low aberrant response patterns were easier to detect by the two indices under the low item difficulty and spuriously high aberrant response patterns were easier to detect under high item difficulty. At low (0.01 and 0.05) false positive rates, LZ had higher detection rates of spuriously high and spuriously low aberrant response patterns than ECIZ4 under the high item difficulty; and ECIZ4 had higher detection rates than LZ under the medium and under the low item difficulty. Twenty percent treatment samples were easier to detect by the two indices than the 10% treatment samples.

Identiferoai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/10488
Date January 1992
CreatorsKorir, Daniel K.
PublisherUniversity of Ottawa (Canada)
Source SetsUniversité d’Ottawa
Detected LanguageEnglish
TypeThesis
Format100 p.

Page generated in 0.0024 seconds