Return to search

Predicting the Distribution of a Goodness-Of-Fit Statistic Appropriate For Use With Performance-Based Assessments

One aspect of evaluating model-data fit in the context of Item Response Theory involves assessing item fit using chi-square goodness-of-fit tests. In the current study, a goodness-of-fit statistic appropriate for assessing item fit on performance-based assessments was investigated.
The statistic utilized a pseudo-observed score distribution, that used examinees entire posterior distributions of ability to form item fit tables. Due to dependencies in the pseudo-observed score distribution, or pseudocounts, the statistic could not be tested for significance using a theoretical chi-square distribution. However, past research suggested that the Pearson and likelihood ratio forms of the pseudocounts-based statistic (c2* and G2*) may follow scaled chi-square distributions.
The purpose of this study was to determine whether item and sample characteristics could be used to predict the scaling corrections needed to rescale c2* and G2* statistics, so that significance tests against theoretical chi-square distributions were possible. Test length (12, 24, and 36 items) and number of item score category levels (2 to 5-category items) were manipulated. Sampling distributions of c2* and G2* statistics were generated, and scaling corrections obtained using the method of moments were applied to the simulated distributions.
Two multilevel equations for predicting the scaling corrections (a scaling factor and degrees of freedom value for each item) were then estimated from the simulated data.
Overall, when scaling corrections were obtained with the method of moments, sampling distributions of rescaled c2* and G2* statistics closely approximated theoretical chi-square distributions across test configurations.
Scaling corrections obtained using multilevel prediction equations did not adequately rescale simulated c2* distributions for 2- to 5-category tests, or simulated G2* distributions for 2- and 3- category tests. Applications to real items showed that the prediction equations were inadequate across score category levels when c2* was used, and for 2- and 3-category items when G2* was used.
However, for 4- and 5-category tests, the predicted scaling corrections did adequately rescale empirical sampling distributions of G2* statistics. In addition, applications to real items indicated that use of the multilevel prediction equations with G2* would result in correct identification of item misfit for 5-category, and potentially 4-category items.

Identiferoai:union.ndltd.org:PITT/oai:PITTETD:etd-12112004-230948
Date13 December 2004
CreatorsHansen, Mary A
ContributorsClement A. Stone, Suzanne Lane, Carol E. Baker, James J. Irrgang
PublisherUniversity of Pittsburgh
Source SetsUniversity of Pittsburgh
LanguageEnglish
Detected LanguageEnglish
Typetext
Formatapplication/pdf
Sourcehttp://etd.library.pitt.edu/ETD/available/etd-12112004-230948/
Rightsunrestricted, I hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to University of Pittsburgh or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.

Page generated in 0.0023 seconds