Tests consisting of both multiple-choice and constructed-response items have gained in popularity in recent years. The evidence shows that many assessment programs have administered these two item formats in the same test. However, linking these two item formats on a common scale has not been thoroughly studied. Even though several methods for linking scales under item response theory (IRT) have been developed, many studies have addressed multiple-choice items only and only a few studies have addressed constructed-response items. No linking studies have addressed both item formats in the same assessment. The purpose of this study was to investigate the effects of several factors on the accuracy of linking item parameter estimates onto a common scale using the combination of the three-parameter logistic (3-PL) model for multiple-choice items with the graded response model (GRM) for constructed-response items. Working with an anchor-test design, the factors considered were: (1) test length, (2) proportion of items of each format in the test, (3) anchor test length, (4) sample size, (5) ability distributions, and (6) method of equating. The data for dichotomous and polytomous responses for unique and anchor items were simulated to vary as a function of these factors. The main findings were as follows: the constructed-response items had a large influence in parameter estimation for both types of item formats. Generally, the slope parameters were estimated with small bias but large variance. Threshold parameters were also estimated with small bias but large variance for constructed-response items. However, the opposite results were obtained for multiple-choice items. For the guessing parameter estimates, the recovery was relatively good. The coefficients of transformation were also relatively well estimated. Overall, it was found that the following conditions led to more effective results: (1) a long test, (2) a large proportion of multiple-choice items in the test, (3) a long anchor test, (4) a large sample size, (5) no ability differences between the groups used in linking the two tests, and (6) the method of concurrent calibration. At the same time, more research will be necessary to expand the conditions, such as the introduction of multidimensional data, under which linking of item formats to a common scale is evaluated.
Identifer | oai:union.ndltd.org:UMASS/oai:scholarworks.umass.edu:dissertations-1806 |
Date | 01 January 2000 |
Creators | Bastari, B |
Publisher | ScholarWorks@UMass Amherst |
Source Sets | University of Massachusetts, Amherst |
Language | English |
Detected Language | English |
Type | text |
Source | Doctoral Dissertations Available from Proquest |
Page generated in 0.0024 seconds