Spelling suggestions: "subject:"equating"" "subject:"squating""
1 |
Assessing first- and second-order equity for the common-item nonequivalent groups design using multidimensional IRTAndrews, Benjamin James 01 July 2011 (has links)
The equity properties can be used to assess the quality of an equating. The degree to which expected scores conditional on ability are similar between test forms is referred to as first-order equity. Second-order equity is the degree to which conditional standard errors of measurement are similar between test forms after equating. The purpose of this dissertation was to investigate the use of a multidimensional IRT framework for assessing first- and second-order equity of mixed format tests.
Both real and simulated data were used for assessing the equity properties for mixed-format tests. Using real data from three Advanced Placement (AP) exams, five different equating methods were compared in their preservation of first- and second-order equity. Frequency estimation, chained equipercentile, unidimensional IRT true score, unidimensional IRT observed score, and multidimensional IRT observed score equating methods were used. Both a unidimensional IRT framework and a multidimensional IRT framework were used to assess the equity properties. Two simulation studies were also conducted. The first investigated the accuracy of expected scores and conditional standard errors of measurement as tests became increasingly multidimensional using both a unidimensional IRT framework and multidimensional IRT framework. In the second simulation study, the five different equating methods were compared in their ability to preserve first- and second-order equity as tests became more multidimensional and as differences in group ability increased.
Results from the real data analyses indicated that the performance of the equating methods based on first- and second-order equity varied depending on which framework was used to assess equity and which test was used. Some tests showed similar preservation of equity for both frameworks while others differed greatly in their assessment of equity. Results from the first simulation study showed that estimates of expected scores had lower mean squared error values when the unidimensional framework was used compared to when the multidimensional framework was used when the correlation between abilities was high. The multidimensional IRT framework had lower mean squared error values for conditional standard errors of measurement when the correlation between abilities was less than .95. In the second simulation study, chained equating performed better than frequency estimation for first-order equity. Frequency estimation better preserved second-order equity compared to the chained method. As tests became more multidimensional or as group differences increased, the multidimensional IRT observed score equating method tended to perform better than the other methods.
|
2 |
A comparison of Van der Linden's conditional equipercentile equating method with other equating methods under the random groups designShin, Seonho 01 July 2011 (has links)
To ensure test security and fairness, alternative forms of the same test are administered in practice. However, alternative forms of the same test generally do not have the same test difficulty level, even though alternative test forms are designed to be as parallel as possible. Equating adjusts for differences in difficulties among forms of the test. Six traditional equating methods are considered in this study: equipercentile equating without smoothing, equipercentile equating with pre-smoothing and post-smoothing, IRT true-score and observed-score equatings, and kernel equating. A common feature of all of the traditional procedures is that the end result of equating is a single transformation (or conversion table) that is used for all examinees who take the same test. Van der Linden has proposed conditional equipercentile (or local) equating (CEE) to reduce the error of equating contained in the traditional equating procedures by introducing individual level equating. Van der Linden's CEE is conceptually closest to IRT-T in that CEE is with respect to a type of true score (θ, or proficiency), but it shares similarities with to IRT-O in that CEE uses an estimated observed score distribution for each individual θ to equate scores using equipercentile equating.
No real-data study has yet compared van der Linden's CEE with each of the traditional equating procedures. Indeed, even for the traditional procedures, no study has compared all six of them simultaneously. In addition to van der Linden's CEE, two additional variations of CEE are considered: CEE using maximum likelihood (CEE-MLE) and CEE using the true characteristic curve (CEE-TCC). The focus of this study is on comparing results from CEE vis-à-vis the traditional procedures, as opposed to answering a “best-procedure”question, which would require a common conception of “true”equating. Although the results of the traditional equating methods are quite similar, the kernel equating method and equipercentile equating with log-linear presmoothing generally show better fit to the respective original form statistical moments under various data conditions. Although IRT-T and IRT-O usually are found to be least favorable under all circumstance in terms of statistical moments, the equated raw score difference distribution illustrates more stable performance than traditional equating methods. It was found here that the number of examinees having a particular score point does not influence results for CEE as much as it does for traditional equatings. CEE-EAP and CEE-MLE are very similar to one another and the equated score difference distributions are similar to those of IRT-O. CEE-TCC involves a part of the IRT-T procedure. Hence, CEE-TCC behaves somewhat similar to IRT-T. Although CEE results are less desirable in terms of maintaining statistical moments, the equated score differences are more consistent and stable than for the traditional equating methods.
|
3 |
A comparison of equating/linking using the Stocking-Lord method and concurrent calibration with mixed-format tests in the non-equivalent groups common-item design under IRTTian, Feng January 2011 (has links)
Thesis advisor: Larry Ludlow / There has been a steady increase in the use of mixed-format tests, that is, tests consisting of both multiple-choice items and constructed-response items in both classroom and large-scale assessments. This calls for appropriate equating methods for such tests. As Item Response Theory (IRT) has rapidly become mainstream as the theoretical basis for measurement, different equating methods under IRT have also been developed. This study investigated the performances of two IRT equating methods using simulated data: linking following separate calibration (the Stocking-Lord method) and the concurrent calibration. The findings from this study show that the concurrent calibration method generally performs better in recovering the item parameters and more importantly, the concurrent calibration method produces more accurate estimated scores than linking following separate calibration. Limitations and directions for future research are discussed. / Thesis (PhD) — Boston College, 2011. / Submitted to: Boston College. Lynch School of Education. / Discipline: Educational Research, Measurement, and Evaluation.
|
4 |
Beta observed score and true score equating methodsWang, Shichao 01 August 2019 (has links)
Equating is a statistical process that is used to adjust scores on test forms so that scores on the forms can be used interchangeably. This dissertation offered intensive investigation of beta true and observed score methods by comparing them to existing traditional and IRT equating methods under multiple designs and various conditions using real data, pseudo-test data and simulated data. Weighted and conditional bias, standard error of equating and root mean squared error were used to evaluate the accuracy of equating results obtained from the pseudo data and simulated data analyses. The single group equipercentile equating based on large sample sizes was used as the criterion equating. Overall, results showed that of the methods examined, the IRT methods performed best, followed by the chained equipercentile methods. Results from beta methods presented different trends from traditional and IRT methods for both the random group and common item nonequivalent groups designs. Beta true scores methods were less sensitive to group difference compared to traditional methods. The length of common items played an important role in the stability of results of beta true score methods.
|
5 |
Factors affecting accuracy of comparable scores for augmented tests under Common Core State StandardsKim, Ja Young 01 May 2013 (has links)
Under the Common Core State Standard (CCSS) initiative, states that voluntarily adopt the common core standards work together to develop a common assessment in order to supplement and replace existing state assessments. However, the common assessment may not cover all state standards, so states within the consortium can augment the assessment using locally developed items that align with state-specific standards to ensure that all necessary standards are measured. The purpose of this dissertation was to evaluate the linking accuracy of the augmented tests using the common-item nonequivalent groups design.
Pseudo-test analyses were conducted by splitting a large-scale math assessment in half, creating two parallel common assessments, and by augmenting two sets of state-specific items from a large-scale science assessment. Based upon some modifications of the pseudo-data, a simulated study was also conducted.
For the pseudo-test analyses, three factors were investigated: (1) the difference in ability between the new and old test groups, (2) the differential effect size for the common assessment and state-specific item set, and (3) the number of common items. For the simulation analyses, the latent-trait correlations between the common assessment and state-specific item set as well as the differential latent-trait correlations between the common assessment and state-specific item set were used in addition to the three factors considered for the pseudo-test analyses. For each of the analyses, four equating methods were used: the frequency estimation, chained equipercentile, item response theory (IRT) true score, and IRT observed score methods.
The main findings of this dissertation were as follows: (1) as the group ability difference increased, bias also increased; (2) when the effect sizes differed for the common assessment and state-specific item set, larger bias was observed; (3) increasing the number of common items resulted in less bias, especially for the frequency estimation method when the group ability differed; (4) the frequency estimation method was more sensitive to the group ability difference than the differential effect size, while the IRT equating methods were more sensitive to the differential effect size than the group ability difference; (5) higher latent-trait correlation between the common assessment and state-specific item set was associated with smaller bias, and if the latent-trait correlation exceeded 0.8, the four equating methods provided adequate linking unless the group ability difference was large; (6) differential latent-trait correlations for the old and new tests resulted in larger bias than the same latent-trait correlations for the old and new tests, and (7) when the old and new test groups were equivalent, the frequency estimation method provided the least bias, but IRT true score and observed score equating resulted in smaller bias than the frequency estimation and chained equipercentile methods when group ability differed.
|
6 |
Exploring the Efficacy of Pre-Equating a Large Scale Criterion-Referenced Assessment with Respect to Measurement EquivalenceDomaleski, Christopher Stephen 12 September 2006 (has links)
This investigation examined the practice of relying on field test item calibrations in advance of the operational administration of a large scale assessment for purposes of equating and scaling. Often termed “pre-equating,” the effectiveness of this method is explored for a statewide, high-stakes assessment in grades three, five, and seven for the content areas of language arts, mathematics, and social studies. Pre-equated scaling was based on item calibrations using the Rasch model from an off-grade field test event in which students tested were one grade higher than the target population. These calibrations were compared to those obtained from post-equating, which used the full statewide population of examinees. Item difficulty estimates and Test Characteristic Curves (TCC) were compared for each approach and found to be similar. The Root Mean Square Error (RMSE) of the theta estimates for each approach ranged from .02 to .12. Moreover, classification accuracy for the pre-equated approach was generally high compared to results from post-equating. Only 3 of the 9 tests examined showed differences in the percent of students classified as passing; errors ranged from 1.7 percent to 3 percent. Measurement equivalence between the field test and operational assessment was also explored using the Differential Functioning of Items and Tests (DFIT) framework. Overall, about 20 to 40 percent of the items on each assessment exhibited statistically significant Differential Item Functioning (DIF). Differential Test Functioning (DTF) was significant for fully 7 tests. There was a positive relationship between the magnitude of DTF and degree of incongruence between pre-equating and post-equating. Item calibrations, score consistency, and measurement equivalence were also explored for a test calibrated with the one, two, and three parameter logistic model, using the TCC equating method. Measurement equivalence and score table incongruence was found to be slightly more pronounced with this approach. It was hypothesized that differences between the field test and operational tests resulted from 1) recency of instruction 2) cognitive growth and 3) motivation factors. Additional research related to these factors is suggested.
|
7 |
Effect of Sample Size on Irt Equating of Uni-Dimensional Tests in Common Item Non-Equivalent Group Design: a Monte Carlo Simulation StudyWang, Xiangrong 03 May 2012 (has links)
Test equating is important to large-scale testing programs because of the following two reasons: strict test security is a key concern for high-stakes tests and fairness of test equating is important for test takers. The question of adequacy of sample size often arises in test equating. However, most recommendations in the existing literature are based on classical test equating. Very few research studies systematically investigated the minimal sample size which leads to reasonably accurate equating results based on item response theory (IRT). The main purpose of this study was to examine the minimal sample size for desired IRT equating accuracy for the common-item nonequivalent groups design under various conditions. Accuracy was determined by examining the relative magnitude of six accuracy statistics. Two IRT equating methods were carried out on simulated tests with combinations of test length, test format, group ability difference, similarity of the form difficulty, and parameter estimation methods for 14 sample sizes using Monte Carlo simulations with 1,000 replications per cell. Observed score equating and true score equating were compared to the criterion equating to obtain the accuracy statistics. The results suggest that different sample size requirements exist for different test lengths, test formats and parameter estimation methods. Additionally, the results show the following: first, the results for true score equating and observed score equating are very similar. Second, the longer test has less accurate equating than the shorter one at the same sample size level and as the sample size decreases, the gap is greater. Third, concurrent parameter estimation method produced less equating error than separate estimation at the same sample size level and as the sample size reduces, the difference increases. Fourth, the cases with different group ability have larger and less stable error comparing to the base case and the cases with different test difficulty, especially when using separate parameter estimation method with sample size less than 750. Last, the mixed formatted test is more accurate than the single formatted one at the same sample size level. / Ph. D.
|
8 |
Equating Accuracy Using Small Samples in the Random Groups DesignHeh, Victor K. 24 August 2007 (has links)
No description available.
|
9 |
Observed score equating with covariatesBränberg, Kenny January 2010 (has links)
In test score equating the focus is on the problem of finding the relationship between the scales of different test forms. This can be done only if data are collected in such a way that the effect of differences in ability between groups taking different test forms can be separated from the effect of differences in test form difficulty. In standard equating procedures this problem has been solved by using common examinees or common items. With common examinees, as in the equivalent groups design, the single group design, and the counterbalanced design, the examinees taking the test forms are either exactly the same, i.e., each examinee takes both test forms, or random samples from the same population. Common items (anchor items) are usually used when the samples taking the different test forms are assumed to come from different populations. The thesis consists of four papers and the main theme in three of these papers is the use of covariates, i.e., background variables correlated with the test scores, in observed score equating. We show how covariates can be used to adjust for systematic differences between samples in a non-equivalent groups design when there are no anchor items. We also show how covariates can be used to decrease the equating error in an equivalent groups design or in a non-equivalent groups design. The first paper, Paper I, is the only paper where the focus is on something else than the incorporation of covariates in equating. The paper is an introduction to test score equating, and the author's thoughts on the foundation of test score equating. There are a number of different definitions of test score equating in the literature. Some of these definitions are presented and the similarities and differences between them are discussed. An attempt is also made to clarify the connection between the definitions and the most commonly used equating functions. In Paper II a model is proposed for observed score linear equating with background variables. The idea presented in the paper is to adjust for systematic differences in ability between groups in a non-equivalent groups design by using information from background variables correlated with the observed test scores. It is assumed that conditional on the background variables the two samples can be seen as random samples from the same population. The background variables are used to explain the systematic differences in ability between the populations. The proposed model consists of a linear regression model connecting the observed scores with the background variables and a linear equating function connecting observed scores on one test forms to observed scores on the other test form. Maximum likelihood estimators of the model parameters are derived, using an assumption of normally distributed test scores, and data from two administrations of the Swedish Scholastic Assessment Test are used to illustrate the use of the model. In Paper III we use the model presented in Paper II with two different data collection designs: the non-equivalent groups design (with and without anchor items) and the equivalent groups design. Simulated data are used to examine the effect - in terms of bias, variance and mean squared error - on the estimators, of including covariates. With the equivalent groups design the results show that using covariates can increase the accuracy of the equating. With the non-equivalent groups design the results show that using an anchor test together with covariates is the most efficient way of reducing the mean squared error of the estimators. Furthermore, with no anchor test, the background variables can be used to adjust for the systematic differences between the populations and produce unbiased estimators of the equating relationship, provided that the “right” variables are used, i.e., the variables explaining those differences. In Paper IV we explore the idea of using covariates as a substitute for an anchor test with a non-equivalent groups design in the framework of Kernel Equating. Kernel Equating can be seen as a method including five different steps: presmoothing, estimation of score probabilities, continuization, equating, and calculating the standard error of equating. For each of these steps we give the theoretical results when observations on covariates are used as a substitute for scores on an anchor test. It is shown that we can use the method developed for Post-Stratification Equating in the non-equivalent groups with anchor test design, but with observations on the covariates instead of scores on an anchor test. The method is illustrated using data from the Swedish Scholastic Assessment Test.
|
10 |
Comparison of MIRT observed score equating methods under the common-item nonequivalent groups designChoi, Jiwon 01 May 2019 (has links)
For equating tests that measure several distinct proficiencies, procedures that reflect the multidimensional structure of the data are needed. Although there exist a few equating procedures developed under the multidimensional item response theory (MIRT) framework, there is a need for further research in this area. Therefore, the primary objectives of this dissertation are to consolidate and expand MIRT observed score equating research with a specific focus on the common-item nonequivalent groups (CINEG) design, which requires scale linking. Content areas and item types are two focal points of dimensionality. This dissertation uses two studies with different data types and comparison criteria to address the research objectives.
In general, a comparison between unidimensional item response theory (UIRT) and MIRT methods suggested a better performance of the MIRT methods over UIRT. The simple structure (SS) and full MIRT methods showed more accurate equating results compared to UIRT. In terms of calibration methods, concurrent calibration outperformed separate calibration for all equating methods under most of the studied conditions.
|
Page generated in 0.0535 seconds