Global ETD Search

Return to search

Detecting rater effects in trend scoring

Trend scoring is often used in large-scale assessments to monitor for rater drift when the same constructed response items are administered in multiple test administrations. In trend scoring, a set of responses from Time A are rescored by raters at Time B. The purpose of this study is to examine the ability of trend-monitoring statistics to detect rater effects in the context of trend scoring. The present study examines the percent of exact agreement and Cohen’s kappa as interrater agreement measures, and the paired t-test and Stuart’s Q as marginal homogeneity measures. Data that contains specific rater effects is simulated under two frameworks: the generalized partial credit model and the latent-class signal detection theory model.
The findings indicate that the percent of exact agreement, the paired t-test, and Stuart’s Q showed high Type I error rates under a rescore design in which half of the rescore papers have a uniform score distribution and the other half have a score distribution proportional to the population papers at Time A. All these Type I errors were reduced when using a rescore design in which all rescore papers have a score distribution proportional to the population papers at Time A. For the second rescore design, results indicate that the ability of the percent of exact agreement, Cohen’s kappa, and the paired t-test in detecting various effects varied across items, sample sizes, and type of rater effect. The only statistic that always detected every level of rater effect across items and frameworks was Stuart’s Q.
Although advances have been made in the automated scoring field, the fact is that many testing programs require humans to score constructed response items. Previous research indicates that rater effects are common in constructed response scoring. In testing programs that keep trends in data across time, changes in scoring across time confound the measurement of change in student performance. Therefore, the study of methods to ensure rating consistency across time, such as trend scoring, is important and needed to ensure fairness and validity.

Rater drift

Rater effects

Trend scoring

Type I error and power analysis

Identifer	oai:union.ndltd.org:uiowa.edu/oai:ir.uiowa.edu:etd-8192
Date	01 May 2019
Creators	Abdalla, Widad
Contributors	Yarbrough, Donald B., Harris, Deborah J.
Publisher	University of Iowa
Source Sets	University of Iowa
Language	English
Detected Language	English
Type	dissertation
Format	application/pdf
Source	Theses and Dissertations
Rights	Copyright © 2019 Widad Abdalla

Page generated in 0.0017 seconds

Detecting rater effects in trend scoring

Description

Links & Downloads

Tags

Additional Fields