Global ETD Search

1	Detecting rater effects in trend scoring Abdalla, Widad 01 May 2019 (has links) Trend scoring is often used in large-scale assessments to monitor for rater drift when the same constructed response items are administered in multiple test administrations. In trend scoring, a set of responses from Time A are rescored by raters at Time B. The purpose of this study is to examine the ability of trend-monitoring statistics to detect rater effects in the context of trend scoring. The present study examines the percent of exact agreement and Cohen’s kappa as interrater agreement measures, and the paired t-test and Stuart’s Q as marginal homogeneity measures. Data that contains specific rater effects is simulated under two frameworks: the generalized partial credit model and the latent-class signal detection theory model. The findings indicate that the percent of exact agreement, the paired t-test, and Stuart’s Q showed high Type I error rates under a rescore design in which half of the rescore papers have a uniform score distribution and the other half have a score distribution proportional to the population papers at Time A. All these Type I errors were reduced when using a rescore design in which all rescore papers have a score distribution proportional to the population papers at Time A. For the second rescore design, results indicate that the ability of the percent of exact agreement, Cohen’s kappa, and the paired t-test in detecting various effects varied across items, sample sizes, and type of rater effect. The only statistic that always detected every level of rater effect across items and frameworks was Stuart’s Q. Although advances have been made in the automated scoring field, the fact is that many testing programs require humans to score constructed response items. Previous research indicates that rater effects are common in constructed response scoring. In testing programs that keep trends in data across time, changes in scoring across time confound the measurement of change in student performance. Therefore, the study of methods to ensure rating consistency across time, such as trend scoring, is important and needed to ensure fairness and validity. Rater drift Rater effects Trend scoring Type I error and power analysis
2	Detecting Rater Centrality Effect Using Simulation Methods and Rasch Measurement Analysis Yue, Xiaohui 01 September 2011 (has links) This dissertation illustrates how to detect the rater centrality effect in a simulation study that approximates data collected in large scale performance assessment settings. It addresses three research questions that: (1) which of several centrality-detection indices are most sensitive to the difference between effect raters and non-effect raters; (2) how accurate (and inaccurate), in terms of Type I error rate and statistical power, each centrality-detection index is in flagging effect raters; and (3) how the features of the data collection design (i.e., the independent variables including the level of centrality strength, the double-scoring rate, and the number of raters and ratees) influence the accuracy of rater classifications by these centrality-detection indices. The results reveal that the measure-residual correlation, the expected-residual correlation, and the standardized deviation of assigned scores perform better than the point-measure correlation. The mean-square fit statistics, traditionally viewed as potential indicators of rater centrality, perform poorly in terms of differentiating central raters from normal raters. Along with the rater slope index, the mean-square fit statistics did not appear to be sensitive to the rater centrality effect. All of these indices provided reasonable protection against Type I errors when all responses were double scored, and that higher statistical power was achieved when responses were 100% double scored in comparison to only 10% being double scored. With a consideration on balancing both Type I error and statistical power, I recommend the measure-residual correlation and the expected-residual correlation for detecting the centrality effect. I suggest using the point-measure correlation only when responses are 100% double scored. The four parameters evaluated in the experimental simulations had different impact on the accuracy of rater classification. The results show that improving the classification accuracy for non-effect raters may come at a cost of reducing the classification accuracy for effect raters. Some simple guidelines for the expected impact of classification accuracy when a higher-order interaction exists summarized from the analyses offer a glimpse of the "pros" and "cons" in adjusting the magnitude of the parameters when we evaluate the impact of the four experimental parameters on the outcomes of rater classification. / Ph. D. ANOVA Rasch measurement centrality rater effects Type I and Type II errors performance assessment statistical power logistic regression

Search results

Detecting rater effects in trend scoring

Detecting Rater Centrality Effect Using Simulation Methods and Rasch Measurement Analysis