The number of performance assessment tasks has increased over the years because some constructs are best assessed in this manner. Though there are benefits to using performance tasks, there are also drawbacks. The problems with performance assessments include scoring time, scoring costs, and problems with human raters. One solution for overcoming the drawbacks of performance assessments is the use of automated scoring programs. There are several automated scoring programs designed to score essays and other constructed responses. Much research has been conducted on these programs by program developers; however, relatively little research has used external criteria to evaluate automated programs. The purpose of this study was to evaluate two popular automated scoring programs. The automated scoring programs were evaluated with respect to several criteria: the percent of exact and adjacent agreements, kappas, correlations, differences in score distributions, discrepant scoring, analysis of variance, and generalizability theory. The scoring results from the two automated scoring programs were compared to the scores from operational scoring and an expert panel of judges. The results indicated close similarity between the two scoring program regarding how they scored the essays. However, the results also revealed some subtle, but important, differences between the programs. One program exhibited higher correlations and agreement indices with both the operational and expert committee scores, although the magnitude of the different was small. Differences were also noted in the scores assigned to fake essays designed to trick the programs into providing a higher score. These results were consistent for both the full set of 500 scored essays and the subset of essays reviewed by the expert committee. Overall, both automated scoring programs did well judged on the criteria; however, one program did slightly better. The G-studies indicated that there were small differences among the raters and that the amount of error in the models was reduced as the number of human raters and automated scoring programs were increased. In summary, results suggest automated scoring programs can approximate scores given by human raters, but they differ with respect to proximity to operational and expert scores, and their ability to identify dubious essays.
Identifer | oai:union.ndltd.org:UMASS/oai:scholarworks.umass.edu:dissertations-2321 |
Date | 01 January 2004 |
Creators | Khaliq, Shameem Nyla |
Publisher | ScholarWorks@UMass Amherst |
Source Sets | University of Massachusetts, Amherst |
Language | English |
Detected Language | English |
Type | text |
Source | Doctoral Dissertations Available from Proquest |
Page generated in 0.002 seconds