Return to search

Integrated listening-to-write assessments: an investigation of score generalizability and raters’ decision-making processes

In measuring second language learners’ writing proficiency, test takers’ performance on a particular assessment task is evaluated by raters using a set of criteria to generate writing scores. The scores are used by teachers, students, and parents to make inferences about their performance levels in real-life writing situations. To examine the accuracy of this inference, it is imperative that we investigate the sources of measurement error involved in the writing score. It is also important to ensure rater consistency, both within a single rater and between raters, to provide evidence that the scores are valid indicators of tested constructs.
This mixed methods research addressed the validity of integrated listening-to-write (L-W) scores. More specifically, it examined the generalizability of L-W scores and raters’ decision-making processes and scoring challenges. A total of 198 high school English learners in Taiwan completed up to two L-W tasks, each of which required them to listen to an academic lecture and respond to a related writing prompt in English. Nine raters who had experience teaching English evaluated each student’s written materials using a holistic scale.
This study employed a univariate two-facet random effects generalizability study (p × t × r) to investigate the effects of tasks and raters on the score variance. Subsequent decision studies (p × T × R) estimated standard error of measurement and generalizability coefficients. Post-rating stimulated recall interview data were analyzed qualitatively to explore raters’ alignment of rating scale descriptors, decision-making behaviors, and scoring challenges.
The results indicated that the majority of score variance was explained by test takers’ ability difference in academic writing proficiency. The raters were similar in their stringency and did not contribute much to score variance. Due to a relatively large magnitude of person-by-task interaction effect, increasing the number of tasks, rather than raters, resulted in a much lower degree of error and higher degree of score generalizability. The ideal assessment procedure to achieve an acceptable level of score generalizability would be to administer two L-W tasks scored by two raters.
When evaluating written materials for L-W tasks, nine raters primarily focused on the content of the essays and paid less attention to language-related features. The raters did not equally consider all aspects of essay features described in the holistic rubric. The most prominent scoring challenges included 1) assigning a holistic score while balancing students’ listening comprehension skills and writing proficiency and 2) assessing the degree of students’ successful reproduction of lecture content. The findings of this study have practical and theoretical implications for integrated writing assessments for high school EFL learners.

Identiferoai:union.ndltd.org:uiowa.edu/oai:ir.uiowa.edu:etd-7564
Date01 May 2018
CreatorsOhta, Renka
ContributorsPlakans, Lia
PublisherUniversity of Iowa
Source SetsUniversity of Iowa
LanguageEnglish
Detected LanguageEnglish
Typedissertation
Formatapplication/pdf
SourceTheses and Dissertations
RightsCopyright © 2018 Renka Ohta

Page generated in 0.0024 seconds