Global ETD Search

1	Multiple Choice and Constructed Response Tests: Do Test Format and Scoring Matter? Kastner, Margit, Stangl, Barbara 10 March 2011 (has links) (PDF) Problem Statement: Nowadays, multiple choice (MC) tests are very common, and replace many constructed response (CR) tests. However, literature reveals that there is no consensus whether both test formats are equally suitable for measuring students' ability or knowledge. This might be due to the fact that neither the type of MC question nor the scoring rule used when comparing test formats are mentioned. Hence, educators do not have any guidelines which test format or scoring rule is appropriate. Purpose of Study: The study focuses on the comparison of CR and MC tests. More precisely, short answer questions are contrasted to equivalent MC questions with multiple responses which are graded with three different scoring rules. Research Methods: An experiment was conducted based on three instruments: A CR and a MC test using a similar stem to assure that the questions are of an equivalent level of difficulty. This procedure enables the comparison of the scores students gained in the two forms of examination. Additionally, a questionnaire was handed out for further insights into students' learning strategy, test preference, motivation, and demographics. In contrast to previous studies the present study applies the many-facet Rasch measurement approach for analyzing data which allows improving the reliability of an assessment and applying small datasets. Findings: Results indicate that CR tests are equal to MC tests with multiple responses if Number Correct (NC) scoring is used. An explanation seems straight forward since the grader of the CR tests did not penalize wrong answers and rewarded partially correct answers. This means that s/he uses the same logic as NC scoring. All other scoring methods such as the All or-Nothing or University-Specific rule neither reward partial knowledge nor penalize guessing. Therefore, these methods are found to be stricter than NC scoring or CR tests and cannot be used interchangeably. Conclusions: CR tests can be replaced by MC tests with multiple responses if NC scoring is used, due to the fact that the multiple response format measures more complex thinking skills than conventional MC questions. Hence, educators can take advantage of low grading costs, consistent grading, no scoring biases, and greater coverage of the syllabus while students benefit from timely feedback. (authors' abstract)
2	The Development of Second Language Writing Across the Lexical and Communicative Dimensions of Performance Denison, George Clinton, 0009-0003-0615-3489 08 1900 (has links) Enabling learners to successfully use their second language (L2) in meaningful ways is a critical goal of instruction. Ultimately, most learners want to meet the L2 demands of the contexts in which they will use the language. To accomplish this, learners must develop linguistic knowledge and apply it in a manner that is contextually appropriate considering the requirements of the task at hand. In other words, learners must develop their L2 across both linguistic and communicative aspects to use it successfully. However, there is a paucity of L2 research in which linguistic and communicative performance have been simultaneously investigated.In this study, I investigated the development of English as a Foreign Language (EFL) learners’ L2 written production across the lexical and communicative dimensions of performance. The study involved 290 Japanese participants recruited from 20 intact EFL classes at four tertiary educational institutions in Japan, representing a wide range of L2 English proficiency levels and instructional contexts. The study used a non-intervention, repeated-measures design, allowing for the general development of participants’ L2 English writing to be examined. Participants’ L2 English written responses were collected using four argumentative writing tasks, which were administered at the beginning and end of the first and second semesters in a counterbalanced manner. Although a total of 952 responses were collected, responses shorter than 50 words were removed, leaving a total of 775 responses written by 250 participants. The 775 responses included in the primary analyses constituted a corpus of 89,122 words. Twenty-four single-word, multi-word, and lexical variation measures were calculated for the responses and subjected to an exploratory factor analysis. Seven latent lexical factors were identified in the data: High-Frequency Trigrams, Lexical Clarity, High-Frequency Bigrams, Lexical Variation, Lexical Breadth, Low-Frequency N-Grams, and Directional Association Strength of N-Grams. In addition, raters scored the responses for functional adequacy (i.e., Content, Comprehensibility, Organization, and Task Completion) and Lexical Appropriateness. The scores were analyzed using many-facet Rasch measurement, which converted the ordinal scores into equal-interval measures that had been adjusted for the influences of task and rater severity. The lexical factor scores and communicative Rasch measures were examined using linear mixed modeling, dominance analysis, and latent growth modeling to investigate (a) if and how lexical development had occurred, (b) if and how communicative development had occurred, (c) the relationships between the lexical and communicative components, and (d) the relationship between lexical and communicative growth. For the lexical factors, the results indicated that Directional Association Strength of N-Grams scores increased in a linear manner. Directional Association Strength of N-Grams comprised ΔP scores, which indicate the degree to which the first word(s) are predictive of the following word(s) in two- and three-word combinations. Thus, the results indicated that participants’ use of multi-word expressions improved. On the other hand, Lexical Clarity, which comprised imageability, concreteness, meaningfulness, and hypernymy scores, showed quadratic change, with scores improving and then regressing. Thus, the findings provide evidence of differing developmental trends for lexical aspects of L2 writing. For the communicative measures, the results indicated that Comprehensibility, Organization, and Lexical Appropriateness changed substantially over time. Improvement of Task Completion was dependent on the university context, and little change was observed for Content. Lexical Appropriateness showed the most improvement, with evidence of both linear and quadratic change. Bias interaction analyses also confirmed the presence of linear and quadratic trends for the communicative measures. Thus, the findings provide evidence of differing developmental trends for communicative aspects of L2 writing. For the relationships between the lexical and communicative components, the results indicated that two lexical factors were of key importance: Lexical Variation and Directional Association Strength of N-Grams. Lexical Variation was found to predict Content, Organization, and Task Completion; and Directional Association Strength of N-Grams was found to predict Comprehensibility and Lexical Appropriateness. The findings suggest that L2 performance assessments should not conflate measurement of lexical variation and use of multi-word expressions because they diverge in terms of the communicative outcomes they predict. The results also indicated a positive relationship between lexical and communicative development. A parallel process latent growth model was constructed that related lexical and communicative growth. The paths tested in the model suggest participants who had lower initial communicative scores were able to increase their lexical scores at faster rates, which in turn leveraged communicative growth. The findings highlight the potential for learners to improve their communicative ability through a targeted focus on multi-word expressions. / Applied Linguistics Education EFL writing Functional adequacy L2 performance Lexical appropriateness Lexical development Many-facet Rasch measurement
3	The Effect of Raters and Rating Conditions on the Reliability of the Missionary Teaching Assessment Ure, Abigail Christine 17 December 2010 (has links) (PDF) This study investigated how 2 different rating conditions, the controlled rating condition (CRC) and the uncontrolled rating condition (URC), effected rater behavior and the reliability of a performance assessment (PA) known as the Missionary Teaching Assessment (MTA). The CRC gives raters the capability to manipulate (pause, rewind, fast-forward) video recordings of an examinee's performance as they rate while the URC does not give them this capability (i.e., the rater must watch the recording straight through without making any manipulations). Few studies have compared the effect of these two rating conditions on ratings. Ryan et al. (1995) analyzed the impact of the CRC and URC on the accuracy of ratings, but few, if any, have analyzed its impact on reliability. The Missionary Teaching Assessment is a performance assessment used to assess the teaching abilities of missionaries for the Church of Jesus Christ of Latter-day Saints at the Missionary Training Center. In this study, 32 missionaries taught a 10-minute lesson that was recorded and later rated by trained raters based on a rubric containing 5 criteria. Each teaching sample was rated by 4 of 6 raters. Two of the 4 ratings were rated using the CRC and 2 using the URC. Camtasia Studio (2010), a screen capture software, was used to record when raters used any type of manipulation. The recordings were used to analyze if raters manipulated the recordings and if so, when and how frequently. Raters also performed think-alouds following a random sample of the ratings that were performed using the CRC. These data revealed that when raters had access to the CRC they took advantage of it the majority of the time, but they differed in how frequently they manipulated the recordings. The CRC did not add an exorbitant amount of time to the rating process. The reliability of the ratings was analyzed using both generalizability theory (G theory) and many-facets Rasch measurement (MFRM). Results indicated that, in general, the reliability of the ratings obtained from the 2 rating conditions were not statistically significantly different from each other. The implications of these findings are addressed. generalizability theory many-facet Rasch measurement performance assessment microteaching reliability rater behavior rater cognition video recording assessment in teacher education Educational Psychology
4	Rubric Rating with MFRM vs. Randomly Distributed Comparative Judgment: A Comparison of Two Approaches to Second-Language Writing Assessment Sims, Maureen Estelle 01 April 2018 (has links) The purpose of this study is to explore a potentially more practical approach to direct writing assessment using computer algorithms. Traditional rubric rating (RR) is a common yet highly resource-intensive evaluation practice when performed reliably. This study compared the traditional rubric model of ESL writing assessment and many-facet Rasch modeling (MFRM) to comparative judgment (CJ), the new approach, which shows promising results in terms of reliability and validity. We employed two groups of raters”novice and experienced”and used essays that had been previously double-rated, analyzed with MFRM, and selected with fit statistics. We compared the results of the novice and experienced groups against the initial ratings using raw scores, MFRM, and a modern form of CJ”randomly distributed comparative judgment (RDCJ). Results showed that the CJ approach, though not appropriate for all contexts, can be valid and as reliable as RR while requiring less time to generate procedures, train and norm raters, and rate the essays. Additionally, the CJ approach is more easily transferable to novel assessment tasks while still providing context-specific scores. Results from this study will not only inform future studies but can help guide ESL programs to determine which rating model best suits their specific needs. rubric rating comparative judgment (CJ) reliability of ESL writing assessment practicality of ESL writing assessment Arts and Humanities Linguistics
5	[en] PROPOSAL OF A METHODOLOGY FOR THE PRODUCTION AND INTERPRETATION OF EDUCATIONAL MEASURES IN LARGE-SCALE ASSESSMENT BY USING RASCH MODELING WITH TWO OR MORE FACETS / [pt] PROPOSTA DE UMA METODOLOGIA PARA A PRODUÇÃO E INTERPRETAÇÃO DE MEDIDAS EDUCACIONAIS EM AVALIAÇÃO EM LARGA ESCALA POR MEIO DA UTILIZAÇÃO DA MODELAGEM RASCH COM DUAS OU MAIS FACETAS WELLINGTON SILVA 18 February 2020 (has links) [pt] Nesta tese, trabalhou-se com a modelagem Rasch visando a apresentar alternativas mais práticas e de melhor qualidade em termos de medida, para dois cenários distintos. O primeiro está relacionado ao fato de que medir conhecimento é algo muito complexo e de difícil entendimento para profissionais que não são da área da psicometria. Por meio de experimentos envolvendo modelos da família Rasch, apresentamos a aplicabilidade e as potencialidades dessa modelagem para atender a novas demandas de avaliação em larga escala no Brasil. O segundo cenário relaciona-se à busca de medir, de modo o mais imparcial possível, itens de produção escrita, em que a nota recebida pelos alunos é influenciada pela subjetividade dos corretores, ou seja, corretores lenientes beneficiam alunos e corretores severos penalizam alunos. Diante desses dois cenários, esta tese tem os seguintes objetivos: (i) trazer para o âmbito das avaliações realizadas no Brasil uma modelagem matemática mais simples que aquela atualmente adotada, visando uma melhor comunicação com os professores, e; (ii) a possibilidade de operar não apenas com itens de múltipla escolha, corrigidos de forma automática, mas também com itens de produção escrita, em que a subjetividade dos corretores (severidade) é controlada pelo modelo psicométrico, gerando medidas de melhor qualidade. Para isso, utilizou-se a modelagem Rasch com multifacetas, abordando, por meio de casos práticos, as vantagens dessa modelagem em relação a outras metodologias atualmente adotadas no país. Assim, para a alcançarmos o primeiro objetivo, confrontamos a modelagem Rasch com multifacetas com a modelagem de três parâmetros logísticos em um estudo de efeito contexto em testes compostos por diferentes modelos de cadernos e com mais de uma disciplina avaliada por caderno e, para o segundo, comparamos as medidas de proficiência através da Rasch com multifacetas com as notas médias das duplas correções dadas pelos corretores aos alunos em testes do tipo redação. A partir dos resultados encontrados, concluímos que a Rasch com multifacetas pode ser utilizada de forma alternativa ou concomitante com as avaliações que utilizam a modelagem de três parâmetros logísticos, produzindo resultados mais rápidos e de entendimento mais fácil por parte dos professores e que, no caso de redações, as proficiências obtidas pela Rasch com multifacetas apresentaram medidas com melhores indicadores de fidedignidade e validade, quando comparadas com as medidas de notas via Teoria Clássica do Teste, sendo, portanto, uma alternativa mais viável para esse tipo de avaliação. Conclui-se essa tese apresentando situações de empregabilidade das metodologias estudadas. / [en] In this thesis, we worked with Rasch modeling, aiming to present more practical alternatives and better quality in terms of measurement, for two different scenarios. The first one is related to the fact that measuring knowledge is something very complex and difficult to understand for professionals who are not in the psychometrics area. Through experiments involving the Rasch family models, we present the applicability and the potentiality of this model to adequately comply with the new demands of the large-scale evaluation in Brazil. The second scenario is related to the search of measuring, in the most impartial way possible, written production items which grade received by the subjectivity of the raters (severity), that is, lenient raters benefit students and severe raters penalize them. In view of these two scenarios, this thesis has the following objectives: (i) to bring to the scope of the evaluations carried out in Brazil a simpler mathematical modeling than the currently adopted, aiming at a better communication with the teachers; and (ii) the possibility of operating not only with multiple choice items, corrected automatically, but also with written production items, in which the subjectivity of the raters (severity) is controlled by the psychometric model, generating better quality measures. For this, Many-Facet Rasch Measurement was used, approaching, through practical cases, the advantages of this modeling in relation to other methodologies currently adopted in the country. Thus, in order to reach the first objective, we confronted Many-Facet Rasch Measurement with the modeling of three logistic parameters in a study of context effect in tests composed by different models of test books and with more than one discipline evaluated by test book and, for the second one, we compared the measures of proficiency through the Many-Facet Rasch Measurement with the average scores of the double corrections given by the raters to the students in tests of the essay type. From the results found, we conclude that the Many-Facet Rasch Measurement can be used in an alternative or concomitant way with the evaluations that use the three logistic parameters model, producing faster results and easier to understand by the teachers and that, in the case of essays, the measures of proficiency obtained by Many-Facet Rasch Measurement presented measures with better reliability and validity indicators, when compared to the grading measures through the Classical Theory of Testing, being, therefore, a more viable alternative for this type of evaluation. This thesis concludes with situations of usability of the methodologies studied. [pt] AVALIACAO EDUCACIONAL [pt] EFEITO CONTEXTO [pt] ESCALA DE CLASSIFICACAO [pt] MODELOS RASCH MULTIFACETAS [en] EDUCATIONAL EVALUATION [en] CONTEXT EFFECT [en] RATING SCALE [en] MANY FACET RASCH MEASUREMENT

1

Page generated in 0.0807 seconds