• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 38
  • 8
  • 4
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 88
  • 36
  • 34
  • 21
  • 20
  • 18
  • 12
  • 12
  • 10
  • 10
  • 9
  • 9
  • 9
  • 8
  • 8
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

Effects of performance appraisal purpose and rater expertise on rating error

Weyhrauch, William S. January 1900 (has links)
Master of Science / Department of Psychology / Satoris S. Culbertson / Performance appraisals are an important component to any organization’s performance management system. They require supervisors to observe and retain information regarding employee performance. This study sought to investigate the effects of appraisal purpose in this process. This extension and replication of Williams, DeNisi, Meglino, and Cafferty’s (1986) lab study of appraisal purpose investigated whether designating an employee for a positive outcome results in lenient performance ratings and vice-versa for a negative designation. This outcome would indicate assimilation, whereby the designation acts as an anchor creating bias in the direction of the anchor. However, the negative and positive designations may both result in leniency, indicating a universal tendency toward leniency when memory for performance is limited. Furthermore, I investigated whether making a deservedness rating for each employee would result in less lenient or severe ratings, relative to the designation conditions. Finally, I investigated whether self-reported rater expertise would moderate the assimilation effect. A total of 108 undergraduate students from a large Midwestern university viewed confederates performing cardio-pulmonary resuscitation (CPR) on a dummy and were instructed to observe performance in order to make a designation (positive or negative) or deservedness rating, or were given no instructions (control). They made an initial decision and were then asked to return two days later and rate each confederate’s performance again. Consistent with previous findings, raters making positive designations tended to give lenient ratings, relative to other conditions. Furthermore, as expected, those making negative designations gave relatively severe ratings. Finally, the results also partially supported my expectation that rater expertise in the performance domain moderates the biasing effects of appraisal purpose. Implications for practice and recommendations for future research are discussed.
32

Further Evaluating the Effect of Behavioral Observability and Overall Impressions on Rater Agreement: A Replication Study

Sizemore, Patrick 01 May 2015 (has links)
This replication study sought to analyze the effects of behavioral observability and overall impressions on rater agreement, as recently examined by Roch, Paquin, & Littlejohn (2009) and Scott (2012). Results from the study performed by Roch et al. indicated that raters are more likely to agree when items are either more difficult to rate or less observable. In the replication study conducted by Scott, the results did not support the relationship which Roch et al. found between observability and rater agreement, but did support the relationship previously found between item difficulty and rater agreement. The four objectives of this replication study were to determine whether rater agreement is negatively related to item observability (Hypothesis 1) and positively related to difficulty (Hypothesis 2), as well as to determine whether item performance ratings are closer to overall impressions when items are less observable (Hypothesis 3) and more difficult to rate (Hypothesis 4). The sample was comprised of 152 undergraduate students tasked with providing performance ratings on an individual depicted in a video of a discussion group. Results indicated that agreement was negatively correlated with both observability (supporting Hypothesis 1) and difficulty (not supporting Hypothesis 2), and that ratings were closer to overall impressions when items were less observable (supporting Hypothesis 3), but not when items were more difficult to rate (not supporting Hypothesis 4).
33

The impact of rater characteristics on oral assessments of second language proficiency

Su, Yi-Wen 10 October 2014 (has links)
This literature review sets out to revisit the studies exploring impact of rater characteristics on language oral assessments. Three categories of raters' backgrounds: occupation, accent familiarity, and native language are identified and will be addressed respectively in the following sections. The results showed that no consensus regarding raters' occupational background, linguistic background and native-speaker status on examiners' rating has been found so far. However, this review will highlight the current testing situations, bring up limitations from previous studies, provide implications for both teachers and raters, and hopefully shed light on future research. / text
34

Personality and Rater Leniency: Comparison of Broad and Narrow Measures of Conscientiousness and Agreeableness

Grahek, Myranda 05 1900 (has links)
Performance appraisal ratings provide the basis for numerous employment decisions, including retention, promotion, and salary increases. Thus, understanding the factors affecting the accuracy of these ratings is important to organizations and employees. Leniency, one rater error, is a tendency to assign higher ratings in appraisal than is warranted by actual performance. The proposed study examined how personality factors Agreeableness and Conscientiousness relate to rater leniency. The ability of narrower facets of personality to account for more variance in rater leniency than will the broad factors was also examined. The study used undergraduates' (n = 226) evaluations of instructor performance to test the study's hypotheses. In addition to personality variables, students' social desirability tendency and attitudes toward instructor were predicted to be related to rater leniency. Partial support for the study's hypotheses were found. The Agreeableness factor and three of the corresponding facets (Trust, Altruism and Tender-Mindedness) were positively related to rater leniency as predicted. The hypotheses that the Conscientiousness factor and three of the corresponding facets (Order, Dutifulness, and Deliberation) would be negatively related to rater leniency were not supported. In the current sample the single narrow facet Altruism accounted for more variance in rater leniency than the broad Agreeableness factor. While social desirability did not account for a significant amount of variance in rater leniency, attitude toward instructor was found to have a significant positive relationship accounting for the largest amount of variance in rater leniency.
35

Inter-Rater Reliability of the Texas Teacher Appraisal System

Crain, John Allen 05 1900 (has links)
The purpose of this study was to determine the interrater reliability of the Texas Teacher Appraisal System instrument. The performance indicators, criteria, domains, and total instrument were analyzed for inter rater reliability. Five videotaped teaching episodes were viewed and scored by 557 to 881 school administrators trained to utilize the Texas Teacher Appraisal System. The fifty-five performance indicators were analyzed for simple percentage of agreement. The ten criteria, four performance domains, and) the whole instrument were analyzed utilizing Ruder-Richardson Formula 20. Indicators were judged reliable if there was 75 percent or greater agreement on four of the five videotaped exercises. Criteria, domains, and the whole instrument were judged reliable if they yielded a -Ruder-Richardson Formula 20 score of .75 or greater on four of the Based on the findings of this study, the following conclusions v/ere drawn: 1. Forty-eight of the fifty-five performance indicators were reliable in evaluation teacher performance. 2. Seven of the performance indicators were unreliable in evaluating teacher performance. 3. None of the ten performance criteria appeared to be reliable in evaluating teacher performance. 4. None of the four performance domains appeared to be reliable in evaluating teacher performance. 5. The whole instrument was reliable in evaluating teacher performance. 6. Reliability problems with the criteria and domains appeared to be an underestimate of reliability of the Kuder-Richardson Formula 20.
36

MULTISOURCE FEEDBACK LEADERSHIP RATINGS: ANALYZING FOR MEASUREMENT INVARIANCE AND COMPARING RATER GROUP IMPLICIT LEADERSHIP THEORIES

Gower, Kim 07 May 2012 (has links)
This research outlines a conceptual framework and data analysis process to examine multisource feedback (MSF) rater group differences from a leadership assessment survey, after testing the measures for equivalence. MSF gathers and compares ratings from supervisors, peer, followers and self and is the predominant leadership assessment tool in the United States. The results of MSF determine significant professional outcomes such as leadership development opportunities, promotions and compensation. An underlying belief behind the extensive use of MSF is that each rater group has a different set of implicit leadership theories (ILTs) they use when assessing the leader, and therefore each group is able to contribute unique insight. If this is true, research findings would find rater group consistency in leadership assessment outcomes, but they do not. A review of group comparison research reveals that most empirical MSF studies fail to perform preliminary data exploration, employ consistent models or adequately test for measurement equivalence (ME); yet industry standards strongly suggest exploratory methods whenever data sets undergo changes, and misspecified models cause biased results. Finally, ME testing is critical to ascertain if rater groups have similar conceptualizations of the factors and items in an MSF survey. If conceptual ME is not established, substantive group comparisons cannot be made. This study draws on the extant MSF, ILT and ME literature and analyzes rater group data from a large, application-based MSF leadership database. After exploring the data and running the requisite MI tests, I found that the measures upheld measurement invariance and were suitable for group comparison. Additional MI tests for substantive hypotheses support found that significant mean differences did exist among certain rater groups and dimensions, but only direct report and peer groups were consistently significantly different in all four dimensions (analytical, interpersonal, courageous and leadership effectiveness). Additionally, the interpersonal dimension was the most highly correlated with leadership effectiveness in all five rater groups. The overall findings of this study address the importance of MSF data exploration, offer alternative explanations to the disparate leadership MSF research findings to date and question the application use of MSF tools in their current form.
37

Evaluierung bestehender Prüfungsmodalitäten in der Zahnärztlichen Vorprüfung und Implementierung neuer Prüfungsstrukturen / The evaluation of existing examination procedures of the dental preliminary exam and the implementation of a novel assessment tool

Ellerbrock, Maike 02 April 2019 (has links)
No description available.
38

On Rank-invariant Methods for Ordinal Data

Yang, Yishen January 2017 (has links)
Data from rating scale assessments have rank-invariant properties only, which means that the data represent an ordering, but lack of standardized magnitude, inter-categorical distances, and linearity. Even though the judgments often are coded by natural numbers they are not really metric. The aim of this thesis is to further develop the nonparametric rank-based Svensson methods for paired ordinal data that are based on the rank-invariant properties only. The thesis consists of five papers. In Paper I the asymptotic properties of the measure of systematic disagreement in paired ordinal data, the Relative Position (RP), and the difference in RP between groups were studied. Based on the findings of asymptotic normality, two tests for analyses of change within group and between groups were proposed. In Paper II the asymptotic properties of rank-based measures, e.g. the Svensson’s measures of systematic disagreement and of additional individual variability were discussed, and a numerical method for approximation was suggested. In Paper III the asymptotic properties of the measures for paired ordinal data, discussed in Paper II, were verified by simulations. Furthermore, the Spearman rank-order correlation coefficient (rs) and the Svensson’s augmented rank-order agreement coefficient (ra) were compared. By demonstrating how they differ and why they differ, it is emphasized that they measure different things. In Paper IV the proposed test in Paper I for comparing two groups of systematic changes in paired ordinal data was compared with other nonparametric tests for group changes, both regarding different approaches of categorising changes. The simulation reveals that the proposed test works better for small and unbalanced samples. Paper V demonstrates that rank invariant approaches can also be used in analysis of ordinal data from multi-item scales, which is an appealing and appropriate alternative to calculating sum scores.
39

Investigating the effects of Rater's Second Language Learning Background and Familiarity with Test-Taker's First Language on Speaking Test Scores

Zhao, Ksenia 01 March 2017 (has links)
Prior studies suggest that raters' familiarity with test-takers' first language (L1) can be a potential source of bias in rating speaking tests. However, there is still no consensus between researchers on how and to what extent that familiarity affects the scores. This study investigates raters' performance and focuses on not only how raters' second language (L2) proficiency level interacts with examinees' L1, but also if raters' teaching experience has any effect on the scores. Speaking samples of 58 ESL learners with L1s of Spanish (n = 30) and three Asian languages (Korean, n = 12; Chinese, n = 8; and Japanese, n = 8) of different levels of proficiency were rated by 16 trained raters with varying levels of Spanish proficiency (Novice to Advanced) and different degrees of teaching experience (between one and over 10 semesters). The ratings were analyzed using Many-Facet Rasch Measurement (MFRM). The results suggest that extensive rater training can be quite effective: there was no significant effect of either raters' familiarity with examinees' L1, or raters' teaching experience on the scores. However, even after training, the raters still exhibited different degrees of leniency/severity. Therefore, the main conclusion of this study is that even trained raters may consistently rate differently. The recommendation is to (a) have further rater training and calibration; and/or (b) use MFRM with fair average to compensate for the variance.
40

A comparability study on differences between scores of handwritten and typed responses on a large-scale writing assessment

Rankin, Angelica Desiree 01 July 2015 (has links)
As the use of technology for personal, professional, and learning purposes increases, more and more assessments are transitioning from a traditional paper-based testing format to a computer-based one. During this transition, some assessments are being offered in both paper and computer formats in order to accommodate examinees and testing center capabilities. Scores on the paper-based test are often intended to be directly comparable to the computer-based scores, but such claims of comparability are often unsupported by research specific to that assessment. Not only should the scores be examined for differences, but the thought processes used by raters while scoring those assessments should also be studied to better understand why raters might score response modes differently. Previous comparability literature can be informative, but more contemporary, test-specific research is needed in order to completely support the direct comparability of scores. The goal of this thesis was to form a more complete understanding of why analytic scores on a writing assessment might differ, if at all, between handwritten and typed responses. A representative sample of responses to the writing composition portion of a large-scale high school equivalency assessment were used. Six trained raters analytically scored approximately six-hundred examinee responses each. Half of those responses were typed, and the other half were the transcribed handwritten duplicates. Multiple methods were used to examine why differences between response modes might exist. A MANOVA framework was applied to examine score differences between response modes, and the systematic analyses of think-alouds and interviews were used to explore differences in rater cognition. The results of these analyses indicated that response mode was of no practical significance, meaning that domain scores were not notably dependent on whether or not a response was presented as typed or handwritten. Raters, on the other hand, had a more substantial effect on scores. Comments from the think-alouds and interviews suggest that, while the scores were not affected by response mode, raters tended to consider certain aspects of typed responses differently than handwritten responses. For example, raters treated typographical errors differently from other conventional errors when scoring typed responses, but not while scoring the handwritten duplicates. Raters also indicated that they preferred scoring typed responses over handwritten ones, but felt they could overcome their personal preferences to score both response modes similarly. Empirical investigations on the comparability of scores, combined with the analysis of raters’ thought processes, helped to provide a more evidence-based answer to the question of why scores might differ between response modes. Such information could be useful for test developers when making decisions regarding what mode options to offer and how to best train raters to score such assessments. The design of this study itself could be useful for testing organizations and future research endeavors, as it could be used as a guide for exploring score differences and the human-based reasons behind them.

Page generated in 0.0629 seconds