Global ETD Search

1	Differentiating Rater Accuracy Training Programs Sinclair, Andrea L. 24 October 2000 (has links) Prior investigation of a new rater training paradigm, rater variability training (RVT), found no clear empirical distinction between RVT and the more established frame-of-reference training (FOR), (Hauenstein, Facteau, & Schmidt, 1999). The purpose of the present study was to expand upon this previous investigation by including a purpose manipulation, alternative operationalizations of Cronbach's accuracy components, finer-grained distinctions in the rating stimuli, and a second control group receiving quantitative accuracy feedback void of a substantive training lecture. Results indicate that finer-grained distinctions in the rating stimuli result in the best differential elevation accuracy for RVT trainees. Furthermore, RVT may be best suited for improving raters' abilities to accurately evaluate average performing ratees when the performance appraisal is used for an administrative purpose. Evidence also suggests that in many cases, the use of Cronbach's accuracy components obscures underlying patterns of rating accuracy. Finally, there is evidence to suggest that accuracy feedback without a training lecture improves some types of rating accuracy. / Master of Science performance appraisal rater training
2	Perceptual Agreement Between Multi-rater Feedback Sources in the Federal Bureau of Investigation Corderman, David Sandt 04 May 2004 (has links) The use of multi-rater feedback as a way to analyze perceptions within the context of job performance and leadership in the Federal Bureau of Investigation (FBI) was examined. Research in this domain is notable as this type of evaluation is now being done with regularity in the private sector and is starting to be utilized more extensively in the public sector, but is still being used to a limited extent in law enforcement. The path of this research examined differences between self-assessments and assessments of others (peers and subordinates) in dimensions of leadership as measured by the same multi-rater instrument at two points in time. This research effort made use of a multi-rater survey instrument called the "Leadership Commitments and Credibility Inventory System (LCCIS)," designed by Keilty, Goldsmith, and Company, which is used in multiple industries and was expanded to capture characteristics considered important to FBI leaders. Results showed high ratings on a five point Likert scale as indicated by mean averages of self and others. Additionally, Z scores, t tests and ANCOVA indicated that FBI supervisors did not overestimate their leadership, as indicated by (1) an overall leadership measure at time two compared to time one, (2) a greater perceptual agreement between others and self existing on second multi-rater assessments than on the initial assessments, and (3) any statistical differences of means in all measured categories at time two versus time one. Various subcategories of the assessment showed a mixture of non-statistically significant results and that subordinates and peers perceived leaders differently. Further, analysis of two unique dimensions of the LCCIS, "Manage Diversity" and "Build Public Trust" showed exceptionally high results. The implications of the present research are that leadership in the FBI, as measured by different dimensions, is strong. Yet, there is no evidence that leaders or others in this organization change their perceptions over time. These findings may point to the need for multi-rater instruments to be used in concert with personal development plans in order to improve the perception of leadership. / Ph. D. Multi-rater Feedback Leadership Development
3	The Effects of Incomplete Rating Designs on Results from Many-Facets-Rasch Model Analyses McEwen, Mary R. 01 February 2018 (has links) A rating design is a pre-specified plan for collecting ratings. The best design for a rater-mediated assessment both psychometrically and from the perspective of fairness is a fully-crossed design in which all objects are rated by all raters. An incomplete rating design is one in which all objects are not rated by all raters, instead each object is rated by an assigned subset of raters usually to reduce the time and/or cost of the assessment. Human raters have varying propensities to rate severely or leniently. One method of compensating for rater severity is the many-facets Rasch model (MFRM). However, unless the incomplete rating design used to gather the ratings is appropriately linked, the results of the MFRM analysis may not be on the same scale and therefore may not be fairly compared. Given non-trivial numbers of raters and/or objects to rate, there are numerous possible incomplete designs with various levels of linkage. The literature provides little guidance on the extent to which differently linked rating designs might affect the results of a MFRM analysis. Eighty different subsets of data were extracted from a pre-existing fully-crossed rating data set originally gathered from 24 essays rated by eight raters. These subsets represented 20 different incomplete rating designs and four specific assignments of raters to essays. The subsets of rating data were analyzed in Facets software to investigate the effects of incomplete rating designs on the MFRM results. The design attributes related to linkage that were varied in the incomplete designs include (a) rater coverage: the number of raters-per-essay, (b) repetition-size: the number of essays rated in one repetition of the sub-design pattern, (c) design structure: the linking network structure of the incomplete design, and (d) rater order: the specific assignments of raters to essays. A number of plots and graphs were used to visualize the incomplete designs and the rating results. Several measures including the observed and fair averages for raters and essays from the 80 MFRM analyses were compared against the fair averages for the fully-crossed design. Results varied widely depending on different combinations of design attributes and rater orders. Rater coverage had the overall largest effect, with rater order producing larger ranges of values for sparser designs. Many of the observed averages for raters and essays more closely approximated the results from the fully-crossed design than did the adjusted fair-averages, particularly for the more sparsely linked designs. The stability of relative standing measures was unexpectedly low. rater mediated assessment many-facets-Rasch model incomplete rating designs rater severity linked design disconnected subsets rater schedule Educational Psychology
4	The Inter-rater Reliability of the Psychopathy Checklist-Revised in Practical Field Settings Matsushima, Yuko 01 May 2016 (has links) This paper examined the inter-rater reliability of psychological assessments in practical field with 42 inmates’ PCL-R scores. As results, this study showed similar ICC and SEM values to those from PCL-manual. Concerning PCL-R structure, factor 2 showed higher ICC value than factor 1, and facet 4 showed higher ICC value than facet 1, 2, or 3. Especially, facet 2 showed low ICC value. Those are consistent with previous studies. However, ICC yielded by factor 2 only and both factor 1 and 2 showed similar ICC values. Considering theoretical and clinical aspects, it was recommendable to use PCL-R total score as risk assessment, though interpreting facet 2 requires cautions. Concerning to rater’s characteristics, the most influential factor to keep the PCL-R reliability was conducting it on regular basis, rather than licensed status. It was difficult to examine whether or not singed-off contribute to maintain sufficient reliability due to small sample size. In regression model, all rater related variables were not significantly correlated to PCL-R score change between two assessment occasions. PCL-R scores at Time 1 was moderately and negatively correlated to PCL-R score change. This indicated natural regression toward the mean. It is desirable to conduct additional study after obtaining more sample and rater related information, such as clinical experience. Additionally, it requires a consideration to apply findings in this study to female psychopathic subjects. As a policy implication, it is recommendable for personnel division to have psychologists to remain in their psychological work. field reliability inter-rater reliability PCL-R
5	Differences Between-teacher-Reports on Universal Risk Assessments: Exploring the Teacher’s Role in Universal Screening of Student Behavior Millman, Marissa Kate January 2015 (has links) No description available. Psychology
6	Detecting rater effects in trend scoring Abdalla, Widad 01 May 2019 (has links) Trend scoring is often used in large-scale assessments to monitor for rater drift when the same constructed response items are administered in multiple test administrations. In trend scoring, a set of responses from Time A are rescored by raters at Time B. The purpose of this study is to examine the ability of trend-monitoring statistics to detect rater effects in the context of trend scoring. The present study examines the percent of exact agreement and Cohen’s kappa as interrater agreement measures, and the paired t-test and Stuart’s Q as marginal homogeneity measures. Data that contains specific rater effects is simulated under two frameworks: the generalized partial credit model and the latent-class signal detection theory model. The findings indicate that the percent of exact agreement, the paired t-test, and Stuart’s Q showed high Type I error rates under a rescore design in which half of the rescore papers have a uniform score distribution and the other half have a score distribution proportional to the population papers at Time A. All these Type I errors were reduced when using a rescore design in which all rescore papers have a score distribution proportional to the population papers at Time A. For the second rescore design, results indicate that the ability of the percent of exact agreement, Cohen’s kappa, and the paired t-test in detecting various effects varied across items, sample sizes, and type of rater effect. The only statistic that always detected every level of rater effect across items and frameworks was Stuart’s Q. Although advances have been made in the automated scoring field, the fact is that many testing programs require humans to score constructed response items. Previous research indicates that rater effects are common in constructed response scoring. In testing programs that keep trends in data across time, changes in scoring across time confound the measurement of change in student performance. Therefore, the study of methods to ensure rating consistency across time, such as trend scoring, is important and needed to ensure fairness and validity. Rater drift Rater effects Trend scoring Type I error and power analysis
7	A Monte Carlo Approach for Exploring the Generalizability of Performance Standards Coraggio, James Thomas 16 April 2008 (has links) While each phase of the test development process is crucial to the validity of the examination, one phase tends to stand out among the others: the standard setting process. The standard setting process is a time-consuming and expensive endeavor. While it has received the most attention in the literature among any of the technical issues related to criterion-referenced measurement, little research attention has been given to generalizing the resulting performance standards. This procedure has the potential to improve the standard setting process by limiting the number of items rated and the number of individual rater decisions. The ability to generalize performance standards has profound implications both from a psychometric as well as a practicality standpoint. This study was conducted to evaluate the extent to which minimal competency estimates derived from a subset of multiple choice items using the Angoff standard setting method would generalize to the larger item set. Individual item-level estimates of minimal competency were simulated from existing and simulated item difficulty distributions. The study was designed to examine the characteristics of item sets and the standard setting process that could impact the ability to generalize a single performance standard. The characteristics and the relationship between the two item sets included three factors: (a) the item difficulty distributions, (b) the location of the 'true' performance standard, (c) the number of items randomly drawn in the sample. The characteristics of the standard setting process included four factors: (d) number of raters, (e) percentage of unreliable raters, (f) magnitude of 'unreliability' in unreliable raters, and (g) the directional influence of group dynamics and discussion. The aggregated simulation results were evaluated in terms of the location (bias) and the variability (mean absolute deviation, root mean square error) in the estimates. The simulation results suggest that the model of using partial item sets may have some merit as the resulting performance standard estimates may 'adequately' generalize to those set with larger item sets. The simulation results also suggest that elements such as the distribution of item difficulty parameters and the potential for directional group influence may also impact the ability to generalize performance standards and should be carefully considered. Standard setting Angoff values Rater reliability Rater error Simulation American Studies Arts and Humanities
8	Complete denture occlusion: intra and inter observer analysis Mpungose, Sandile Khayalethu Derrick January 2014 (has links) Magister Scientiae Dentium - MSc(Dent) / Aim: The aim of this study was to investigate the accuracy, intra- and inter-observer reliability of identifying occlusal markings made by articulating paper on complete dentures intra-orally. Methods: A series of photographs of 14 tissue borne complete dentures with occlusal markings was obtained. Articulating paper was used intra-orally at the delivery visit to make the occlusal markings. The denture sets were divided into two groups. Group 1 comprised pictures of the 14 complete lower dentures on their own, and group 2 comprised pictures of the same 14 lower dentures together with their opposing upper denture. The two groups of images were loaded into a Microsoft PowerPoint presentation as well as Keynote. Two experienced observers analysed the complete dentures independently and noted the number and distribution of the markings that they felt required adjustment. They differed, but discussed these and reached consensus. These data served as the control. Three groups of observers (10 per group) were then asked to analyse the occlusal markings of the 2 groups of denture images twice, with a two-week interval between each assessment. Before each subsequent assessment, the images were randomised by means of computer-generated random number sequence. The mean number of markings was established for each group and compared with the control mean. Intra-rater reliability was established by comparing the difference of the means of sequential observations for each rater by establishing the z-value. Inter-rater reliability within each group was established by means of analysis of variance. Results: Considering all the data, in only 17 instances (of the possible 60), did observers’ mean scores not differ from the control mean scores with good intra-rater reliability. In all other 43 instances the observers’ mean scores differed from the control mean scores and/or displayed poor intra-rater reliability. Considerable variation in inter-rater reliability was also found within every group of observers. Conclusion: The results indicate that observers are generally unable to reliably identify occlusal markings warranting occlusal adjustment, made by articulating paper on a lower complete denture. Clinical significance: Articulating paper should not be used intra-orally when delivering removable complete dentures. Complete dentures Articulating paper Occlusal markings Intra-rater reliability Inter-rater reliability
9	“She is such a B!” – “Really? How can you tell?” : A qualitive study into inter-rater reliability in grading EFL writing in a Swedish upper-secondary school Mård Grinde, Josefin January 2019 (has links) This project investigates the extent to which EFL teachers’ assessment practices of two students’ written texts differ in a Swedish upper-secondary school. It also seeks to understand the factors influencing the teachers regarding inter-rater reliability in their assessment and marking process. The results show inconsistencies in the summative grades given by the raters; these inconsistencies include what the raters deem important in the rubric; however, the actual assessment process was very similar for different raters. Based on the themes found in the content analysis regarding what perceived factors affected the raters, the results showed that peer-assessment, assessment training, context, and time were of importance to the raters. Emerging themes indicate that the interpretation of rubrics, which should actually matter the most when it comes to assessment, causes inconsistencies in summative marking, regardless of the use of the same rubrics, criteria and instructions by the raters. The results suggest a need for peer-assessment as a tool in the assessment and marking of students’ texts to ensure inter-rater reliability, which would mean that more time needs to be allocated to grading. English writing assessment Sweden upper-secondary school inter-rater reliability rater factors rubric measurement Languages and Literature Språk och litteratur
10	Relatively idiosyncratic : exploring variations in assessors' performance judgements within medical education Yeates, Peter January 2013 (has links) Background: Whilst direct-observation, workplace-based (or performance) assessments, sit at the conceptual epitome of assessment within medical education, their overall utility is limited by high-inter-assessor score variability. We conceptualised this issue as one of problematic judgements by assessors. Existing literature and evidence about judgements within performance appraisal and impression formation, as well as the small evolving literature on raters’ cognition within medical education, provided the theoretical context to study assessor’s judgement processes.Methods and Results: In this thesis we present three studies. The first study adopted an exploratory approach to studying assessors’ judgements in direct observation performance assessments, by asking assessors to describe their thoughts whilst assessing standard videoed performances by junior doctors. Comments and follow up interviews were analysed qualitatively using grounded theory principles. Results showed that assessors attributed different levels of salience to different aspects of performances, understood criteria differently (often comparing performance against other trainees) and expressed their judgements in unique narrative language. Consequently assessors’ judgements were comparatively idiosyncratic, or unique.The two subsequent follow up studies used experimental, internet based, experimental designs to further investigate the comparative judgements demonstrated in study 1. In study 2, participants were primed with either good or poor performances prior to watching intermediate (borderline) performances. In study 3 a similar design was employed but participants watched identical performances in either increasing or decreasing levels of proficiency. Collectively, the results of these two studies showed that recent experiences influenced assessors’ judgements, repeatedly showing a contrast effect (performances were scored unduly differently from earlier performances). These effects were greater than participants’ consistent tendency to be either lenient or stringent and occurred at multiple levels of performance. The effect appeared to be robust despite our attempting to reduce participants’ reliance on the immediate context. Moreover, assessors appeared to lack insight into the effect on their judgements.Discussion: Collectively, these results indicate that assessors score variations can be substantially explained by idiosyncrasy in cognitive representations of the judgement task, and susceptibility to contrast effects through comparative judgements. Moreover, assessors appear to be incapable of judging in absolute terms, instead judging normatively. These findings have important implications for theory and practice and suggest numerous further lines of research. 610.71

Search results