Global ETD Search

81	Equating tests under the Generalized Partial Credit Model Swediati, Nonny 01 January 1997 (has links) The efficacy of several procedures for equating tests when the scoring is based on the Partial Credit item response theory was investigated. A simulation study was conducted to investigate the effect of several factors on the accuracy of equating for tests calibrated using the Partial Credit Model. The factors manipulated were the number of anchor items, the difference in the ability distributions of the examinee groups that take alternate forms of a test, the sample size of the groups taking the tests, and the equating method. The data for this study were generated according to the Generalized Partial Credit model. Test lengths of 5 and 20 items were studied. The number of items in the anchor test ranged from two to four for the five item test while the number of anchor items ranged from two to eight items in the twenty item test. Two levels of sample size (500 and 1000) and two levels of ability distribution (equal and unequal) were studied. The equating methods studied were four variations of the Mean and Sigma method and the characteristic curve method. The results showed that the characteristic curve method was the most accurate equating method under all conditions studied. The second most effective method of equating was the Mean and Sigma method which used the all the step difficulty parameter estimates in the computation of the equating constants. In general, all equating methods produced reasonably accurate equating with long tests and with a large number of anchor items when there was no mean difference in ability of the two groups. When there was a large ability difference between the two groups of examinees taking the test, item parameters were estimated poorly, particularly in short tests, and this in turn affected the equating methods adversely. The conclusion is that poor parameter estimation makes it difficult to equate tests which are administered to examinee groups that differ greatly in ability, especially when the tests are relatively short and when the number of anchor items is small. Educational evaluation
82	An investigation of alternative approaches to scoring multiple response items on a certification exam Ma, Xiaoying 01 January 2004 (has links) Multiple-response (MR) items are items that have more than one correct answer. This item type is often used in licensure and achievement tests to accommodate situations where identification of a single correct answer no longer suffices or where multiple steps are required in solving a problem. MR items can be scored either dichotomously or polytomously. Polytomous scoring of MR items often employs some type of option weighting to assign differential point values to each of the response options. Weights for each option are defined a priori by expert judgments or derived empirically from item analysis. Studies examining the reliability and validity of differential option weighting methods have been based on classical test theory. Little or no research has been done to examine the usefulness of item response theory (IRT) models for deriving empirical weights, or to compare the effectiveness of different option weighting methods. The purposes of this study, therefore, were to investigate polytomous scoring methods for MR items and to evaluate the impacts different scoring methods may have on the reliability of the test scores, item and test information functions, as well as on measurement efficiency and classification accuracy. Results from this study indicate that polytomous scoring of the MR items did not significantly increase the reliability of the test, nor did it increase the test information functions drastically, probably due to 2/3 of the items being multiple-choice items, scored the same way across comparisons. However, substantial increase in test information function at the lower end of the score scale was observed under polytomous scoring schema. With respect to classification accuracy, the results were inconsistent across different samples; therefore, further study is needed. In summary, findings from this study suggest that polytomous scoring of MR items has the potential to increase the efficiency (as shown in increase in test information functions) of measurement and the accuracy of classification. Realizing these advantages, however, will be contingent on the quality and quantity of the MR items on the test. Further research is needed to evaluate the quality of the MR items and its effect on the effectiveness of polytomous scoring. Educational evaluation
83	An evaluation of automated scoring programs designed to score essays Khaliq, Shameem Nyla 01 January 2004 (has links) The number of performance assessment tasks has increased over the years because some constructs are best assessed in this manner. Though there are benefits to using performance tasks, there are also drawbacks. The problems with performance assessments include scoring time, scoring costs, and problems with human raters. One solution for overcoming the drawbacks of performance assessments is the use of automated scoring programs. There are several automated scoring programs designed to score essays and other constructed responses. Much research has been conducted on these programs by program developers; however, relatively little research has used external criteria to evaluate automated programs. The purpose of this study was to evaluate two popular automated scoring programs. The automated scoring programs were evaluated with respect to several criteria: the percent of exact and adjacent agreements, kappas, correlations, differences in score distributions, discrepant scoring, analysis of variance, and generalizability theory. The scoring results from the two automated scoring programs were compared to the scores from operational scoring and an expert panel of judges. The results indicated close similarity between the two scoring program regarding how they scored the essays. However, the results also revealed some subtle, but important, differences between the programs. One program exhibited higher correlations and agreement indices with both the operational and expert committee scores, although the magnitude of the different was small. Differences were also noted in the scores assigned to fake essays designed to trick the programs into providing a higher score. These results were consistent for both the full set of 500 scored essays and the subset of essays reviewed by the expert committee. Overall, both automated scoring programs did well judged on the criteria; however, one program did slightly better. The G-studies indicated that there were small differences among the raters and that the amount of error in the models was reduced as the number of human raters and automated scoring programs were increased. In summary, results suggest automated scoring programs can approximate scores given by human raters, but they differ with respect to proximity to operational and expert scores, and their ability to identify dubious essays. Educational evaluation
84	Using performance level descriptors to ensure consistency and comparability in standard -setting Khembo, Dafter January 01 January 2004 (has links) The need for fair and comparable performance standards in high-stakes examinations cannot be overstated. For examination results to be comparable over time, uniform performance standards need to be applied to different cohorts of students taking different forms of the examination. The motivation to conduct a study on maintenance of the Malawi School Certificate of Education (MSCE) performance standards arose following the observation by the Presidential Commission of Enquiry into the MSCE Results that the examination was producing fluctuating results whose cause could not be identified and explained, except for blaming the standard setting procedure that was in use. This study was conducted with the following objectives: (1) to see if use of performance level descriptors could ensure consistency in examination standards; (2) to assess the role of training of judges in standard setting; and (3) to examine the impact of judges' participation in scoring students' written answers prior to being involved in setting examination standards. To maintain examination standards over years means assessing different cohorts of students taking different forms of the examination using common criteria. In this study, common criteria, in the form of performance level descriptors, were developed and applied to the 2002 and 2003 MSCE Mathematics examination, using the item score string estimation (ISSE) standard setting method. Twenty MSCE mathematics experts were purposely identified and trained to use the method. Results from the study demonstrated that performance level descriptors, especially when used in concert with test equating, can help greatly determine grading standards that can be maintained from year to year by reducing variability in performance standards due to ambiguity about what it means to achieve each grade category. It has also been shown in this study that preparing judges to set performance standards is an important factor for producing quality standard setting results. At the same time, the results did not support a recommendation for judges to gain experience as scorers prior to participating in standard setting activities. Educational evaluation
85	Validity issues in standard setting Meara, Kevin Charles 01 January 2001 (has links) Standard setting is an important yet controversial aspect of testing. In credentialing, pass-fail decisions must be made to determine who is competent to practice in a particular profession. In education, decisions based on standards can have tremendous consequences for students, parents and teachers. Standard setting is controversial due to the judgmental nature of the process. In addition, the nature of testing is changing. With the increased use of computer based testing and new item formats, test-centered methods may no longer be applicable. How are testing organizations currently setting standards? How can organizations gather validity evidence to support their standards? This study consisted of two parts. The purpose of the first part was to learn about the procedures credentialing organizations use to set standards on their primary exam. A survey was developed and mailed to 98 credentialing organizations. Fifty-four percent of the surveys were completed and returned. The results indicated that most organizations used a modified Angoff method, however, no two organizations used exactly the same procedure. In addition, the use of computer based testing (CBT) and new item formats has increased during the past ten years. The results were discussed in terms of ways organizations can alter their procedures to gather additional validity evidence. The purpose of the second part was to conduct an evaluation of the standard-setting process used by a state department of education. Two activities were conducted. First, the documentation was evaluated, and second, secondary data analyses (i.e., contrasting groups analysis and cluster analysis) were conducted on data made available by the state. The documentation and the contrasting groups indicated that the standards were set with care and diligence. The results of the contrasting groups, however, also indicated that the standards in some categories might be a bit high. In addition, some of the score categories were somewhat narrow in range. The information covered in this paper might be useful for practitioners who must validate the standards they create. Educational evaluation
86	Development and evaluation of test assembly procedures for computerized adaptive testing Robin, Frederic 01 January 2001 (has links) Computerized adaptive testing provides a flexible and efficient framework for the assembly and administration of on-demand tests. However, the development of practical test assembly procedures that can ensure desired measurement, content, and security objectives for all individual tests, has proved difficult. To address this challenge, desirable test specifications, such as minimum test information targets, minimum and maximum test content attributes, and item exposure limits, were identified. Five alternative test assembly procedures where then implemented, and extensive computerized adaptive testing simulations were conducted under various test security and item pool size conditions. All five procedures implemented were modeled based on the weighted deviation model and optimized to produce the most acceptable compromise between testing objectives. As expected, the random (RD) and maximum information (MI) test assembly procedures resulted in the least acceptable tests—producing either the most informative but least secure and efficient tests or the most efficient and secure but least informative tests—illustrating the need for compromise between competing objectives. The combined maximum information item selection and Sympson-Hetter unconditional exposure control procedure (MI-SH) allowed for more acceptable compromise between testing objectives but demonstrated only moderate levels of test security and efficiency. The more sophisticated combined maximum information and Stocking and Lewis conditional exposure control procedure (MI-SLC) demonstrated both high levels of testing security and efficiency while providing acceptable measurement. Results obtained with the combined maximum information and stochastic conditional exposure control procedure (MI-SC) were similar to those obtained with MI-SLC. However, MI-SC offers the advantage of not requiring extensive preliminary simulations and allows for more flexibility in the removal or replacement of faulty items from operational pools. The importance of including minimum test information targets in the testing objectives was supported by the relatively large variability of test information observed for all the test assembly procedures used. Failure to take this problem into account when test assembly procedures are operationalized is likely to results in the administration of sub-standard tests to many examinees. Concerning pool management, it was observed that increasing pool size beyond what is needed to satisfy all testing objectives actually reduced testing efficiency. Educational evaluation
87	Investigation of the validity of the Angoff standard setting procedure for multiple -choice items Mattar, John D 01 January 2000 (has links) Setting passing standards is one of the major challenges in the implementation of valid assessments for high-stakes decision making in testing situations such as licensing and certification. If high stakes pass-fail decisions are to be made from test scores, the passing standards must be valid for the assessment itself to be valid. Multiple-choice test items continue to play an important role in measurement. The Angoff (1971) procedure continues to be widely used to set standards on multiple-choice examinations. This study focuses on the internal consistency, or underlying validity, of Angoff standard setting ratings. The Angoff procedure requires judges to estimate the proportion of borderline candidates who would answer each test question correctly. If the judges are successful at estimating the difficulty of items for borderline candidates that suggests an underlying validity to the procedure. This study examines this question by evaluating the relationships among Angoff standard setting ratings and actual candidate performance from professional certification tests. For each test, a borderline group of candidates was defined as those near the cutscore. The analyses focus on three aspects of judges' ratings with respect to item difficulties for the borderline group: accuracy, correlation and variability. The results of this study demonstrate some evidence for the validity of the Angoff standard setting procedure. For two of the three examinations studied, judges were accurate and consistent in rating the difficult of items for borderline candidates. However, the study also shows that the procedure may be less successful in its application. These results indicate that the procedure can be valid, but that its validity should be checked for each application. Practitioners should not assume that the Angoff method is valid. The results of this study also show some limitations to the procedure even when the overall results are positive. Judges are less successful at rating very difficult or very easy test items. The validity of the Angoff procedure may be enhanced by further study of methods designed to ameliorate those limitations. Educational evaluation
88	An Analysis of FEMA Curricular Outcomes in an Emergency Management and Homeland Security Certificate Program— a Case Study Exploring Instructional Practice Unknown Date (has links) In the United States, the higher education community is charged with the academic education of emergency management professionals. The present rate of natural disasters as well as the evolving threat of terrorist attacks have created a demand for practitioners who are solidly educated in emergency management knowledge, skills, and abilities. These conditions have in turn precipitated the aggressive growth of emergency management and homeland security academic programs in higher education, characterized as the most relevant development in the field of emergency management (Darlington, 2008). With the goal of accelerating professionalization of emergency management occupations through higher education, the Federal Emergency Management Agency’s (FEMA) Higher Education Program’s research efforts focused on developing a set of evidence-based competencies for academic programs. These were outlined in FEMA’s Curricular Outcomes (2011). This study explored how these evidence-based competencies are manifested in emergency management and homeland security academic programs and contributes to filling the gap in the literature on the implementation of FEMA’s professional competencies in academic programs, a consequence of legal constraints prohibiting the direct collection of implementation data by federal agencies. The results indicated a wide range of competencies were represented in program coursework with gaps in alignment identified in the five competency areas. The analysis also revealed the exclusion of homeland security topics in Curricular Outcomes (2011) which led to issues of operationalization. Lastly, instructors shared feedback to improve alignment while the researcher discusses key conditions for similar use of a responsive evaluation framework in academic programs. / A Dissertation submitted to the Department of Educational Leadership and Policy Studies in partial fulfillment of the requirements for the degree of Doctor of Education. / Spring Semester 2018. / March 6, 2018. / curriculum alignment, curriculum mapping, emergency management academic programs, evidence-based practice, professional comepetencies, responsive evaluation / Includes bibliographical references. / Linda Schrader, Professor Co-Directing Dissertation; Tamara Bertrand Jones, Professor Co-Directing Dissertation; Ralph Brower, University Representative; Marytza Gawlik, Committee Member; Motoko Akiba, Committee Member. Educational evaluation
89	Assessing fit of item response theory models Lu, Ying 01 January 2006 (has links) Item response theory (IRT) modeling is a statistical technique that is being widely applied in the field of educational and psychological testing. The usefulness of IRT models, however, is dependent on the extent to which they effectively reflect the data, and it is necessary that model data fit be evaluated before model application by accumulating a wide variety of evidence that supports the proposed uses of the model with a particular set of data. This thesis addressed issues in the collection of two major sources of fit evidence to support IRT model application: evidence based on model data congruence, and evidence based on intended uses of the model and practical consequences. Specifically, the study (a) proposed a new goodness-of-fit procedure, examined its performance using fitting and misfitting data, and compared its behavior with that of the commonly used goodness-of-fit procedures, and (b) investigated through simulations the consequences of model misfit on two of the major IRT applications: equating and computer adaptive testing. In all simulation studies, 3PLM was assumed to be the true IRT model, while 1PLM and 2PLM were treated as misfitting models. The study found that the new proposed goodness-of-fit statistic correlated consistently higher than the commonly used fit statistics with the true size of misfit, making it a useful index to estimate the degree of misfit, which is often of interest but unknown in practice. A major issue with the new statistic is its inappropriately defined null distribution and critical values, and as a result the new statistical test appeared to be less powerful, but less susceptible to type I error rate either. In examining the consequences of model data misfit, the study showed that although theoretically 2PLM could not provide a perfect fit to 3PLM data, there was minimum consequence if 2PLM was used to equate 3PLM data and if number correct scores were to be reported. This, however, was not true in CAT given the significant bias 2PLM produced. The study further emphasized the importance of fit evaluation through both goodness-of-fit statistical tests and examining practical consequences of misfit. Educational evaluation
90	Detecting exposed items in computer -based testing Han, Ning 01 January 2006 (has links) More and more testing programs are transferring from traditional paper and pencil to computer-based administrations. Common practice in computer-based testing is that test items are utilized repeatedly in a short time period to support large volumes of examinees, which makes disclosed items a concern to the validity and fairness of test scores. Most current research is focused on controlling item exposure rates, which minimizes the probability that some items are over used, but there is no common understanding about issues such as how long an item pool should be used, what the pool size should be, and what exposure rates are acceptable. A different approach to addressing overexposure of test items is to focus on generation and investigation of item statistics that reveal whether test items are known to examinees prior to their seeing the tests. A method was proposed in this study to detect disclosed items by monitoring the moving averages of some common item statistics. Three simulation studies were conducted to investigate and evaluate the usefulness of the method. The statistics investigated included classical item difficulty, IRT-based item raw residuals, and three kinds of IRT-based standardized item residuals. The detection statistic used in study 1 was the classical item difficulty statistic. Study 2 investigated classical item difficulty, IRT-based item residuals and the best known of the IRT-based standardized residuals. Study 3 investigated three different types of standardizations of residuals. Other variables in the simulations included window sizes, item characteristics, ability distributions, and the extent of item disclosure. Empirical type I error and power of the method were computed for different situations. The results showed that, with reasonable window sizes (about 200 examinees), the IRT-based statistics under a wide variety of conditions produced the most promising results and seem ready for immediate implementation. Difficult and discriminating items were the easiest to spot when they had been exposed and it is the most discriminating items that contribute most to proficiency estimation with multi-parameter IRT models. Therefore, early detection of these items is especially important. The applicability of the approach to large scale testing programs was also addressed. Educational evaluation

Search results