21 |
Investigation of the validity of the Angoff standard setting procedure for multiple -choice itemsMattar, John D 01 January 2000 (has links)
Setting passing standards is one of the major challenges in the implementation of valid assessments for high-stakes decision making in testing situations such as licensing and certification. If high stakes pass-fail decisions are to be made from test scores, the passing standards must be valid for the assessment itself to be valid. Multiple-choice test items continue to play an important role in measurement. The Angoff (1971) procedure continues to be widely used to set standards on multiple-choice examinations. This study focuses on the internal consistency, or underlying validity, of Angoff standard setting ratings. The Angoff procedure requires judges to estimate the proportion of borderline candidates who would answer each test question correctly. If the judges are successful at estimating the difficulty of items for borderline candidates that suggests an underlying validity to the procedure. This study examines this question by evaluating the relationships among Angoff standard setting ratings and actual candidate performance from professional certification tests. For each test, a borderline group of candidates was defined as those near the cutscore. The analyses focus on three aspects of judges' ratings with respect to item difficulties for the borderline group: accuracy, correlation and variability. The results of this study demonstrate some evidence for the validity of the Angoff standard setting procedure. For two of the three examinations studied, judges were accurate and consistent in rating the difficult of items for borderline candidates. However, the study also shows that the procedure may be less successful in its application. These results indicate that the procedure can be valid, but that its validity should be checked for each application. Practitioners should not assume that the Angoff method is valid. The results of this study also show some limitations to the procedure even when the overall results are positive. Judges are less successful at rating very difficult or very easy test items. The validity of the Angoff procedure may be enhanced by further study of methods designed to ameliorate those limitations.
|
22 |
An Analysis of FEMA Curricular Outcomes in an Emergency Management and Homeland Security Certificate Program— a Case Study Exploring Instructional PracticeUnknown Date (has links)
In the United States, the higher education community is charged with the academic education of emergency management professionals. The present rate of natural disasters as well as the evolving threat of terrorist attacks have created a demand for practitioners who are solidly educated in emergency management knowledge, skills, and abilities. These conditions have in turn precipitated the aggressive growth of emergency management and homeland security academic programs in higher education, characterized as the most relevant development in the field of emergency management (Darlington, 2008). With the goal of accelerating professionalization of emergency management occupations through higher education, the Federal Emergency Management Agency’s (FEMA) Higher Education Program’s research efforts focused on developing a set of evidence-based competencies for academic programs. These were outlined in FEMA’s Curricular Outcomes (2011). This study explored how these evidence-based competencies are manifested in emergency management and homeland security academic programs and contributes to filling the gap in the literature on the implementation of FEMA’s professional competencies in academic programs, a consequence of legal constraints prohibiting the direct collection of implementation data by federal agencies. The results indicated a wide range of competencies were represented in program coursework with gaps in alignment identified in the five competency areas. The analysis also revealed the exclusion of homeland security topics in Curricular Outcomes (2011) which led to issues of operationalization. Lastly, instructors shared feedback to improve alignment while the researcher discusses key conditions for similar use of a responsive evaluation framework in academic programs. / A Dissertation submitted to the Department of Educational Leadership and Policy Studies in partial fulfillment of the requirements for the degree of Doctor of Education. / Spring Semester 2018. / March 6, 2018. / curriculum alignment, curriculum mapping, emergency management academic programs, evidence-based practice, professional comepetencies, responsive evaluation / Includes bibliographical references. / Linda Schrader, Professor Co-Directing Dissertation; Tamara Bertrand Jones, Professor Co-Directing Dissertation; Ralph Brower, University Representative; Marytza Gawlik, Committee Member; Motoko Akiba, Committee Member.
|
23 |
Assessing fit of item response theory modelsLu, Ying 01 January 2006 (has links)
Item response theory (IRT) modeling is a statistical technique that is being widely applied in the field of educational and psychological testing. The usefulness of IRT models, however, is dependent on the extent to which they effectively reflect the data, and it is necessary that model data fit be evaluated before model application by accumulating a wide variety of evidence that supports the proposed uses of the model with a particular set of data. This thesis addressed issues in the collection of two major sources of fit evidence to support IRT model application: evidence based on model data congruence, and evidence based on intended uses of the model and practical consequences. Specifically, the study (a) proposed a new goodness-of-fit procedure, examined its performance using fitting and misfitting data, and compared its behavior with that of the commonly used goodness-of-fit procedures, and (b) investigated through simulations the consequences of model misfit on two of the major IRT applications: equating and computer adaptive testing. In all simulation studies, 3PLM was assumed to be the true IRT model, while 1PLM and 2PLM were treated as misfitting models. The study found that the new proposed goodness-of-fit statistic correlated consistently higher than the commonly used fit statistics with the true size of misfit, making it a useful index to estimate the degree of misfit, which is often of interest but unknown in practice. A major issue with the new statistic is its inappropriately defined null distribution and critical values, and as a result the new statistical test appeared to be less powerful, but less susceptible to type I error rate either. In examining the consequences of model data misfit, the study showed that although theoretically 2PLM could not provide a perfect fit to 3PLM data, there was minimum consequence if 2PLM was used to equate 3PLM data and if number correct scores were to be reported. This, however, was not true in CAT given the significant bias 2PLM produced. The study further emphasized the importance of fit evaluation through both goodness-of-fit statistical tests and examining practical consequences of misfit.
|
24 |
Detecting exposed items in computer -based testingHan, Ning 01 January 2006 (has links)
More and more testing programs are transferring from traditional paper and pencil to computer-based administrations. Common practice in computer-based testing is that test items are utilized repeatedly in a short time period to support large volumes of examinees, which makes disclosed items a concern to the validity and fairness of test scores. Most current research is focused on controlling item exposure rates, which minimizes the probability that some items are over used, but there is no common understanding about issues such as how long an item pool should be used, what the pool size should be, and what exposure rates are acceptable. A different approach to addressing overexposure of test items is to focus on generation and investigation of item statistics that reveal whether test items are known to examinees prior to their seeing the tests. A method was proposed in this study to detect disclosed items by monitoring the moving averages of some common item statistics. Three simulation studies were conducted to investigate and evaluate the usefulness of the method. The statistics investigated included classical item difficulty, IRT-based item raw residuals, and three kinds of IRT-based standardized item residuals. The detection statistic used in study 1 was the classical item difficulty statistic. Study 2 investigated classical item difficulty, IRT-based item residuals and the best known of the IRT-based standardized residuals. Study 3 investigated three different types of standardizations of residuals. Other variables in the simulations included window sizes, item characteristics, ability distributions, and the extent of item disclosure. Empirical type I error and power of the method were computed for different situations. The results showed that, with reasonable window sizes (about 200 examinees), the IRT-based statistics under a wide variety of conditions produced the most promising results and seem ready for immediate implementation. Difficult and discriminating items were the easiest to spot when they had been exposed and it is the most discriminating items that contribute most to proficiency estimation with multi-parameter IRT models. Therefore, early detection of these items is especially important. The applicability of the approach to large scale testing programs was also addressed.
|
25 |
Equating high-stakes educational measurements: A study of design and consequencesChulu, Bob Wajizigha 01 January 2006 (has links)
The practice of equating educational and psychological tests to create comparable and interchangeable scores is increasingly becoming appealing to most testing and credentialing agencies. However, the Malawi National Examinations Board (MANEB) and many other testing organizations in Africa and Europe do not conduct equating and the consequences of not equating tests have not been clearly documented. Furthermore, there are no proper equating designs for some agencies to employ because they administer tests annually to different examinee' populations and they disclose all items after each administration. Therefore, the purposes of this study were to: (1) determine whether it was necessary to equate MANEB tests; (2) investigate consequences of not equating educational tests; and (3) explore the possibility of using an external anchor test that is administered separately from the target tests to equate scores. The study used 2003, 2004, and 2005 Primary School Leaving Certificate (PSLCE) Mathematics scores for two randomly equivalent groups of eighth grade examinees drawn from 12 primary schools in the Zomba district in Malawi. In the first administration, group A took the 2004 test while group B took the 2003 form. In the second administration both groups took an external anchor test and five weeks later, they both took the 2005 test. Data were analyzed using identity and log-linear methods, t-tests, decision consistency analyses, classification consistency analyses, and by computing reduction in uncertainty, and the root mean square difference indices. Both linear and post-smoothed equipercentile methods were used to equate test scores. The study revealed that: (1) score distributions and test difficulties were dissimilar across test forms signifying that equating is necessary; (2) classification of students into grade categories across forms were different before equating, but similar after equating; and (3) the external anchor test design performed in the same way as the random groups design. The results suggest that MANEB should equate tests scores to improve consistency of decisions and to match their distributions and difficulty levels across forms. Given the current policy of exam discloser, the use of an external anchor test that is administered separately from the operational form to equate score is recommended.
|
26 |
Small -sample item parameter estimation in the three parameter logistic model: Using collateral informationKeller Stowe, Lisa Ann 01 January 2002 (has links)
The appeal of computer adaptive testing (CAT) is growing in the licensure, credentialing, and educational fields. A major promise of CAT is the more efficient measurement of an examinee's ability. However, for CAT to be successful, a large calibrated item bank is essential. As item selection depends on the proper calibration of items, and accurate estimation of the item information functions, obtaining accurate and stable estimates of item parameters is paramount. However, concerns of item exposure and test security require item parameter estimation with much smaller samples than is recommended. Therefore, the development of methods for small sample estimation is essential. The purpose of this study was to investigate a technique to improve small sample estimation of item parameters, as well as recovery of item information functions by using auxiliary information about item in the estimation process. A simulation study was conducted to examine the improvements in both item parameter and item information recovery. Several different conditions were simulated, including sample size, test length, and quality of collateral information. The collateral information was used to set prior distributions on the item parameters. Several prior distributions were placed on both the a - and b-parameters and were compared to each other as well as to the default options in BILOG. The results indicate that with some relatively good collateral information, nontrivial gains in both item parameter and item information recovery can be made. The current literature in automatic item generation indicates that such information is available for the prediction of item difficulty. The largest improvements were made in the bias of both the a-parameters and the information functions. The implications are that more accurate item selection can occur, leading to more accurate estimates of examinee ability.
|
27 |
Evaluating the consistency and accuracy of proficiency classifications using item response theoryLi, Shuhong 01 January 2006 (has links)
As demanded by the No Child Left Behind (NCLB) legislation, state-mandated testing has increased dramatically, and almost all of these tests report examinee's performance in terms of several ordered proficiency categories. Like licensure exams, these assessments often have high-stakes consequences, such as graduation requirements and school accountability. It goes without saying that we want these tests to be of high quality, and the quality of these test instruments can be assessed, in part, through the decision accuracy (DA) and decision consistency (DC) indices. With the popularization of IRT, an increasing number of tests are adopting IRT for test development, test score equating and all other data analyses, which naturally calls for approaches to evaluating DA and DC in the framework of IRT. However, it is still common to observe the practice of carrying out all data analyses in IRT while reporting DA and DC indices derived in the framework of CTT. This situation testifies to the necessity to the exploration of possibilities to quantify DA and DC under IRT. The current project addressed several possible methods for estimating DA and DC in the framework of IRT with the specific focus on tests involving both dichotomous and polytomous items. It consisted of several simulation studies in which the all IRT methods introduced were valuated with simulated data, and all methods introduced were also be applied in a real data context to demonstrate their application in practice. Overall, the results from this study provided evidence that would support the use of the 3 IRT methods introduced in this project in estimating DA and DC indices in most of the simulated situations, and in most of the cases the 3 IRT methods produced results that were close to the "true" DA and DC values, and consistent results to (sometimes even better results than) those from the commonly used L&L method. It seems the IRT methods showed more robustness on the distribution shapes than on the test length. Their implications to educational measurement and some directions for future studies in this area were also discussed.
|
28 |
USING RESIDUAL ANALYSES TO ASSESS ITEM RESPONSE MODEL-TEST DATA FIT (MEASUREMENT TESTING)MURRAY, LINDA NORINE 01 January 1985 (has links)
Statistical tests are commonly used for studying item response model-test data fit. But, many of these tests have well-known problems associated with them. The biggest concern is the confounding of sample size in the interpretation of fit results. In the study, the fit of three item response models was investigated using a different approach: exploratory residual procedures. These residual techniques rely on the use of judgment for interpreting the size and direction of discrepancies between observed and expected examinee performances. The objectives of the study were to investigate if exploratory procedures involving residuals are valuable for judging instances of model-data fit, and to examine the fit of the one-parameter, two-parameter, and three-parameter logistic models to National Assessment of Educational Progress (NAEP) and Maryland Functional Reading Test (MFRT) data. The objectives were investigated by determining if judgments about model-data fit are altered if different variations of residuals are used in the analysis, and by examining fit at the item, ability, and overall test level using plots and simple summary statistics. Reasons for model misfit were sought by analyzing associations between the residuals and important item variables. The results showed that the statistics based on average raw and standardized residuals provided useful fit information, but that when compared, the statistics based on standardized residuals presented a more accurate picture of model-data fit and therefore, provided the best overall fit information. Other results revealed that with the NAEP and MFRT type of items, failure to consider variations in item discriminating power resulted in the one-parameter model providing substantially poorer fits to the data sets. Also, guessing on difficult NAEP multiple-choice items affected the degree of model-data fit. The main recommendation from the study is that because the residual analyses provide substantial amounts of empirical evidence about fit, practitioners should consider these procedures as one of the several types of strategies to employ when dealing with the goodness of fit question.
|
29 |
INVESTIGATION OF JUDGES' ERRORS IN ANGOFF AND CONTRASTING-GROUPS CUT-OFF SCORE METHODS (STANDARD SETTING, MASTERY TESTING, CRITERION-REFERENCED TESTING)ARRASMITH, DEAN GORDON 01 January 1986 (has links)
Methods for specifying cut-off scores for a criterion-referenced test usually rely on judgments about item content and/or examinees. Comparisons of cut-off score methods have found that different methods result in different cut-off scores. This dissertation focuses on understanding why and how cut-off score methods are different. The importance of this understanding is reflected in practitioners' needs to choose appropriate cut-off score methods, and to understand and control inappropriate factors that may influence the cut-off scores. First, a taxonomy of cut-off score methods was developed. The taxonomy identified the generic categories of setting cut-off scores. Second, the research focused on three methods for estimating the errors associated with setting cut-off scores: generalizability theory, item response theory and bootstrap estimation. These approaches were applied to Angoff and Contrasting-groups cut-off score methods. For the Angoff cut-off score method, the IRT index of consistency and analyses of the differences between judges' ratings and expected test item difficulty, provided useful information for reviewing specific test items that judges were inconsistent in rating. In addition, the generalizability theory and bootstrap estimates were useful for overall estimates of the errors in judges' ratings. For the Contrasting-groups cut-off score method, the decision accuracy of the classroom cut-off scores was useful for identifying classrooms in which the classification of students may need to be reviewed by teachers. The bootstrap estimate of the pooled sample of students provided a useful overall estimate of the errors in the resulting cut-off score. There are several extensions of this investigation that can be made. For example, there is a need to understand the magnitude of errors in relationship to the precision with which judges are able to rate test items or classify examinees; better ways of reporting and dealing with judges' inconsistencies need to be developed; and the analysis of errors needs to be extended to other cut-off score methods. Finally, these procedures can provide the operational criterion against which improvements and comparisons of cut-off score procedures can be evaluated.
|
30 |
Including students with significant disabilities in school climate assessmentMacari, Krista M. 26 January 2022 (has links)
The purpose of this qualitative case study was to explore ways to include students with significant disabilities in school climate measures. Student perceptions of school climate are generally collected through self-report surveys. These tools may be inaccessible to students with significant cognitive, language, physical or other challenges. This study had two main research objectives: to understand how core school climate constructs apply to the experience of students with significant disabilities and to explore the development of a meaningful tool that could be used to capture their experience. To address the first goal, a series of in-depth interviews were conducted with staff, parents, and students from a private special education day program to learn how the key school climate constructs of Engagement, Safety, and Environment from the US DOE 2019 Safe and Supportive Schools framework applied to the experience of students with significant disabilities. Results from this initial phase of the study included themes related to understanding difference, supporting authentic relationships, meaningful participation and accessibility, independence, and qualities of the physical learning environment. This data was applied in the second phase of the study to develop a rating scale that could potentially be used to include the experience of students with significant disabilities in school climate measures. Using the 2018 Massachusetts Views of Climate and Learning survey as a model and incorporating the interview data, parent and staff versions of the School Climate Rating Scale were created. A focus group and two rounds of cognitive laboratory interviews were conducted to assess how potential respondents would interpret and respond to the rating scale prompts. Feedback was also gathered about the viability of the School Climate Rating Scale as an alternate measurement tool for students with significant disabilities and experiences of school climate. / 2024-01-26T00:00:00Z
|
Page generated in 0.1652 seconds