Spelling suggestions: "subject:"educational tests& measurements"" "subject:"cducational tests& measurements""
11 |
Evaluating several multidimensional adaptive testing procedures for diagnostic assessmentYoo, Hanwook 01 January 2011 (has links)
This computer simulation study was designed to comprehensively investigate how formative test designs can capitalize on the dimensional structure among the proficiencies being measured in a test, item selection methods, and computerized adaptive testing to improve measurement precision and classification accuracy. Four variables were manipulated to investigate the effectiveness of multidimensional adaptive testing (MAT): Number of dimensions measured by the test, magnitude of the correlations among the dimensions, the item selection method, and the test design. Outcome measures included recovery of known proficiency scores, bias in estimation, and accuracy of proficiency classifications. Unlike previous MAT research, no significant effect was found on the outcome measures due to the number of dimensions. A moderate improvement in the outcome measures was found with higher correlations (e.g., .50 or .80) among the dimensions. Four different item selection methods—Bayesian, Fisher, optimal, and random—were applied to evaluate the measurement efficiency of adaptive item selection methods and non-adaptive methods. As a baseline, the findings from the item selection method using random selection were available. The Bayesian item selection method showed the best results under different conditions. The Fisher item selection method showed the second best results, but the gap among adaptive item selection methods was reduced with longer tests and higher correlations among the dimensions. The optimal item selection method produced comparable results to adaptive item selection methods, when the focus was on the accuracy of decision making which in many applications of diagnostic assessment is the most important criterion. The level of impact of increased test length with a fixed test length condition was apparent on all of the outcome measures. The results from the study suggest that the Bayesian item selection method can be quite useful when there are at least moderate correlations among the dimensions. As these results were obtained using a good estimate of the priors, in a next step, the impact of poor prior (i.e., inaccurate) information on the validity of the Bayesian approach (e.g., too high, too low, too tight) should be investigated. We note too the very good results obtained with optimal item selection when the focus was on accuracy of proficiency classifications.
|
12 |
The impact of judges' consensus on the accuracy of anchor-based judgmental estimates of multiple-choice test item difficulty: The case of the NATABOC ExaminationDiBartolomeo, Matthew 01 January 2010 (has links)
Multiple factors have influenced testing agencies to more carefully consider the manner and frequency in which pretest item data are collected and analyzed. One potentially promising development is judges’ estimates of item difficulty. Accurate estimates of item difficulty may be used to reduce pretest samples sizes, supplement insufficient pretest sample sizes, aid in test form construction, assist in test form equating, calibrate test item writers who may be asked to produce items to meet statistical specifications, inform the process of standard setting, aid in preparing randomly equivalent blocks of pretest items, and/or aid in helping to set item response theory prior distributions. Two groups of 11 and eight judges, respectively, provided estimates of difficulty for the same set of 33 multiple-choice items from the National Athletic Trainers’ Association Board of Certification (NATABOC) Examination. Judges were faculty in Commission on Accreditation of Athletic Training Education-approved athletic training education programs and were NATABOC-approved examiners of the former hands-on practical portion of the Examination. For each item, judges provided two rounds of independent estimates of item difficulty and a third round group-level consensus estimate. Prior to providing estimates of item difficulty in rounds two and three, group discussion of the estimates provided in the preceding round was conducted. In general, the judges’ estimates of test item difficulty did not improve across rounds as predicted. Two-way repeated measures analyses of variance comparing item set mean difficulty estimates by round and the item set mean empirical item difficulty revealed no statistically significant differences across rounds, groups, or the interaction of these two factors. Moreover, item set mean difficulty estimates by round gradually drifted away from the item set mean empirical item difficulty and, therefore, mean estimation bias and effect size analyses gradually increased in correspondence with the item set mean item difficulty estimates provided across rounds. Therefore, the results revealed that no item difficulty estimation round yielded statistically significantly better recovery of the empirical item difficulty values compared to the other rounds.
|
13 |
Impact of item parameter drift on test equating and proficiency estimatesHan, Kyung Tyek 01 January 2008 (has links)
When test equating is implemented, the effects of item parameter drift (IPD), especially on the linking items with the anchor test design is expected to cause flaws in the measurement. However, an important question that has not yet been examined is how much IPD is allowed until the effect is consequential. To answer this overarching question, three Monte-Carlo simulation studies were conducted. In the first study, titled 'Impact of unidirectional IPD on test equating and proficiency estimates,' the indirect effect of IPD on proficiency estimates (through its effect on test equating designs that use linking items containing IPD) was examined. The results with the regression line-like plots provided a comprehensive illustration of the relationship between IPD and its consequences, which can be used as an important guideline for practitioners when IPD is expected in testing. In the second study, titled 'Impact of multidirectional IPD on test equating and proficiency estimates,' the impact of different combinations of linking items with various multidirectional IPD on the test equating procedure was investigated for three popular scaling methods (mean-mean, mean-sigma, and TCC method). It was hypothesized that multidirectional IPD would influence the amount of random error observed in the linking while the effect of systematic error would be minimal. The study found the results confirming the hypothesis and also found different combinations of multidimensional IPD results in different levels of impact even with the same total amount of IPD. The third study, titled 'Impact of IPD on pseudo-guessing parameter estimates and test equating,' examined how serious the consequences are if c-parameters are not transformed in the test equating procedure when IPD exists. Three new item calibration strategies to put c-parameter estimates on the same scale across tests were proposed. The results indicated that the consequences of IPD with various calibration strategies and scaling methods were not substantially different when the external linking design was used, but the study found a choice of calibration method and scaling method could result in different outputs when the internal linking design and/or different cut scores were used.
|
14 |
Approaches for addressing the fit of item response theory models to educational test dataZhao, Yue 01 January 2008 (has links)
The study was carried out to accomplish three goals : (1) Propose graphical displays of IRT model fit at the item level and suggest fit procedures at the test level that are not impacted by large sample size, (2) examine the impact of IRT model misfit on proficiency classifications, and (3) investigate consequences of model misfit in assessing academic growth. The main focus of the first goal was on the use of more and better graphical procedures for investigating model fit and misfit through the use of residuals and standardized residuals at the item level. In addition, some new graphical procedures and a non-parametric test statistic for investigating fit at the test score level were introduced, and some examples were provided. Based on a realistic dataset from a high school assessment, statistical and graphical methods were applied and results were reported. More important than the results about the actual fit, were the procedures that were developed and evaluated. In addressing the second goal, practical consequences of IRT model misfit on performance classifications and test score precision were examined. It was found that with several of the data sets under investigation, test scores were noticeably less well recovered with the misfitting model; and there were practically significant differences in the accuracy of classifications with the model that fit the data less well. In addressing the third goal, the consequences of model misfit in assessing academic growth in terms of test score precision, decision accuracy and passing rate were examined. The three-parameter logistic/graded response (3PL/GR) models produced more accurate estimates than the one-parameter logistic/partial credit (1PL/PC) models, and the fixed common item parameter method produced closer results to “truth” than linear equating using the mean and sigma transformation. IRT model fit studies have not received the attention they deserve among testing agencies and practitioners. On the other hand, IRT models can almost never provide a perfect fit to the test data, but the evidence is substantial that these models can provide an excellent framework for solving practical measurement problems. The importance of this study is that it provides ideas and methods for addressing model fit, and most importantly, highlights studies for addressing the consequences of model misfit for use in making determinations about the suitability of particular IRT models.
|
15 |
Item parameter drift as an indication of differential opportunity to learn: An exploration of item flagging methods & accurate classification of examineesSukin, Tia M 01 January 2010 (has links)
The presence of outlying anchor items is an issue faced by many testing agencies. The decision to retain or remove an item is a difficult one, especially when the content representation of the anchor set becomes questionable by item removal decisions. Additionally, the reason for the aberrancy is not always clear, and if the performance of the item has changed due to improvements in instruction, then removing the anchor item may not be appropriate and might produce misleading conclusions about the proficiency of the examinees. This study is conducted in two parts consisting of both a simulation and empirical data analysis. In these studies, the effect on examinee classification was investigated when the decision was made to remove or retain aberrant anchor items. Three methods of detection were explored; (1) delta plot, (2) IRT b-parameter plots, and (3) the RPU method. In the simulation study, degree of aberrancy was manipulated as well as the ability distribution of examinees and five aberrant item schemes were employed. In the empirical data analysis, archived statewide science achievement data that was suspected to possess differential opportunity to learn between administrations was re-analyzed using the various item parameter drift detection methods. The results for both the simulation and empirical data study provide support for eliminating the use of flagged items for linking assessments when a matrix-sampling design is used and a large number of items are used within that anchor. While neither the delta nor the IRT b-parameter plot methods produced results that would overwhelmingly support their use, it is recommended that both methods be employed in practice until further research is conducted for alternative methods, such as the RPU method since classification accuracy increases when such methods are employed and items are removed and most often, growth is not misrepresented by doing so.
|
16 |
Examination of the application of item response theory to the Angoff standard setting procedureClauser, Jerome Cody 01 January 2013 (has links)
Establishing valid and reliable passing scores is a vital activity for any examination used to make classification decisions. Although there are many different approaches to setting passing scores, this thesis is focused specifically on the Angoff standard setting method. The Angoff method is a test-centric classical test theory based approach to estimating performance standards. In the Angoff method each judge estimates the proportion of minimally competent examinees who will answer each item correctly. These values are summed across items and averages across judges to arrive at a recommended passing score. Unfortunately, research has shown that the Angoff method has a number of limitations which have the potential to undermine both the validity and reliability of the resulting standard. Many of the limitations of the Angoff method can be linked to its grounding in classical test theory. The purpose of this study is to determine if the limitations of the Angoff could be mitigated by a transition to an item response theory (IRT) framework. Item response theory is a modern measurement model for relating examinees' latent ability to their observed test performance. Theoretically the transition to an IRT-based Angoff method could result in more accurate, stable, and efficient passing scores. The methodology for the study was divided into three studies designed to assess the potential advantages of using an IRT-based Angoff method. Study one examined the effect of allowing judges to skip unfamiliar items during the ratings process. The goal of this study was to detect if passing scores are artificially biased due to deficits in the content experts' specific item level content knowledge. Study two explored the potential benefit of setting passing scores on an adaptively selected subset of test items. This study attempted to leverage IRT's score invariance property to more efficiently estimate passing scores. Finally study three compared IRT-based standards to traditional Angoff standards using a simulation study. The goal of this study was to determine if passing scores set using the IRT Angoff method had greater stability and accuracy than those set using the common True Score Angoff method. Together these three studies examined the potential advantages of an IRT-based approach to setting passing scores. The results indicate that the IRT Angoff method does not produce more reliable passing score than the common Angoff method. The transition to the IRT-based approach, however, does effectively ameliorate two sources of systematic error in the common Angoff method. The first source of error is brought on by requiring that all judges rate all items and the second source is introduced during the transition from test to scaled score passing scores. By eliminating these sources of error the IRT-based method allows for accurate and unbiased estimation of the judges' true opinion of the ability of the minimally capable examinee. Although all of the theoretical benefits of the IRT Angoff method could not be demonstrated empirically, the results of this thesis are extremely encouraging. The IRT Angoff method was shown to eliminate two sources of systematic error resulting in more accurate passing scores. In addition this thesis provides a strong foundation for a variety of studies with the potential to aid in the selection, training, and evaluation of content experts. Overall findings from this thesis suggest that the application of IRT to the Angoff standard setting method has the potential to offer significantly more valid passing scores.
|
17 |
Application of item response theory models to the algorithmic detection of shift errors on paper and pencil testsCook, Robert Joseph 01 January 2013 (has links)
On paper and pencil multiple choice tests, the potential for examinees to mark their answers in incorrect locations presents a serious threat to the validity of test score interpretations. When an examinee skips one or more items (i.e., answers out of sequence) but fails to accurately reflect the size of that skip on their answer sheet, that can trigger a string of misaligned responses called shift errors. Shift errors can result in correct answers being marked as incorrect, leading to possible underestimation of an examinee's true ability. Despite movement toward computerized testing in recent years, paper and pencil multiple choice tests are still pervasive in many high stakes assessment settings, including K 12 testing (e.g., MCAS) and college entrance exams (e.g., SAT), leaving a continuing need to address issues that arise within this format. Techniques for detecting aberrant response patterns are well established but do little to recognize reasons for the aberrance, limiting options for addressing the misfitting patterns. While some work has been done to detect and address specific forms of aberrant response behavior, little has been done in the area of shift error detection, leaving great room for improvement in addressing this source of aberrance. The opportunity to accurately detect construct irrelevant errors and either adjust scores to more accurately reflect examinee ability or flag examinees with inaccurate scores for removal from the dataset and retesting would improve the validity of important decisions based on test scores, and could positively impact model fit by allowing for more accurate item parameter and ability estimation. The purpose of this study is to investigate new algorithms for shift error detection that employ IRT models for probabilistic determination as to whether misfitting patterns are likely to be shift errors. The study examines a matrix of detection algorithms, probabilistic models, and person parameter methods, testing combinations of these factors for their selectivity (i.e., true positives vs. false positives), sensitivity (i.e., true shift errors detected vs. undetected), and robustness to parameter bias, all under a carefully manipulated, multifaceted simulation environment. This investigation attempts to provide answers to the following questions, applicable across detection methods, bias reduction procedures, shift conditions, and ability levels, but stated generally as: 1) How sensitively and selectively can an IRT based probabilistic model detect shift error across the full range of probabilities under specific conditions?, 2) How robust is each detection method to the parameter bias introduced by shift error?, 3) How well does the detection method detect shift errors compared to other, more general, indices of person fit?, 4) What is the impact on bias of making proposed corrections to detected shift errors?, and 4) To what extent does shift error, as detected by the method, occur within an empirical data set? Results show that the proposed methods can indeed detect shift errors at reasonably high detection rates with only a minimal number of false positives, that detection improves when detecting longer shift errors, and that examinee ability is a huge determinant factor in the effectiveness of the shift error detection techniques. Though some detection ability is lost to person parameter bias, when detecting all but the shortest shift errors, this loss is minimal. Application to empirical data also proved effective, though some discrepancies in projected total counts suggest that refinements in the technique are required. Use of a person fit statistic to detect examinees with shift errors was shown to be completely ineffective, underscoring the value of shift error specific detection methods.
|
18 |
Consistency of Single Item Measures Using Individual-Centered Structural AnalysesIaccarino, Stephanie 12 1900 (has links)
Estimating reliability for single-item motivational measures presents challenges, particularly when constructs are anticipated to vary across time (e.g., effort, self-efficacy, emotions). We explored an innovative approach for estimating reliability of single-item motivational measures by defining reliability as consistency of interpreting the meaning of items. We applied a psychometric approach to identifying meaning systems from distances between items and operationalized meaning systems as the ordinally-ranked participant’s responses to the items. We investigated the feasibility of this approach among 193 Introduction to Biology undergraduate participant responses to five single items assessing motivational constructs collected through thirteen weekly questionnaires. Partitioning among medoids (PAM) analysis was used to identify an optimal solution from which systems of meaning (SOM) were identified by the investigator. Transitions from SOM to SOM were tracked across time for each individual in the sample, and consistency groupings based on the percentage of consecutively repeated SOMs were computed for each individual. Results suggested that from an optimal eight-cluster solution, six SOMs emerged. While moderate transitions from SOM to SOM occurred, a small minority of participants consecutively repeated the same SOM across time and were placed in high consistency group; participants with moderate and low percentages were placed in lower consistency groups, accordingly. These results provide preliminary evidence in support of the approach, particularly for those highly consistent participants whose reliability might be misrepresented by conventional single-item reliability methods. Implications of the proposed approach and propositions for future research are included. / Educational Psychology
|
19 |
Predicting National Council Licensure Examination for Registered Nurses PerformanceWhitehead, Charles D. 08 July 2016 (has links)
<p> The Baccalaureate Nursing program in San Antonio, Texas experienced a decrease in National Council Licensure Examination for Registered Nurses (NCLEX-RN) on the first attempt for students graduating between 2009 and 2014 without a clear explanation for the decline. The purpose of this quantitative non-experimental correlational study was to analyze retrospective data from the school of nursing in San Antonio to determine the extent to which multiple variables (age, gender, race/ethnicity, cumulative pre-nursing GPA, cumulative GPA of nursing courses, remediation, and the Assessment Technologies Institute (ATI) examination predicted NCLEX-RN performance. The research question was: Is the ATI comprehensive examination a significant predictor of the NCLEX-RN performance of graduating nursing students in the San Antonio, Texas nursing program, either (a) on its own; or (b) in combination with other independent variables. The statistical problem was directed toward identifying the significant variables that predicted the NCLEX-RN performance of graduating nursing students between 2009 and 2014 using binary logistic regression analysis. The proportion of N = 334, nurses who passed the NCLEX-RN was n = 232, 69.5%. The answer to the research question, based on odds ratios (OR) was that NCLEX-RN performance could not be predicted solely by using the ATI predictor examinations. The ATI examination score was the strongest predictor of passing the NCLEX-RN (OR = 1.59) Ethnicity (OR = 1.38) and the combined pre-nursing and nursing GPA (OR = 1.28) were also found to be predictors of NCLEX-RN performance. The proportions of NCLEX-RN failures and need for remediation were highest among the African-American students. The gender and age of the students were not significant predictors of NCLEX-RN performance. The results of this research can be utilized by the San Antonio nursing program, as well as other nursing programs, to identify and address the factors identity of those graduating nursing students who are at risk of failing the NCLEX-RN. The researcher has shown that the predictor variables of the ATI predictor examination, cumulative college and nursing GPA’s, and ethnicity have a statistically significant correlation and therefore have impact on first time NCLEX-RN test takers passing the exam. It is recommended that Bachelor Degree Programs in Nursing focus on specific strategies within their institutions that would have a direct impact on these variables.</p>
|
20 |
Comparison of Cox regression and discrete time survival modelsYe, Hong 03 September 2016 (has links)
<p> A standard analysis of prostate cancer biochemical failure data is done by conducting two approaches in which risk factors or covariates are measured. Cox regression and discrete-time survival models were compared under different attributes: sample size, time periods, and parameters in the model. The person-period data was reconstructed when examining the same data in discrete-time survival model. Twenty-four numerical examples covering a variety of sample sizes, time periods, and number of parameters displayed the closeness of Cox regression and discrete-time survival methods in situations typical of the cancer study. </p>
|
Page generated in 0.2141 seconds