Global ETD Search

21	Multidimensional item response theory observed score equating methods for mixed-format tests Peterson, Jaime Leigh 01 July 2014 (has links) The purpose of this study was to build upon the existing MIRT equating literature by introducing a full multidimensional item response theory (MIRT) observed score equating method for mixed-format exams because no such methods currently exist. At this time, the MIRT equating literature is limited to full MIRT observed score equating methods for multiple-choice only exams and Bifactor observed score equating methods for mixed-format exams. Given the high frequency with which mixed-format exams are used and the accumulating evidence that some tests are not purely unidimensional, it was important to present a full MIRT equating method for mixed-format tests. The performance of the full MIRT observed score method was compared with the traditional equipercentile method, and unidimensional IRT (UIRT) observed score method, and Bifactor observed score method. With the Bifactor methods, group-specific factors were defined according to item format or content subdomain. With the full MIRT methods, two- and four-dimensional models were included and correlations between latent abilities were freely estimated or set to zero. All equating procedures were carried out using three end-of-course exams: Chemistry, Spanish Language, and English Language and Composition. For all subjects, two separate datasets were created using pseudo-groups in order to have two separate equating criteria. The specific equating criteria that served as baselines for comparisons with all other methods were the theoretical Identity and the traditional equipercentile procedures. Several important conclusions were made. In general, the multidimensional methods were found to perform better for datasets that evidenced more multidimensionality, whereas unidimensional methods worked better for unidimensional datasets. In addition, the scale on which scores are reported influenced the comparative conclusions made among the studied methods. For performance classifications, which are most important to examinees, there typically were not large discrepancies among the UIRT, Bifactor, and full MIRT methods. However, this study was limited by its sole reliance on real data which was not very multidimensional and for which the true equating relationship was not known. Therefore, plans for improvements, including the addition of a simulation study to introduce a variety of dimensional data structures, are also discussed. Bifactor Dimensionality Equating Item Response Theory Multidimensional Item Response Theory Educational Psychology
22	Observed score and true score equating procedures for multidimensional item response theory Brossman, Bradley Grant 01 May 2010 (has links) The purpose of this research was to develop observed score and true score equating procedures to be used in conjunction with the Multidimensional Item Response Theory (MIRT) framework. Currently, MIRT scale linking procedures exist to place item parameter estimates and ability estimates on the same scale after separate calibrations are conducted. These procedures account for indeterminacies in (1) translation, (2) dilation, (3) rotation, and (4) correlation. However, no procedures currently exist to equate number correct scores after parameter estimates are placed on the same scale. This research sought to fill this void in the current psychometric literature. Three equating procedures--two observed score procedures and one true score procedure--were created and described in detail. One observed score procedure was presented as a direct extension of unidimensional IRT observed score equating, and is referred to as the "Full MIRT Observed Score Equating Procedure." The true score procedure and the second observed score procedure incorporated the statistical definition of the "direction of best measurement" in an attempt to equate exams using unidimensional IRT (UIRT) equating principles. These procedures are referred to as the "Unidimensional Approximation of MIRT True Score Equating Procedure" and the "Unidimensional Approximation of MIRT Observed Score Equating Procedure," respectively. Three exams within the Iowa Test of Educational Development (ITED) Form A and Form B batteries were used to conduct UIRT observed score and true score equating, MIRT observed score and true score equating, and equipercentile equating. The equipercentile equating procedure was conducted for the purpose of comparison since this procedure does not explicitly violate the IRT assumption of unidimensionality. Results indicated that the MIRT equating procedures performed more similarly to the equipercentile equating procedure than the UIRT equating procedures, presumably due to the violation of the unidimensionality assumption under the UIRT equating procedures. Future studies are expected to address how the MIRT procedures perform under varying levels of multidimensionality (weak, moderate, strong), varying frameworks of dimensionality (simple structure vs. complex structure), and number of dimensions, among other conditions. Equating Item Response Theory Linking Multidimensional Item Response Theory Educational Psychology
23	A comparison of statistics for selecting smoothing parameters for loglinear presmoothing and cubic spline postsmoothing under a random groups design Liu, Chunyan 01 May 2011 (has links) Smoothing techniques are designed to improve the accuracy of equating functions. The main purpose of this dissertation was to propose a new statistic (CS) and compare it to existing model selection strategies in selecting smoothing parameters for polynomial loglinear presmoothing (C) and cubic spline postsmoothing (S) for mixed-format tests under a random groups design. For polynomial loglinear presmoothing, CS was compared to seven existing model selection strategies in selecting the C parameters: likelihood ratio chi-square test (G2), Pearson chi-square test (PC), likelihood ratio chi-square difference test (G2diff), Pearson chi-square difference test (PCdiff), Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and Consistent Akaike Information Criterion (CAIC). For cubic spline postsmoothing, CS was compared to the ± 1 standard error of equating (± 1 SEE) rule. In this dissertation, both the pseudo-test data, Biology long and short, and Environmental Science long and short, and the simulated data were used to evaluate the performance of the CS statistic and the existing model selection strategies. For both types of data, sample sizes of 500, 1000, 2000, and 3000 were investigated. In addition, No Equating Needed conditions and Equating Needed conditions were investigated for the simulated data. For polynomial loglinear presmoothing, mean absolute difference (MAD), average squared bias (ASB), average squared error (ASE), and mean squared errors (MSE) were computed to evaluate the performance of all model selection strategies based on three sets of criteria: cumulative relative frequency distribution (CRFD), relative frequency distribution (RFD), and the equipercentile equating relationship. For cubic spline postsmoothing, the evaluation of different model selection procedures was only based on the MAD, ASB, ASE, and MSE of equipercentile equating. The main findings based on the pseudo-test data and simulated data were as follows: (1) As sample sizes increased, the average C values increased and the average S values decreased for all model selection strategies. (2) For polynomial loglinear presmoothing, compared to the results without smoothing, all model selection strategies always introduced bias of RFD and significantly reduced the standard errors and mean squared errors of RFD; only AIC reduced the MSE of CRFD and MSE of equipercentile equating across all sample sizes and all test forms; the best CS procedure tended to yield an equivalent or smaller MSE of equipercentile equating than the AIC and G2diff statistics. (3) For cubic spline postsmoothing, both the ± 1 SEE rule and the CS procedure tended to perform reasonably well in reducing the ASE and MSE of equipercentile equating. (4) Among all existing model selection strategies, the ±1 SEE rule in postsmoothing tended to perform better than any of the seven existing model selection strategies in presmoothing in terms of the reduction of random error and total error; (5) pseudo-test data and the simulated data tended to yield similar results. The limitations of the study and possible future research are discussed in the dissertation. Cubic spline postsmoothing Equipercentile equating Polynomial loglinear presmoothing Pseudo-test Random groups design Simulation Educational Psychology
24	A perfect score : Validity arguments for college admission tests Lyrén, Per-Erik January 2009 (has links) College admission tests are of great importance for admissions systems in general and for candidates in particular. The SweSAT (Högskoleprovet in Swedish) has been used for college admission in Sweden for more than 30 years, and today it is alongside with the upper-secondary school GPA the most widely used instrument for selection of college applicants. Because of the importance that is placed on the SweSAT, it is essential that the scores are reliable and that the interpretations and uses of the scores are valid. The main purpose of this thesis was therefore to examine some assumptions that are of importance for the validity of the interpretation and use of SweSAT scores. The argument-based approach to validation was used as the framework for the evaluation of these assumptions.The thesis consists of four papers and an extensive introduction with summaries of the papers. The first three papers examine assumptions that are relevant for the use of SweSAT scores for admission decisions, while the fourth paper examines an assumption that is relevant for the use of SweSAT scores for providing diagnostic information. The first paper is a review of predictive validity studies that have been performed on the SweSAT. The general conclusion from the review is that the predictive validity of SweSAT scores varies greatly among study programs, and that there are many problematic issues related to the methodology of the predictive validity studies. The second paper focuses on an assumption underlying the current SweSAT equating design, namely that the groups taking different forms of the test have equal abilities. The results show that this assumption is highly problematic, and consequently a more appropriate equating design should be applied when equating SweSAT scores. The third paper examines the effect of textual item revisions on item statistics and preequating outcomes, using data from the SweSAT data sufficiency subtest. The results show that most kinds of revisions have a significant effect on both p-values and point-biserial correlations, and as a consequence the preequating outcomes are affected negatively. The fourth paper examines whether there is added value in reporting subtest scores rather than just the total score to the test-takers. Using a method derived from classical test theory, the results show that all observed subscores are better predictors of the true subscores than is the observed total score, with the exception of the Swedish reading comprehension subtest. That is, the subscores contain information that the test-takers can use for remedial studies and hence there is added value in reporting the subscores. The general conclusion from the thesis as a whole is that the interpretations and use of SweSAT scores are based on several questionable assumptions, but also that the interpretations and uses are supported by a great deal of validity evidence. college admission tests SweSAT validity interpretive arguments predictive validity equating item revisions subscores
25	<原著>共通被験者デザインにおける等化係数の周辺最尤法による推定野口, 裕之, NOGUCHI, Hiroyuki G. 25 December 1990 (has links) 国立情報学研究所で電子化したコンテンツを使用している。 item response theory equating item parameters marginal maximum likelihood estimation common subjects' design
26	An investigation of bootstrap methods for estimating the standard error of equating under the common-item nonequivalent groups design Wang, Chunxin 01 July 2011 (has links) The purpose of this study was to investigate the performance of the parametric bootstrap method and to compare the parametric and nonparametric bootstrap methods for estimating the standard error of equating (SEE) under the common-item nonequivalent groups (CINEG) design with the frequency estimation (FE) equipercentile method under a variety of simulated conditions. When the performance of the parametric bootstrap method was investigated, bivariate polynomial log-linear models were employed to fit the data. With the consideration of the different polynomial degrees and two different numbers of cross-product moments, a total of eight parametric bootstrap models were examined. Two real datasets were used as the basis to define the population distributions and the "true" SEEs. A simulation study was conducted reflecting three levels for group proficiency differences, three levels of sample sizes, two test lengths and two ratios of the number of common items and the total number of items. Bias of the SEE, standard errors of the SEE, root mean square errors of the SEE, and their corresponding weighted indices were calculated and used to evaluate and compare the simulation results. The main findings from this simulation study were as follows: (1) The parametric bootstrap models with larger polynomial degrees generally produced smaller bias but larger standard errors than those with lower polynomial degrees. (2) The parametric bootstrap models with a higher order cross product moment (CPM) of two generally yielded more accurate estimates of the SEE than the corresponding models with the CPM of one. (3) The nonparametric bootstrap method generally produced less accurate estimates of the SEE than the parametric bootstrap method. However, as the sample size increased, the differences between the two bootstrap methods became smaller. When the sample size was equal to or larger than 3,000, the differences between the nonparametric bootstrap method and the parametric bootstrap model that produced the smallest RMSE were very small. (4) Of all the models considered in this study, parametric bootstrap models with the polynomial degree of four performed better under most simulation conditions. (5) Aside from method effects, sample size and test length had the most impact on estimating the SEE. Group proficiency differences and the ratio of the number of common items to the total number of items had little effect on a short test, but had slight effect on a long test. bootstrap method log-linear models nonparametric bootstrap method parametric bootstrap method standard error of equating Educational Psychology
27	Item Parameter Drift as an Indication of Differential Opportunity to Learn: An Exploration of item Flagging Methods & Accurate Classification of Examinees Sukin, Tia M. 01 September 2010 (has links) The presence of outlying anchor items is an issue faced by many testing agencies. The decision to retain or remove an item is a difficult one, especially when the content representation of the anchor set becomes questionable by item removal decisions. Additionally, the reason for the aberrancy is not always clear, and if the performance of the item has changed due to improvements in instruction, then removing the anchor item may not be appropriate and might produce misleading conclusions about the proficiency of the examinees. This study is conducted in two parts consisting of both a simulation and empirical data analysis. In these studies, the effect on examinee classification was investigated when the decision was made to remove or retain aberrant anchor items. Three methods of detection were explored; (1) delta plot, (2) IRT b-parameter plots, and (3) the RPU method. In the simulation study, degree of aberrancy was manipulated as well as the ability distribution of examinees and five aberrant item schemes were employed. In the empirical data analysis, archived statewide science achievement data that was suspected to possess differential opportunity to learn between administrations was re-analyzed using the various item parameter drift detection methods. The results for both the simulation and empirical data study provide support for eliminating the use of flagged items for linking assessments when a matrix-sampling design is used and a large number of items are used within that anchor. While neither the delta nor the IRT b-parameter plot methods produced results that would overwhelmingly support their use, it is recommended that both methods be employed in practice until further research is conducted for alternative methods, such as the RPU method since classification accuracy increases when such methods are employed and items are removed and most often, growth is not misrepresented by doing so. classification accuracy equating item parameter drift matrix-sampling design opportunity to learn Education
28	The Robustness of Rasch True Score Preequating to Violations of Model Assumptions Under Equivalent and Nonequivalent Populations Gianopulos, Garron 22 October 2008 (has links) This study examined the feasibility of using Rasch true score preequating under violated model assumptions and nonequivalent populations. Dichotomous item responses were simulated using a compensatory two dimensional (2D) three parameter logistic (3PL) Item Response Theory (IRT) model. The Rasch model was used to calibrate difficulty parameters using two methods: Fixed Parameter Calibration (FPC) and separate calibration with the Stocking and Lord linking (SCSL) method. A criterion equating function was defined by equating true scores calculated with the generated 2D 3PL IRT item and ability parameters, using random groups equipercentile equating. True score preequating to FPC and SCSL calibrated item banks was compared to identity and Levine's linear true score equating, in terms of equating bias and bootstrap standard errors of equating (SEE) (Kolen & Brennan, 2004). Results showed preequating was robust to simulated 2D 3PL data and to nonequivalent item discriminations, however, true score equating was not robust to guessing and to the interaction of guessing and nonequivalent item discriminations. Equating bias due to guessing was most marked at the low end of the score scale. Equating an easier new form to a more difficult base form produced negative bias. Nonequivalent item discriminations interacted with guessing to magnify the bias and to extend the range of the bias toward the middle of the score distribution. Very easy forms relative to the ability of the examinees also produced substantial error at the low end of the score scale. Accumulating item parameter error in the item bank increased the SEE across five forms. Rasch true score preequating produced less equating error than Levine's true score linear equating in all simulated conditions. FPC with Bigsteps performed as well as separate calibration with the Stocking and Lord linking method. These results support earlier findings, suggesting that Rasch true score preequating can be used in the presence of guessing if accuracy is required near the mean of the score distribution, but not if accuracy is required with very low or high scores. Item Response Theory Equating to a calibrated item bank Multidimensionality Fixed Parameter Calibration Stocking and Lord linking American Studies Arts and Humanities
29	A comparison of smoothing methods for the common item nonequivalent groups design Kim, Han Yi 01 July 2014 (has links) The purpose of this study was to compare the relative performance of various smoothing methods under the common item nonequivalent groups (CINEG) design. In light of the previous literature on smoothing under the CINEG design, this study aimed to provide general guidelines and practical insights on the selection of smoothing procedures under specific testing conditions. To investigate the smoothing procedures, 100 replications were simulated under various testing conditions by using an item response theory (IRT) framework. A total of 192 conditions (3 sample size × 4 group ability difference × 2 common-item proportion × 2 form difficulty difference × 1 test length × 2 common-item type × 2 common-item difficulty spread) were investigated. Two smoothing methods including log-linear presmoothing and cubic spline postsmoothing were considered with four equating methods including frequency estimation (FE), modified frequency estimation (MFE), chained equipercentile equating (CE), and kernel equating (KE). Bias, standard error, and root mean square error were computed to evaluate the performance of the smoothing methods. Results showed that 1) there were always one or more smoothing methods that produced smaller total error than unsmoothed methods; 2) polynomial log-linear presmoothing tended to perform better than cubic spline postsmoothing in terms of systematic and total errors when FE or MFE were used; 3) cubic spline postsmoothing showed a strong tendency to produce the least amount of random error regardless of the equating method used; 4) KE produced more accurate equating relationships under a majority of testing conditions when paired with CE; and 5) log-linear presmoothing produced smaller total error under a majority testing conditions than did cubic spline postsmoothing. Tables are provided to show the best-performing method for all combinations of testing conditions considered. Equating Smoothing Educational Psychology
30	Teorie odpovědí na položku a její aplikace v oblasti Národních srovnávacích zkoušek / Item Response Theory and its Application in the National Comparative Exams Fiřtová, Lenka January 2012 (has links) Item Response Theory, a psychometric paradigm for test development and evaluation, comprises a collection of models which enable the estimation of the probability of a correct answer to a particular item in the test as a function of the item parameters and the level of a respondent's underlying ability. This paper, written in cooperation with the company Scio, is focused on the application of Item Response Theory in the context of the National Comparative Exams. Its aim is to propose a test-equating procedure which would ensure a fair comparison of respondents' scores in the Test of General Academic Prerequisites regardless of the particular test administration.

Search results