• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 25
  • 2
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 36
  • 18
  • 16
  • 15
  • 14
  • 14
  • 9
  • 9
  • 8
  • 8
  • 7
  • 7
  • 6
  • 6
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Equating multidimensional tests under a random groups design: a comparison of various equating procedures

Lee, Eunjung 01 December 2013 (has links)
The purpose of this research was to compare the equating performance of various equating procedures for the multidimensional tests. To examine the various equating procedures, simulated data sets were used that were generated based on a multidimensional item response theory (MIRT) framework. Various equating procedures were examined, including both unidimensional and the multidimensional equating procedures based on an IRT framework in addition to traditional equating procedures. Specifically, the performance of the following six equating procedures under the random groups design was compared: (1) unidimensional IRT observed score equating, (2) unidimensional IRT true score equating, (3) full MIRT observed score equating, (4) unidimensionalized MIRT observed score equating, (5) unidimensionalized MIRT true score equating, and (6) equipercentile equating. A total of four factors (test length, sample size, form difficulty differences, and correlations between dimensions) were expected to impact equating performance, and their impacts were investigated by creating two conditions per each factor: long vs. short test, large vs. small sample size, some vs. no form differences, and high vs. low correlation between dimensions. This simulation study over 50 replications yielded several patterns of equating performance of the six procedures across the simulation conditions. The following six findings are notable: (1) the full MIRT procedure provided more accurate equating results (i.e., less degree of error) than other equating procedures especially when the correlation between dimensions was low; (2) the equipercentile procedure was more likely than the IRT methods to yield a larger amount of random error and overall error across all the conditions; (3) equating for multidimensional tests was more accurate when form differences were small, sample size was large, and test length was long; (4) even when multidimensional tests were used (i.e., the unidimensionality assumptions were violated), still the unidimensional IRT procedures were found to yield quite accurate equating results; and (5) whether an equating procedure is an observed or a true score procedure did not seem to yield any differences in equating results. Building upon these findings, some theoretical and practical implications are discussed, and future research directions are suggested to strengthen the generalizability of the current findings. Given that only a handful of studies have been conducted in the MIRT literature, such research is expected to examine the various specific conditions where these findings are likely to be hold, thereby leading to practical guidelines that can be used in various operational testing situations.
12

IRT linking methods for the bifactor model: a special case of the two-tier item factor analysis model

Kim, Kyung Yong 01 August 2017 (has links)
For unidimensional item response theory (UIRT) models, three linking methods, which are the separate, concurrent, and fixed parameter calibration methods, have been developed and widely used in applications such as vertical scaling, differential item functioning, computerized adaptive testing (CAT), and equating. By contrast, even though a few studies have compared the separate and concurrent calibration methods for full multidimensional IRT (MIRT) models or applied the concurrent calibration method to vertical scaling using the bifactor model, no study has yet provided technical descriptions of the concurrent and fixed parameter calibration methods for any MIRT models. Thus, the purpose of this dissertation was to extend the concurrent and fixed parameter calibration methods for UIRT models to the two-tier item factor analysis model. In addition, the relative performance of the separate, concurrent, and fixed parameter calibration methods was compared in terms of the recovery of item parameters and accuracy of IRT observed score equating using both real and simulated datasets. The separate, concurrent, and fixed parameter calibration methods well recovered the item parameters, with the concurrent calibration method performing slightly better than the other two linking methods. Despite the comparable performance of the three linking methods in terms of the recovery of item parameters, however, some discrepancy was observed between the IRT observed score equating results obtained with the three linking methods. In general, the concurrent calibration method provided equating results with the smallest equating error, whereas the separate calibration method provided equating results with the largest equating error due to the largest standard error of equating. The performance of the fixed parameter calibration method depended on the proportion of common items. When the proportion was , the fixed parameter calibration method provided more biased equating results than the concurrent calibration method because of the underestimated specific slope parameters. However, when the proportion of common items was 40%, the fixed parameter calibration method worked as well as the concurrent calibration method.
13

Subscore equating with the random groups design

Lim, Euijin 01 May 2016 (has links)
There is an increasing demand for subscore reporting in the testing industry. Many testing programs already include subscores as part of their score report or consider a plan of reporting subscores. However, relatively few studies have been conducted on subscore equating. The purpose of this dissertation is to address the necessity for subscore equating and to evaluate the performance of various equating methods for subscores. Assuming the random groups design and number-correct scoring, this dissertation analyzed two sets of real data and simulated data with four study factors including test dimensionality, subtest length, form difference in difficulty, and sample size. Equating methods considered in this dissertation were linear equating, equipercentile equating, equipercentile with log-linear presmoothing, equipercentile equating with cubic-spline postsmoothing, IRT true score equating using a three-parameter logistic model (3PL) with separate calibration (3PsepT), IRT observed score equating using 3PL with separate calibration (3PsepO), IRT true score equating using 3PL with simultaneous calibration (3PsimT), IRT observed score equating using 3PL with simultaneous calibration (3PsimO), IRT true score equating using a bifactor model (BF) with simultaneous calibration (BFT), and IRT observed score equating using BF with simultaneous calibration (BFO). They were compared to identity equating and evaluated with respect to systematic, random, and total errors of equating. The main findings of this dissertation were as follows: (1) reporting subscores without equating would provide misleading information in terms of score profiles; (2) reporting subscores without a pre-specified test specification would bring practical issues such as constructing alternate subtest forms with comparable difficulty, conducting equating between forms with different lengths, and deciding an appropriate score scale to be reported; (3) the best performing subscore equating method, overall, was 3PsepO followed by equipercentile equating with presmoothing, and the worst performing method was BFT; (4) simultaneous calibration involving other subtest items in the calibration process yielded larger bias but smaller random error than did separate calibration, indicating that borrowing information from other subtests increased bias but decreased random error in subscore equating; (5) BFO performed the best when a test is multidimensional, form difference is small, subtest length is short, or sample size is small; (6) equating results for BFT and BFO were affected by the magnitude of factor loading and variability for the estimated general and specific factors; and (7) smoothing improved equating results, in general.
14

Simple structure MIRT equating for multidimensional tests

Kim, Stella Yun 01 May 2018 (has links)
Equating is a statistical process used to accomplish score comparability so that the scores from the different test forms can be used interchangeably. One of the most widely used equating procedures is unidimensional item response theory (UIRT) equating, which requires a set of assumptions about the data structure. In particular, the essence of UIRT rests on the unidimensionality assumption, which requires that a test measures only a single ability. However, this assumption is not likely to be fulfilled for many real data such as mixed-format tests or tests composed of several content subdomains: failure to satisfy the assumption threatens the accuracy of the estimated equating relationships. The main purpose of this dissertation was to contribute to the literature on multidimensional item response theory (MIRT) equating by developing a theoretical and conceptual framework for true-score equating using a simple-structure MIRT model (SS-MIRT). SS-MIRT has several advantages over other complex MIRT models such as improved efficiency in estimation and a straightforward interpretability. In this dissertation, the performance of the SS-MIRT true-score equating procedure (SMT) was examined and evaluated through four studies using different data types: (1) real data, (2) simulated data, (3) pseudo forms data, and (4) intact single form data with identity equating. Besides SMT, four competitors were included in the analyses in order to assess the relative benefits of SMT over the other procedures: (a) equipercentile equating with presmoothing, (b) UIRT true-score equating, (c) UIRT observed-score equating, and (d) SS-MIRT observed-score equating. In general, the proposed SMT procedure behaved similarly to the existing procedures. Also, SMT showed more accurate equating results compared to the traditional UIRT equating. Better performance of SMT over UIRT true-score equating was consistently observed across the three studies that employed different criterion relationships with different datasets, which strongly supports the benefit of a multidimensional approach to equating with multidimensional data.
15

<原著>IRT 正規累積モデルに於ける等化係数の推定

野口, 裕之, NOGUCHI, Hiroyuki 25 December 1989 (has links)
国立情報学研究所で電子化したコンテンツを使用している。
16

The impact of equating method and format representation of common items on the adequacy of mixed-format test equating using nonequivalent groups

Hagge, Sarah Lynn 01 July 2010 (has links)
Mixed-format tests containing both multiple-choice and constructed-response items are widely used on educational tests. Such tests combine the broad content coverage and efficient scoring of multiple-choice items with the assessment of higher-order thinking skills thought to be provided by constructed-response items. However, the combination of both item formats on a single test complicates the use of psychometric procedures. The purpose of this dissertation was to examine how characteristics of mixed-format tests and composition of the common-item set impact the accuracy of equating results in the common-item nonequivalent groups design. Operational examinee item responses for two classes of data were considered in this dissertation: (1) operational test forms and (2) pseudo-test forms that were assembled from portions of operational test forms. Analyses were conducted on three mixed-format tests from the Advanced Placement Examination program: English Language, Spanish Language, and Chemistry. For the operational test form analyses, two factors of investigation were considered as follows: (1) difference in proficiency between old and new form groups of examinees and (2) relative difficulty of multiple-choice and constructed-response items. For the pseudo-test form analyses, two additional factors of investigation were considered: (1) format representativeness of the common-item set and (2) statistical representativeness of the common-item set. For each study condition, two traditional equating methods, frequency estimation and chained equipercentile equating, and two item response theory (IRT) equating methods, IRT true score and IRT observed score methods, were considered. There were five main findings from the operational and pseudo-test form analyses. (1) As the difference in proficiency between old and new form groups of examinees increased, bias also tended to increase. (2) Relative to the criterion equating relationship for a given equating method, increases in bias were typically largest for frequency estimation and smallest for the IRT equating methods. However, it is important to note that the criterion equating relationship was different for each equating method. Additionally, only one smoothing value was analyzed for the traditional equating methods. (3) Standard errors of equating tended to be smallest for IRT observed score equating and largest for chained equipercentile equating. (4) Results for the operational and pseudo-test analyses were similar when the pseudo-tests were constructed to be similar to the operational test forms. (5) Results were mixed regarding which common-item set composition resulted in the least bias.
17

A comparison of calibration methods and proficiency estimators for creating IRT vertical scales

Kim, Jungnam 01 January 2007 (has links)
The main purpose of this study was to construct different vertical scales based on various combinations of calibration methods and proficiency estimators to investigate the impact different choices may have on these properties of the vertical scales that result: grade-to-grade growth, grade-to-grade variability, and the separation of grade distributions. Calibration methods investigated were concurrent calibration, separate calibration, and fixed a, b, and c item parameters for common items with simple prior updates (FSPU). Proficiency estimators investigated were Maximum Likelihood Estimator (MLE) with pattern scores, Expected A Posteriori (EAP) with pattern scores, pseudo-MLE with summed scores, pseudo-EAP with summed scores, and Quadrature Distribution (QD). The study used datasets from the Iowa Tests of Basic Skills (ITBS) in the Vocabulary, Reading Comprehension (RC), Math Problem Solving and Data Interpretation (MPD), and Science tests for grades 3 through 8. For each of the research questions, the following conclusions were drawn from the study. With respect to the comparisons of three calibration methods, for the RC and Science tests, concurrent calibration, compared to FSPU and separate calibration, showed less growth and more slowly decreasing growth in the lower grades, less decrease in variability over grades, and less separation in the lower grades in terms of horizontal distances. For the Vocabulary and MPD tests, differences in both grade-to-grade growth and in the separation of grade distributions were trivial. With respect to the comparisons of five proficiency estimators, for all content areas, the trend of pseudo-MLE ≥ MLE > QD > EAP ≥ pseudo-EAP was found in within-grade SDs, and the trend of pseudo-EAP ≥ EAP > QD > MLE ≥ pseudo-MLE was found in the effect sizes. However, the degree of decrease in variability over grades was similar across proficiency estimators. With respect to the comparisons of the four content areas, for the Vocabulary and MPD tests compared to the RC and Science tests, growth was less, but somewhat steady, and the decrease in variability over grades was less. For separation of grade distributions, it was found that the large growth suggested by larger mean differences for the RC and Science tests was reduced through the use of effect sizes to standardize the differences.
18

Contributions to Kernel Equating

Andersson, Björn January 2014 (has links)
The statistical practice of equating is needed when scores on different versions of the same standardized test are to be compared. This thesis constitutes four contributions to the observed-score equating framework kernel equating. Paper I introduces the open source R package kequate which enables the equating of observed scores using the kernel method of test equating in all common equating designs. The package is designed for ease of use and integrates well with other packages. The equating methods non-equivalent groups with covariates and item response theory observed-score kernel equating are currently not available in any other software package. In paper II an alternative bandwidth selection method for the kernel method of test equating is proposed. The new method is designed for usage with non-smooth data such as when using the observed data directly, without pre-smoothing. In previously used bandwidth selection methods, the variability from the bandwidth selection was disregarded when calculating the asymptotic standard errors. Here, the bandwidth selection is accounted for and updated asymptotic standard error derivations are provided. Item response theory observed-score kernel equating for the non-equivalent groups with anchor test design is introduced in paper III. Multivariate observed-score kernel equating functions are defined and their asymptotic covariance matrices are derived. An empirical example in the form of a standardized achievement test is used and the item response theory methods are compared to previously used log-linear methods. In paper IV, Wald tests for equating differences in item response theory observed-score kernel equating are conducted using the results from paper III. Simulations are performed to evaluate the empirical significance level and power under different settings, showing that the Wald test is more powerful than the Hommel multiple hypothesis testing method. Data from a psychometric licensure test and a standardized achievement test are used to exemplify the hypothesis testing procedure. The results show that using the Wald test can provide different conclusions to using the Hommel procedure.
19

Impact of matched samples equating methods on equating accuracy and the adequacy of equating assumptions

Powers, Sonya Jean 01 December 2010 (has links)
This dissertation investigates the interaction of population invariance, equating assumptions, and equating accuracy with group differences. In addition, matched samples equating methods are considered as a possible way to improve equating accuracy with large group differences. Data from one administration of four mixed-format Advanced Placement (AP) Exams were used to create pseudo old and new forms sharing common items. Population invariance analyses were conducted based on levels of examinee parental education using a single group equating design. Old and new form groups with common item effect sizes (ESs) ranging from 0 to 0.75 were created by sampling examinees based on their level of parental education. Equating was conducted for four common item nonequivalent group design equating methods: frequency estimation, chained equipercentile, IRT true score, and IRT observed score. Additionally, groups with ESs greater than zero were matched using three different matching techniques including exact matching on parental education level and propensity score matching with several other background variables. The accuracy of equating results was evaluated by comparing each equating relationship with an ES greater than zero to the equating relationship where the ES equaled zero. Differences between comparison and criterion equating relationships were quantified using the root expected mean squared difference (REMSD) statistic, classification consistency, and standard errors of equating (SEs). The accuracy of equating results and the adequacy of equating assumptions was compared for unmatched and matched samples. As ES increased, equating results tended to become less accurate and less consistent across equating methods. However, there was relatively little population dependence of equating results, despite large subgroup performance differences. Large differences between criterion and comparison equating relationships appeared to be caused instead by violations of equating assumptions. As group differences increased, the degree to which frequency estimation and chained equipercentile assumptions held decreased. In addition, all four AP Exams showed some evidence of multidimensionality. Because old and new form groups were selected to differ in terms of their respective levels of parental education, the matching methods that included parental education appeared to improve equating accuracy and the degree to which equating assumptions held, at least for very large ESs.
20

Evaluating equating properties for mixed-format tests

He, Yi 01 May 2011 (has links)
Mixed-format tests containing both multiple-choice (MC) items and constructed-response (CR) items are used in many testing programs. The use of multiple formats presents a number of measurement challenges, one of which is how to adequately equate mixed-format tests under the common-item nonequivalent groups (CINEG) design, especially when, due to practical constraints, the common-item set contains only MC items. The purpose of this dissertation was to evaluate how equating properties were preserved for mixed-format tests under the CINEG design. Real data analyses were conducted on 22 equating linkages of 39 mixed-format tests from the Advanced Placement (AP) Examination program. Four equating methods were used: the frequency estimation (FE) method, the chained equipercentile (CE) method, item response theory (IRT) true score equating, and IRT observed score equating. In addition, cubic spline postsmoothing was used with the FE and CE methods. The factors of investigation were the correlation between MC and CR scores, the proportion of common items, the proportion of MC-item score points, and the similarity between alternate forms. Results were evaluated using three equating properties: first-order equity, second-order equity, and the same distributions property. The main findings from this dissertation were as follows: (1) Between the two IRT equating methods, true score equating better preserved first-order equity than observed score equating, and observed score equating better preserved second-order equity and the same distributions property than true score equating. (2) Between the two traditional methods, CE better preserved first-order equity than FE, but in terms of preserving second-order equity and the same distributions property, CE and FE produced similar results. (3) Smoothing helped to improve the preservation of second-order equity and the same distributions property. (4) A higher MC-CR correlation was associated with better preservation of first-order equity for both IRT methods. (5) A higher MC-CR correlation was associated with better preservation of second-order equity for IRT true score equating. (6) A higher MC-CR correlation was associated with better preservation of the same distributions property for IRT observed score equating. (7) The proportion of common items, the proportion of MC score points, and the similarity between forms were not found to be associated with the preservation of the equating properties. These results are interpreted in the context of research literature in this area and suggestions for future research are provided.

Page generated in 0.0798 seconds