1 |
The Impact of Misspecifying A Higher Level Nesting Structure in Item Response Theory Models: A Monte Carlo StudyZhou, Qiong 16 December 2013 (has links)
The advantages of Multilevel Item Response Theory (MLIRT) model have been studied by several researchers, and even the impact of ignoring a higher level of data structure in multilevel analysis has been studied and discussed. However, due to the technical complexity of modeling and the shortage in function of dealing with multilevel data in traditional IRT packages (e.g., BILOG and PARSCALE), researchers may not be able to analyze the multilevel IRT data accurately. The impact of this type of misspecification, especially for MLIRT models, has not yet been thoughtfully examined. This dissertation consists of two studies: one is a Monte Carlo study that investigates the impact of this type of misspecification and the other one is a study with real-world data to validate the results obtaining from the simulation study.
In Study One (the simulation study), we investigate the potential impact of several factors, including: intra-class correlation (ICC), sample size, cluster size and test length, on the parameter estimates and corresponding test of significance under two situations: when the higher level nesting structure is appropriately modeled (i.e., true model condition) versus inappropriately modeled (i.e., misspecified model condition). Three-level straightly hierarchical data (i.e., items are nested within students who are further nested within schools) were generated. Two person-related and school-related covariates were added at the second level (i.e., person-level) and the third level (i.e., school-level), respectively. The results of simulation studies showed that both parameter estimates and their corresponding standard errors would be biased if the higher level nesting structure was ignored.
In Study Two, a real data from the Programme for International Student Assessment with purely hierarchical structure were analyzed by comparing parameter estimates when inappropriate versus appropriate IRT models are specified. The findings mirrored the results obtained from the first study.
The implication of this dissertation to researchers is that it is important to model the multilevel data structure even in item response theory models. Researchers should interpret their results in caution when ignoring a higher level nesting structure in MLIRT models. What's more, the findings may help researchers determine when MLIRT should be used to get an unbiased result.
Limitations concerning about some of the constraints of the simulation study could be relaxed. For instance, although this study used only dichotomous items, the MLIRT could also be used with polytomous items. The test length could be longer and more variability could be introduced into the item parameters’ values.
|
2 |
Kan nya regler för icke-revisionstjänster innebära konsekvenser för revisorns oberoende? : En intervjustudie om hur svenska revisorer uppfattar att EU:s revisionspaket och reglering av IRT kan komma att påverka revisorns oberoende.Lorentzon, Martin January 2016 (has links)
Revisorns oberoende är en debatt som ständigt återaktualiseras i nya sammanhang. Under de senaste decennierna har en debatt blossat upp kring huruvida tillhandahållande av icke-revisionstjänster (IRT) har en negativ inverkan på revisorns oberoende. Studien syftar till att upprätta en konsekvensanalys kring vilka möjliga konsekvenser EU:s nya revisionspaket, med striktare reglering för IRT, kan komma att få för svenska revisorers oberoende. Studiens resultat visar framför allt en skillnad mellan konsekvenser för det faktiska- och synbara oberoendet. Det nya regelverket tycks ha en låg inverkan på revisorns faktiska oberoende till följd av den praxis och de rutiner som existerar bland revisorer, avseende oberoendefrågan. Resultatet visar dock att reglerna kan få desto större konsekvenser för det synbara oberoendet vilket kan komma att stärkas till följd av att den nya lagen bidrar med ökat tydlighet och medvetande mot tredjeparter.
|
3 |
A comparison of equating/linking using the Stocking-Lord method and concurrent calibration with mixed-format tests in the non-equivalent groups common-item design under IRTTian, Feng January 2011 (has links)
Thesis advisor: Larry Ludlow / There has been a steady increase in the use of mixed-format tests, that is, tests consisting of both multiple-choice items and constructed-response items in both classroom and large-scale assessments. This calls for appropriate equating methods for such tests. As Item Response Theory (IRT) has rapidly become mainstream as the theoretical basis for measurement, different equating methods under IRT have also been developed. This study investigated the performances of two IRT equating methods using simulated data: linking following separate calibration (the Stocking-Lord method) and the concurrent calibration. The findings from this study show that the concurrent calibration method generally performs better in recovering the item parameters and more importantly, the concurrent calibration method produces more accurate estimated scores than linking following separate calibration. Limitations and directions for future research are discussed. / Thesis (PhD) — Boston College, 2011. / Submitted to: Boston College. Lynch School of Education. / Discipline: Educational Research, Measurement, and Evaluation.
|
4 |
Investigation of the impact of various factors on the validity of customized normsZhao, Xiaohui 01 January 2008 (has links)
Investigating the possibility of customizing off-the-shelf tests to provide various kinds of information is becoming increasingly interesting due to high information demands in the current testing environment. Comparing examinee achievement status on a national basis using such tests may provide a cost-effective solution for some practical problems. However, the normative estimates based on customized tests may be very different from those based on intact tests, and the validity of customized norms may be seriously compromised. The primary purpose of this study was to investigate the impact of various factors on the validity of customized norms. These factors included customizing strategy, estimating items, test length, correlations of latent abilities assessed by items from an intact test and new items, and test dimensional structures.
Monte Carlo simulation techniques were used to examine the accuracy of the customized norms. Both unidimensional and multidimensional data sets were generated and calibrated using unidimensional item response theory models. The five factors cited above were manipulated in a partially crossed design, with a total of 44 combinations of conditions. The outcomes of interest included estimated ability distributions and correlations, mean differences, mean absolute differences of ability and percentile estimates derived from intact tests and customized tests.
Based on the results of this study, it was concluded that: (1) customized instruments with all items from intact tests provided more accurate normative estimates than instruments having some items from intact tests removed; (2) using only items from intact tests to derive norms yielded more accurate estimates than using all items in customized tests; (3) lengthened customized tests yielded more accurate estimates than shortened tests; (4) the higher the correlation of latent abilities measured by items from intact tests and new items, the more accurate the normative estimates; (5) the impacts of the various factors were small when the unidimensionality assumption was satisfied; the differences increased when data structures became more complicated.
|
5 |
An Empirical Examination of the Impact of Item Parameters on IRT Information Functions in Mixed Format TestsLam, Wai Yan Wendy 01 February 2012 (has links)
IRT, also referred as "modern test theory", offers many advantages over CTT-based methods in test development. Specifically, an IRT information function has the capability to build a test that has the desired precision of measurement for any defined proficiency scale when a sufficient number of test items are available. This feature is extremely useful when the information is used for decision making, for instance, whether an examinee attain certain mastery level. Computerized adaptive testing (CAT) is one of the many examples using IRT information functions in test construction. The purposes of this study were as follows: (1) to examine the consequences of improving the test quality through the addition of more discriminating items with different item formats; (2) to examine the effect of having a test where its difficulty does not align with the ability level of the intended population; (3) to investigate the change in decision consistency and decision accuracy; and (4) to understand changes in expected information when test quality is either improved or degraded, using both empirical and simulated data. Main findings from the study were as follows: (1) increasing the discriminating power of any types of items generally increased the level of information; however, sometimes it could bring adverse effect to the extreme ends of the ability continuum; (2) it was important to have more items that were targeted at the population of interest, otherwise, no matter how good the quality of the items may be, they were of less value in test development when they were not targeted to the distribution of candidate ability or at the cutscores; (3) decision consistency (DC), Kappa statistic, and decision accuracy (DA) increased with better quality items; (4) DC and Kappa were negatively affected when difficulty of the test did not match with the ability of the intended population; however, the effect was less severe if the test was easier than needed; (5) tests with more better quality items lowered false positive (FP) and false negative (FN) rate at the cutscores; (6) when test difficulty did not match with the ability of the target examinees, in general, both FP and FN rates increased; (7) polytomous items tended to yield more information than dichotomously scored items, regardless of the discriminating parameter and difficulty of the item; and (8) the more score categories an item had, the more information it could provide. Findings from this thesis should help testing agencies and practitioners to have better understanding of the item parameters on item and test information functions. This understanding is crucial for the improvement of the item bank quality and ultimately on how to build better tests that could provide more accurate proficiency classifications. However, at the same time, item writers should be conscientious about the fact that the item information function is merely a statistical tool for building a good test, other criteria should also be considered, for example, content balancing and content validity.
|
6 |
Effect of Sample Size on Irt Equating of Uni-Dimensional Tests in Common Item Non-Equivalent Group Design: a Monte Carlo Simulation StudyWang, Xiangrong 03 May 2012 (has links)
Test equating is important to large-scale testing programs because of the following two reasons: strict test security is a key concern for high-stakes tests and fairness of test equating is important for test takers. The question of adequacy of sample size often arises in test equating. However, most recommendations in the existing literature are based on classical test equating. Very few research studies systematically investigated the minimal sample size which leads to reasonably accurate equating results based on item response theory (IRT). The main purpose of this study was to examine the minimal sample size for desired IRT equating accuracy for the common-item nonequivalent groups design under various conditions. Accuracy was determined by examining the relative magnitude of six accuracy statistics. Two IRT equating methods were carried out on simulated tests with combinations of test length, test format, group ability difference, similarity of the form difficulty, and parameter estimation methods for 14 sample sizes using Monte Carlo simulations with 1,000 replications per cell. Observed score equating and true score equating were compared to the criterion equating to obtain the accuracy statistics. The results suggest that different sample size requirements exist for different test lengths, test formats and parameter estimation methods. Additionally, the results show the following: first, the results for true score equating and observed score equating are very similar. Second, the longer test has less accurate equating than the shorter one at the same sample size level and as the sample size decreases, the gap is greater. Third, concurrent parameter estimation method produced less equating error than separate estimation at the same sample size level and as the sample size reduces, the difference increases. Fourth, the cases with different group ability have larger and less stable error comparing to the base case and the cases with different test difficulty, especially when using separate parameter estimation method with sample size less than 750. Last, the mixed formatted test is more accurate than the single formatted one at the same sample size level. / Ph. D.
|
7 |
Differential Item Functioning on the Armed Services Vocational Aptitude BatteryGibson, Shanan Gwaltney IV 19 November 1998 (has links)
Utilizing Item Response Theory (IRT) methodologies, the Armed Services Vocational Aptitude Battery (ASVAB) was examined for differential item functioning (DIF) on the basis of crossed gender and ethnicity variables. Both the Mantel-Haenszel procedure and an IRT area-based technique were utilized to assess the degree of uniform and non-uniform DIF in a sample of ASVAB takers. The analysis was performed such that each subgroup of interest functioned as the focal group to be compared to the male reference group. This type of DIF analysis allowed for comparisons within ethnic group, within gender group, as well as crossed ethnic/gender group. The groups analyzed were: White, Black, and Hispanic males, and White and Black females. It was hypothesized that DIF would be found, at the scale level, on several of the ASVAB sub-tests as a result of unintended latent trait demands of items. In particular, those tests comprised of items requiring specialized jargon, visuospatial ability, or advanced English vocabulary are anticipated to show bias toward white males and/or white females.
Findings were mixed. At the item level, DIF fluctuated greatly. Numerous instances of DIF favoring the reference as well as the focal group were found. At the scale level, inconsistencies existed across the forms and versions. Tests varied in their tendency to be biased against the focal group of interest and at times, performed contrary to expectations. / Master of Science
|
8 |
An Item Response Theory Analysis of the Scales from the International Personality Item Pool and the NEO Personality Inventory-RevisedMcBride, Nadine LeBarron 10 August 2001 (has links)
Personality tests are widely used in the field of Industrial/Organizational Psychology; however, few studies have focused on their psychometric properties using Item Response Theory. This paper uses IRT to examine the test information functions (TIFs) of two personality measures: the NEO-PI-R and scales from the International Personality Item Pool. Results showed that most scales for both measures provided relatively consistent levels of information and measurement precision across levels of theta (q). Although the NEO-PI-R provided overall higher levels of information and measurement precision, the IPIP scales provided greater efficiency in that they provided more precision per item. Both scales showed substantial decrease in precision and information when response scales were dichotomized away from the original 5 point likert scale format. Implications and further avenues for research are discussed. / Master of Science
|
9 |
Using the Score-based Testlet Method to Handle Local Item DependenceTao, Wei January 2008 (has links)
Thesis advisor: Larry H. Ludlow / Item Response Theory (IRT) is a contemporary measurement technique which has been used widely to model testing data and survey data. To apply IRT models, several assumptions have to be satisfied. Local item independence is a key assumption directly related to the estimation process. Many studies have been conducted to examine the impact of local item dependence (LID) on test statistics and parameter estimates in large-scale assessments. However, in the heath care field where IRT is experiencing greater popularity, few studies have been conducted to study LID specifically. LID in the health care field bears some unique characteristics which deserve separate analysis. In health care surveys, it is common to see several items that are phrased in a similar structure or items that have a hierarchical order of difficulties. Therefore, a Guttman scaling pattern, or a deterministic response pattern, is observed among those items. The purposes of this study are to detect whether the Guttman scaling pattern among a subset of items exhibit local dependence, whether such dependence has any impact on test statistics, and whether these effects differ when different IRT models are employed. The score-based approach - forming locally dependent dichotomous items into a polytomous testlet - is used to accommodate LID. Results from this dissertation suggest that the Guttman scaling pattern among a subset of items does exhibit moderate- to high-degree of LID. However, the impact of this special LID is minimal on internal reliability estimates and on the unidimensional data structure. Regardless of which models are employed, the dichotomously-scored person ability estimates are highly correlated with the polytomously-scored person ability estimates. However, the impact of this special LID on test information differs between Rasch models and non-Rasch models. Specifically, when only Rasch models are involved, test information derived from the LID-laden data is underestimated for non-extreme scores; whereas, when non-Rasch models are used, the opposite finding is reached –that is, LID tends to overestimate test information. / Thesis (PhD) — Boston College, 2008. / Submitted to: Boston College. Lynch School of Education. / Discipline: Educational Research, Measurement, and Evaluation.
|
10 |
Assessing first- and second-order equity for the common-item nonequivalent groups design using multidimensional IRTAndrews, Benjamin James 01 July 2011 (has links)
The equity properties can be used to assess the quality of an equating. The degree to which expected scores conditional on ability are similar between test forms is referred to as first-order equity. Second-order equity is the degree to which conditional standard errors of measurement are similar between test forms after equating. The purpose of this dissertation was to investigate the use of a multidimensional IRT framework for assessing first- and second-order equity of mixed format tests.
Both real and simulated data were used for assessing the equity properties for mixed-format tests. Using real data from three Advanced Placement (AP) exams, five different equating methods were compared in their preservation of first- and second-order equity. Frequency estimation, chained equipercentile, unidimensional IRT true score, unidimensional IRT observed score, and multidimensional IRT observed score equating methods were used. Both a unidimensional IRT framework and a multidimensional IRT framework were used to assess the equity properties. Two simulation studies were also conducted. The first investigated the accuracy of expected scores and conditional standard errors of measurement as tests became increasingly multidimensional using both a unidimensional IRT framework and multidimensional IRT framework. In the second simulation study, the five different equating methods were compared in their ability to preserve first- and second-order equity as tests became more multidimensional and as differences in group ability increased.
Results from the real data analyses indicated that the performance of the equating methods based on first- and second-order equity varied depending on which framework was used to assess equity and which test was used. Some tests showed similar preservation of equity for both frameworks while others differed greatly in their assessment of equity. Results from the first simulation study showed that estimates of expected scores had lower mean squared error values when the unidimensional framework was used compared to when the multidimensional framework was used when the correlation between abilities was high. The multidimensional IRT framework had lower mean squared error values for conditional standard errors of measurement when the correlation between abilities was less than .95. In the second simulation study, chained equating performed better than frequency estimation for first-order equity. Frequency estimation better preserved second-order equity compared to the chained method. As tests became more multidimensional or as group differences increased, the multidimensional IRT observed score equating method tended to perform better than the other methods.
|
Page generated in 0.0277 seconds