Spelling suggestions: "subject:"educational tests anda measurements"" "subject:"educational tests ando measurements""
141 |
A study of tests devised to measure art capacitiesUnknown Date (has links)
M.A. Florida State College for Women 1930
|
142 |
Investigating the Recovery of Latent Class Membership in the Mixture Rasch ModelingUnknown Date (has links)
Mixture IRT modeling allows the detection of latent classes and different item parameter profile patterns across latent classes. In Rasch mixture model estimation, latent classes are
assumed to follow a normal distribution with means constrained to be equal across latent classes for the model identification purpose. In the literature, this conventional constraint was
shown to be problematic in establishing a common scale and comparing item profile patterns across different latent classes. In this study, a simulation study was conducted to explore the
degree of recovery of class membership. Also, the class membership recovery of the conventional constraint approach was compared to the class-invariant item constraint approach. The results
show that the recovery of class membership has the similar recovery for both approaches. In addition, the consistency of class membership for two approaches is consistent with each
other. / A Thesis submitted to the Department of Educational Psychology and Learning Systems in partial fulfillment of the requirements for the degree of Master of
Science. / Fall Semester, 2014. / October 2, 2014. / Includes bibliographical references. / Insu Paek, Professor Directing Thesis; Betsy Jane Becker, Committee Member; Dan McGee, Committee Member.
|
143 |
Some suggestions for constructing tests and test items for the primary grades of the elementary schoolUnknown Date (has links)
"This problem was chosen because there seems to be a need for an understanding by primary teachers about what learnings should be tested and how to test those learnings. For too long we, as teachers of the first, second, and third grades of the elementary school of America, have relied on our subjective opinions as an adequate measure of the child's progress. If we, in our educational program, are going to insist that children come to school, and progress from grade to grade, we will have to base the criteria that govern advancing from one grade to another on something more scientific than teacher observations. This is the plan that was followed for arriving at some workable regulations to govern the construction of teacher-made tests for the lower elementary grades. Material in text books on measurement and primary teaching methods was explored. Periodicals were examined in hopes of finding some very recent research in the field. The study of standardized tests showed the writer what had successfully been done. In constructing test items the use of teachers' manuals, workbooks, and state bulletins should be invaluable"--Introduction. / "August, 1950." / Typescript. / "Submitted to the Graduate Council of Florida State University in partial fulfillment of the requirements for the degree of Master of Science." / Advisor: M. H. DeGraff, Professor Directing Paper. / Includes bibliographical references (leaves 29-30).
|
144 |
Evaluation of Measurement Invariance in IRT Using Limited Information Fit Statistics/Indices: A Monte Carlo StudyUnknown Date (has links)
Measurement invariance analysis is important when test scores are used to make a group-wise comparison. Multiple-group IRT
modeling is one of the commonly used methods for measurement invariance examination. One essential step in the multiple-group modeling
method is the evaluation of overall model-data fit. A family of limited information fit statistics has been recently developed for
assessing the overall model-data fit in IRT. Previous studies evaluated the performance of limited information fit statistics using
single-group data, and found that these fit statistics performed better than the traditional full information fit statistics when data
were sparse. However, no study has investigated the performance of the limited information fit statistics within the multiple-group
modeling framework. This study aims to examine the performance of the limited information fit statistic (M₂) and M₂-based corresponding
descriptive fit indices in conducting measurement invariance analysis within the multiple-group IRT framework. A Monte Carlo study was
conducted to examine sampling distributions of M₂ and M₂-based descriptive fit indices, and their sensitivities to lack of measurement
invariance under various conditions. The manipulated factors included sample sizes, model types, dimensionality, types and numbers of DIF
items, and latent trait distributions. Results showed that the M₂ followed an approximately chi-square distribution when the model was
correctly specified, as expected. The type I error rates of M₂ were reasonable under large sample sizes (1000/2000). When the model was
misspecified, the power of M₂ was a function of sample size and the number of DIF items. For example, the power of M₂ for rejecting the
U2PL Scalar Model increased from 29.2% to 99.9% when the number of uniform DIF items increased from one to six, given the sample sizes of
1000/2000. With six uniform DIF items (30% of the studied items), the power of increased from 42.4% to 99.9% when sample sizes changed
from 250/500 to 1000/2000. When the difference in M₂(ΔM₂) was used to compare two correctly specified nested models, the sampling
distribution of ΔM₂ appeared to be apart from the reference chi-square distribution at both tails, especially under small sample sizes.
The type I error rates of the ΔM₂ test became closer to the expectation when sample sizes increased. For example, both Metric and
Configural Models were correctly specified when the test included no DIF item. Given the alpha level of .05, the type I error rates of for
the comparsion between the Metric and Configural Model were slightly inflated with n=250/500 (8.72%), and became closer to the alpha level
with n=1000/2000 (5.3%). When at least one of the models was misspecified, the power of increased when the number of DIF items or sample
sizes became larger. For example, the Metric Model was misspecified when nonuniform DIF item existed. Given sample sizes of 1000/2000 and
alpha level of .05, the power of ΔM₂ for the comparison between the Metric and Configural Model increased from 52.55 % to 99.39% when the
number of nonuniform DIF items changes from one to six. With one nonuniform DIF item in the test, the power of ΔM₂ was only 17.05% given
the alpha level of .05 and sample sizes of 250/500, but increased to 52.55% given the sample sizes of 1000/2000. The descriptive fit
indices and their differences between nested models were also affected by the number of DIF items. When there was no DIF item, all fit
indices indicated good model-data fit. The differences of the five fit indices between nested models were all very small (<.008) across
different sample sizes. When DIF items existed, the means of descriptive fit indices, and their differences between nested models
increased when number of DIF items increased. The finding from this study provided some suggestions about the implementation of the
limited information fit statistics/indices in measurement invariance analysis within the multiple-group IRT framework. / A Dissertation submitted to the Department of Educational Psychology and Learning Systems in partial
fulfillment of the requirements for the degree of Doctor of Philosophy. / Fall Semester 2016. / October 31, 2016. / Includes bibliographical references. / Yanyun Yang, Professor Co-Directing Dissertation; Insu Paek, Professor Co-Directing Dissertation;
Fred W. Huffer, University Representative; Betsy J. Becker, Committee Member; Salih Binici, Committee Member.
|
145 |
Using the Rasch Analysis and Structural Equation Modeling to validate a Health-Related Quality of Life QuestionnaireSingh, Priti January 2020 (has links)
No description available.
|
146 |
Bayesian Model Checking in Cognitive Diagnostic ModelsUnknown Date (has links)
Checking that models adequately present data is an essential component of applied statistical inference. Psychometricans increasingly use complex models to analyze test takers responses. The appeal of using complex cognitive diagnostic models (CDMs) is undeniable, as psychometricians can fit and build models that represent complex cognitive processes in the test while simultaneously controlling observation errors. With a trend toward diagnosing fine-grained skills that are responsible for test performance, both new methods and extensions of existing methods of assessing person fit in CDMs are required. Posterior predictive method (PP) is the most commonly used method in evaluating the effectiveness of person fit statistics in detecting aberrant response patterns in CDMs. It has been shown to be effective in detecting aberrant responses in IRT models but it is seldom implemented in cognitive diagnostic model. Additionally, two less known Bayesian model checking methods, prior predictive posterior simulation method (PPPS), pivotal discrepancy measure (PDM) will also be used to investigate the effectiveness of chosen person fit statistics. Three person fit statistics, log-likelihood statistic (l_z), un-weighted between-set index (UB) and response conformity index (RCI) are chosen in this study. In this study, I investigated the effectiveness of different Bayesian model checking methods in detecting aberrant response patterns with chosen discrepancy measures. The results from this study might help researchers answer the following two questions: (1) which discrepancy measure is more effective in detecting the aberrant response patterns under different model checking methods? (2) how well do the chosen discrepancy measures detect outlying response pattern? A simulate study was conducted to answer the above two questions. In terms of the data generation, it consists of two parts. One is for aberrant response pattern and the other is for normal response pattern. The normal response pattern is simulated from the DINA model with designated attribute parameters and each of four different aberrant response patterns was simulated by using binomial distribution with different assigned probabilities. The data was simulated and analyzed in R programming language and with Rjags package. Several interesting results can be drawn from my study: (1) increasing the test length did not improve the detection rates for each kind of aberrant response pattern. (2) Q-matrix complexity did not decrease the detection rate too much. (3) Generally speaking, loglikelihood statistic is the best measure in detecting each of different response pattern, especially for the cheating responses. (4) there isn’t too much performance difference of discrepancy measures under the PP and PPPS method. (5) Although the discrepancy measure RCI was developed in the context of cognitive diagnostic model (CDM), it had a poor performance in detecting each of the different aberrant responses. / A Dissertation submitted to the Department of Educational Psychology and Learning Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy. / 2019 / August 27, 2019. / Aberrant responses, Bayesian model checking, CDMs, Person fit analysis / Includes bibliographical references. / Russell Almond, Professor Directing Dissertation; Fred Huffer, University Representative; Betsy Becker, Committee Member; Insu Paek, Committee Member.
|
147 |
The impact of judges' consensus on the accuracy of anchor-based judgmental estimates of multiple-choice test item difficulty: The case of the NATABOC ExaminationDiBartolomeo, Matthew 01 January 2010 (has links)
Multiple factors have influenced testing agencies to more carefully consider the manner and frequency in which pretest item data are collected and analyzed. One potentially promising development is judges’ estimates of item difficulty. Accurate estimates of item difficulty may be used to reduce pretest samples sizes, supplement insufficient pretest sample sizes, aid in test form construction, assist in test form equating, calibrate test item writers who may be asked to produce items to meet statistical specifications, inform the process of standard setting, aid in preparing randomly equivalent blocks of pretest items, and/or aid in helping to set item response theory prior distributions. Two groups of 11 and eight judges, respectively, provided estimates of difficulty for the same set of 33 multiple-choice items from the National Athletic Trainers’ Association Board of Certification (NATABOC) Examination. Judges were faculty in Commission on Accreditation of Athletic Training Education-approved athletic training education programs and were NATABOC-approved examiners of the former hands-on practical portion of the Examination. For each item, judges provided two rounds of independent estimates of item difficulty and a third round group-level consensus estimate. Prior to providing estimates of item difficulty in rounds two and three, group discussion of the estimates provided in the preceding round was conducted. In general, the judges’ estimates of test item difficulty did not improve across rounds as predicted. Two-way repeated measures analyses of variance comparing item set mean difficulty estimates by round and the item set mean empirical item difficulty revealed no statistically significant differences across rounds, groups, or the interaction of these two factors. Moreover, item set mean difficulty estimates by round gradually drifted away from the item set mean empirical item difficulty and, therefore, mean estimation bias and effect size analyses gradually increased in correspondence with the item set mean item difficulty estimates provided across rounds. Therefore, the results revealed that no item difficulty estimation round yielded statistically significantly better recovery of the empirical item difficulty values compared to the other rounds.
|
148 |
The construction and validation of a test of technical terms in general educationUnderwood, Edward S. January 1954 (has links)
Thesis (Ed.D.)--Boston University
|
149 |
Impact of item parameter drift on test equating and proficiency estimatesHan, Kyung Tyek 01 January 2008 (has links)
When test equating is implemented, the effects of item parameter drift (IPD), especially on the linking items with the anchor test design is expected to cause flaws in the measurement. However, an important question that has not yet been examined is how much IPD is allowed until the effect is consequential. To answer this overarching question, three Monte-Carlo simulation studies were conducted. In the first study, titled 'Impact of unidirectional IPD on test equating and proficiency estimates,' the indirect effect of IPD on proficiency estimates (through its effect on test equating designs that use linking items containing IPD) was examined. The results with the regression line-like plots provided a comprehensive illustration of the relationship between IPD and its consequences, which can be used as an important guideline for practitioners when IPD is expected in testing. In the second study, titled 'Impact of multidirectional IPD on test equating and proficiency estimates,' the impact of different combinations of linking items with various multidirectional IPD on the test equating procedure was investigated for three popular scaling methods (mean-mean, mean-sigma, and TCC method). It was hypothesized that multidirectional IPD would influence the amount of random error observed in the linking while the effect of systematic error would be minimal. The study found the results confirming the hypothesis and also found different combinations of multidimensional IPD results in different levels of impact even with the same total amount of IPD. The third study, titled 'Impact of IPD on pseudo-guessing parameter estimates and test equating,' examined how serious the consequences are if c-parameters are not transformed in the test equating procedure when IPD exists. Three new item calibration strategies to put c-parameter estimates on the same scale across tests were proposed. The results indicated that the consequences of IPD with various calibration strategies and scaling methods were not substantially different when the external linking design was used, but the study found a choice of calibration method and scaling method could result in different outputs when the internal linking design and/or different cut scores were used.
|
150 |
Approaches for addressing the fit of item response theory models to educational test dataZhao, Yue 01 January 2008 (has links)
The study was carried out to accomplish three goals : (1) Propose graphical displays of IRT model fit at the item level and suggest fit procedures at the test level that are not impacted by large sample size, (2) examine the impact of IRT model misfit on proficiency classifications, and (3) investigate consequences of model misfit in assessing academic growth. The main focus of the first goal was on the use of more and better graphical procedures for investigating model fit and misfit through the use of residuals and standardized residuals at the item level. In addition, some new graphical procedures and a non-parametric test statistic for investigating fit at the test score level were introduced, and some examples were provided. Based on a realistic dataset from a high school assessment, statistical and graphical methods were applied and results were reported. More important than the results about the actual fit, were the procedures that were developed and evaluated. In addressing the second goal, practical consequences of IRT model misfit on performance classifications and test score precision were examined. It was found that with several of the data sets under investigation, test scores were noticeably less well recovered with the misfitting model; and there were practically significant differences in the accuracy of classifications with the model that fit the data less well. In addressing the third goal, the consequences of model misfit in assessing academic growth in terms of test score precision, decision accuracy and passing rate were examined. The three-parameter logistic/graded response (3PL/GR) models produced more accurate estimates than the one-parameter logistic/partial credit (1PL/PC) models, and the fixed common item parameter method produced closer results to “truth” than linear equating using the mean and sigma transformation. IRT model fit studies have not received the attention they deserve among testing agencies and practitioners. On the other hand, IRT models can almost never provide a perfect fit to the test data, but the evidence is substantial that these models can provide an excellent framework for solving practical measurement problems. The importance of this study is that it provides ideas and methods for addressing model fit, and most importantly, highlights studies for addressing the consequences of model misfit for use in making determinations about the suitability of particular IRT models.
|
Page generated in 0.1702 seconds