Global ETD Search

71	The use of item response theory in developing the Dynamic Indicators of Vocabulary Skills parallel forms Riley, Jane M 01 January 2008 (has links) The primary purpose of this study was to develop ten parallel-forms of the DIVS measures using a modified one-parameter model of item response theory (IRT). Secondarily, the IRT created forms were compared to the original forms. Data were collected on pre-kindergarten and kindergarten students. The item responses first underwent preliminary analyses and poor and negatively discriminating items were removed. Then, the item characteristics were estimated and parallel-forms were created. The test information and test characteristic curves (TCC) for the newly developed forms are reported. The preexisting DIVS forms, which had been originally created by random assignment, were compared to the new IRT created forms. Overall, the IRT forms yielded vastly improved test characteristics. Implications for future studies are discussed.
72	A Comparison of Performance for 3rd Through 8th Grade Students on the 2014 NJ ASK and 2015 End-Of-Year PARCC Assessments Kendler, Adam 01 January 2021 (has links) State-mandated standardized testing comprises a significant component of student outcome measures utilized by the state and federal government to assess school district performance. Failure to meet Adequate Yearly Progress (AYP) on standardized assessments can result in negative consequences for districts both systemically and financially. The current study analyzes the transition from the New Jersey Assessment of Skills and Knowledge (NJ ASK) to the Partnership for Assessment of Readiness for College and Careers (PARCC) for the 2014-2015 school year. Among the differences between the two assessments is a change in modality, with students completing their PARCC English Language Arts (ELA) and Mathematics assessments via computer rather than the traditional paper-pencil administration on the NJ ASK. Outcome data for students from Vernon, New Jersey indicate that students performed significantly better on the NJ ASK than the PARCC for both ELA and Mathematics both in terms of score and proficiency level for the overall sample as well as a subset of students with disabilities. Familiarity with computer-based assessment from a cohort of students provided individual laptops for the duration of the school year did not improve student performance.
73	Moments of determination: Detecting learning within Intelligent Tutoring Systems Schweid, Jason A 01 January 2012 (has links) A central criticism of the assessment-based evaluation policies now in vogue in American public education is reduction of student learning time. Likewise, many see the current crop of year-end, summative assessments as only serving the data needs of politicians and higher-level school administrators. Stemming from these criticisms and a combination of technological and cognitive psychological curiosity, the computer-science community has offered a unique alternative the traditional assessment form. Intelligent Tutoring Systems (ITS) offer the hope of just-in-time assessment with no time away from instruction. That is, ITS are purported to both test and teach at the same time. However, inherent to ITS are the inference of learning. While the inference of learning, and ITS's themselves, are placed within the context of education the analytic logic employed for justification are grounded in data mining and artificial intelligence traditions. This proposed dissertation seeks to bridge the analytic traditions of educational measurement and data mining. The proposed study, carried out in three steps, will apply measurement strategies to a form of Intelligent Tutoring to compare the determination of learning between the two different analytic traditions.
74	Assessment literacy and efficacy: Making valid educational decisions Chapman, Mary Lou 01 January 2008 (has links) The purpose of the study was to gather information from practicing teachers about their knowledge and use of assessment in the classroom, referred to as assessment literacy; their confidence to effectively assess student progress and make valid educational decisions, referred to as assessment efficacy; and their beliefs about the consequences of these decisions. A two-part survey was administered to general and special education teachers in selected schools in western Massachusetts: an assessment literacy questionnaire to determine knowledge of assessment principles, and an assessment efficacy scale to determine confidence in using assessment results. The second part of the study was an interview of a sample of teachers about the consequences of educational decisions that are made using assessment data. The participants were mostly general education teachers at the secondary level, with graduate degrees, and prior training in assessment; and were almost equally divided by years of experience. They perceived themselves to be somewhat prepared to very prepared to teach and assess student performance; but less than two-thirds of the teachers responded correctly to 70% of the items on the adapted assessment literacy questionnaire. The participants generally perceived themselves to be confident in their skills to make appropriate educational decisions, thus possessing a high level of assessment efficacy. The responses from the interviews indicated that the teachers perceived that the decisions they make from classroom-based, school- and district-wide assessments had valid and meaningful outcomes when they were correct, and implemented appropriately. Their responses indicated concern about the unintended consequences of decisions that are made from the statewide assessment.
75	An empirical examination of the impact of item parameters on IRT information functions in mixed format tests Lam, Wai Yan Wendy 01 January 2012 (has links) IRT, also referred as “modern test theory”, offers many advantages over CTT-based methods in test development. Specifically, an IRT information function has the capability to build a test that has the desired precision of measurement for any defined proficiency scale when a sufficient number of test items are available. This feature is extremely useful when the information is used for decision making, for instance, whether an examinee attain certain mastery level. Computerized adaptive testing (CAT) is one of the many examples using IRT information functions in test construction. The purposes of this study were as follows: (1) to examine the consequences of improving the test quality through the addition of more discriminating items with different item formats; (2) to examine the effect of having a test where its difficulty does not align with the ability level of the intended population; (3) to investigate the change in decision consistency and decision accuracy; and (4) to understand changes in expected information when test quality is either improved or degraded, using both empirical and simulated data. Main findings from the study were as follows: (1) increasing the discriminating power of any types of items generally increased the level of information; however, sometimes it could bring adverse effect to the extreme ends of the ability continuum; (2) it was important to have more items that were targeted at the population of interest, otherwise, no matter how good the quality of the items may be, they were of less value in test development when they were not targeted to the distribution of candidate ability or at the cutscores; (3) decision consistency (DC), Kappa statistic, and decision accuracy (DA) increased with better quality items; (4) DC and Kappa were negatively affected when difficulty of the test did not match with the ability of the intended population; however, the effect was less severe if the test was easier than needed; (5) tests with more better quality items lowered false positive (FP) and false negative (FN) rate at the cutscores; (6) when test difficulty did not match with the ability of the target examinees, in general, both FP and FN rates increased; (7) polytomous items tended to yield more information than dichotomously scored items, regardless of the discriminating parameter and difficulty of the item; and (8) the more score categories an item had, the more information it could provide. Findings from this thesis should help testing agencies and practitioners to have better understanding of the item parameters on item and test information functions. This understanding is crucial for the improvement of the item bank quality and ultimately on how to build better tests that could provide more accurate proficiency classifications. However, at the same time, item writers should be conscientious about the fact that the item information function is merely a statistical tool for building a good test, other criteria should also be considered, for example, content balancing and content validity.
76	Evaluation of IRT anchor test designs in test translation studies Bollwark, John Anthony 01 January 1991 (has links) Translating measurement instruments from one language to another is a common way of adapting them for use in a population other than those for which the instruments were designed. This technique is particularly useful in helping to (1) understand the similarities and differences that exist between populations and (2) provide unbiased testing opportunities across different segments of a single population. To help insure that a translated instrument is valid for these purposes, it is essential that the equivalence of the original and translated instrument be established. One focus of this thesis was to provide a review of the history, problems and techniques associated with establishing the translation equivalence of measurement instruments. In addition, this review provided support for the use of item response theory (IRT) in translation equivalence studies. The second and main focus of this thesis was to investigate anchor test designs when using IRT in translation equivalence studies. Simulated data were used to determine the anchor test length required to provide adequate scaling results under conditions similar to those that are likely to be found in a translation equivalence study. These conditions included (1) relatively small samples and (2) examinee ability distribution overlaps that are more representative of vertical rather than horizontal scaling situations. The effects of these two variables on the anchor test design required to provide adequate scaling results were also investigated. The main conclusions from this research concerning the scaling of IRT ability and item parameters are: (1) larger examinee samples with larger ability overlaps should be used whenever possible, (2) under ideal scaling conditions of larger examinee samples with larger ability overlaps, relatively good scaling results can be obtained with anchor tests consisting of as few as 5 items (although the use of such short anchor tests is not recommended), and (3) anchor test lengths of at least 10 items should provide adequate scaling results, but longer anchor tests, consisting of well-translated items, should be used if possible. Finally, suggestions for further research on establishing translation equivalence were provided.
77	Obtaining norm-referenced scores from criterion-referenced tests: An analysis of estimation errors Tucker, Charlene Gower 01 January 1991 (has links) One customized testing model equates a criterion-referenced test (CRT) to a norm-referenced test (NRT) so that performance on the CRT can produce an estimate of performance on the NRT. The error associated with these estimated norms is not well understood. The purpose of this study was to examine the extent and nature of error present in these normative scores. In two subject areas and at three grade levels, actual NRT scores were compared to NRT scores which were estimated from a CRT. The estimation error was analyzed for individual scores and for group means at different parts of the score distribution. For individuals, the mean absolute difference between the actual NRT scores and the estimated NRT scores was approximately five raw score points on a 60-item reading subtest and approximately two points on a 30-item mathematics subtest. A comparison of the standard errors of substitution showed that individual differences were similar whether a parallel form or a CRT estimate was substituted for the NRT score. The bias present in the estimation of NRT scores from a CRT for groups of examinees is shown by the mean difference between the estimated and actual NRT scores. For all subtests, mean differences were less than one score point, indicating that group data can be accurately obtained through the use of this model. To examine the accuracy of estimation at different parts of the score distribution, the data was divided into three score groups (low, middle, and high) and, subsequently, into deciles. After correcting for a regression effect, mean group differences between actual NRT scores and those estimated from a CRT were fairly consistent for groups at different parts of the distribution. Individual scores, however, were most accurate at the upper end of the score distribution with a decline in accuracy as the score level decreased. In conclusion, this study offers evidence that NRT scores can be estimated from performance on a CRT with reasonable accuracy. However, generalizability of these results to other sets of tests or other populations is unknown. It is recommended that similar research be pursued under varying conditions.
78	Factors influencing the performance of the Mantel-Haenszel procedure in identifying differential item functioning Clauser, Brian Errol 01 January 1993 (has links) The Mantel-Haenszel (MH) procedure has emerged as one of the methods of choice for identification of differentially functioning test items (DIF). Although there has been considerable research examining its performance in this context, important gaps remain in the knowledge base for effectively applying this procedure. This investigation is an attempt to fill these gaps with the results of five simulation studies. The first study is an examination of the utility of the two-step procedure recommended by Holland and Thayer in which the matching criterion used in the second step is refined by removing items identified in the first step. The results showed that using the two-step procedure is associated with a reduction in the Type II error rate. In the second study, the capability of the MH procedure to identify uniform DIF was examined. The statistic was used to identify simulated DIF in items with varying levels of difficulty and discrimination and with differing levels of difference in difficulty between groups. The results indicated that when difference in difficulty was held constant, poorly discriminating items and items that were very difficult were less likely to be identified by the procedure. In the third study, the effects of sample size were considered. In spite of the fact that the MH procedure has been repeatedly recommended for use with small samples, the results of this study suggest that samples below 200 per group may be inadequate. Performance with larger samples was satisfactory and improved as samples increased. The fourth study is an examination of the effects of score group width on the statistic. Holland and Thayer recommended that n + 1 score groups should be used for matching (where n is the number of items). Since then, various authors have suggested that there may be utility in using fewer (wider) score groups. It was shown that use of this variation on the MH procedure could result in dramatically increased type I error rates. In the final study, a simple variation on the MH statistic which may allow it to identify non-uniform DIF was examined. The MH statistic's inability to identify certain types of non-uniform DIF items has been noted as a major shortcoming. Use of the variation resulted in identification of many of the simulated non-uniform DIF items with little or no increase in the type I error rate.
79	Optimal test designs with content balancing and variable target information functions as constraints Lam, Tit Loong 01 January 1993 (has links) Optimal test design involves the application of an item selection heuristic to construct a test to fit the target information function in order that the standard error of the test can be controlled at different regions of the ability continuum. The real data simulation study assessed the efficiency of binary programming in optimal item selection by comparing the degree in which the obtained test information was approximated to different target information functions with a manual heuristic. The effects of imposing a content balancing constraint was studied in conventional, two-stage and adaptive tests designed using the automated procedure. Results showed that the automated procedure improved upon the manual procedure significantly when a uniform target information function was used. However, when a peaked target information function was used, the improvement over the manual procedure was marginal. Both procedures were affected by the distribution of the item parameters in the item pool. The degree in which the examinee empirical scores were recovered was lower when a content balancing constraint was imposed in the conventional test designs. The effect of uneven item parameter distribution in the item pool was shown by the poorer recovery of the empirical scores at the higher regions of the ability continuum. Two-stage tests were shown to limit the effects of content balancing. Content balanced adaptive tests using optimal item selection was shown to be efficient in empirical score recovery, especially in maintaining equiprecision in measurement over a wide ability range despite the imposition of content balancing constraint in the test design. The study had implications for implementing automated test designs in the school systems supported by hardware and expertise in measurement theory and addresses the issue of content balancing using optimal test designs within an adaptive testing framework.
80	An inquiry into the effect of positive and negative expectancy on hypnotizability as measured on the Stanford Hypnotic Susceptibility Scale: Form A Langdell, Sarah 01 January 1993 (has links) The purpose of this study was to detect the influence, if any, of high or low expectancy with regard to hypnotizability on the part of the hypnotist and subject. The result was measured by the subject's score on the SHSS:A. The time each subject took to complete the SHSS:A was also recorded. Data were analyzed using a 2 x 2 analysis of variance (ANOVA) with experimenter expectancy (high vs. low) and subject expectancy (high vs. low) as variables (as shown in table 4.1). Two measures were examined: time taken to complete the SHSS:A and the score received. Since individual experimenters may differ in administration of the SHSS:A even with safeguards to insure uniformity, possible differences in experimenter performance were examined in a 1-way ANOVA with the experimenters as the variables (3 levels). There were no significant differences between the scores of any of the subject groups and no interaction found between any of them. There was a significant result in the time taken for the high expectancy subjects (SE) which was shorter (36.438 min.) than the low expectancy subjects (SE) (41.471 min.). The primary result does not support the contention that hypnotizability as measured on the SHSS:A is affected significantly by the expectations of either the subject or the hypnotist. The secondary result indicates a significant effect on the subjects who were told that they were highly hypnotizability which was not directly measured by the SHSS:A, i.e., time. That may be the result of an interaction between those subjects and the hypnotists. They may have communicated their heightened belief in their hypnotizability to the hypnotists in subtle ways which enabled the hypnotists to deliver the hypnotizability test more quickly.

Search results