Spelling suggestions: "subject:"tem response theory"" "subject:"stem response theory""
111 |
CT3 as an Index of Knowledge Domain Structure: Distributions for Order Analysis and Information HierarchiesSwartz Horn, Rebecca 12 1900 (has links)
The problem with which this study is concerned is articulating all possible CT3 and KR21 reliability measures for every case of a 5x5 binary matrix (32,996,500 possible matrices). The study has three purposes. The first purpose is to calculate CT3 for every matrix and compare the results to the proposed optimum range of .3 to .5. The second purpose is to compare the results from the calculation of KR21 and CT3 reliability measures. The third purpose is to calculate CT3 and KR21 on every strand of a class test whose item set has been reduced using the difficulty strata identified by Order Analysis. The study was conducted by writing a computer program to articulate all possible 5 x 5 matrices. The program also calculated CT3 and KR21 reliability measures for each matrix. The nonparametric technique of Order Analysis was applied to two sections of test items to stratify the items into difficulty levels. The difficulty levels were used to reduce the item set from 22 to 9 items. All possible strands or chains of these items were identified so that both reliability measures (CT3 and KR21) could be calculated. One major finding of this study indicates that .3 to .5 is a desirable range for CT3 (cumulative p=.86 to p=.98) if cumulative frequencies are measured. A second major finding is that the KR21 reliability measure produced an invalid result more than half the time. The last major finding is that CT3, rescaled to range between 0 and 1, supports De Vellis' guidelines for reliability measures. The major conclusion is that CT3 is a better measure of reliability since it considers both inter- and intra-item variances.
|
112 |
Breaking Free from the Limitations of Classical Test Theory: Developing and Measuring Information Systems Scales Using Item Response TheoryRusch, Thomas, Lowry, Paul Benjamin, Mair, Patrick, Treiblmaier, Horst 03 1900 (has links) (PDF)
Information systems (IS) research frequently uses survey data to measure the interplay between technological systems and human beings. Researchers have developed sophisticated procedures to build and validate multi-item scales that measure latent constructs. The vast majority of IS studies uses classical test theory (CTT), but this approach suffers from three major theoretical shortcomings: (1) it assumes a linear relationship between the latent variable and observed scores, which rarely represents the empirical reality of behavioral constructs; (2) the true score can either not be estimated directly or only by making assumptions that are difficult to be met; and (3) parameters such as reliability, discrimination, location, or factor loadings depend on the sample being used. To address these issues, we present item response theory (IRT) as a collection of viable alternatives for measuring continuous latent variables by means of categorical indicators (i.e., measurement variables). IRT offers several advantages: (1) it assumes nonlinear relationships; (2) it allows more appropriate estimation of the true score; (3) it can estimate item parameters independently of the sample being used; (4) it allows the researcher to select items that are in accordance with a desired model; and (5) it applies and generalizes concepts such as reliability and internal consistency, and thus allows researchers to derive more information about the measurement process. We use a CTT approach as well as Rasch models (a special class of IRT models) to demonstrate how a scale for measuring hedonic aspects of websites is developed under both approaches. The results illustrate how IRT can be successfully applied in IS research and provide better scale results than CTT. We conclude by explaining the most appropriate circumstances for applying IRT, as well as the limitations of IRT.
|
113 |
O Desempenho em Matemática do ENEM de 2012 em Luis Eduardo Magalhães (BA), na Teoria de Resposta ao ItemOliveira, Leandro Santana 06 July 2017 (has links)
O desempenho de estudantes em matemática na prova do ENEM é a discussão central deste
trabalho. Com as mudanças no ENEM ocorridas no ano de 2009, a TRI - Teoria de Resposta
ao Item - passou a ser utilizada para elaboração e correção da prova, permitindo, assim,
mais confiabilidade nos resultados das provas, e, claro, uma resposta mais interessante ao
estudantes, para além do aspecto quantitativos de acerto e de erro em questões. O presente
trabalho tem o propósito analisar o desempenho de estudantes na prova do ENEM 2012,
na cidade de Luis Eduardo Magalhães, (BA) comparando com os resultados desta mesma
prova dos participantes de todo o estado da Bahia. Realizou-se uma análise de 10 questões
e seus resultados de acertos e erros, sendo possível uma análise, mesmo sem os parâmetros
TRI, sobre o desempenho dos estudantes na respectiva prova. Os resultados da pesquisa
são o encaminhamento de ações na educação básica voltados ao compromisso de elevar
o desempenho dos estudantes de ensino médio que realizam a prova do ENEM, através
de programas de estudos e outros meios, na forma de produtos educacionais. A presente
pesquisa aponta também o avanço de sua análise com a obtenção dos parâmetros TRI -
já que não são de domínio publico e sua obtenção não é de fácil localização e acesso no
Ministério da Educação, bem como, programas específicos de softwares, não são de fácil
acesso - que contribuiriam muito para melhorar o esclarecimento desta avaliação em todo
o Brasil e, por conseguinte, permitir a elevação dos índices de desempenho dos estudantes,
sobretudo, em matemática em Luis Eduardo Magalhães(BA). / The performance of students in mathematics in the ENEM test is the central discussion of
this work. With the changes in the ENEM in 2009, TRI - Item Response Theory - began to
be used for the preparation and correction of the test, allowing, therefore, more reliability
in the results of the tests, and, of course, a more interesting response To students, beyond
the quantitative aspect of correctness and error in questions. The purpose of this paper
is to analyze student performance in the ENEM 2012 test in the city of Luis Eduardo
Magalhães (BA), comparing with the results of this same test of the participants from the
entire state of Bahia. An analysis of 10 questions and their results of correct answers and
errors was made, and it was possible to analyze, even without the TRI parameters, on the
students’ performance in the respective test. The results of the research are the referral of
actions in basic education aimed at raising the performance of high school students who
take the ENEM test, through study programs and other means, in the form of educational
products. The present research also indicates the progress of its analysis with the obtaining
of the TRI parameters - since they are not of public domain and their obtaining is not
of easy location and access in the Ministry of Education, as well as, specific programs
of software, are not of Which would greatly contribute to improving the clarification of
this evaluation throughout Brazil and, consequently, to allow students to increase their
performance, especially in mathematics in Luis Eduardo Magalhães (BA).
|
114 |
Can a computer adaptive assessment system determine, better than traditional methods, whether students know mathematics skills?Whorton, Skyler 19 April 2013 (has links)
Schools use commercial systems specifically for mathematics benchmarking and longitudinal assessment. However these systems are expensive and their results often fail to indicate a clear path for teachers to differentiate instruction based on students’ individual strengths and weaknesses in specific skills. ASSISTments is a web-based Intelligent Tutoring System used by educators to drive real-time, formative assessment in their classrooms. The software is used primarily by mathematics teachers to deliver homework, classwork and exams to their students. We have developed a computer adaptive test called PLACEments as an extension of ASSISTments to allow teachers to perform individual student assessment and by extension school-wide benchmarking. PLACEments uses a form of graph-based knowledge representation by which the exam results identify the specific mathematics skills that each student lacks. The system additionally provides differentiated practice determined by the students’ performance on the adaptive test. In this project, we describe the design and implementation of PLACEments as a skill assessment method and evaluate it in comparison with a fixed-item benchmark.
|
115 |
The effectiveness of automatic item generation for the development of cognitive ability testsLoe, Bao Sheng January 2019 (has links)
Research has shown that the increased use of computer-based testing has brought about new challenges. With the ease of online test administration, a large number of items are necessary to maintain the item bank and minimise the exposure rate. However, the traditional item development process is time-consuming and costly. Thus, alternative ways of creating items are necessary to improve the item development process. Automatic Item Generation (AIG) is an effective method in generating items rapidly and efficiently. AIG uses algorithms to create questions for testing purposes. However, many of these generators are in the closed form, available only to the selected few. There is a lack of open source, publicly available generators that researchers can utilise to study AIG in greater depth and to generate items for their research. Furthermore, research has indicated that AIG is far from being understood, and more research into its methodology and the psychometric properties of the items created by the generators are needed for it to be used effectively. The studies conducted in this thesis have achieved the following: 1) Five open source item generators were created, and the items were evaluated and validated. 2) Empirical evidence showed that using a weak theory approach to develop item generators was just as credible as using a strong theory approach, even though they are theoretically distinct. 3) The psychometric properties of the generated items were estimated using various IRT models to assess the impact of the template features used to create the items. 4) Joint responses and response time modelling was employed to provide new insights into cognitive processes that go beyond those obtained by typical IRT models. This thesis suggests that AIG provides a tangible solution for improving the item development process for content generation and reducing the procedural cost of generating a large number of items, with the possibility of a unified approach towards test administration (i.e. adaptive item generation). Nonetheless, this thesis focused on rule-based algorithms. The application of other forms of item generation methods and the potential for measuring the intelligence of artificial general intelligence (AGI) is discussed in the final chapter, proposing that the use of AIG techniques create new opportunities as well as challenges for researchers that will redefine the assessment of intelligence.
|
116 |
Rural Opioid and Other Drug Use Disorder Diagnosis: Assessing Measurement Invariance and Latent Classification of DSM-IV Abuse and Dependence CriteriaBrooks, Billy 01 August 2015 (has links)
The rates of non-medical prescription drug use in the United States (U.S.) have increased dramatically in the last two decades, leading to a more than 300% increase in deaths from overdose, surpassing motor vehicle accidents as the leading cause of injury deaths. In rural areas, deaths from unintentional overdose have increased by more than 250% since 1999 while urban deaths have increased at a fraction of this rate. The objective of this research was to test the hypothesis that cultural, economic, and environmental factors prevalent in rural America affect the rate of substance use disorder (SUD) in that population, and that diagnosis of these disorders across rural and urban populations may not be generalizable due to these same effects. This study applies measurement invariance analysis and factor analysis techniques: item response theory (IRT), multiple indicators, multiple causes (MIMIC), and latent class analysis (LCA), to the DSM-IV abuse and dependency diagnosis instrument. The sample used for the study was a population of adult past-year illicit drug users living in a rural or urban area drawn from the 2011-2012 National Survey on Drug Use and Health data files (N = 3,369| analyses 1 and 2; N = 12,140| analysis 3). Results of the IRT and MIMIC analyses indicated no significant variance in DSM item function across rural and urban sub-groups; however, several socio-demographic variables including age, race, income, and gender were associated with bias in the instrument. Latent class structures differed across the sub-groups in quality and number, with the rural sample fitting a 3-class structure and the urban fitting 6-class model. Overall the rural class structure exhibited less diversity and lower prevalence of SUD in multiple drug categories (e.g. cocaine, hallucinogens, and stimulants). This result suggests underlying elements affecting SUD patterns in the two populations. These findings inform the development of surveillance instruments, clinical services, and public health programming tailored to specific communities.
|
117 |
A comparison of fixed item parameter calibration methods and reporting score scales in the development of an item poolChen, Keyu 01 August 2019 (has links)
The purposes of the study were to compare the relative performances of three fixed item parameter calibration methods (FIPC) in item and ability parameter estimation and to examine how the ability estimates obtained from these different methods affect interpretations using reported scales of different lengths.
Through a simulation design, the study was divided into two stages. The first stage was the calibration stage, where the parameters of pretest items were estimated. This stage investigated the accuracy of item parameter estimates and the recovery of the underlying ability distributions for different sample sizes, different numbers of pretest items, and different types of ability distributions under the three-parameter logistic model (3PL). The second stage was the operational stage, where the estimated parameters of the pretest items were put on operational forms and were used to score examinees. The second stage investigated the effect of item parameter estimation had on the ability estimation and reported scores for the new test forms.
It was found that the item parameters estimated from the three FIPC methods showed subtle differences, but the results of the DeMars method were closer to those of the separate calibration with linking method than to the FIPC with simple-prior update and FIPC with iterative prior update methods, while the FIPC with simple-prior update and FIPC with iterative prior update methods performed similarly. Regarding the experimental factors that were manipulated in the simulation, the study found that the sample size influenced the estimation of item parameters. The effect of the number of pretest items on estimation of item parameters was strong but ambiguous, likely because the effect was confounded by changes of both the number of the pretest items and the characteristics of the pretest items among the item sets. The effect of ability distributions on estimation of item parameters was not as evident as the effect of the other two factors.
After the pretest items were calibrated, the parameter estimates of these items were put into operational use. The abilities of the examinees were then estimated based on the examinees’ response to the existing operational items and the new items (previously called pretest items), of which the item parameters were estimated under different conditions. This study found that there were high correlations between the ability estimates and the true abilities of the examinees when forms containing pretest items calibrated using any of the three FIPC methods. The results suggested that all three FIPC methods were similarly competent in estimating parameters of the items, leading to satisfying determination of the examinees’ abilities. When considering the scale scores, because the estimated abilities were very similar, there were small differences among the scaled scores on the same scale; the relative frequency of examinees classified into performance categories and the classification consistency index also showed the interpretation of reported scores across scales were similar.
The study provided a comprehensive comparison on the use of FIPC methods in parameter estimation. It was hoped that this study would help the practitioners choose among the methods according to the needs of the testing programs. When ability estimates were linearly transformed into scale scores, the lengths of scales did not affect the statistical properties of scores, however, they may impact how the scores are subjectively perceived by stakeholders and therefore should be carefully selected.
|
118 |
Effect of Violating Unidimensional Item Response Theory Vertical Scaling Assumptions on Developmental Score ScalesTopczewski, Anna Marie 01 July 2013 (has links)
Developmental score scales represent the performance of students along a continuum, where as students learn more they move higher along that continuum. Unidimensional item response theory (UIRT) vertical scaling has become a commonly used method to create developmental score scales. Research has shown that UIRT vertical scaling methods can be inconsistent in estimating grade-to-grade growth, within-grade variability, and separation of grade distributions (effect size) of developmental score scale. In particular the finding of scale shrinkage (decreasing within-grade score variability as grade-level increases) has led to concerns about and criticism of IRT vertical scales. The causes of scale shrinkage have yet to be fully understood. Real test data and simulation studies have been unable to provide complete answers as to why IRT vertical scaling inconsistencies occur. Violations of assumptions have been a commonly cited potential cause for the inconsistent results. For this reason, this dissertation is an extensive investigation into how violations of the three assumptions of UIRT vertical scaling - local item dependence, unidimensionality, and similar reliability of grade level tests - affect estimated developmental score scales.
Simulated tests were developed that purposefully violated a UIRT vertical scaling assumption. Three sets of simulated tests were created to test the effect of violating a single assumption. First, simulated tests were created with increasing, decreasing, low, medium, and high local item dependence. Second, multidimensional simulated tests were created by varying the correlation between dimensions. Third, simulated tests with dissimilar reliability were created by varying item parameters characteristics of the grade level tests. Multiple versions of twelve simulated tests were used to investigate UIRT vertical scaling assumption violations. The simulated tests were calibrated under the UIRT model to purposefully violate an assumption of UIRT vertical scaling. Each simulated test version was replicated for 1000 random examinee samples to assess the bias and standard error of estimated grade-to-grade-growth, within-grade-variability, and separation-of-grade-distributions (effect size) of the estimated developmental score scales.
The results suggest that when UIRT vertical scaling assumptions are violated the resulting estimated developmental score scales contain standard error and bias. For this study, the magnitude of standard error was similar across all simulated tests regardless of the assumption violation. However, bias fluctuated as a result of different types and magnitudes of UIRT vertical scaling assumption violations. More local item dependence resulted in more grade-to-grade-growth and separation-of-grade-distributions bias. And local item dependence resulted in developmental score scales that displayed scale expansion. Multidimensionality resulted in more grade-to-grade-growth and separation-of-grade-distributions bias when the correlation between dimensions was smaller. Multidimensionality resulted in developmental score scales that displayed scale expansion. Dissimilar reliability of grade level tests resulted in more grade-to-grade-growth bias and minimal separation-of-grade-distributions bias. Dissimilar reliability of grade level tests resulted in scale expansion or scale shrinkage depending on the item characteristics of the test. Limitations of this study and future research are discussed.
|
119 |
Simple structure MIRT equating for multidimensional testsKim, Stella Yun 01 May 2018 (has links)
Equating is a statistical process used to accomplish score comparability so that the scores from the different test forms can be used interchangeably. One of the most widely used equating procedures is unidimensional item response theory (UIRT) equating, which requires a set of assumptions about the data structure. In particular, the essence of UIRT rests on the unidimensionality assumption, which requires that a test measures only a single ability. However, this assumption is not likely to be fulfilled for many real data such as mixed-format tests or tests composed of several content subdomains: failure to satisfy the assumption threatens the accuracy of the estimated equating relationships.
The main purpose of this dissertation was to contribute to the literature on multidimensional item response theory (MIRT) equating by developing a theoretical and conceptual framework for true-score equating using a simple-structure MIRT model (SS-MIRT). SS-MIRT has several advantages over other complex MIRT models such as improved efficiency in estimation and a straightforward interpretability.
In this dissertation, the performance of the SS-MIRT true-score equating procedure (SMT) was examined and evaluated through four studies using different data types: (1) real data, (2) simulated data, (3) pseudo forms data, and (4) intact single form data with identity equating. Besides SMT, four competitors were included in the analyses in order to assess the relative benefits of SMT over the other procedures: (a) equipercentile equating with presmoothing, (b) UIRT true-score equating, (c) UIRT observed-score equating, and (d) SS-MIRT observed-score equating.
In general, the proposed SMT procedure behaved similarly to the existing procedures. Also, SMT showed more accurate equating results compared to the traditional UIRT equating. Better performance of SMT over UIRT true-score equating was consistently observed across the three studies that employed different criterion relationships with different datasets, which strongly supports the benefit of a multidimensional approach to equating with multidimensional data.
|
120 |
Improving the Transition Readiness Assessment Questionnaire (TRAQ) using Item Response TheoryWood, David L., Johnson, Kiana R., McBee, Matthew 01 January 2017 (has links)
Background:
Measuring the acquisition of self-management and health care utilization skills are part of evidence based transition practice. The Transition Readiness Assessment Questionnaire (TRAQ) is a validated 20-question and 5-factor instrument with a 5-point Likert response set using a Stages of Change Framework.
Objective:
To improve the performance of the TRAQ and allow more precise measurement across the full range of transition readiness skills (from precontemplation to initiation to mastery).
Design/Methods:
On data from 506 previously completed TRAQs collected from several clinical practices we used MPlus v.7.4 to apply a graded response model (GRM), examining item discrimination and difficulty. New questions were written and added across all domains to increase the difficulty and discrimination of the overall scale. To evaluate the performance of new items and the resulting factor structure of the revised scale we fielded a new version of the TRAQ (with a total of 30 items) using an online anonymous survey of first year college students (in process).
Results:
We eliminated the five least discriminating TRAQ items with minimal impact to the conditional test information. After item elimination (k = 15) the factor structure of the instrument was maintained with good quality, ?2 (86) = 365.447, CFI = 0.977, RMSEA = 0.079, WRMR = 1.017. We also found that a majority of items could reliably discriminate only across lower levels of transition readiness (precontemplation to initiation) but could not discriminate at higher levels of transition readiness (action and mastery). Therefore we wrote 15 additional items intended to have higher difficulty. On the new 30 item TRAQ, confirmatory factor analysis, internal reliability and IRT results will be reported from a large sample of college students
Conclusion(s):
Using IRT and factor analyses we eliminated 5 of 20 TRAQ items that were poorly discriminating. We found that many of the items in the TRAQ could discriminate among those in the early stages of transition readiness, but could not discriminate among those in later stages of transition readiness. To have a more robust measure of transition readiness we added more difficult items and are evaluating the scale’s psychometric properties.
|
Page generated in 0.0697 seconds