Spelling suggestions: "subject:"educational tests anda measurements"" "subject:"educational tests ando measurements""
11 |
The Impact of Rater Variability on Relationships among Different Effect-Size Indices for Inter-Rater Agreement between Human and Automated Essay ScoringUnknown Date (has links)
Since researchers investigated automatic scoring systems in writing assessments, they have dealt with relationships between human and
machine scoring, and then have suggested evaluation criteria for inter-rater agreement. The main purpose of my study is to investigate the
magnitudes of and relationships among indices for inter-rater agreement used to assess the relatedness of human and automated essay scoring,
and to examine impacts of rater variability on inter-rater agreement. To implement the investigations, my study consists of two parts:
empirical and simulation studies. Based on the results from the empirical study, the overall effects for inter-rater agreement were .63 and
.99 for exact and adjacent proportions of agreement, .48 for kappas, and between .75 and .78 for correlations. Additionally, significant
differences between 6-point scales and the other scales (i.e., 3-, 4-, and 5-point scales) for correlations, kappas and proportions of
agreement existed. Moreover, based on the results of the simulated data, the highest agreements and lowest discrepancies achieved in the
matched rater distribution pairs. Specifically, the means of exact and adjacent proportions of agreement, kappa and weighted kappa values,
and correlations were .58, .95, .42, .78, and .78, respectively. Meanwhile the average standardized mean difference was .0005 in the matched
rater distribution pairs. Acceptable values for inter-rater agreement as evaluation criteria for automated essay scoring, impacts of rater
variability on inter-rater agreement, and relationships among inter-rater agreement indices were discussed. / A Dissertation submitted to the Department of Educational Psychology and Learning Systems in partial
fulfillment of the requirements for the degree of Doctor of Philosophy. / Fall Semester 2017. / November 10, 2017. / Automated Essay Scoring, Inter-Rater Agreement, Meta-Analysis, Rater Variability / Includes bibliographical references. / Betsy Jane Becker, Professor Directing Dissertation; Fred Huffer, University Representative; Insu Paek,
Committee Member; Qian Zhang, Committee Member.
|
12 |
A comparison of low level readers' success in taking oral tests versus printed tests in the level one intermediate science curriculum study classroomKugler, Douglas Kent January 2010 (has links)
Digitized by Kansas Correctional Industries
|
13 |
On the Use of Covariates in a Latent Class Signal Detection Model, with Applications to Constructed Response ScoringWang, Zijian Gerald January 2012 (has links)
A latent class signal detection (SDT) model was recently introduced as an alternative to traditional item response theory (IRT) methods in the analysis of constructed response data. This class of models can be represented as restricted latent class models and differ from the IRT approach in the way the latent construct is conceptualized. One appeal of the signal detection approach is that it provides an intuitive framework from which psychological processes governing rater behavior can be better understood. The present study developed an extension of the latent class SDT model to include covariates and examined the performance of the resulting model. Covariates can be incorporated into the latent class SDT model in three ways: 1) to affect latent class membership, 2) conditional response probabilities and 3) both latent class membership and conditional response probabilities. In each case, simulations were conducted to investigate both parameter recovery and classification accuracy of the extended model under two competing rater designs; in addition, implications of ignoring covariate effects and covariate misspecification were explored. Here, the ability of information criteria, namely the AIC, small sample adjusted AIC and BIC, in recovering the true model with respect to how covariates are introduced was also examined. Results indicate that parameters were generally well recovered in fully-crossed designs; to obtain similar levels of estimation precision in incomplete designs, sample size requirements were comparatively higher and depend on the number of indicators used. When covariate effects were not accounted for or misspecified, results show that parameter estimates tend to be severely biased, which in turn reduced classification accuracy. With respect to model recovery, the BIC performed the most consistently amongst the information criteria considered. In light of these findings, recommendations were made with regard to sample size requirements and model building strategies when implementing the extended latent class SDT model.
|
14 |
Statistical Inference for Diagnostic Classification ModelsXu, Gongjun January 2013 (has links)
Diagnostic classification models (DCM) are an important recent development in educational and psychological testing. Instead of an overall test score, a diagnostic test provides each subject with a profile detailing the concepts and skills (often called "attributes") that he/she has mastered. Central to many DCMs is the so-called Q-matrix, an incidence matrix specifying the item-attribute relationship. It is common practice for the Q-matrix to be specified by experts when items are written, rather than through data-driven calibration. Such a non-empirical approach may lead to misspecification of the Q-matrix and substantial lack of model fit, resulting in erroneous interpretation of testing results. This motivates our study and we consider the identifiability, estimation, and hypothesis testing of the Q-matrix. In addition, we study the identifiability of diagnostic model parameters under a known Q-matrix. The first part of this thesis is concerned with estimation of the Q-matrix. In particular, we present definitive answers to the learnability of the Q-matrix for one of the most commonly used models, the DINA model, by specifying a set of sufficient conditions under which the Q-matrix is identifiable up to an explicitly defined equivalence class. We also present the corresponding data-driven construction of the Q-matrix. The results and analysis strategies are general in the sense that they can be further extended to other diagnostic models. The second part of the thesis focuses on statistical validation of the Q-matrix. The purpose of this study is to provide a statistical procedure to help decide whether to accept the Q-matrix provided by the experts. Statistically, this problem can be formulated as a pure significance testing problem with null hypothesis H0 : Q = Q0, where Q0 is the candidate Q-matrix. We propose a test statistic that measures the consistency of observed data with the proposed Q-matrix. Theoretical properties of the test statistic are studied. In addition, we conduct simulation studies to show the performance of the proposed procedure. The third part of this thesis is concerned with the identifiability of the diagnostic model parameters when the Q-matrix is correctly specified. Identifiability is a prerequisite for statistical inference, such as parameter estimation and hypothesis testing. We present sufficient and necessary conditions under which the model parameters are identifiable from the response data.
|
15 |
Dealing with Sparse Rater Scoring of Constructed Responses within a Framework of a Latent Class Signal Detection ModelKim, Sunhee January 2013 (has links)
In many assessment situations that use a constructed-response (CR) item, an examinee's response is evaluated by only one rater, which is called a single rater design. For example, in a classroom assessment practice, only one teacher grades each student's performance. While single rater designs are the most cost-effective method among all rater designs, the lack of a second rater causes difficulties with respect to how the scores should be used and evaluated. For example, one cannot assess rater reliability or rater effects when there is only one rater. The present study explores possible solutions for the issues that arise in sparse rater designs within the context of a latent class version of signal detection theory (LC-SDT) that has been previously used for rater scoring. This approach provides a model for rater cognition in CR scoring (DeCarlo, 2005; 2008; 2010) and offers measures of rater reliability and various rater effects. The following potential solutions to rater sparseness were examined: 1) the use of parameter restrictions to yield an identified model, 2) the use of informative priors in a Bayesian approach, and 3) the use of back readings (e.g., partially available 2nd rater observations), which are available in some large scale assessments. Simulations and analyses of real-world data are conducted to examine the performance of these approaches. Simulation results showed that using parameter constraints allows one to detect various rater effects that are of concern in practice. The Bayesian approach also gave useful results, although estimation of some of the parameters was poor and the standard deviations of the parameter posteriors were large, except when the sample size was large. Using back-reading scores gave an identified model and simulations showed that the results were generally acceptable, in terms of parameter estimation, except for small sample sizes. The paper also examines the utility of the approaches as applicable to the PIRLS USA reliability data. The results show some similarities and differences between parameter estimates obtained with posterior mode estimation and with Bayesian estimation. Sensitivity analyses revealed that rater parameter estimates are sensitive to the specification of the priors, as also found in the simulation results with smaller sample sizes.
|
16 |
An Item Response Theory Approach to Causal Inference in the Presence of a Pre-intervention AssessmentMarini, Jessica January 2013 (has links)
This research develops a form of causal inference based on Item Response Theory (IRT) to combat bias that occurs when existing causal inference methods are used under certain scenarios. When a pre-test is administered, prior to a treatment decision, bias can occur in causal inferences about the decision's effect on the outcome. This new IRT based method uses item-level information, treatment placement, and the outcome to produce estimates of each subject's ability in the chosen domain. Examining a causal inference research question in an IRT model-based framework becomes a model-based way to match subjects on estimates of their true ability. This model-based matching allows inferences to be made about a subject's performance as if they had been in the opposite treatment group. The IRT method is developed to combat existing methods' downfalls such as relying on conditional independence between pre-test scores and outcomes. Using simulation, the IRT method is compared to existing methods under two different model scenarios in terms of Type I and Type II errors. Then the method's parameter recovery is analyzed followed by accuracy of treatment effect evaluation. The IRT method is shown to out perform existing methods in an ability-based scenario. Finally, the IRT method is applied to real data assessing the impact of advanced STEM in high school on a students choice of major, and compared to existing alternative approaches.
|
17 |
Examining the Impact of Examinee-Selected Constructed Response Items in the Context of a Hierarchical Rater Signal Detection ModelPatterson, Brian Francis January 2013 (has links)
Research into the relatively rarely used examinee-selected item assessment designs has revealed certain challenges. This study aims to more comprehensively re-examine the key issues around examinee-selected items under a modern model for constructed-response scoring. Specifically, data were simulated under the hierarchical rater model with signal detection theory rater components (HRM-SDT; DeCarlo, Kim, and Johnson, 2011) and a variety of examinee-item selection mechanisms were considered. These conditions varied from the hypothetical baseline condition--where examinees choose randomly and with equal frequency from a pair of item prompts--to the perhaps more realistic and certainly more troublesome condition where examinees select items based on the very subject-area proficiency that the instrument intends to measure. While good examinee, item, and rater parameter recovery was apparent in the former condition for the HRM-SDT, serious issues with item and rater parameter estimation were apparent in the latter. Additional conditions were considered, as well as competing psychometric models for the estimation of examinee proficiency. Finally, practical implications of using examinee-selected item designs are given, as well as future directions for research.
|
18 |
Analyzing Hierarchical Data with the DINA-HC ApproachZhang, Jianzhou January 2015 (has links)
Cognitive Diagnostic Models (CDMs) are a class of models developed in order to diagnose the cognitive attributes of examinees. They have received increasing attention in recent years because of the need of more specific attribute and item related information. A particular cognitive diagnostic model, namely, the hierarchical deterministic, input, noisy ‘and’ gate model with convergent attribute hierarchy (DINA-HC) is proposed to handle situations when the attributes have a convergent hierarchy. Su (2013) first introduced the model as the deterministic, input, noisy ‘and’ gate with hierarchy (DINA-H) and retrofitted The Trends in International Mathematics and Science Study (TIMSS) data utilizing this model with linear and unstructured hierarchies. Leighton, Girl, and Hunka (1999) and Kuhn (2001) introduced four forms of hierarchical structures (Linear, Convergent, Divergent, and Unstructured) by assuming the interrelated competencies of the cognitive skills. Specifically, the convergent hierarchy is one of the four hierarchies (Leighton, Gierl & Hunka, 2004) and it was used to describe the attributes that have a convergent structure. One of the features of this model is that it can incorporate the hierarchical structures of the cognitive skills in the model estimation process (Su, 2013). The advantage of the DINA-HC over the Deterministic, input, noisy ‘and’ gate (DINA) model (Junker & Sijtsma, 2001) is that it will reduce the number of parameters as well as the latent classes by imposing the particular attribute hierarchy. This model follows the specification of the DINA except that it will pre-specify the attribute profiles by utilizing the convergent attribute hierarchies. Only certain possible attribute pattern will be allowed depending on the particular convergent hierarchy. Properties regarding the DINA-HC and DINA are examined and compared through the simulation and empirical study. Specifically, the attribute profile pattern classification accuracy, the model and item fit are compared between the DINA-HC and DINA under different conditions when the attributes have convergent hierarchies. This study indicates that the DINA-HC provides better model fit, less biased parameter estimates and higher attribute profile classification accuracy than the DINA when the attributes have a convergent hierarchy. The sample size, the number of attributes, and the test length have been shown to have an effect on the parameter estimates. The DINA model has better model fit than the DINA-HC when the attributes are not dependent on each other.
|
19 |
Posterior Predictive Model Checks in Cognitive Diagnostic ModelsPark, Jung Yeon January 2015 (has links)
Cognitive diagnostic models (CDMs; DiBello, Roussos, & Stout, 2007) have received increasing attention in educational measurement for the purpose of diagnosing strengths and weaknesses of examinees’ latent attributes. And yet, despite the current popularity of a number of diagnostic models, research seeking to assess model-data fit has been limited. The current study applied one of the Bayesian model checking methods, namely the posterior predictive model check method (PPMC; Rubin, 1984), to its investigation of model misfit. We employed the technique in order to assess the model-data misfit from various diagnostic models, using real data and conducting two simulation studies. An important issue when it comes to the application of PPMC is choice of discrepancy measure. This study examines the performance of three discrepancy measures utilized to assess different aspects of model misfit: observed total-scores distribution, association of item pairs, and correlation between attribute pairs as adequate measures of the diagnostic models.
|
20 |
A longitudinal study to determine the stanine stability of a group's test-score performance in the elementary schoolCorcoran, John E January 1958 (has links)
Thesis (Ed.D.)--Boston University
|
Page generated in 0.1989 seconds