Spelling suggestions: "subject:"polytomous"" "subject:"polytoumous""
1 |
The validity of polytomous items in the Rasch model - The role of statistical evidence of the threshold orderSalzberger, Thomas January 2015 (has links) (PDF)
Rating scales involving more than two response categories are a popular response format in measurement in education, health and business sciences. Their primary purpose lies in the increase of information and thus measurement precision. For these objectives to be met, the response scale has to provide valid scores with higher numbers reflecting more of the property to be measured. Thus, the response scale is closely linked to construct validity since any kind of malfunctioning would jeopardize measurement. While tests of fit are not necessarily sensitive to violations of the assumed order of response categories, the order of empirical threshold estimates provides insight into the functionality of the scale. The Rasch model and, specifically, the so-called Rasch-Andrich thresholds are unique in providing this kind of evidence. The conclusion whether thresholds are to be considered truly ordered or disordered can be based on empirical point estimates of thresholds. Alternatively, statistical tests can be carried out taking standard errors of threshold estimates into account. Such tests might either stress the need for evidence of ordered thresholds or the need for a
lack of evidence of disordered thresholds. Both approaches are associated with unacceptably high error rates, though. A hybrid approach that accounts for both evidence of ordered and disordered thresholds is suggested as a compromise. While the usefulness of statistical tests for a given data set is still limited, they provide some guidance in terms of a modified response scale in future applications.
|
2 |
Assessing Invariance of Factor Structures and Polytomous Item Response Model Parameter EstimatesReyes, Jennifer McGee 2010 December 1900 (has links)
The purpose of the present study was to examine the
invariance of the factor structure and item response model
parameter estimates obtained from a set of 27 items
selected from the 2002 and 2003 forms of Your First College
Year (YFCY). The first major research question of the
present study was: How similar/invariant are the factor
structures obtained from two datasets (i.e., identical
items, different people)? The first research question was
addressed in two parts: (1) Exploring factor structures
using the YFCY02 dataset; and (2) Assessing factorial
invariance using the YFCY02 and YFCY03 datasets.
After using exploratory and confirmatory and factor
analysis for ordered data, a four-factor model using 20
items was selected based on acceptable model fit for the YFCY02 and YFCY03 datasets. The four factors (constructs)
obtained from the final model were: Overall Satisfaction,
Social Agency, Social Self Concept, and Academic Skills.
To assess factorial invariance, partial and full factorial
invariance were examined. The four-factor model fit both
datasets equally well, meeting the criteria for partial and
full measurement invariance.
The second major research question of the present
study was: How similar/invariant are person and item
parameter estimates obtained from two different datasets
(i.e., identical items, different people) for the
homogenous graded response model (Samejima, 1969) and the
partial credit model (Masters, 1982)?
To evaluate measurement invariance using IRT methods,
the item discrimination and item difficulty parameters
obtained from the GRM need to be equivalent across
datasets. The YFCY02 and YFCY03 GRM item discrimination
parameters (slope) correlation was 0.828. The YFCY02 and
YFCY03 GRM item difficulty parameters (location)
correlation was 0.716. The correlations and scatter plots
indicated that the item discrimination parameter estimates
were more invariant than the item difficulty parameter
estimates across the YFCY02 and YFCY03 datasets.
|
3 |
Testing the Assumption of Sample Invariance of Item Difficulty Parameters in the Rasch Rating Scale ModelCurtin, Joseph A. 20 August 2007 (has links) (PDF)
Rasch is a mathematical model that allows researchers to compare data that measure a unidimensional trait or ability (Bond & Fox, 2007). When data fit the Rasch model, it is mathematically proven that the item difficulty estimates are independent of the sample of respondents. The purpose of this study was to test the robustness of the Rasch model with regards to its ability to maintain invariant item difficulty estimates when real (data that does not perfectly fit the Rasch model), polytomous scored data is used. The data used in this study comes from a university alumni questionnaire that was collected over a period of five years. The analysis tests for significant variation between (a) small samples taken from a larger sample, (b) a base sample and subsequent (longitudinal) samples and (c) variation over time with confounding variables. The confounding variables studied include (a) the gender of the respondent and (b) the respondent's type of major at the time of graduation. The study used three methods to assess variation: (a) the between-fit statistic, (b) confidence intervals around the mean of the estimates and (c) a general linear model. The general linear model used the person residual statistic from the Winsteps' person output file as a dependent variable with year, gender and type of major as independent variables. Results of the study support the invariant nature of the item difficulty estimates when polytomous data from the alumni questionnaire is used. The analysis found comparable results (within sampling error) for the between-fit statistics and the general linear model. The confidence interval method was limited in its usefulness due to small confidence bands and the limitation of the plots. The linear model offered the most valuable data in that it provides methods to not only detect the existence of variation but to assess the relative magnitude of the variation from different sources. Recommendations for future research include studies regarding the impact of sample size on the between-fit statistic and confidence intervals as well as the impact of large amounts of systematic missing data on the item parameter estimates.
|
4 |
A Comparison of IRT and Rasch Procedures in a Mixed-Item Format TestKinsey, Tari L. 08 1900 (has links)
This study investigated the effects of test length (10, 20 and 30 items), scoring schema (proportion of dichotomous ad polytomous scoring) and item analysis model (IRT and Rasch) on the ability estimates, test information levels and optimization criteria of mixed item format tests. Polytomous item responses to 30 items for 1000 examinees were simulated using the generalized partial-credit model and SAS software. Portions of the data were re-coded dichotomously over 11 structured proportions to create 33 sets of test responses including mixed item format tests. MULTILOG software was used to calculate the examinee ability estimates, standard errors, item and test information, reliability and fit indices. A comparison of IRT and Rasch item analysis procedures was made using SPSS software across ability estimates and standard errors of ability estimates using a 3 x 11 x 2 fixed factorial ANOVA. Effect sizes and power were reported for each procedure. Scheffe post hoc procedures were conducted on significant factos. Test information was analyzed and compared across the range of ability levels for all 66-design combinations. The results indicated that both test length and the proportion of items scored polytomously had a significant impact on the amount of test information produced by mixed item format tests. Generally, tests with 100% of the items scored polytomously produced the highest overall information. This seemed to be especially true for examinees with lower ability estimates. Optimality comparisons were made between IRT and Rasch procedures based on standard error rates for the ability estimates, marginal reliabilities and fit indices (-2LL). The only significant differences reported involved the standard error rates for both the IRT and Rasch procedures. This result must be viewed in light of the fact that the effect size reported was negligible. Optimality was found to be highest when longer tests and higher proportions of polytomous scoring were applied. Some indications were given that IRT procedures may produce slightly improved results in gathering available test information. Overall, significant differences were not found between the IRT and Rasch procedures when analyzing the mixed item format tests. Further research should be conducted in the areas of test difficulty, examinee test scores, and automated partial-credit scoring along with a comparison to other traditional psychometric measures and how they address challenges related to the mixed item format tests.
|
5 |
Differential item functioning procedures for polytomous items when examinee sample sizes are smallWood, Scott William 01 May 2011 (has links)
As part of test score validity, differential item functioning (DIF) is a quantitative characteristic used to evaluate potential item bias. In applications where a small number of examinees take a test, statistical power of DIF detection methods may be affected. Researchers have proposed modifications to DIF detection methods to account for small focal group examinee sizes for the case when items are dichotomously scored. These methods, however, have not been applied to polytomously scored items.
Simulated polytomous item response strings were used to study the Type I error rates and statistical power of three popular DIF detection methods (Mantel test/Cox's β, Liu-Agresti statistic, HW3) and three modifications proposed for contingency tables (empirical Bayesian, randomization, log-linear smoothing). The simulation considered two small sample size conditions, the case with 40 reference group and 40 focal group examinees and the case with 400 reference group and 40 focal group examinees.
In order to compare statistical power rates, it was necessary to calculate the Type I error rates for the DIF detection methods and their modifications. Under most simulation conditions, the unmodified, randomization-based, and log-linear smoothing-based Mantel and Liu-Agresti tests yielded Type I error rates around 5%. The HW3 statistic was found to yield higher Type I error rates than expected for the 40 reference group examinees case, rendering power calculations for these cases meaningless. Results from the simulation suggested that the unmodified Mantel and Liu-Agresti tests yielded the highest statistical power rates for the pervasive-constant and pervasive-convergent patterns of DIF, as compared to other DIF method alternatives. Power rates improved by several percentage points if log-linear smoothing methods were applied to the contingency tables prior to using the Mantel or Liu-Agresti tests. Power rates did not improve if Bayesian methods or randomization tests were applied to the contingency tables prior to using the Mantel or Liu-Agresti tests. ANOVA tests showed that statistical power was higher when 400 reference examinees were used versus 40 reference examinees, when impact was present among examinees versus when impact was not present, and when the studied item was excluded from the anchor test versus when the studied item was included in the anchor test. Statistical power rates were generally too low to merit practical use of these methods in isolation, at least under the conditions of this study.
|
6 |
An investigation of the optimal test design for multi-stage test using the generalized partial credit modelChen, Ling-Yin 27 January 2011 (has links)
Although the design of Multistage testing (MST) has received increasing attention, previous studies mostly focused on comparison of the psychometric properties of MST with CAT and paper-and-pencil (P&P) test. Few studies have systematically examined the number of items in the routing test, the number of subtests in a stage, or the number of stages in a test design to achieve accurate measurement in MST. Given that none of the studies have identified an ideal MST test design using polytomously-scored items, the current study conducted a simulation to investigate the optimal design for MST using generalized partial credit model (GPCM).
Eight different test designs were examined on ability estimation across two routing test lengths (short and long) and two total test lengths (short and long). The item pool and generated item responses were based on items calibrated from a national test consisting of 273 partial credit items. Across all test designs, the maximum information routing method was employed and the maximum likelihood estimation was used for ability estimation. Ten samples of 1,000 simulees were used to assess each test design. The performance of each test design was evaluated in terms of the precision of ability estimates, item exposure rate, item pool utilization, and item overlap.
The study found that all test designs produced very similar results. Although there were some variations among the eight test structures in the ability estimates, results indicate that the performance overall of these eight test structures in achieving measurement precision did not substantially deviate from one another with regard to total test length and routing test length. However, results from the present study suggest that routing test length does have a significant effect on the number of non-convergent cases in MST tests. Short routing tests tended to result in more non-convergent cases, and the presence of fewer stage tests yielded more of such cases than structures with more stages. Overall, unlike previous findings, the results of the present study indicate that the MST test structure is less likely to be a factor impacting ability estimation when polytomously-scored items are used, based on GPCM. / text
|
7 |
A Monte Carlo Study Investigating Missing Data, Differential Item Functioning, and Effect SizeGarrett, Phyllis Lorena 12 August 2009 (has links)
ABSTRACT A MONTE CARLO STUDY INVESTIGATING MISSING DATA, DIFFERENTIAL ITEM FUNCTIONING, AND EFFECT SIZE by Phyllis Garrett The use of polytomous items in assessments has increased over the years, and as a result, the validity of these assessments has been a concern. Differential item functioning (DIF) and missing data are two factors that may adversely affect assessment validity. Both factors have been studied separately, but DIF and missing data are likely to occur simultaneously in real assessment situations. This study investigated the Type I error and power of several DIF detection methods and methods of handling missing data for polytomous items generated under the partial credit model. The Type I error and power of the Mantel and ordinal logistic regression were compared using within-person mean substitution and multiple imputation when data were missing completely at random. In addition to assessing the Type I error and power of DIF detection methods and methods of handling missing data, this study also assessed the impact of missing data on the effect size measure associated with the Mantel, the standardized mean difference effect size measure, and ordinal logistic regression, the R-squared effect size measure. Results indicated that the performance of the Mantel and ordinal logistic regression depended on the percent of missing data in the data set, the magnitude of DIF, and the sample size ratio. The Type I error for both DIF detection methods varied based on the missing data method used to impute the missing data. Power to detect DIF increased as DIF magnitude increased, but there was a relative decrease in power as the percent of missing data increased. Additional findings indicated that the percent of missing data, DIF magnitude, and sample size ratio also influenced the effect size measures associated with the Mantel and ordinal logistic regression. The effect size values for both DIF detection methods generally increased as DIF magnitude increased, but as the percent of missing data increased, the effect size values decreased.
|
8 |
Scoring pour le risque de crédit : variable réponse polytomique, sélection de variables, réduction de la dimension, applications / Scoring for credit risk : polytomous response variable, variable selection, dimension reduction, applicationsVital, Clément 11 July 2016 (has links)
Le but de cette thèse était d'explorer la thématique du scoring dans le cadre de son utilisation dans le monde bancaire, et plus particulièrement pour contrôler le risque de crédit. En effet, la diversification et la globalisation des activités bancaires dans la deuxième moitié du XXe siècle ont conduit à l'instauration d'un certain nombre de régulations, afin de pouvoir s'assurer que les établissements bancaires disposent de capitaux nécessaires à couvrir le risque qu'ils prennent. Cette régulation impose ainsi la modélisation de certains indicateurs de risque, dont la probabilité de défaut, qui est pour un prêt en particulier la probabilité que le client se retrouve dans l'impossibilité de rembourser la somme qu'il doit. La modélisation de cet indicateur passe par la définition d'une variable d'intérêt appelée critère de risque, dénotant les "bons payeurs" et les "mauvais payeurs". Retranscrit dans un cadre statistique plus formel, cela signifie que nous cherchons à modéliser une variable à valeurs dans {0,1} par un ensemble de variables explicatives. Cette problématique est en pratique traitée comme une question de scoring. Le scoring consiste en la définition de fonction, appelées fonctions de score, qui retransmettent l'information contenue dans l'ensemble des variables explicatives dans une note de score réelle. L'objectif d'une telle fonction sera de donner sur les individus le même ordonnancement que la probabilité a posteriori du modèle, de manière à ce que les individus ayant une forte probabilité d'être "bons" aient une note élevée, et inversement que les individus ayant une forte probabilité d'être "mauvais" (et donc un risque fort pour la banque) aient une note faible. Des critères de performance tels que la courbe ROC et l'AUC ont été définis, permettant de quantifier à quel point l'ordonnancement produit par la fonction de score est pertinent. La méthode de référence pour obtenir des fonctions de score est la régression logistique, que nous présentons ici. Une problématique majeure dans le scoring pour le risque de crédit est celle de la sélection de variables. En effet, les banques disposent de larges bases de données recensant toutes les informations dont elles disposent sur leurs clients, aussi bien sociodémographiques que comportementales, et toutes ne permettent pas d'expliquer le critère de risque. Afin d'aborder ce sujet, nous avons choisi de considérer la technique du Lasso, reposant sur l'application d'une contrainte sur les coefficients, de manière à fixer les valeurs des coefficients les moins significatifs à zéro. Nous avons envisagé cette méthode dans le cadre des régressions linéaires et logistiques, ainsi qu'une extension appelée Group Lasso, permettant de considérer les variables explicatives par groupes. Nous avons ensuite considéré le cas où la variable réponse n'est plus binaire, mais polytomique, c'est-à-dire avec plusieurs niveaux de réponse possibles. La première étape a été de présenter une définition du scoring équivalente à celle présentée précédemment dans le cas binaire. Nous avons ensuite présenté différentes méthodes de régression adaptées à ce nouveau cas d'étude : une généralisation de la régression logistique binaire, des méthodes semi-paramétriques, ainsi qu'une application à la régression logistique polytomique du principe du Lasso. Enfin, le dernier chapitre est consacré à l'application de certaines des méthodes évoquées dans le manuscrit sur des jeux de données réelles, permettant de les confronter aux besoins réels de l'entreprise. / The objective of this thesis was to explore the subject of scoring in the banking world, and more precisely to study how to control credit risk. The diversification and globalization of the banking business in the second half of the twentieth century led to introduce regulations, which require banks to make reserves to cover the risk they take. These regulations also dictate that they should model different risk indicators, among which the probability of default. This indicator represents the probability for a client to find himself in the incapacity to pay back his debt. In order to predict this probability, one should define a risk criterion, that allows to distinguish the "bad clients" from the "good clients". In a more formal statistical approach, that means we want to model a binary variable by an ensemble of explanatory variables. This problem is usually treated as a scoring problem. It consists in the definition of functions, called scoring functions, which interpret the information contained in the explanatory variables and transform it into a real-value score note. The goal of such a function is to induce the same order on the observations than the a posteriori probability, so that the observations that have a high probability to be "good" have a high score, and those that have a high probability to be "bad" (and thus a high risk for the bank) have a low score. Performance criteria such as the ROC curve and the AUC allow us to quantify the quality of the order given by the scoring function. The reference method to obtain such scoring functions is the logistic regression, which we present here. A major subject in credit scoring is the variable selection. The banks have access to large databases, which gather information on the profile of their clients and their past behavior. However, those variables may not all be discriminating regarding the risk criterion. In order to select the variables, we proposed to use the Lasso method, based on the restriction of the coefficients of the model, so that the less significative coefficients will be fixed to zero. We applied the Lasso method on linear regression and logistic regression. We also considered an extension of the Lasso method called Group Lasso on logistic regression, which allows us to select groups of variables rather than individual variables. Then, we considered the case in which the response variable is not binary, but polytomous, that is to say with more than two response levels. The first step in this new context was to extend the scoring problem as we knew in the binary case to the polytomous case. We then presented some models adapted to this case: an extension of the binary logistic regression, semi-parametric methods, and an application of the Lasso method on the polytomous logistic regression. Finally, the last chapter deals with some application studies, in which the methods presented in this manuscript are applied to real data from the bank, to see how they meet the needs of the real world.
|
9 |
Elektriska flickor och mekaniska pojkar : Om gruppskillnader på prov - en metodutveckling och en studie av skillnader mellan flickor och pojkar på centrala prov i fysikRamstedt, Kristian January 1996 (has links)
This dissertation served two purposes. The first was to develop a method of detecting differential item functioning (DIF) within tests containing both dichotomously and polytomously scored items. The second was related to gender and aimed a) to investigate if those items that were functioning differently for girls and boys showed any characteristic properties and, if so, b) determine if these properties could be used to predict which items would be flagged for D1F. The method development was based on the Mantel-Haenszel (MH) method used for dichotmously scored items. By dichotomizing the polytomously scored items both types of item could be compared on the same statistical level as either solved or non-solved items. It was not possible to compare the internal score structures for the two gender groups, only overall score differences were detected. By modelling the empirical item characteristic curves it was possible to develop a MH method for identifying nonuniform DIF. Both internal and external ability criteria were used. Total test score with no purification was used as the internal criterion. Purification was not done for validity reasons, no items were judged as biased. Teacher set marks were used as external criteria. The marking scale had to be transformed for either boys or girls since a comparison of scores for boys and girls with the same marks showed that boys always got higher mean scores. The results of the two MH analyses based on internal and external criterion were compared with results from P-SIBTEST. All three methods corresponded well although P-SIBTEST flagged considerably more items in favour of the reference group (boys) which exhibited a higher overall ability. All 200 items included in the last 15 annual national tests in physics were analysed for DIF and classified by ten criteria The most significant result was that items in electricity were, to a significantly higher degree, flagged as DIF in favour of girls whilst items in mechanics were flagged in favour of boys. Items in other content areas showed no significant pattern. Multiple-Choice items were flagged in favour of boys. Regardless of the degree of significance by which items from different content areas were flagged on a group level it was not possible to predict which single item would be flagged for DIF. The most probable prediction was always that an item was neutral. Some possible interpretations of DIF as an effect of multidimen-sionality were discussed as were some hypotheses about the reasons why boys did better in mechanics and girls in electricity. / digitalisering@umu
|
10 |
Modeling Diseases With Multiple Disease Characteristics: Comparison Of Models And Estimation MethodsErdem, Munire Tugba 01 July 2011 (has links) (PDF)
Epidemiological data with disease characteristic information can be modelled in several ways. One way is taking each disease characteristic as a response and constructing binary or polytomous logistic regression model. Second way is using a new response which consists of disease subtypes created by cross-classification of disease characteristic levels, and then constructing polytomous logistic regression model. The former may be disadvantageous since any possible covariation between disease characteristics is neglected, whereas the latter can capture that covariation behaviour. However, cross-classifying the characteristic levels increases the number of categories of response, so that dimensionality problem in parameter space may occur in classical polytomous logistic regression model. A two staged polytomous logistic regression model overcomes that dimensionality problem. In this thesis, study is progressen in two main directions: simulation study and data analysis parts. In simulation study, models that capture the covariation behaviour are compared in terms of the response model parameter estimators. That is, performances of the maximum likelihood estimation (MLE) approach to classical polytomous logistic regression, Bayesian estimation approach to classical polytomous logistic regression and pseudo-conditional likelihood (PCL) estimation approach to two stage polytomous logistic regression are compared in terms of bias and variation of estimators. Results of the simulation study revealed that for small sized sample and small number of disease subtypes, PCL outperforms in terms of bias and variance. For medium scaled size of total disease subtypes situation when sample size is small, PCL performs better than MLE, however when the sample size gets larger MLE has better performance in terms of standard errors of estimates. In addition, sampling variance of PCL estimators of two stage model converges to asymptotic variance faster than the ML estimators of classical polytomous logistic regression model. In data analysis, etiologic heterogeneity in breast cancer subtypes of Turkish female cancer patients is investigated, and the superiority of the two stage polytomous logistic regression model over the classical polytomous logistic model with disease subtypes is represented in terms of the interpretation of parameters and convenience in hypothesis testing.
|
Page generated in 0.0283 seconds