Spelling suggestions: "subject:"tem bias"" "subject:"stem bias""
1 |
Improving the prediction of differential item functioning: a comparison of the use of an effect size for logistic regression DIF and Mantel-Haenszel DIF methodsDuncan, Susan Cromwell 17 September 2007 (has links)
Psychometricians and test developers use DIF analysis to determine if there is possible bias in a given test item. This study examines the conditions under which two predominant methods for determining differential item function compare with each other in item bias detection using an effect size statistic as the basis for comparison. The main focus of the present research was to test whether or not incorporating an effect size for LR DIF will more accurately detect DIF and to compare the utility of an effect size index across MH DIF and LR DIF methods. A simulation study was used to compare the accuracy of MH DIF and LR DIF methods using a p value or supplemented with an effect size. Effect sizes were found to increase the accuracy of DIF and the possibility of the detection of DIF across varying ability distributions, population distributions, and sample size combinations. Varying ability distributions and sample size combinations affected the detection of DIF, while population distributions did not seem to affect the detection of DIF.
|
2 |
Improving the prediction of differential item functioning: a comparison of the use of an effect size for logistic regression DIF and Mantel-Haenszel DIF methodsDuncan, Susan Cromwell 17 September 2007 (has links)
Psychometricians and test developers use DIF analysis to determine if there is possible bias in a given test item. This study examines the conditions under which two predominant methods for determining differential item function compare with each other in item bias detection using an effect size statistic as the basis for comparison. The main focus of the present research was to test whether or not incorporating an effect size for LR DIF will more accurately detect DIF and to compare the utility of an effect size index across MH DIF and LR DIF methods. A simulation study was used to compare the accuracy of MH DIF and LR DIF methods using a p value or supplemented with an effect size. Effect sizes were found to increase the accuracy of DIF and the possibility of the detection of DIF across varying ability distributions, population distributions, and sample size combinations. Varying ability distributions and sample size combinations affected the detection of DIF, while population distributions did not seem to affect the detection of DIF.
|
3 |
Gender and Ethnicity-Based Differential Item Functioning on the Myers-Briggs Type IndicatorGratias, Melissa B. 07 May 1997 (has links)
Item Response Theory (IRT) methodologies were employed in order to examine the Myers-Briggs Type Indicator (MBTI) for differential item functioning (DIF) on the basis of crossed gender and ethnicity variables. White males were the reference group, and the focal groups were: black females, black males, and white females. The MBTI was predicted to show DIF in all comparisons. In particular, DIF on the Thinking-Feeling scale was hypothesized especially in the comparisons between white males and black females and between white males and white females. A sample of 10,775 managers who took the MBTI at assessment centers provided the data for the present experiment. The Mantel-Haenszel procedure and an IRT-based area technique were the methods of DIF-detection.
Results showed several biased items on all scales for all comparisons. Ethnicitybased bias was seen in the white male vs. black female and white male vs. black male comparisons. Gender-based bias was seen particularly in the white male vs. white female comparisons. Consequently, the Thinking-Feeling showed the least DIF of all scales across comparisons, and only one of the items differentially scored by gender was found to be biased. Findings indicate that the gender-based differential scoring system is not defensible in managerial samples, and there is a need for further research into the study of differential item functioning with regards to ethnicity. / Master of Science
|
4 |
Detection and Classification of DIF Types Using Parametric and Nonparametric Methods: A comparison of the IRT-Likelihood Ratio Test, Crossing-SIBTEST, and Logistic Regression ProceduresLopez, Gabriel E. 01 January 2012 (has links)
The purpose of this investigation was to compare the efficacy of three methods for detecting differential item functioning (DIF). The performance of the crossing simultaneous item bias test (CSIBTEST), the item response theory likelihood ratio test (IRT-LR), and logistic regression (LOGREG) was examined across a range of experimental conditions including different test lengths, sample sizes, DIF and differential test functioning (DTF) magnitudes, and mean differences in the underlying trait distributions of comparison groups, herein referred to as the reference and focal groups. In addition, each procedure was implemented using both an all-other anchor approach, in which the IRT-LR baseline model, CSIBEST matching subtest, and LOGREG trait estimate were based on all test items except for the one under study, and a constant anchor approach, in which the baseline model, matching subtest, and trait estimate were based on a predefined subset of DIF-free items. Response data for the reference and focal groups were generated using known item parameters based on the three-parameter logistic item response theory model (3-PLM). Various types of DIF were simulated by shifting the generating item parameters of select items to achieve desired DIF and DTF magnitudes based on the area between the groups' item response functions. Power, Type I error, and Type III error rates were computed for each experimental condition based on 100 replications and effects analyzed via ANOVA. Results indicated that the procedures varied in efficacy, with LOGREG when implemented using an all-other approach providing the best balance of power and Type I error rate. However, none of the procedures were effective at identifying the type of DIF that was simulated.
|
5 |
An Analysis of Item Bias in the WISC-R with Kainaiwa Native Canadian ChildrenPace, Deborah Faith 01 May 1995 (has links)
The present study examined the responses of 332 Kainai students ranging in age from 6 to 16 years to the Information, Arithmetic, and Picture Completion subtests of the Wechsler Intelligence Scale for Children-Revised (WISC-R) in order to determine the validity of these subtests as a measure of their intelligence. Two indices of validity were assessed: (a) subtest unidimensionality, and (b) order of item difficulty. With regard to the assumption of unidimensionality, examination of the data indicated low item-factor loadings on the Information, Arithmetic, and Picture Completion subtests. Examination of difficulty parameters revealed a nonlinear item difficulty order on all three subtests.
These results support the conclusion of previous research that the WISC-R does not adequately assess the intelligence of Native children. Possible bases for the invalidity of the WISC-R for this population are discussed and recommendations for future research are presented.
|
6 |
Fighting Bias with Statistics: Detecting Gender Differences in Responses on Items on a Preschool Science AssessmentGreenberg, Ariela Caren 06 August 2010 (has links)
Differential item functioning (DIF) and differential distractor functioning (DDF) are methods used to screen for item bias (Camilli && Shepard, 1994; Penfield, 2008). Using an applied empirical example, this mixed-methods study examined the congruency and relationship of DIF and DDF methods in screening multiple-choice items. Data for Study I were drawn from item responses of 271 female and 236 male low-income children on a preschool science assessment. Item analyses employed a common statistical approach of the Mantel-Haenszel log-odds ratio (MH-LOR) to detect DIF in dichotomously scored items (Holland & Thayer, 1988), and extended the approach to identify DDF (Penfield, 2008). Findings demonstrated that the using MH-LOR to detect DIF and DDF supported the theoretical relationship that the magnitude and form of DIF and are dependent on the DDF effects, and demonstrated the advantages of studying DIF and DDF in multiple-choice items. A total of 4 items with DIF and DDF and 5 items with only DDF were detected. Study II incorporated an item content review, an important but often overlooked and under-published step of DIF and DDF studies (Camilli & Shepard). Interviews with 25 female and 22 male low-income preschool children and an expert review helped to interpret the DIF and DDF results and their comparison, and determined that a content review process of studied items can reveal reasons for potential item bias that are often congruent with the statistical results. Patterns emerged and are discussed in detail. The quantitative and qualitative analyses were conducted in an applied framework of examining the validity of the preschool science assessment scores for evaluating science programs serving low-income children, however, the techniques can be generalized for use with measures across various disciplines of research.
|
7 |
Differential item functioning in the Peabody Picture Vocabulary Test - Third Edition: partial correlation versus expert judgmentConoley, Colleen Adele 30 September 2004 (has links)
This study had three purposes: (1) to identify differential item functioning (DIF) on the PPVT-III (Forms A & B) using a partial correlation method, (2) to find a consistent pattern in items identified as underestimating ability in each ethnic minority group, and (3) to compare findings from an expert judgment method and a partial correlation method. Hispanic, African American, and white subjects for the study were provided by American Guidance Service (AGS) from the standardization sample of the PPVT-III; English language learners (ELL) of Mexican descent were recruited from school districts in Central and South Texas. Content raters were all self-selected volunteers, each had advanced degrees, a career in education, and no special expertise of ELL or ethnic minorities. Two groups of teachers participated as judges for this study. The "expert" group was selected because of their special knowledge of ELL students of Mexican descent. The control group was all regular education teachers with limited exposure to ELL. Using the partial correlation method, DIF was detected within each group comparison. In all cases except with the ELL on form A of the PPVT-III, there were no significant differences in numbers of items found to have significant positive correlations versus significant negative correlations. On form A, the ELL group comparison indicated more items with negative correlation than positive correlation [χ2 (1) = 5.538; p=.019]. Among the items flagged as underestimating ability of the ELL group, no consistent trend could be detected. Also, it was found that none of the expert judges could adequately predict those items that would underestimate ability for the ELL group, despite expertise. Discussion includes possible consequences of item placement and recommendations regarding further research and use of the PPVT-III.
|
8 |
Gruppskillnader i Provresultat : uppgiftsinnehållets betydelse för resultatskillnader mellan män och kvinnor på prov i ordkunskap och allmänorienteringStage, Christina January 1985 (has links)
The present monograph deals with the problem of sex differences in test results from various angles. Initially, the aim was to investigate whether the use of test results in selection could be considered fair in spite of sex differences in test score averages. As work progressed, the aim was specified towards clarifying in what manner test item content is related to sex differences in results and whether the observed differences are consistent over different groupé of men and women. After a brief review of some research results on sex differences in cognitive abilities, the Swedish Scholastic Aptitude Test (SSAT) is described. The SSAT is the measuring instrument in the following empirical studies. In chapter four there is a survey of a number of models which aim at correcting for unfair group differences in test scores when the tests are to be used in selection. Two models are examined empirically. In chapter five such models are examined that aim to identify individual test items giving deviant results. The conclusion of these two studies is that statistical models can not solve the problem of group differences in test scores, since what constitutes fairness is mainly a value problem. This cannot be dealt with in a strictly technical manner. Chapter six is devoted to analyses of test item content and sex differences in all subtests on vocabulary and general knowledge which have been used in the SSAT between 1977 and 1983. The conclusion from these analyses is that test item content seems to determine whether men or women obtain higher test scores. Some subcategories of items seem to favour men and others favour women. The extent to which the testees are able to predict which items favour one sex or the other is studied in chapter seven. The testees could only make appropriate judgements to a very limited extent. In chapter eight the significance of age and education for sex differences in test scores is studied. Furthermore, sex differences on individual items are studied for men and women having the same score on the subtest level. Sex differences in scores on individual test items could not be eliminated by equalizing age, education or subtest achievement respectively. Finally, the results from all the studies are summarized and discussed in view of their significance for the validity of the tests. / digitalisering@umu
|
9 |
A comparability analysis of the National Nurse Aide Assessment ProgramJones, Peggy K 01 June 2006 (has links)
When an exam is administered across dual platforms, such as paper-and-pencil and computer-based testing simultaneously, individual items may become more or less difficult in the computer version (CBT) as compared to the paper-and-pencil (P&P) version, possibly resulting in a shift in the overall difficulty of the test (Mazzeo & Harvey, 1988). Using 38,955 examinees' response data across five forms of the National Nurse Aide Assessment Program (NNAAP) administered in both the CBT and P&P mode, three methods of differential item functioning (DIF) detection were used to detect item DIF across platforms. The three methods were Mantel-Haenszel (MH), Logistic Regression (LR), and the 1-Parameter Logistic Model (1-PL). These methods were compared to determine if they detect DIF equally in all items on the NNAAP forms. Data were reported by agreement of methods, that is, an item flagged by multiple DIF methods. A kappa statistic was calculated to provide an index of agreement bet
ween paired methods of the LR, MH, and the 1-PL based on the inferential tests. Finally, in order to determine what, if any, impact these DIF items may have on the test as a whole, the test characteristic curves for each test form and examinee group were displayed. Results indicated that items behaved differently and the examinee's odds of answering an item correctly were influenced by the test mode administration for several items ranging from 23% of the items on Forms W and Z (MH) to 38% of the items on Form X (1-PL) with an average of 29%. The test characteristic curves for each test form were examined by examinee group and it was concluded that the impact of the DIF items on the test was not consequential. Each of the three methods detected items exhibiting DIF in each test form (ranging from 14 items to 23 items). The Kappa statistic demonstrated a strong degree of agreement between paired methods of analysis for each test form and each DIF method pairing (reporting good to excell
ent agreement in all pairings). Findings indicated that while items did exhibit DIF, there was no substantial impact at the test level.
|
10 |
Differential item functioning in the Peabody Picture Vocabulary Test - Third Edition: partial correlation versus expert judgmentConoley, Colleen Adele 30 September 2004 (has links)
This study had three purposes: (1) to identify differential item functioning (DIF) on the PPVT-III (Forms A & B) using a partial correlation method, (2) to find a consistent pattern in items identified as underestimating ability in each ethnic minority group, and (3) to compare findings from an expert judgment method and a partial correlation method. Hispanic, African American, and white subjects for the study were provided by American Guidance Service (AGS) from the standardization sample of the PPVT-III; English language learners (ELL) of Mexican descent were recruited from school districts in Central and South Texas. Content raters were all self-selected volunteers, each had advanced degrees, a career in education, and no special expertise of ELL or ethnic minorities. Two groups of teachers participated as judges for this study. The "expert" group was selected because of their special knowledge of ELL students of Mexican descent. The control group was all regular education teachers with limited exposure to ELL. Using the partial correlation method, DIF was detected within each group comparison. In all cases except with the ELL on form A of the PPVT-III, there were no significant differences in numbers of items found to have significant positive correlations versus significant negative correlations. On form A, the ELL group comparison indicated more items with negative correlation than positive correlation [χ2 (1) = 5.538; p=.019]. Among the items flagged as underestimating ability of the ELL group, no consistent trend could be detected. Also, it was found that none of the expert judges could adequately predict those items that would underestimate ability for the ELL group, despite expertise. Discussion includes possible consequences of item placement and recommendations regarding further research and use of the PPVT-III.
|
Page generated in 0.0767 seconds