Spelling suggestions: "subject:"examinations -- validity."" "subject:"examinations -- alidity.""
1 |
Criterion measures : the other half of the validity equationFehr, Steven E. 01 January 1999 (has links)
No description available.
|
2 |
BIAS IN THE ITEMS OF THE CALIFORNIA ACHIEVEMENT TESTS FOR CHILDREN FROM THREE SOCIO-CULTURAL GROUPS.VANTAGGI, TERRENCE B. January 1984 (has links)
This study investigated bias at the item level in six subtests of the California Achievement Tests (CAT). Variability of performance across all individual items of the CAT for fourth graders from three ethnic groups was examined. A two-factor (item scores and ethnicity) ANOVA procedure was used to examine the interaction between Anglo and Hispanic children and between Anglo and black subjects on individual test items of the subtests. Significant F-ratios for the Items x Groups interaction were further examined by using Bonerroni's post-hoc test for the purpose of identifying specific items reflecting cultural bias. A total of twenty-one items was identified as culturally biased. Of these items, sixteen were biased against Hispanics, three items were found to contain bias against blacks, and two items reflected bias against both Hispanic and black children. Of these twenty-one items identified as biased, eighteen belonged to four verbal subtests and three items are part of the two mathematics subtests. In addition to these items identified as being statistically biased, this study suggests that ethnocultural differences exist on overall performance levels between groups. For example, on the verbal subtests, there was a total of only three items on which Hispanic children scored higher than Anglo subjects, and only one item which reflected a better performance by black children than Anglo students. Higher performance levels by Anglo subjects were also noted on mathematic subtests, wherein Hispanic children scored higher on six items than their Anglo counterparts, and black subjects outperformed Anglo children on only one item. These data reflected a tendency of higher performance by Anglo students across all subtests when by an examination of the number of items passed or failed by members of each ethnic group was made. The examination of the verbal subtests additionally showed that Anglos passed sixty-five items, Hispanic children passed twenty-four items and thirty-two items were passed by black subjects. This trend continued on the mathematics subtests, where thirty-one items were passed by Anglo students and seventeen and fifteen items were passed by Hispanic and black children respectively. The findings of this study led to the conclusion that the majority of items on the CAT does not reflect evidence of cultural bias. There were, however, a limited number of items on which either Hispanic or black children out-performed their Anglo counterparts. Implications of these findings were discussed and recommendations were made for future studies to examine bias at the item level.
|
3 |
Comparison of different equating methods and an application to link testlet-based tests. / CUHK electronic theses & dissertations collection / Digital dissertation consortiumJanuary 2010 (has links)
Keywords: Equating, IRT, Testlet Respons Model, Rasch Testlet Model, LSC, Concurrent, FPC / Test equating allows direct comparison of test scores from alternative forms measuring the same construct by employing equating procedures to put the test scores on the same metric. Three equating procedures are commonly used in the literature including the concurrent calibration method, the linking separate calibration methods (e.g. the moment methods and the characteristic curve methods), and FPC (Fixed Parameter Calibration) method. The first two types of methods for traditional IRT model have been well developed. FPC is being emphasized recently because of its utility for constructing item bank and computerized adaptive testing (CAT). However, there are few studies that examine the equating accuracy of the FPC method compared to that of the linking separate calibration method and the concurrent calibration method. / The equating methods for the traditional IRT model are not appropriate for linking testlet-based tests because the local independence assumption of IRT model cannot be held for this type of tests. Some measurement models, such as testlet response model, bi-factor model, and Rasch testlet model, were advanced to calibrate the models for the testlet-based tests. Few equating methods, however, that take into consideration the additional local dependence among the examinees' responses to items within testlets have been developed for linking testelet-based tests. / The first study compared the equating accuracies of the FPC, the linking separate calibration, and the concurrent calibration method based on the IRT model to equate item parameters under different conditions. The results indicated that the FPC method using BILOG-MG performed as well as the linking separate calibration method and the concurrent calibration method for linking the equivalent groups. However, the FPC method produced larger equating errors than the other two methods did when the ability distributions of the base and target groups were substantially nonequivalent. Differences in difficulties between the common items set and the total test did not substantially affect the equating results with the three methods, with other conditions being held equal. As expected, both small sample size and less number of common items led to slight greater equating errors. / The last study used the concurrent calibration method under the multidimensional Rasch testlet model to link the testlet-based tests in which the testlets were composed of dichotomous, polytomous, and mixed-format items. The results demonstrated that the concurrent calibration method under the Rasch testlet model worked well in recovering the underlying item parameters. Again, equating errors were substantially increased if the local dependences were ignored in model calibration. And smaller testlet variances for the common testlets led to more accurate equating results. / The results of the studies contribute to a better understanding of the effectiveness of the different equating methods, particularly those for linking testlet-based tests. They also help clarify influences of the other factors, such as characteristics of the examinees, features of the common items and common testlets on equating results. Testing practitioners and researchers may draw useful recommendations from the findings about equating method selection. Nevertheless, generalizations of the findings from the simulated studies to practical testing programs should be cautious. / The second study developed an item characteristic curve method and a testlet characteristic curve method for the testlet response model to transform the scale of item parameters. It then compared the effectiveness of the characteristic curve methods and the concurrent calibration methods under different conditions in linking item parameters from alternate test forms which were composed of dichotomously scored testlet-based items. The newly developed item characteristic curve method and the testlet characteristic curve method were shown to perform similarly as or even better than the Stocking-Lord test characteristic curve method and the concurrent calibration method did. Ignoring the local dependence in model calibration substantially increased equating errors. And larger testlet variances for the common testlets led to greater equating errors. / To address the need to better understand the FPC method and to develop new equating methods for linking testlet-based tests, the studies were to compare the effectiveness of the three types of equating methods under different linking situations and to develop equating methods for linking testlet-based tests. Besides the equating methods concerned, other factors, including sample size, ability distribution, and characteristics of common items and testlets that might affect equating results were also considered. Three simulation studies were carried out to accomplish the research purposes. / Zhang, Zhonghua. / Adviser: Yujing Ni. / Source: Dissertation Abstracts International, Volume: 72-01, Section: A, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (leaves 156-166). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
4 |
CLINICAL VERSUS AUTOMATED ADMINISTRATION OF A MENTAL TEST: A STUDY OF EXAMINER INFLUENCECampo, Robert Ettore, 1932- January 1963 (has links)
No description available.
|
5 |
Nonparametric item response modeling for identifying differential item functioning in the moderate-to-small-scale testing contextWitarsa, Petronilla Murlita 11 1900 (has links)
Differential item functioning (DIF) can occur across age, gender, ethnic, and/or
linguistic groups of examinee populations. Therefore, whenever there is more than one
group of examinees involved in a test, a possibility of DIF exists. It is important to detect
items with DIF with accurate and powerful statistical methods. While finding a proper
DIP method is essential, until now most of the available methods have been dominated
by applications to large scale testing contexts. Since the early 1990s, Ramsay has
developed a nonparametric item response methodology and computer software, TestGraf
(Ramsay, 2000). The nonparametric item response theory (IRT) method requires fewer
examinees and items than other item response theory methods and was also designed to
detect DIF. However, nonparametric IRT's Type I error rate for DIF detection had not
been investigated.
The present study investigated the Type I error rate of the nonparametric IRT DIF
detection method, when applied to moderate-to-small-scale testing context wherein there
were 500 or fewer examinees in a group. In addition, the Mantel-Haenszel (MH) DIF
detection method was included.
A three-parameter logistic item response model was used to generate data for the
two population groups. Each population corresponded to a test of 40 items. Item statistics
for the first 34 non-DIF items were randomly chosen from the mathematics test of the
1999 TEVISS (Third International Mathematics and Science Study) for grade eight,
whereas item statistics for the last six studied items were adopted from the DIF items
used in the study of Muniz, Hambleton, and Xing (2001). These six items were the focus
of this study.
|
6 |
Nonparametric item response modeling for identifying differential item functioning in the moderate-to-small-scale testing contextWitarsa, Petronilla Murlita 11 1900 (has links)
Differential item functioning (DIF) can occur across age, gender, ethnic, and/or
linguistic groups of examinee populations. Therefore, whenever there is more than one
group of examinees involved in a test, a possibility of DIF exists. It is important to detect
items with DIF with accurate and powerful statistical methods. While finding a proper
DIP method is essential, until now most of the available methods have been dominated
by applications to large scale testing contexts. Since the early 1990s, Ramsay has
developed a nonparametric item response methodology and computer software, TestGraf
(Ramsay, 2000). The nonparametric item response theory (IRT) method requires fewer
examinees and items than other item response theory methods and was also designed to
detect DIF. However, nonparametric IRT's Type I error rate for DIF detection had not
been investigated.
The present study investigated the Type I error rate of the nonparametric IRT DIF
detection method, when applied to moderate-to-small-scale testing context wherein there
were 500 or fewer examinees in a group. In addition, the Mantel-Haenszel (MH) DIF
detection method was included.
A three-parameter logistic item response model was used to generate data for the
two population groups. Each population corresponded to a test of 40 items. Item statistics
for the first 34 non-DIF items were randomly chosen from the mathematics test of the
1999 TEVISS (Third International Mathematics and Science Study) for grade eight,
whereas item statistics for the last six studied items were adopted from the DIF items
used in the study of Muniz, Hambleton, and Xing (2001). These six items were the focus
of this study. / Education, Faculty of / Educational and Counselling Psychology, and Special Education (ECPS), Department of / Graduate
|
7 |
Determining the reliability and validity of an instrument to measure beginning teacher knowledge of reading instructionTarbet, Leslie. January 1984 (has links)
Call number: LD2668 .T4 1984 T37 / Master of Science
|
8 |
The Generalization of the Logistic Discriminant Function Analysis and Mantel Score Test Procedures to Detection of Differential Testlet FunctioningKinard, Mary E. 08 1900 (has links)
Two procedures for detection of differential item functioning (DIF) for polytomous items were generalized to detection of differential testlet functioning (DTLF). The methods compared were the logistic discriminant function analysis procedure for uniform and non-uniform DTLF (LDFA-U and LDFA-N), and the Mantel score test procedure. Further analysis included comparison of results of DTLF analysis using the Mantel procedure with DIF analysis of individual testlet items using the Mantel-Haenszel (MH) procedure. Over 600 chi-squares were analyzed and compared for rejection of null hypotheses. Samples of 500, 1,000, and 2,000 were drawn by gender subgroups from the NELS:88 data set, which contains demographic and test data from over 25,000 eighth graders. Three types of testlets (totalling 29) from the NELS:88 test were analyzed for DTLF. The first type, the common passage testlet, followed the conventional testlet definition: items grouped together by a common reading passage, figure, or graph. The other two types were based upon common content and common process. as outlined in the NELS test specification.
|
9 |
Construct validity of the Differential Ability Scales with a mentally handicapped population : an investigation into the interpretability of cluster scoresParker, Kathy L. January 1995 (has links)
The purpose of the present study was to investigate the construct validity of the Differential Ability Scales (DAS) with a mentally handicapped population. The DAS is an individually administered, standardized test of intelligence. The stated purposes of the DAS are to provide a composite measure of conceptual reasoning abilities for classification and placement decisions and to provide a reliable profile of relative strengths and weaknesses for diagnostic purposes. With these goals in mind, it follows that this cognitive measure would be used often with mentally handicapped students. The DAS was developed using an hierarchical model based upon exploratory and confirmatory factor analyses. The model assumes that ability measures or subtests will load on a general factor g and will form subfactors at a lower level. The model also assumes that as children get older, the number of subfactors will increase because of development and differentiation of abilities. How mentally handicapped children would fit into this model was the subject of the current research.Using a sample of 100 mildly and moderately handicapped children ages 8 years, 0 months to 17 years, 5 months, confirmatory factor analysis was used to explore the factor structure of the DAS with this population. Three separate models were investigated: Model I, in which a one factor solution was proposed, Model II, in which two factors, Verbal Ability and Nonverbal Ability, were proposed, and Model III, in which three factors, Verbal Ability, Nonverbal Reasoning Ability, and Spatial Ability, as proposed by the test's authors, were investigated. Results of the analyses support the use of a one factor interpretation when using the DAS with mentally handicapped students. In practice, only the broadest score, the General Conceptual Ability Score (GCA), can be interpreted with confidence. Further, case study investigation illustrates the inconsistencies encountered in scoring at the lower end of the norms, as well as in using the outof-level procedure proposed by the test's authors. / Department of Educational Psychology
|
10 |
Comparing job component validity to observed validity across jobsMorris, David Charles 01 January 2002 (has links)
Five hundred and eighteen observed validity coefficients based on correlations between commercially available test data and supervisory ratings of overall job performances were collected in 89 different job titles. Using Dictionary of Occupational Title Codes, Job Component Validity (JUV) estimates based on similar job titles residing in the PAQ Service database were collected and averaged across the General Aptitude Test.
|
Page generated in 0.0863 seconds