Global ETD Search

121	Students' responses to content specific open-ended mathematics tasks : describing activities and difficulties of classroom participants / Siu, Yuet-ming. January 2006 (has links) Thesis (M. Ed.)--University of Hong Kong, 2006.
122	Historical information and judgment in pupils of elementary schools Van Wagenen, Marvin James, January 1919 (has links) Thesis--Columbia University, 1918.
123	An empirical study of the variability of reliability coefficients and standard errors of measurement in certain defined and intact groups for a variety of standardized tests. Sullivan, Arthur Francis January 1954 (has links) Thesis (Ed.D.)--Boston University / Please note: the original document's page numbers are out of order. The digitized item made available here is true to the original. Educational tests and measurements
124	Application of information theory to test item weighting. Dolansky, Marie Plodrova January 1952 (has links) Thesis (Ed.D.)--Boston University. Educational tests and measurements
125	The construction and evaluation of a language vocabulary test for the intermediate grades Friis, Clayton Albert January 1954 (has links) Thesis (Ed.D.)--Boston University Educational tests and measurements Vocabulary
126	An Examination of the Impact of Residuals and Residual Covariance Structures on Scores for Next Generation, Mixed-Format, Online Assessments with the Existence of Potential Irrelevant Dimensions Under Various Calibration Strategies Bukhari, Nurliyana 23 August 2017 (has links) <p> In general, newer educational assessments are deemed more demanding challenges than students are currently prepared to face. Two types of factors may contribute to the test scores: (1) factors or dimensions that are of primary interest to the construct or test domain; and, (2) factors or dimensions that are irrelevant to the construct, causing residual covariance that may impede the assessment of psychometric characteristics and jeopardize the validity of the test scores, their interpretations, and intended uses. To date, researchers performing item response theory (IRT)-based model simulation research in educational measurement have not been able to generate data, which mirrors the complexity of real testing data due to difficulty in separating different types of errors from multiple sources and due to comparability issues across different psychometric models, estimators, and scaling choices.</p><p> Using the context of the next generation K-12 assessments, I employed a computer simulation to generate test data under six test configurations. Specifically, I generated tests that varied based on the sample size of examinees, the degree of correlation between four primary dimensions, the number of items per dimension, and the discrimination levels of the primary dimensions. I also explicitly modeled the potential nuisance dimensions in addition to the four primary dimensions of interest, for which (when two nuisance dimensions were modeled) I also used varying degrees of correlation. I used this approach for two purposes. First, I aimed to explore the effects that two calibration strategies have on the structure of residuals of such complex assessments when the nuisance dimensions are not explicitly modeled during the calibration processes and when tests differ in testing configurations. The two calibration models I used included a unidimensional IRT (UIRT) model and a multidimensional IRT (MIRT) model. For this test, both models only considered the four primary dimensions of interest. Second, I also wanted to examine the residual covariance structures when the six test configurations vary. The residual covariance in this case would indicate statistical dependencies due to unintended dimensionality. </p><p> I employed Luecht and Ackerman’s (2017) expected response function (ERF)-based residuals approach to evaluate the performance of the two calibration models and to prune the bias-induced residuals from the other measurement errors. Their approach provides four types of residuals that are comparable across different psychometric models and estimation methods, hence are ‘metric-neutral’. The four residuals are: (1) e0, which comprises the total residuals or total errors; (2) e1, the bias-induced residuals; (3) e2, the parameter-estimation residuals; and, (4) e3, the estimated model-data fit residuals.</p><p> With regard to my first purpose, I found that the MIRT model tends to produce less estimation error than the UIRT model on average (e2MIRT is less than e2UIRT) and tends to fit the data better than the UIRT model on average (e3MIRT is less than e3UIRT). With regard to my second research purpose, my analyses of the correlations of the bias-induced residuals provide evidence of the large impact of the presence of nuisance dimension regardless of its amount. On average, I found that the residual correlations increase with the presence of at least one nuisance dimension but tend to decrease with high item discriminations.</p><p> My findings shed light on the need to consider the choice of calibration model, especially when there are some intended and unintended indications of multidimensionality in the assessment. Essentially, I applied a cutting-edge technique based on the ERF-based residuals approach (Luecht & Ackerman, 2017) that permits measurement errors (systematic or random) to be cleanly partitioned, understood, examined, and interpreted—in-context and in relative to difference-that-matters criteria—regardless of the choice of scaling, calibration models, and estimation methods. For that purpose, I conducted my work based on the context of the complex reality of the next generation K-12 assessments and based on my effort to maintain adherence to the established educational measurement standards (American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME), 1999, 2014); International Test Commission (ITC) (ITC, 2005a, 2005b, 2013a, 2013b, 2014, 2015)).</p><p> Educational tests & measurements
127	An investigation of the relationship between the relevance category of achievement test items and their indices of discrimination McKie, Thomas Douglas Muir January 1962 (has links) It was hypothesized that on an achievement test, items measuring complex cognitive objectives would exhibit a higher mean discrimination index — based on the whole test as criterion — than would an equal number of items measuring less complex cognitive objectives; and that the mean discrimination index of these items would in turn be higher than that of the same number of still less complex items. The proviso was made that the difficulty indices of the items be similarly distributed within the several categories of items, hereafter called "relevance-categories," since discriminating power is related to difficulty. The categories selected were, from simplest to most complex, the Knowledge, Comprehension, and Application categories of Bloom's "Taxonomy of Educational Objectives." An achievement test was constructed, consisting of items in all three categories, and covering the content of two units of the British Columbia university-programme grade nine science course. A try-out of this test, on 200 students in two schools, permitted negatively discriminating items to be rejected and, in addition, provided difficulty indices for the remaining items. It was possible to match forty Knowledge items and forty Comprehension items very closely for difficulty; however, the mean difficulty of the Application items was so high that they could not be used in a test of the hypothesis without reducing numbers too drastically in all categories. Two "equivalent forms," matched for content, relevance-category, and difficulty were constructed from these eighty items and administered to 530 students in three schools. The reliability coefficient of the total test, estimated by correlating the sub-test scores and applying the Spearman-Brown formula, was .84; those of the Knowledge and Comprehension categories were similarly found to be .69 and .77, respectively. Revised difficulty indices, based on the new and larger sample, were calculated. Their distribution within the two relevance categories were found to be very similar, though not as closely matched as on the basis of the try-out test. For each item, the point-biserial coefficient of correlation between item and total score was computed — this being the selected index of discrimination — and Fisher's z-transformation was applied to produce measures with more nearly an equal-unit scale, in the hope that the parametric t-test could be used. However, the shapes of the resulting distributions were such that they could not be claimed to be samples from a normal population or populations. Accordingly, the t-test was rejected in favour of the non-parametric Mann-Whitney test of "no difference in median discrimination indices." The respective medians were .27 and .30, in terms of Fisher's z-values, but the difference proved to be non-significant at the pre-selected l%-level of significance. It was concluded that this experiment provided no grounds for accepting the hypothesis of the study. However, the actual probability of obtaining, in random sampling from a single population, a difference as large as that observed was only about .10; in addition, the results consistently favoured the Comprehension items, whose discrimination indices exceeded those of the Knowledge items at the extremes as well as at the mean. It was therefore suggested that if adequate testing time could be obtained, the use of larger numbers of items in all categories might increase test-reliability and possibly produce a significant result. Suggestions were advanced, based upon observations from the data, for refining the experiment and for further research. / Education, Faculty of / Graduate Educational tests and measurements
128	A comparison of Goodenough and Stanford-Binet scores of children referred to a mental hygiene clinic Fox, Jack Frank January 1960 (has links) The purpose of the study was to compare the Goodenough and Stanford-Binet scores of children referred to a Mental Hygiene Clinic. The sample was composed of 150 children between the ages of six and twelve. The relationship between the two sets of scores was examined, comparing the IQ's, MA's, and MA's with CA held constant. Differences associated with age and sex, individual differences between the two scores, and the range of variations were noted. The inter-scorer reliability of the Goodenough was also investigated. The Goodenough inter-scorer reliability was found to be very high. The correlation between the Goodenough and Stanford-Binet scores was slight for the IQ's, higher for the MA's, and still higher for the MA's when the CA was held constant. A large majority of the children scored higher on their Stanford-Binet than on their Goodenough tests, and there were many large differences between a child's score on the Binet and his score on the Goodenough. There were no differences found between the two instruments that were associated with age, but there were marked differences associated with sex. The boys' drawings were much better indicators of their intellectual abilities, as measured by the Stanford-Binet, than the girls'. It was concluded that the Goodenough Draw-a-Man Test could not be safely used as an independent test of an individual's intelligence, in a Mental Hygiene Clinic. / Arts, Faculty of / Psychology, Department of / Graduate Educational tests and measurements
129	Effects of pre-testing commercial pesticide applicators prior to engaging in a short adult education activity Hlatky, Robert M. January 1973 (has links) The purposes of this study were to determine the relationships of participant socio-economic characteristics to the post-test, to investigate the effects of pre-testing in a short-term adult education programme, and to assess the influence of pre-course utilization of the handbooks on pre-test and post-test scores. The study was carried out on a group of 324 commercial pesticide applicators who attended 16 individual short courses conducted in 1972 by the British Columbia Department of Agriculture as a means of upgrading the participants' knowledge of the safe and proper uses of pesticides. The design used was a modification of the pre-test/post-test control group type with 135 individuals assigned to the treatment condition and 189 assigned to the control. Three hypotheses were tested in the study. The hypothesis of primary concern attempted to determine whether pre-testing the participants significantly improved their post-test scores. A second hypothesis was tested to determine whether a relationship existed between the socio-economic variables and the post-test scores. A final hypothesis was tested to determine whether the intensity of pre-course handbook utilization significantly influenced pre-test and post-test mean scores. No differential effect due to significant treatment-control differences were observed in the variables: area of origin of participants, proportion of salary earned from pesticide application, previous attendance at BCDA sponsored short courses, previous attendance at related, non-BCDA short courses, and number of pesticide application certificates held. The control group were of significantly higher age, had a longer period of residence in Canada, and had more experience as pesticide applicators than the treatment group. The effects of each of these characteristics upon the post-test was negligible because of their low individual correlation with the post-test scores. The three variables; previous attendance at BCDA sponsored short courses, previous attendance at related non-BCDA short courses, and number of pesticide application certificates held, exhibited a significantly high degree of mutual inter-correlation. This indicated the variables were measuring a common factor such as a need to participate. Both educational level and pre-test scores significantly influenced the post-test mean score although the influence of the latter was definitely more pronounced. The intensity of handbook utilization positively influenced only the post-test mean score of those participants who received no pre-test. This indicated the pre-test was a better means of improving the post-test mean score than pre-course distribution of the handbooks. / Education, Faculty of / Graduate Educational tests and measurements
130	The construction of a criterion-referenced physical education knowledge test Wilson, Gail E. January 1980 (has links) Throughout the last two decades, physical educators have worked to develop a specific body of knowledge. Associated with the formation of this body of knowledge has been a trend by most physical educators to include a cognitive objective as one of the stated aims in their physical education, curricula. As a result, the need for adequate knowledge assessment instruments has become apparent. Although some assessment of knowledge in physical and health education has occurred since the late 1920's, the majority of tests which have been developed to date are directed towards the evaluation of knowledge in specific sports or activities. Relatively few tests are available that assess general knowledge concepts in physical education. As well, all of the knowledge tests that have been produced are norm-referenced' instruments. That is, they have been constructed for the purpose of ranking individuals and comparing differences among them. The purpose of this study was to design a criterion-referenced test which would assess the physical education knowledge of grade eleven high school students in British Columbia and which could function as a measurement instrument for the evaluation of groups or classes. As a criterion-referenced assessment tool, the knowledge test assesses the performance of individuals based on' objectives which had been previously formulated by the Learning Assessment Branch of the Ministry of Education in British Columbia. In order to prepare a table of specifications for the design of the test, the specific objectives to be measured were grouped into six subtest areas. Multiple-choice items were then constructed according to the requirements of the table of specifications. For the initial pilot administration of the test, two test forms, of 48 items each, were developed. Each of these forms included three of the six sub-test areas. One half of the 288 students to whom the first pilot was administered answered Form A while the remaining students answered Form B. Following the administration of pilot test 1, the results obtained were analysed by the Laboratory of Educational Research Test Analysis Package (LERTAP), and were subjectively reviewed by an advisory panel. As a result of these procedures, 70 items were retained for use on the second pilot test. This test was administered to 133 students and the results were again analysed subjectively and psychometrically. Thirty-eight items from pilot test 2 were considered acceptable for use on the final pilot test. In order to maintain adherence to the table of specifications, nine new items were developed and after approval by the advisory panel, were included on the third test form. This form was given to 800 grade eleven students and the responses of 250 randomly selected students were analysed by the LERTAP procedure. The analysis indicated that all items were psychometrically sound and the reliability of this form was estimated at .71. Thus, the items utilized during the third pilot administration constituted the final form of the knowledge test. The test is suitable for evaluating groups and the six sub-tests, as well as the total test, can be used to identify strengths and weaknesses within programs. / Education, Faculty of / Kinesiology, School of / Graduate Educational tests and measurements

Search results