51 |
Generalizability and Reproducibility of Search Engine Online User StudiesXu, Zijian 11 June 2020 (has links)
Research in interactive information retrieval (IR) usually relies on lab user studies or online ones. A key concern of these studies is the generalizability and reproducibility of the results, especially when the studies involved only a limited number of participants. The interactive IR community, however, does not have a commonly agreed guideline regarding how many participants should recruit. We study this fundamental research protocol issue by examining the generalizability and reproducibility of results with respect to a different number of participants using simulation-based approaches. Specifically, we collect a relatively large number of participants' observations for a representative interactive IR experiment setting from online user studies using crowdsourcing. We sample smaller numbers of participants' results from the collected observations to simulate the results of online user studies with a smaller scale. We empirically analyze the patterns of generalizability and reproducibility regarding different dependent variables and draw conclusions related to the optimal number of participants. Our study contributes to interactive information retrieval research by 1) establishing a methodology for evaluating the generalizability and reproducibility of results, and 2) providing guidelines regarding the optimal number of participants for search engine user studies. / Master of Science / In the domain of Information Retrieval, researchers or scientists usually require human participants to interact, test and evaluate a novel system, which is usually called user studies. However, researchers usually perform these studies with small sample size, some of them recruited fewer than 20 participants, which casts doubt on the generalizability and reproducibility of these studies. Generalizability means how reliable the results of relatively small sample size in an experimental setting can be generalized to the outcomes of a larger population. Reproducibility means whether the results from two groups with the same amount of sample size are consistent with each other. In order to examine the generalizability and reproducibility of online user studies in interactive information retrieval systems, we conducted an online user study with large sample size. We reproduced a well-recognized lab user study from Kelly et al. (2015) in an online environment. We established a simulation-based methodology for evaluating the generalizability and reproducibility of the results and then provided guidelines regarding the optimal number of participants for search engine user studies.
|
52 |
Evaluation of Continuous Friction Measuring Equipment (CFME) for Supporting Pavement Friction Management ProgramsNajafi, Shahriar 28 December 2010 (has links)
It is the responsibility of pavement engineers to design pavements that provide safe and smooth riding surfaces over their entire life cycle. Each year many people around the world lose their lives in vehicle crashes, which are one of the leading causes of death in the United States (US). One of the contributing factors in many of these crashes is inappropriate friction between tires and the pavement. To minimize the impact of this factor, state Departments of Transportation (DOTs) must monitor the friction of their pavement networks systematically and regularly. Several devices are used around the world for measuring friction. Locked-wheel skid trailers are the predominant technology for roadways in the U.S. However, Continues Friction Measuring Equipment (CFME) is emerging as a practical alternative, especially for network-level monitoring. This type of technology has been used for monitoring runway friction for many years and is starting to be used also for measuring roadway friction.
This thesis evaluates the different operational characteristics of CFME to provide guidelines for highway agencies interested in using this technology for supporting their friction management programs. It follows a manuscript format and is composed of two papers. The first part of the thesis presents a methodology to objectively synchronize and compare CFME measurements using cross-correlation. This methodology allows for comparing the “shape” of the friction profiles, instead of only the average friction values. The methodology is used for synchronizing friction measurements and assessing the repeatability and reproducibility of the CFME using friction measurements taken on a wide range of surfaces at the Virginia Smart Road. The proposed approach provides highway agencies with a rigorous method to process CFME measurements.
The second part of the thesis evaluates the impact of several operational characteristics on the CFME measurements using a field experiment. The results of the experiment confirmed that the measurements are significantly affected by (1) the direction of testing while testing on sections of road with a significant grade, (2) water film thickness, and (3) testing speed. The experiment showed that measurements taken downhill on a 6% grade were significantly higher than those taken uphill. The analysis also verified that, consistent with previous studies, the measured friction decreases with higher water depth and testing speeds. It also showed that the change of friction with speed is approximately linear over the range of speeds used in the experiment.
In general, the thesis results suggest that CFME can provide repeatable and reproducible friction profiles that can be used to support friction management programs and other asset management business functions. However, care should be taken with regard to the operational conditions during testing since the measurements are affected by several factors. Further research is needed to (1) quantify the effect of these, and potentially other, operational factors; and (2) establish standard testing condition and approaches for correcting measurements taken under other conditions. / Master of Science
|
53 |
Investigating the Reproducbility of NPM packagesGoswami, Pronnoy 19 May 2020 (has links)
The meteoric increase in the popularity of JavaScript and a large developer community has led to the emergence of a large ecosystem of third-party packages available via the Node Package Manager (NPM) repository which contains over one million published packages and witnesses a billion daily downloads. Most of the developers download these pre-compiled published packages from the NPM repository instead of building these packages from the available source code. Unfortunately, recent articles have revealed repackaging attacks to the NPM packages. To achieve such attacks the attackers primarily follow three steps – (1) download the source code of a highly depended upon NPM package, (2) inject malicious code, and (3) then publish the modified packages as either misnamed package (i.e., typo-squatting attack) or as the official package on the NPM repository using compromised maintainer credentials. These attacks highlight the need to verify the reproducibility of NPM packages. Reproducible Build is a concept that allows the verification of build artifacts for pre-compiled packages by re-building the packages using the same build environment configuration documented by the package maintainers. This motivates us to conduct an empirical study (1) to examine the reproducibility of NPM packages, (2) to assess the influence of any non-reproducible packages, and (3) to explore the reasons for non-reproducibility. Firstly, we downloaded all versions/releases of 226 most-depended upon NPM packages, and then built each version with the available source code on Github. Secondly, we applied diffoscope, a differencing tool to compare the versions we built against the version downloaded from the NPM repository. Finally, we did a systematic investigation of the reported differences. At least one version of 65 packages was found to be non-reproducible. Moreover, these non- reproducible packages have been downloaded millions of times per week which could impact a large number of users. Based on our manual inspection and static analysis, most reported differences were semantically equivalent but syntactically different. Such differences result due to non-deterministic factors in the build process. Also, we infer that semantic differences are introduced because of the shortcomings in the JavaScript uglifiers. Our research reveals challenges of verifying the reproducibility of NPM packages with existing tools, reveal the point of failures using case studies, and sheds light on future directions to develop better verification tools. / Master of Science / Software packages are distributed as pre-compiled binaries to facilitate software development. There are various package repositories for various programming languages such as NPM (JavaScript), pip (Python), and Maven (Java). Developers install these pre-compiled packages in their projects to implement certain functionality. Additionally, these package repositories allow developers to publish new packages and help the developer community to reduce the delivery time and enhance the quality of the software product. Unfortunately, recent articles have revealed an increasing number of attacks on the package repositories. Moreover, developers trust the pre-compiled binaries, which often contain malicious code. To address this challenge, we conduct our empirical investigation to analyze the reproducibility of NPM packages for the JavaScript ecosystem. Reproducible Builds is a concept that allows any individual to verify the build artifacts by replicating the build process of software packages. For instance, if the developers could verify that the build artifacts of the pre-compiled software packages available in the NPM repository are identical to the ones generated when they individually build that specific package, they could mitigate and be aware of the vulnerabilities in the software packages. The build process is usually described in configuration files such as package.json and DOCKERFILE. We chose the NPM registry for our study because of three primary reasons – (1) it is the largest package repository, (2) JavaScript is the most widely used programming language, and (3) there is no prior dataset or investigation that has been conducted by researchers. We took a two-step approach in our study – (1) dataset collection, and (2) source-code differencing for each pair of software package versions. For the dataset collection phase, we downloaded all available releases/versions of 226 popularly used NPM packages and for the code-differencing phase, we used an off-the-shelf tool called diffoscope. We revealed some interesting findings. Firstly, at least one of the 65 packages as found to be non-reproducible, and these packages have millions of downloads per week. Secondly, we found 50 package-versions to have divergent program semantics which high- lights the potential vulnerabilities in the source-code and improper build practices. Thirdly, we found that the uglification of JavaScript code introduces non-determinism in the build process. Our research sheds light on the challenges of verifying the reproducibility of NPM packages with the current state-of-the-art tools and the need to develop better verification tools in the future. To conclude, we believe that our work is a step towards realizing the reproducibility of NPM packages and making the community aware of the implications of non-reproducible build artifacts.
|
54 |
The applicability of a validated team-based learning student assessment instrument to assess United Kingdom pharmacy students’ attitude toward team-based learningNation, L.M., Tweddell, Simon, Rutter, P. 2016 August 1929 (has links)
Yes / Purpose:
It aimed at testing the applicability of a validated team-based learning student assessment instrument (TBL-SAI) to assess United Kingdom (UK) pharmacy students’ attitude toward team-based learning.
Methods:
TBL-SAI, consisting of 33 items, was administered to undergraduate pharmacy students from two schools of pharmacy each at University of Wolverhampton and University of Bradford that utilized TBL as a primary instructional method across credit bearing modules. Validity and reliability tests were conducted on the data, along with comparisons between the two schools.
Results:
Students’ response rate was 80.0% (138/173) in completion of the instrument. Overall, the instrument demonstrated validity and reliability when used with pharmacy students. Sub-analysis between schools of pharmacy did, however, show that four items from Wolverhampton data, had factor loadings of less than 0.40. No item in the Bradford data had factor loadings less than 0.40. Cronbach’s alpha score was reliable at 0.897 for the total instrument: Wolverhampton, 0.793 and Bradford, 0.902. Students showed preference to TBL, with Bradford’s scores being statistically higher (P < 0.005).
Conclusion:
This validated instrument has demonstrated reliability and validity when used with pharmacy students. Furthermore students at both schools preferred TBL compared to traditional teaching.
|
55 |
Repeatability and reproducibility of Macular Thickness Measurements Using Fourier Domain Optical Coherence TomographyBruce, Alison, Pacey, Ian E., Dharni, Poonam, Scally, Andy J., Barrett, Brendan T. January 2009 (has links)
No / To evaluate repeatability and reproducibility of macular thickness measurements in visually normal eyes
using the Topcon 3D OCT-1000.
Methods: Phase 1 investigated scan repeatability, the effect of age and pupil dilation. Two groups (6 younger and 6 older
participants) had one eye scanned 5 times pre and post- dilation by 1 operator. Phase 2 investigated between-operator,
within and between-visit reproducibility. 10 participants had 1 un-dilated eye scanned 3 times on 2 separate visits by 2
operators.
Results: Phase 1: No significant difference existed between repeat scans (p=0.75) and no significant difference was found
pre- and post-dilation (p=0.54). In the younger group variation was low (95% limits ± 3.62 m) and comparable across all
retinal regions. The older group demonstrated greater variation (95% limits ± 7.6 m).
Phase 2: For a given retinal location, 95% confidence limits for within-operator, within-visit reproducibility was 5.16 m.
This value increased to 5.56 m for the same operator over two visits and to 6.18 m for two operators over two visits.
Conclusion: A high level repeatability, close to 6 m, of macular thickness measurement is possible using the 3D OCT-
1000. Measured differences in macular thickness between successive visits that exceed 6 m in pre-presbyopic individuals
are therefore likely to reflect actual structural change. OCT measures are more variable in older individuals and it is
advisable to take a series of scans so that outliers can be more easily identified.
|
56 |
Assessment of active commuting behaviour : walking and bicycling in Greater StockholmStigell, Erik January 2011 (has links)
Walking and bicycling to work, active commuting, can contribute to sustainable mobility and provide regular health-enhancing physical activity for individuals. Our knowledge of active commuting behaviours in general and in different mode and gender groups in particular is limited. Moreover, the validity and reproducibility of the methods to measure the key variables of the behaviours are uncertain. The aims of this thesis is to explore gender and mode choice differences in commuting behaviours in terms of distance, duration, velocity and trip frequency, of a group of adult commuters in Greater Stockholm, Sweden, and furthermore to develop a criterion method for distance measurements and to assess the validity of four other distance measurement methods. We used one sample of active commuters recruited by advertisements, n = 1872, and one street-recruited sample, n = 140. Participants received a questionnaire and a map to draw their commuting route on. The main findings of the thesis were, firstly, that the map-based method could function as a criterion method for active commuting distance measurements and, secondly, that four assessed distance measurement methods – straight-line distance, GIS, GPS and self-report – differed significantly from the criterion method. Therefore, we recommend the use of correction factors to compensate for the systematic over- and underestimations. We also found three distinctly different modality groups in both men and women with different behaviours in commuting distance, duration and trip frequency. These groups were commuters who exclusively walk or bicycle the whole way to work, and dual mode commuters who switch between walking and cycling. These mode groups accrued different amounts of activity time for commuting. Through active commuting per se, the median pedestrian and dual mode commuters met or were close to the recommended physical activity level of 150 minutes per week during most months of the year, whereas the single mode cyclists did so only during the summer half of the year. / FAAP
|
57 |
Development and evaluation of a food frequency questionnaire to assess daily total flavonoid intake using a rooibos intervention study modelVenter, Irma 03 1900 (has links)
Thesis (PhD)--Stellenbosch University, 2013. / ENGLISH ABSTRACT: A comprehensive food frequency questionnaire (FFQ) was developed to assess the daily total
flavonoid intake over the past fortnight within a 14-week intervention that consisted of four periods
to determine the effect of rooibos consumption on oxidative stress in adults (n=40) at intermediate
to high coronary heart disease (CHD) risk. Within the intervention the comprehensive FFQ validity
(against six estimated dietary records and biomarkers), reproducibility (on administrations in the
washout and control periods six weeks apart as these periods had similar flavonoid intake
restrictions) and responsiveness (across the four intervention periods of changed dietary
conditions) was evaluated. The baseline period dietary record and FFQ dietary sources found to
contribute most to the participants’ daily total flavonoid intake, considering the percentage
contribution, and the between-person variation in intake, considering the stepwise multiple
regression analysis, formed the food list of the resultant abbreviated FFQ. The validity,
reproducibility and responsiveness of the latter were also evaluated within the intervention and its
validity (against dietary records) and reproducibility (on re-administration two weeks apart) in an
additional group (n=90) being at low and intermediate CHD risk to evaluate its external strength.
The validity and reproducibility evaluations of the comprehensive and abbreviated FFQs in the
intervention and abbreviated FFQ within the additional group comprised paired difference tests (to
establish the ability to estimate group intakes), correlation coefficients (to establish the ability to
rank individual participants), category agreement and gross misclassification next to the weighted
kappa statistic (to establish the ability to classify the participants into tertiles and quintiles of intake)
and Bland-Altman plots (as representation of the limits of agreement between the two dietary
assessment methods). Correlation coefficients were also used for biomarker validity evaluations in
the baseline period. The repeated measures analysis of variance (ANOVA) (Bonferroni correction)
was used for the responsiveness evaluations of the comprehensive and abbreviated FFQs across
the intervention periods alongside that of the biomarkers as evidence for the changed dietary
conditions.
The study demonstrated that the comprehensive FFQ could be modified to a format with a brief
food list as few items contributed appreciably to the total flavonoid intake and of which most also
contributed to the between-person intake variability. The comprehensive and moreover the
abbreviated FFQ in the validity evaluations provided sufficiently accurate daily total flavonoid intake
estimates. They could determine the intake at group level in correspondence with that of the
dietary records. The participant intakes could additionally be categorized and in particular ranked
greatly alike to the dietary record intakes. The Bland-Altman plots revealed proportional bias
regarding overestimation at the higher intake level. The reproducibility also appeared to be greatly
satisfactory although seasonal fruit exclusions from the abbreviated FFQ food list may hamper its
repeated administration. Both FFQs also confirmed the changed total flavonoid intakes across the
intervention periods in relation to changes in the expected direction concerning the plasma total
polyphenol, conjugated diene and thiobarbituric acid reactive substance concentrations. / AFRIKAANSE OPSOMMING: ‘n Omvattende voedsel frekwensie vraelys (VFV) is ontwikkel om die daaglikse totale flavonoïed
inname oor twee agtereenvolgende weke te beraam te midde van ‘n 14-week intervensie. Die
intervensie het uit vier periodes bestaan wat die effek van rooibosinname op oksidatiewe stres in
volwassenes (n=40), met ‘n intermediêre tot hoë koronêre hartsiekte (KHS) risiko, bepaal het.
Binne die intervensie is die geldigheid (teen ses geskatte dieetrekords en biochemiese merkers),
herhaalbaarheid (op aanwending ses weke uitmekaar in die uitwas en kontrole intervensie
periodes met dieselfde flavonoïed inname bepalings) en waarneembaarheid (oor vier intervensie
periodes van veranderde dieet bepalings) van die omvattende VFV geëvalueer. Die dieetbronne in
die basislyn periode dieetrekords en vraelyste wat die meeste tot die deelnemers se daaglikse
totale flavonoïed inname (baseer op die persentasie bydrae) en die tussen-persoon variasie in
inname (baseer op die stapsgewyse meervuldige regressie analise) bygedra het, het die
voedsellys van die voortvloeiende verkorte VFV gevorm. Die geldigheid, herhaalbaarheid en
waarneembaarheid van dié VFV is binne die intervensie geëvalueer en die geldigheid (teen
dieetrekords) en herhaalbaarheid (heradministrasie twee weke later) daarvan in ‘n verdere groep
(n=90) met lae en intermediêre KHS risiko as evaluasie van die eksterne vermoë van die VFV.
Die geldigheid en herhaalbaarheid evaluasies van die omvattende en verkorte VFV in die
intervensie en die verkorte VFV in die verdere groep het bestaan uit gepaarde verskil toetse
(bepaling van die groepinname skattingsvermoë), korrelasie koëffisiënte (bepaling van individuele
deelnemer rangorde skattingsvermoë), kategorie ooreenstemming en erge wanklassifikasie naas
die aangepaste kappa statistiek (bepaling van die vermoë om die deelnemer innames in derdes en
vyfdes te klassifiseer) en die Bland-Altman karterings (verteenwoordiging van
ooreenstemmingslimiete tussen die twee dieetinname metodes). Korrelasie koëffisiënte is ook
gebruik vir biochemiese merker geldigheid evaluasies in die basislyn periode. Die herhaalde
metings analise van variansie (ANOVA) (Bonferroni regstelling) is gebruik om die
waarneembaarheid evaluasies van die omvattende en verkorte VFV oor die intervensie periodes
naas dit van die biochemiese merkers te evalueer as bewys van die veranderde dieet bepalings.
Die studie het aangedui dat die omvattende VFV gewysig kon word tot ‘n formaat met ‘n verkorte
voedsellys omdat slegs ‘n aantal items merkbaar tot die totale flavonoïed inname bygedra het en
die meeste hiervan ook tot die tussen-persoon variasie in inname. Die omvattende en die verkorte
VFV het in die geldigheid evaluasies daarvan voldoende akkurate daaglikse totale flavonoïed
inname skattings opgelewer omdat groep innames bepaal kon word in ooreenstemming met dit
verkry van die dieetrekords en die deelnemer innames bykomend kategoriseer en in besonder
grootliks eenders rangeer kon word as met hul dieetrekord innames. ‘n Proporsionele oorskatting
by die hoër inname vlakke is wel vir al twee getoon in die Bland-Altman karterings. Die
herhaalbaarheid was ook grootliks aanvaarbaar, alhoewel seisoenale vrugte uitsluitings in die
verkorte VFV voedsellys die heruitvoering kan bemoeilik. Al twee vraelyste kon ook die
veranderinge in die daaglikse totale flavonoïed inname oor die intervensie periodes bevestig in
ooreenstemming met veranderinge in die verwagte rigting van die plasma totale polifenool,
konjugaat diëne en tiobarbituursuur reaktiewe stof konsentrasies.
|
58 |
Exploration, quantification, and mitigation of systematic error in high-throughput approaches to gene-expression profiling : implications for data reproducibilityKitchen, Robert Raymond January 2011 (has links)
Technological and methodological advances in the fields of medical and life-sciences have, over the last 25 years, revolutionised the way in which cellular activity is measured at the molecular level. Three such advances have provided a means of accurately and rapidly quantifying mRNA, from the development of quantitative Polymerase Chain Reaction (qPCR), to DNA microarrays, and second-generation RNA-sequencing (RNA-seq). Despite consistent improvements in measurement precision and sample throughput, the data generated continue to be a ffected by high levels of variability due to the use of biologically distinct experimental subjects, practical restrictions necessitating the use of small sample sizes, and technical noise introduced during frequently complex sample preparation and analysis procedures. A series of experiments were performed during this project to pro le sources of technical noise in each of these three techniques, with the aim of using the information to produce more accurate and more reliable results. The mechanisms for the introduction of confounding noise in these experiments are highly unpredictable. The variance structure of a qPCR experiment, for example, depends on the particular tissue-type and gene under assessment while expression data obtained by microarray can be greatly influenced by the day on which each array was processed and scanned. RNA-seq, on the other hand, produces data that appear very consistent in terms of differences between technical replicates, however there exist large differences when results are compared against those reported by microarray, which require careful interpretation. It is demonstrated in this thesis that by quantifying some of the major sources of noise in an experiment and utilising compensation mechanisms, either pre- or post-hoc, researchers are better equipped to perform experiments that are more robust, more accurate, and more consistent.
|
59 |
Dichotic Listening Test Performance In ChildrenKelley, Kairn Stetler 01 January 2017 (has links)
Dichotic tests evaluate binaural integration through simultaneous presentation of different stimuli to each ear of a listener who has normal hearing sensitivity in both ears. Dichotic listening deficits may lead to problems with language, communication, reading, or academic performance. If accurately identified, dichotic deficits may be treatable with listening training or managed with accommodation. However, it is not clear which of several commercially-available dichotic test recordings are best for audiologists to use when assessing binaural integration in children. Literature review revealed limited evidence of reliability, accuracy, usefulness, or value for dichotic tests applied to children. Of 11 dichotic tests identified, five reported some evidence of test-retest reliability. Correlation between results on repeated administration was moderate to good (r=0.59 to 0.92). Evidence of accuracy was identified for 5 tests but was not generalizable due to significant limitations in study design. No evidence was found to either support or dispute claims of usefulness or value. Since reliability is a necessary prerequisite for good test performance, we sought to directly compare test-retest reliability for three dichotic measures: SCAN-3 Competing Words (CW), Musiek's Double Dichotic Digits (DD-M), and Bergen Dichotic Listening Test with Consonant-Vowel Syllables (CV-B). Sixty English-speaking children, 7-14 years old with normal hearing, had a single study-visit during which each test was administered twice. Changes on retest were compared to binomial model predictions, summarized by within-subject standard deviation (Sw), and compared among tests. Correlates of variance were explored. All 3 tests had reliability within bounds predicted by binomial model. Forty-item scores were more reliable (Sw=5%) than those based on 20-30 items (Sw=6-8%). No associations between participant characteristics and reliability were found. CW and DD-M were evaluated for evidence of agreement and decision consistency. Although participants were rank ordered similarly by right ear (ρ = 0.58), left ear (ρ = 0.51) and total (ρ = 0.73) scores, the tests did not agree on ranking by inter-aural asymmetry (ρ =0.18). CW and DD-M did not agree on direction of ear advantage (κ = 0.01, p = 0.93) and had poor agreement on which children displayed dichotic deficits (κ = 0.22, p < 0.01). DD identified significantly more participants with deficits (n=18) than CW (n=3) (p < 0.001). Although dichotic procedures show moderate reliability, their precision is limited. Assessment of their accuracy is limited by the absence of a widely-accepted gold standard reference test, but two commonly used tests failed to agree on which children had deficits. The data do not yet support routine clinical use of dichotic tests of binaural integration with children. Additional research is needed to determine if there are any conditions under which dichotic procedures demonstrate usefulness or value.
|
60 |
Validating multiple structural change models. A case study.Zeileis, Achim, Kleiber, Christian January 2004 (has links) (PDF)
In a recent article, Bai and Perron (2003, Journal of Applied Econometrics) present a comprehensive discussion of computational aspects of multiple structural change models along with several empirical examples. Here, we report on the results of a replication study using the R statistical software package. We are able to verify most of their findings; however, some confidence intervals associated with breakpoints cannot be reproduced. These confidence intervals require computation of the quantiles of a nonstandard distribution, the distribution of the argmax functional of a certain stochastic process. Interestingly, the difficulties appear to be due to numerical problems in GAUSS, the software package used by Bai and Perron. / Series: Research Report Series / Department of Statistics and Mathematics
|
Page generated in 0.073 seconds