Global ETD Search

71	Investigating data quality in question and answer reports Mohamed Zaki Ali, Mona January 2016 (has links) Data Quality (DQ) has been a long-standing concern for a number of stakeholders in a variety of domains. It has become a critically important factor for the effectiveness of organisations and individuals. Previous work on DQ methodologies have mainly focused on either the analysis of structured data or the business-process level rather than analysing the data itself. Question and Answer Reports (QAR) are gaining momentum as a way to collect responses that can be used by data analysts, for instance, in business, education or healthcare. Various stakeholders benefit from QAR such as data brokers and data providers, and in order to effectively analyse and identify the common DQ problems in these reports, the various stakeholders' perspectives should be taken into account which adds another complexity for the analysis. This thesis investigates DQ in QAR through an in-depth DQ analysis and provide solutions that can highlight potential sources and causes of problems that result in "low-quality" collected data. The thesis proposes a DQ methodology that is appropriate for the context of QAR. The methodology consists of three modules: question analysis, medium analysis and answer analysis. In addition, a Question Design Support (QuDeS) framework is introduced to operationalise the proposed methodology through the automatic identification of DQ problems. The framework includes three components: question domain-independent profiling, question domain-dependent profiling and answers profiling. The proposed framework has been instantiated to address one example of DQ issues, namely Multi-Focal Question (MFQ). We introduce MFQ as a question with multiple requirements; it asks for multiple answers. QuDeS-MFQ (the implemented instance of QuDeS framework) has implemented two components of QuDeS for MFQ identification, these are question domain-independent profiling and question domain-dependent profiling. The proposed methodology and the framework are designed, implemented and evaluated in the context of the Carbon Disclosure Project (CDP) case study. The experiments show that we can identify MFQs with 90% accuracy. This thesis also demonstrates the challenges including the lack of domain resources for domain knowledge representation, such as domain ontology, the complexity and variability of the structure of QAR, as well as the variability and ambiguity of terminology and language expressions and understanding stakeholders or users need. 004
72	Representing Data Quality in Sensor Data Streaming Environments Lehner, Wolfgang, Klein, Anja 20 May 2022 (has links) Sensors in smart-item environments capture data about product conditions and usage to support business decisions as well as production automation processes. A challenging issue in this application area is the restricted quality of sensor data due to limited sensor precision and sensor failures. Moreover, data stream processing to meet resource constraints in streaming environments introduces additional noise and decreases the data quality. In order to avoid wrong business decisions due to dirty data, quality characteristics have to be captured, processed, and provided to the respective business task. However, the issue of how to efficiently provide applications with information about data quality is still an open research problem. In this article, we address this problem by presenting a flexible model for the propagation and processing of data quality. The comprehensive analysis of common data stream processing operators and their impact on data quality allows a fruitful data evaluation and diminishes incorrect business decisions. Further, we propose the data quality model control to adapt the data quality granularity to the data stream interestingness. info:eu-repo/classification/ddc/004 ddc:004
73	Data quality in marine biotoxins’ risk assessment: Perceptions from data production to consumption Katikou, Panagiota January 2022 (has links) Marine biotoxins constitute one of the major hazards associated with seafood consumption. Risk assessments are essential for the effective management of problems arising from marine biotoxins occurrence, as they are a prerequisite for the establishment or periodic re-evaluation of marine biotoxins regulatory limits and for the adoption of appropriate risk management plans. Risk assessments are science-based data-intensive processes, and their successful outcomes are largely dependent on the quality of data used when they are carried out. In fact, data-related challenges are the most frequently reported issues rendering most marine biotoxins’ risk assessments conducted to date as inconclusive. Notably, data quality perceptions among the stakeholders involved in risk assessments may vary significantly, which may be a human factor influencing data quality. As such, the problem addressed in this thesis is the shortage of empirical information on how data quality is perceived by the different stakeholder roles involved in risk assessments relevant to marine biotoxins hazards. The focus of this thesis is thus to investigate the perceptions of diverse stakeholders within the information chain, namely data producers, collectors and consumers/users, regarding the quality of data used in risk assessments of marine biotoxins hazards, to provide a contribution directed towards data quality improvement. This was done by means of a survey, gathering data through interviewing a number of recognized marine biotoxins’ experts with documented experience in risk assessments. The research question of this study is: “What are the perceptions of data quality among diverse stakeholders along the information chain relevant to marine biotoxins’ risk assessments?” To answer the research question, the concept of data quality for marine biotoxins data destined for risk assessments was dissected into seven individual subtopics on which the perceptions of expert participants of all three roles were captured. The subtopics explored included: data quality challenges; changes in marine biotoxins data quality during the last decade; awareness on data quality legislation and standardization; importance of data quality dimensions, objectives and key performance indicators; importance of data quality-related feedback exchange between stakeholders of the relevant information chain; factors for successful adoption of harmonized standardized formats for marine biotoxins data collection; and suggestions for data quality improvement. The perceptions gathered per subtopic were analyzed using inductive thematic analysis, yielding a total of twelve themes, namely communication, compound, data/quality control, Information Technology or Data Collection Framework, legislation, method, organization, people, policy, risk assessment procedure, society/environment and toxicological aspects, with each subtopic containing items categorized within several of these themes. Certain differences were observed in the perceptions between participants of diverse data roles, in the sense that data producers and to a lesser extent data users mostly focused on themes relevant to analytical methodology, compound particularities, data and quality control, toxicological aspects and policies. On the other hand, data collectors’ views were more concentrated on items relevant to Information Technology or Data Collection Framework and organization. It is noted, however, that interpretation of these trends needs to consider that in many of the study participants different roles overlapped in the same person. This indicates that results should be cautiously generalized. Nevertheless, they could constitute a basis for further research to generate deeper knowledge in the field of data quality in risk assessments relevant to marine biotoxins and gain further insights on the differences in perceptions among data roles. Data quality Data quality dimensions Data production Data collection Data consumption Data roles Marine Biotoxins Risk assessment Computer Sciences Datavetenskap (datalogi)
74	Data cleaning techniques for software engineering data sets Liebchen, Gernot Armin January 2010 (has links) Data quality is an important issue which has been addressed and recognised in research communities such as data warehousing, data mining and information systems. It has been agreed that poor data quality will impact the quality of results of analyses and that it will therefore impact on decisions made on the basis of these results. Empirical software engineering has neglected the issue of data quality to some extent. This fact poses the question of how researchers in empirical software engineering can trust their results without addressing the quality of the analysed data. One widely accepted definition for data quality describes it as `fitness for purpose', and the issue of poor data quality can be addressed by either introducing preventative measures or by applying means to cope with data quality issues. The research presented in this thesis addresses the latter with the special focus on noise handling. Three noise handling techniques, which utilise decision trees, are proposed for application to software engineering data sets. Each technique represents a noise handling approach: robust filtering, where training and test sets are the same; predictive filtering, where training and test sets are different; and filtering and polish, where noisy instances are corrected. The techniques were first evaluated in two different investigations by applying them to a large real world software engineering data set. In the first investigation the techniques' ability to improve predictive accuracy in differing noise levels was tested. All three techniques improved predictive accuracy in comparison to the do-nothing approach. The filtering and polish was the most successful technique in improving predictive accuracy. The second investigation utilising the large real world software engineering data set tested the techniques' ability to identify instances with implausible values. These instances were flagged for the purpose of evaluation before applying the three techniques. Robust filtering and predictive filtering decreased the number of instances with implausible values, but substantially decreased the size of the data set too. The filtering and polish technique actually increased the number of implausible values, but it did not reduce the size of the data set. Since the data set contained historical software project data, it was not possible to know the real extent of noise detected. This led to the production of simulated software engineering data sets, which were modelled on the real data set used in the previous evaluations to ensure domain specific characteristics. These simulated versions of the data set were then injected with noise, such that the real extent of the noise was known. After the noise injection the three noise handling techniques were applied to allow evaluation. This procedure of simulating software engineering data sets combined the incorporation of domain specific characteristics of the real world with the control over the simulated data. This is seen as a special strength of this evaluation approach. The results of the evaluation of the simulation showed that none of the techniques performed well. Robust filtering and filtering and polish performed very poorly, and based on the results of this evaluation they would not be recommended for the task of noise reduction. The predictive filtering technique was the best performing technique in this evaluation, but it did not perform significantly well either. An exhaustive systematic literature review has been carried out investigating to what extent the empirical software engineering community has considered data quality. The findings showed that the issue of data quality has been largely neglected by the empirical software engineering community. The work in this thesis highlights an important gap in empirical software engineering. It provided clarification and distinctions of the terms noise and outliers. Noise and outliers are overlapping, but they are fundamentally different. Since noise and outliers are often treated the same in noise handling techniques, a clarification of the two terms was necessary. To investigate the capabilities of noise handling techniques a single investigation was deemed as insufficient. The reasons for this are that the distinction between noise and outliers is not trivial, and that the investigated noise cleaning techniques are derived from traditional noise handling techniques where noise and outliers are combined. Therefore three investigations were undertaken to assess the effectiveness of the three presented noise handling techniques. Each investigation should be seen as a part of a multi-pronged approach. This thesis also highlights possible shortcomings of current automated noise handling techniques. The poor performance of the three techniques led to the conclusion that noise handling should be integrated into a data cleaning process where the input of domain knowledge and the replicability of the data cleaning process are ensured. 005.3
75	Software defect prediction using static code metrics : formulating a methodology Gray, David Philip Harry January 2013 (has links) Software defect prediction is motivated by the huge costs incurred as a result of software failures. In an effort to reduce these costs, researchers have been utilising software metrics to try and build predictive models capable of locating the most defect-prone parts of a system. These areas can then be subject to some form of further analysis, such as a manual code review. It is hoped that such defect predictors will enable software to be produced more cost effectively, and/or be of higher quality. In this dissertation I identify many data quality and methodological issues in previous defect prediction studies. The main data source is the NASA Metrics Data Program Repository. The issues discovered with these well-utilised data sets include many examples of seemingly impossible values, and much redundant data. The redundant, or repeated data points are shown to be the cause of potentially serious data mining problems. Other methodological issues discovered include the violation of basic data mining principles, and the misleading reporting of classifier predictive performance. The issues discovered lead to a new proposed methodology for software defect prediction. The methodology is focused around data analysis, as this appears to have been overlooked in many prior studies. The aim of the methodology is to be able to obtain a realistic estimate of potential real-world predictive performance, and also to have simple performance baselines with which to compare against the actual performance achieved. This is important as quantifying predictive performance appropriately is a difficult task. The findings of this dissertation raise questions about the current defect prediction body of knowledge. So many data-related and/or methodological errors have previously occurred that it may now be time to revisit the fundamental aspects of this research area, to determine what we really know, and how we should proceed. 005.1
76	Track quality monitoring for the compact muon solenoid silicon strip tracker Goitom, Israel January 2009 (has links) The CMS Tracker is an all silicon detector and it is the biggest of its kind to be built. The system consists of over 15,000 individual detector modules giving rise to readout through almost 107 channels. The data generated by the Tracker system is close to 650 MB at 40 MHz. This has created a challenge for the CMS collaborators in terms of data storage for analysis. To store only the interesting physics data the readout rate has to be reduced to 100 Hz where the data has to be ltered through a monitoring system for quality checks. The Tracker being the closest part of the detector to the interaction point of the CMS creates yet another challenge that needs the data quality monitoring system. As it operates in a very hostile environment the silicon detectors used to detect the particles will be degraded. It is very important to monitor the changes in the sensor behaviour with time so that to calibrate the sensors to compensate for the erroneous readings. This thesis discusses the development of a monitoring system that will enable the checking of data generated by the tracker to address the issues discussed above. The system has two parts, one dealing with the data used to monitor the Tracker and a second one that deals with statistical methods used to check the quality of the data. 502.85
77	Options for providing quality axle load data for pavement design Wood, Steven 30 March 2017 (has links) This research evaluates four options to produce quality axle load data for pavement design: piezoelectric WIM sites (corrected and uncorrected data), static weigh scales, and a piezo-quartz WIM site. The evaluation applies four data quality principles: data validity, spatial coverage, temporal coverage, and data availability. While all principles are considered, the research contributes in the development and application of an integrated and sequential approach to assess data validity of the options by performing analyses to determine the precision and accuracy of axle load measurements. Within the context of Manitoba, the evaluation reveals that data produced by piezo-quartz and static weigh scales have superior validity, with piezo-quartz data offering better temporal coverage, data availability, and future geographic coverage. Ultimately, the selection of the best option for providing quality axle load data depends on the relative importance of data quality principles for producing data supporting sound pavement designs and infrastructure management decisions. / May 2017 Weigh-in-Motion Data quality Truck weights Piezoelectric Piezo-quartz Load cell Pavement
78	Beyond the Turk: Alternative platforms for crowdsourcing behavioral research Peer, Eyal, Brandimarte, Laura, Samat, Sonam, Acquisti, Alessandro 05 1900 (has links) The success of Amazon Mechanical Turk (MTurk) as an online research platform has come at a price: MTurk has suffered from slowing rates of population replenishment, and growing participant non-naivety. Recently, a number of alternative platforms have emerged, offering capabilities similar to MTurk but providing access to new and more naïve populations. After surveying several options, we empirically examined two such platforms, CrowdFlower (CF) and Prolific Academic (ProA). In two studies, we found that participants on both platforms were more naïve and less dishonest compared to MTurk participants. Across the three platforms, CF provided the best response rate, but CF participants failed more attention-check questions and did not reproduce known effects replicated on ProA and MTurk. Moreover, ProA participants produced data quality that was higher than CF's and comparable to MTurk's. ProA and CF participants were also much more diverse than participants from MTurk. Online research Crowdsourcing Data quality Amazon Mechanical Turk Prolific Academic CrowdFlower
79	'Active ageing' and health : an exploration of longitudinal data for four European countries Di Gessa, Giorgio January 2011 (has links) `Active Ageing' has been promoted by the World Health Organisation (WHO) as a strategy for promoting the health and well-being of older people. Keeping active and involved in a range of activities not restricted to those associated with labour market participation may, it has been suggested, be beneficial for older people. In this research three domains of `engagement' were considered: paid work, formal involvement (i. e. activities such as voluntary work, attendance at training courses and participation in political organisations) and informal involvement (i. e. activities such as providing care and help to family, and looking after grandchildren). Using the first two waves of the Survey of Health, Ageing and Retirement in Europe (SHARE) and the English Longitudinal Study of Ageing (ELSA), this thesis investigated both the cross-sectional association between socio-economic, demographic and health-related variables and engagement at baseline, and the longitudinal association between engagement at baseline and self-rated health (SRH) and depressive symptoms at follow-up (controlling for baseline measures of health). The analysis was based on sample members aged 50- 69 at baseline in Denmark, France, Italy and England, countries selected to represent different welfare regimes. Cross-sectional findings showed that levels of engagement in paid work and formal activities varied across countries, whereas socio-economic, demographic and healthrelated characteristics were similarly associated with engagement in all countries under study. This suggested that country-specific factors, such as retirement policies, might play an important role in determining older people's level of engagement in paid work. Cross-sectional results also suggested that work and formal engagement were associated with good health, whereas -among certain subpopulations -informal activities were associated with bad health. Longitudinal analyses showed that, in all countries, respondents in paid work at baseline were more likely to improve their SRH and less likely to become depressed than those who were `inactive'. Formal and informal engagement were not significantly associated with health at follow-up. Longitudinal results and associations found, however, might have been biased by the high rates of attrition, as multiple imputation techniques and sensitivity analyses suggested. The current research study confirms that engagement in work is an important pathway to health in late life. More attention, however, should be paid to people's working lives, the quality of work and work conditions as these may influence participation in, and withdrawal from, the labour market. 305.26
80	Probabilistic real-time urban flood forecasting based on data of varying degree of quality and quantity René, Jeanne-Rose Christelle January 2014 (has links) This thesis provides a basic framework for probabilistic real-time urban flood forecasting based on data of varying degree of quality and quantity. The framework was developed based on precipitation data from two case study areas:Aarhus Denmark and Castries St. Lucia. Many practitioners have acknowledged that a combination of structural and non-structural measures are required to reduce the effects of flooding on urban environments, but the general dearth of the desired data and models makes the development of a flood forecasting system seem unattainable. Needless to say, high resolution data and models are not always achievable and it may be necessary to override accuracy in order to reduce flood risk in urban areas and focus on estimating and communicating the uncertainty in the available resource. Thus, in order to develop a pertinent framework, both primary and secondary data sources were used to discover the current practices and to identify relevant data sources. Results from an online survey revealed that we currently have the resources to make a flood forecast and also pointed to potential open source quantitative precipitation forecast (QPF) which is the single most important component in order to make a flood forecast. The design of a flood forecasting system entails the consideration of several factors, thus the framework provides an overview of the considerations and provides a description of the proposed methods that apply specifically to each component. In particular, this thesis focuses extensively on the verification of QPF and QPE from NWP weather radar and highlights a method for estimating the uncertainty in the QPF from NWP models based on a retrospective comparison of observed and forecasted rainfall in the form of probability distributions. The results from the application of the uncertainty model suggest that the rainfall forecasts has a large contribution to the uncertainty in the flood forecast and applying a method which bias corrects and estimates confidence levels in the forecast looks promising for real-time flood forecasting. This work also describes a method used to generate rainfall ensembles based on a catalogue of observed rain events at suitable temporal scales. Results from model calibration and validation highlights the invaluable potential in using images extracted from social network sites for model calibration and validation. This framework provides innovative possibilities for real-time urban flood forecasting. 363.34

Search results