Global ETD Search

1	Spatial data quality management He, Ying, Surveying & Spatial Information Systems, Faculty of Engineering, UNSW January 2008 (has links) The applications of geographic information systems (GIS) in various areas have highlighted the importance of data quality. Data quality research has been given a priority by GIS academics for three decades. However, the outcomes of data quality research have not been sufficiently translated into practical applications. Users still need a GIS capable of storing, managing and manipulating data quality information. To fill this gap, this research aims to investigate how we can develop a tool that effectively and efficiently manages data quality information to aid data users to better understand and assess the quality of their GIS outputs. Specifically, this thesis aims: 1. To develop a framework for establishing a systematic linkage between data quality indicators and appropriate uncertainty models; 2. To propose an object-oriented data quality model for organising and documenting data quality information; 3. To create data quality schemas for defining and storing the contents of metadata databases; 4. To develop a new conceptual model of data quality management; 5. To develop and implement a prototype system for enhancing the capability of data quality management in commercial GIS. Based on reviews of error and uncertainty modelling in the literature, a conceptual framework has been developed to establish the systematic linkage between data quality elements and appropriate error and uncertainty models. To overcome the limitations identified in the review and satisfy a series of requirements for representing data quality, a new object-oriented data quality model has been proposed. It enables data quality information to be documented and stored in a multi-level structure and to be integrally linked with spatial data to allow access, processing and graphic visualisation. The conceptual model for data quality management is proposed where a data quality storage model, uncertainty models and visualisation methods are three basic components. This model establishes the processes involved when managing data quality, emphasising on the integration of uncertainty modelling and visualisation techniques. The above studies lay the theoretical foundations for the development of a prototype system with the ability to manage data quality. Object-oriented approach, database technology and programming technology have been integrated to design and implement the prototype system within the ESRI ArcGIS software. The object-oriented approach allows the prototype to be developed in a more flexible and easily maintained manner. The prototype allows users to browse and access data quality information at different levels. Moreover, a set of error and uncertainty models are embedded within the system. With the prototype, data quality elements can be extracted from the database and automatically linked with the appropriate error and uncertainty models, as well as with their implications in the form of simple maps. This function results in proposing a set of different uncertainty models for users to choose for assessing how uncertainty inherent in the data can affect their specific application. It will significantly increase the users' confidence in using data for a particular situation. To demonstrate the enhanced capability of the prototype, the system has been tested against the real data. The implementation has shown that the prototype can efficiently assist data users, especially non-expert users, to better understand data quality and utilise it in a more practical way. The methodologies and approaches for managing quality information presented in this thesis should serve as an impetus for supporting further research. Data quality management. Data quality. Uncertainty model.
2	Spatial data quality management He, Ying, Surveying & Spatial Information Systems, Faculty of Engineering, UNSW January 2008 (has links) The applications of geographic information systems (GIS) in various areas have highlighted the importance of data quality. Data quality research has been given a priority by GIS academics for three decades. However, the outcomes of data quality research have not been sufficiently translated into practical applications. Users still need a GIS capable of storing, managing and manipulating data quality information. To fill this gap, this research aims to investigate how we can develop a tool that effectively and efficiently manages data quality information to aid data users to better understand and assess the quality of their GIS outputs. Specifically, this thesis aims: 1. To develop a framework for establishing a systematic linkage between data quality indicators and appropriate uncertainty models; 2. To propose an object-oriented data quality model for organising and documenting data quality information; 3. To create data quality schemas for defining and storing the contents of metadata databases; 4. To develop a new conceptual model of data quality management; 5. To develop and implement a prototype system for enhancing the capability of data quality management in commercial GIS. Based on reviews of error and uncertainty modelling in the literature, a conceptual framework has been developed to establish the systematic linkage between data quality elements and appropriate error and uncertainty models. To overcome the limitations identified in the review and satisfy a series of requirements for representing data quality, a new object-oriented data quality model has been proposed. It enables data quality information to be documented and stored in a multi-level structure and to be integrally linked with spatial data to allow access, processing and graphic visualisation. The conceptual model for data quality management is proposed where a data quality storage model, uncertainty models and visualisation methods are three basic components. This model establishes the processes involved when managing data quality, emphasising on the integration of uncertainty modelling and visualisation techniques. The above studies lay the theoretical foundations for the development of a prototype system with the ability to manage data quality. Object-oriented approach, database technology and programming technology have been integrated to design and implement the prototype system within the ESRI ArcGIS software. The object-oriented approach allows the prototype to be developed in a more flexible and easily maintained manner. The prototype allows users to browse and access data quality information at different levels. Moreover, a set of error and uncertainty models are embedded within the system. With the prototype, data quality elements can be extracted from the database and automatically linked with the appropriate error and uncertainty models, as well as with their implications in the form of simple maps. This function results in proposing a set of different uncertainty models for users to choose for assessing how uncertainty inherent in the data can affect their specific application. It will significantly increase the users' confidence in using data for a particular situation. To demonstrate the enhanced capability of the prototype, the system has been tested against the real data. The implementation has shown that the prototype can efficiently assist data users, especially non-expert users, to better understand data quality and utilise it in a more practical way. The methodologies and approaches for managing quality information presented in this thesis should serve as an impetus for supporting further research. Data quality management. Data quality. Uncertainty model.
3	Data Quality Through Active Constraint Discovery and Maintenance Chiang, Fei Yen 10 December 2012 (has links) Although integrity constraints are the primary means for enforcing data integrity, there are cases in which they are not defined or are not strictly enforced. This leads to inconsistencies in the data, causing poor data quality. In this thesis, we leverage the power of constraints to improve data quality. To ensure that the data conforms to the intended application domain semantics, we develop two algorithms focusing on constraint discovery. The first algorithm discovers a class of conditional constraints, which hold over a subset of the relation, under specific conditional values. The second algorithm discovers attribute domain constraints, which bind specific values to the attributes of a relation for a given domain. These two types of constraints have been shown to be useful for data cleaning. In practice, weak enforcement of constraints often occurs for performance reasons. This leads to inconsistencies between the data and the set of defined constraints. To resolve this inconsistency, we must determine whether it is the constraints or the data that is incorrect, and then make the necessary corrections. We develop a repair model that considers repairs to the data and repairs to the constraints on an equal footing. We present repair algorithms that find the necessary repairs to bring the data and the constraints back to a consistent state. Finally, we study the efficiency and quality of our techniques. We show that our constraint discovery algorithms find meaningful constraints with good precision and recall. We also show that our repair algorithms resolve many inconsistencies with high quality repairs, and propose repairs that previous algorithms did not consider. data management data quality 0984
4	Data Quality Through Active Constraint Discovery and Maintenance Chiang, Fei Yen 10 December 2012 (has links) Although integrity constraints are the primary means for enforcing data integrity, there are cases in which they are not defined or are not strictly enforced. This leads to inconsistencies in the data, causing poor data quality. In this thesis, we leverage the power of constraints to improve data quality. To ensure that the data conforms to the intended application domain semantics, we develop two algorithms focusing on constraint discovery. The first algorithm discovers a class of conditional constraints, which hold over a subset of the relation, under specific conditional values. The second algorithm discovers attribute domain constraints, which bind specific values to the attributes of a relation for a given domain. These two types of constraints have been shown to be useful for data cleaning. In practice, weak enforcement of constraints often occurs for performance reasons. This leads to inconsistencies between the data and the set of defined constraints. To resolve this inconsistency, we must determine whether it is the constraints or the data that is incorrect, and then make the necessary corrections. We develop a repair model that considers repairs to the data and repairs to the constraints on an equal footing. We present repair algorithms that find the necessary repairs to bring the data and the constraints back to a consistent state. Finally, we study the efficiency and quality of our techniques. We show that our constraint discovery algorithms find meaningful constraints with good precision and recall. We also show that our repair algorithms resolve many inconsistencies with high quality repairs, and propose repairs that previous algorithms did not consider. data management data quality 0984
5	Definition and analysis of population-based data completeness measurement Emran, Nurul Akmar Binti January 2011 (has links) Poor quality data such as data with errors or missing values cause negative consequences in many application domains. An important aspect of data quality is completeness. One problem in data completeness is the problem of missing individuals in data sets. Within a data set, the individuals refer to the real world entities whose information is recorded. So far, in completeness studies however, there has been little discussion about how missing individuals are assessed. In this thesis, we propose the notion of population-based completeness (PBC) that deals with the missing individuals problem, with the aim of investigating what is required to measure PBC and to identify what is needed to support PBC measurements in practice. To achieve these aims, we analyse the elements of PBC and the requirements for PBC measurement, resulting in a definition of the PBC elements and PBC measurement formula. We propose an architecture for PBC measurement systems and determine the technical requirements of PBC systems in terms of software and hardware components. An analysis of the technical issues that arise in implementing PBC makes a contribution to an understanding of the feasibility of PBC measurements to provide accurate measurement results. Further exploration of a particular issue that was discovered in the analysis showed that when measuring PBC across multiple databases, data from those databases need to be integrated and materialised. Unfortunately, this requirement may lead to a large internal store for the PBC system that is impractical to maintain. We propose an approach to test the hypothesis that the available storage space can be optimised by materialising only partial information from the contributing databases, while retaining accuracy of the PBC measurements. Our approach involves substituting some of the attributes from the contributing databases with smaller alternatives, by exploiting the approximate functional dependencies (AFDs) that can be discovered within each local database. An analysis of the space-accuracy trade-offs of the approach leads to the development of an algorithm to assess candidate alternative attributes in terms of space-saving and accuracy (of PBC measurement). The result of several case studies conducted for proxy assessment contributes to an understanding of the space-accuracy trade-offs offered by the proxies. A better understanding of dealing with the completeness problem has been achieved through the proposal and the investigation of PBC, in terms of the requirements to measure and to support PBC in practice. 004 completeness measurement ; data quality
6	Towards a Theory of Spreadsheet Accuracy: An Empirical Study Kruck, Susan E. Jr. 21 August 1998 (has links) Electronic spreadsheets have made a major contribution to financial analysis and problem solving. Although professionals base many decisions on the analysis of a spreadsheet model, literature documents the data quality problems that often occur, i.e. underlying formulas and resulting numbers are frequently wrong. A growing body of evidence, gathered from students in academia as well as working professionals in business settings, indicates that these errors in spreadsheets are a pervasive problem. In addition, numerous published articles describe techniques to increase spreadsheet accuracy, but no aggregation of the topics and no model explaining this phenomenon exist. The research described here develops a theory and model of spreadsheet accuracy and then attempts to verify the propositions in a laboratory experiment. Numerous practitioner articles suggest techniques to move spreadsheets into a more structured development process, which implies an increase in spreadsheet accuracy. However, advances in our understanding of spreadsheet accuracy have been limited due to a lack of theory explaining this phenomenon. This study tests various propositions of the proposed theory. Four constructs were developed from the theory to test it. The four constructs are planning and design organization, formula complexity, testing and debugging assessment, and spreadsheet accuracy. From these four constructs three aids were designed to test the relationship between the four constructs. Each of the three aids developed was designed to increase spreadsheet accuracy by addressing a single proposition in the model. The lab experiment conducted required the participants to create a reusable spreadsheet model. The developed model and theory in this paper appear to represent the spreadsheet accuracy phenomenon. The three aids developed did increase spreadsheet data quality as measured by the number of errors in the spreadsheets. In addition, the formula complexity participants created spreadsheets that contained significantly fewer constants in formulas, and the testing and debugging participants corrected a significant number of errors after using the aid. / Ph. D. spreadsheet errors spreadsheet data quality
7	A Framework for Data Quality for Synthetic Information Gupta, Ragini 24 July 2014 (has links) Data quality has been an area of increasing interest for researchers in recent years due to the rapid emergence of 'big data' processes and applications. In this work, the data quality problem is viewed from the standpoint of synthetic information. Based on the structure and complexity of synthetic data, a need to have a data quality framework specific to it was realized. This thesis presents this framework along with implementation details and results of a large synthetic dataset to which the developed testing framework is applied. A formal conceptual framework was designed for assessing data quality of synthetic information. This framework involves developing analytical methods and software for assessing data quality for synthetic information. It includes dimensions of data quality that check the inherent properties of the data as well as evaluate it in the context of its use. The framework developed here is a software framework which is designed considering software design techniques like scalability, generality, integrability and modularity. A data abstraction layer has been introduced between the synthetic data and the tests. This abstraction layer has multiple benefits over direct access of the data by the tests. It decouples the tests from the data so that the details of storage and implementation are kept hidden from the user. We have implemented data quality measures for several quality dimensions: accuracy and precision, reliability, completeness, consistency, and validity. The particular tests and quality measures implemented span a range from low-level syntactic checks to high-level semantic quality measures. In each case, in addition to the results of the quality measure itself, we also present results on the computational performance (scalability) of the measure. / Master of Science Data quality Synthetic data Testing
8	Data Quality of Motion Research Utilizing fNIRS Soares, Shayna 01 January 2019 (has links) This study assessed whether data collected while a participant was intentionally moving is of the same quality as data collected from a motionless participant via fNIRS (functional near-infrared spectroscopy). This study was a within-subjects design with 3 head-movement conditions (no head-movement, low head-movement, and high head-movement). Data was recorded via the fNIRS system as well as an app called VibSensor, which recorded head movement on the X, Y, and Z planes. Results for the behavioral data indicated significance only on the Y plane across the no and high movement conditions and one significant channel for the fNIRS data. fNIRS motion data quality fnirs movement fnirs data quality movement data quality Psychology
9	Datová kvalita v prostředí databáze hospodářských informací / Data quality in the business information database environment Cabalka, Martin January 2015 (has links) This master thesis is concerned with the choice of suitable data quality dimensions for a particular database of economy information and proposes and implements metrics for its assessment. The aim of this paper is to define the term data quality in the context of economy information database and possible ways to measure it. Based on dimensions suitable to observe, a list of metrics was created and subsequently implemented in SQL query language, alternatively in a procedural extension Transact SQL. These metrics were also tested with the use of real data and the results were provided with a commentary. The main asset of this work is its complex processing of the data quality topic, from theoretical term definition to particular implementation of individual metrics. Finally, this study offers a variety of both theoretical and practical directions fort this issue to be further researched.
10	Extending dependencies for improving data quality Ma, Shuai January 2011 (has links) This doctoral thesis presents the results of my work on extending dependencies for improving data quality, both in a centralized environment with a single database and in a data exchange and integration environment with multiple databases. The first part of the thesis proposes five classes of data dependencies, referred to as CINDs, eCFDs, CFDcs, CFDps and CINDps, to capture data inconsistencies commonly found in practice in a centralized environment. For each class of these dependencies, we investigate two central problems: the satisfiability problem and the implication problem. The satisfiability problem is to determine given a set Σ of dependencies defined on a database schema R, whether or not there exists a nonempty database D of R that satisfies Σ. And the implication problem is to determine whether or not a set Σ of dependencies defined on a database schema R entails another dependency φ on R. That is, for each database D ofRthat satisfies Σ, the D must satisfy φ as well. These are important for the validation and optimization of data-cleaning processes. We establish complexity results of the satisfiability problem and the implication problem for all these five classes of dependencies, both in the absence of finite-domain attributes and in the general setting with finite-domain attributes. Moreover, SQL-based techniques are developed to detect data inconsistencies for each class of the proposed dependencies, which can be easily implemented on the top of current database management systems. The second part of the thesis studies three important topics for data cleaning in a data exchange and integration environment with multiple databases. One is the dependency propagation problem, which is to determine, given a view defined on data sources and a set of dependencies on the sources, whether another dependency is guaranteed to hold on the view. We investigate dependency propagation for views defined in various fragments of relational algebra, conditional functional dependencies (CFDs) [FGJK08] as view dependencies, and for source dependencies given as either CFDs or traditional functional dependencies (FDs). And we establish lower and upper bounds, all matching, ranging from PTIME to undecidable. These not only provide the first results for CFD propagation, but also extend the classical work of FD propagation by giving new complexity bounds in the presence of a setting with finite domains. We finally provide the first algorithm for computing a minimal cover of all CFDs propagated via SPC views. The algorithm has the same complexity as one of the most efficient algorithms for computing a cover of FDs propagated via a projection view, despite the increased expressive power of CFDs and SPC views. Another one is matching records from unreliable data sources. A class of matching dependencies (MDs) is introduced for specifying the semantics of unreliable data. As opposed to static constraints for schema design such as FDs, MDs are developed for record matching, and are defined in terms of similarity metrics and a dynamic semantics. We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. We also propose a mechanism for inferring MDs with a sound and complete system, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. We finally provide a quadratic time algorithm for inferring MDs, and an effective algorithm for deducing quality RCKs from a given set of MDs. The last one is finding certain fixes for data monitoring [CGGM03, SMO07], which is to find and correct errors in a tuple when it is created, either entered manually or generated by some process. That is, we want to ensure that a tuple t is clean before it is used, to prevent errors introduced by adding t. As noted by [SMO07], it is far less costly to correct a tuple at the point of entry than fixing it afterward. Data repairing based on integrity constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. 005.3

Search results