Spelling suggestions: "subject:"data quality"" "subject:"mata quality""
31 |
Datová kvalita v prostředí otevřených a propojitelných dat / Data quality on the context of open and linked dataTomčová, Lucie January 2014 (has links)
The master thesis deals with data quality in the context of open and linked data. One of the goals is to define specifics of data quality in this context. The specifics are perceived mainly with orientation to data quality dimensions (i. e. data characteristics which we study in data quality) and possibilities of their measurement. The thesis also defines the effect on data quality that is connected with data transformation to linked data; the effect if defined with consideration to possible risks and benefits that can influence data quality. The list of metrics verified on real data (open linked data published by government institution) is composed for the data quality dimensions that are considered to be relevant in context of open and linked data. The thesis points to the need of recognition of differences that are specific in this context when assessing and managing data quality. At the same time, it offers possibilities for further study of this question and it presents subsequent directions for both theoretical and practical evolution of the topic.
|
32 |
Improving data quality : data consistency, deduplication, currency and accuracyYu, Wenyuan January 2013 (has links)
Data quality is one of the key problems in data management. An unprecedented amount of data has been accumulated and has become a valuable asset of an organization. The value of the data relies greatly on its quality. However, data is often dirty in real life. It may be inconsistent, duplicated, stale, inaccurate or incomplete, which can reduce its usability and increase the cost of businesses. Consequently the need for improving data quality arises, which comprises of five central issues of improving data quality, namely, data consistency, data deduplication, data currency, data accuracy and information completeness. This thesis presents the results of our work on the first four issues with regards to data consistency, deduplication, currency and accuracy. The first part of the thesis investigates incremental verifications of data consistencies in distributed data. Given a distributed database D, a set S of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates ΔD to D, it is to find, with minimum data shipment, changes ΔV to V in response to ΔD. Although the problems are intractable, we show that they are bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of ΔD and ΔV, independent of the size of the database D. Such incremental algorithms are provided for both vertically and horizontally partitioned data, and we show that the algorithms are optimal. The second part of the thesis studies the interaction between record matching and data repairing. Record matching, the main technique underlying data deduplication, aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data using constraints. These are treated as separate processes in most data cleaning systems, based on heuristic solutions. However, our studies show that repairing can effectively help us identify matches, and vice versa. To capture the interaction, a uniform framework that seamlessly unifies repairing and matching operations is proposed to clean a database based on integrity constraints, matching rules and master data. The third part of the thesis presents our study of finding certain fixes that are absolutely correct for data repairing. Data repairing methods based on integrity constraints are normally heuristic, and they may not find certain fixes. Worse still, they may even introduce new errors when attempting to repair the data, which may not work well when repairing critical data such as medical records, in which a seemingly minor error often has disastrous consequences. We propose a framework and an algorithm to find certain fixes, based on master data, a class of editing rules and user interactions. A prototype system is also developed. The fourth part of the thesis introduces inferring data currency and consistency for conflict resolution, where data currency aims to identify the current values of entities, and conflict resolution is to combine tuples that pertain to the same real-world entity into a single tuple and resolve conflicts, which is also an important issue for data deduplication. We show that data currency and consistency help each other in resolving conflicts. We study a number of associated fundamental problems, and develop an approach for conflict resolution by inferring data currency and consistency. The last part of the thesis reports our study of data accuracy on the longstanding relative accuracy problem which is to determine, given tuples t1 and t2 that refer to the same entity e, whether t1[A] is more accurate than t2[A], i.e., t1[A] is closer to the true value of the A attribute of e than t2[A]. We introduce a class of accuracy rules and an inference system with a chase procedure to deduce relative accuracy, and the related fundamental problems are studied. We also propose a framework and algorithms for inferring accurate values with users’ interaction.
|
33 |
The inner and inter construct associations of the quality of data warehouse customer relationship data for problem enactmentAbril, Raul Mario January 2005 (has links)
The literature identifies perceptions of data quality as a key factor influencing a wide range of attitudes and behaviors related to data in organizational settings (e.g. decision confidence). In particular, there is an overwhelming consensus that effective customer relationship management, CRM, depends on the quality of customer data. Data warehouses, if properly implemented, enable data integration which is a key attribute of data quality. The literature highlights the relevance of formulating problem statements because this will determine the course of action. CRM managers formulate problem statements through a cognitive process known as enactment. The literature on data quality is very fragmented. It posits that this construct is of a high order nature (it is dimensional), it is contextual and situational, and it is closely linked to a utilitarian value. This study addresses all these disperse views of the nature of data quality from a holistic perspective. Social cognitive theory, SCT, is the backbone for studying data quality in terms of information search behavior and enhancements in formulating problem statements. The main objective of this study is to explore the nature of a data warehouse's customer relationship data quality in situations where there is a need for understanding a customer relationship problem. The research question is What are the inner and inter construct associations of the quality of data warehouse customer relationship data for problem enactment? To reach this objective, a positivistic approach was adopted complemented with qualitative interventions along the research process. Observations were gathered with a survey. Scales were adjusted using a construct-based approach. Research findings confirm that data quality is a high order construct with a contextual dimension and a situational dimension. Problem sense making enhancements is a dependent variable of data quality in a confirmed positive association between both constructs. Problem sense making enhancements is also a high order construct with a mastering experience dimension and a self-efficacy dimension. Behavioral patterns for information search mode (scanning mode orientation vs. focus mode orientation) and for information search heuristic (template heuristic orientation vs. trial-and-error heuristic orientation) have been identified. Focus is the predominant information search mode orientation and template is the predominant information search heuristic orientation. Overall, the research findings support the associations advocated by SCT. The self-efficacy dimension in problem sense making enhancements is a discriminant for information search mode orientation (focus mode orientation vs. scanning mode orientation). The contextual dimension in data quality (i.e. data task utility) is a discriminant for information search heuristic (template heuristic orientation vs. trial-and-error heuristic orientation). A data quality cognitive metamodel and a data quality for problem enactment model are suggested for research in the areas of data quality, information search behavior, and cognitive enhancements.
|
34 |
Deduplikační metody v databázích / Deduplication methods in databasesVávra, Petr January 2010 (has links)
In the present work we study the record deduplication problem as an issue of data quality. We define duplicates as records having different syntax and the same semantics and which are representing the same real-world entity. The main goal of this work is to provide the overview of existing deduplication methods according to their requirements, results and usability. We focus on the comparison of two groups of record deduplication methods - with and without the domain knowledge. Therefore, the second part of this work is dedicated to the implementation of our method which does not utilize any domain knowledge and compare its results with the results of commercial tool deeply utilizing the domain knowledge.
|
35 |
Data Governance - koncept projektu zavedení procesu / Data Governance - The implementation project conceptKmoch, Václav January 2010 (has links)
Companies in these days deal with underlying issue that concerns about questions how to manage volume growth of corporate data needed to decision making processes and how to control credibility and relevance of derived information and knowledge. Other questions deal with problem of responsibility and data security that represents potential risk of information outflow. The Data Governance concepts provide comprehensive answer to these questions. However, making a decision on implementing a Data Governance program is usually triggering many other problems like setting up environments, making determination of project scope, allocating capacity of data experts and finding one's way in non-uniform Data Governance concepts offered by various IT vendors. The aim of this thesis is to draw the unified and universal implementation process that helps with setting up DG projects and makes certain conception about how to run these projects step-by-step. The first and the second part of the thesis are dedicated to describe principles, components and tools of Data Governance and also methods of measuring data quality levels. The third part is offering concrete approach for successful implementation of Data Governance conception into corporate data environment.
|
36 |
Vizualizace kvality dat v Business Intelligence / Visualization of Data Quality in Business IntelligencePohořelý, Radovan January 2009 (has links)
This thesis deals with the area of Business Intelligence and especially the part of data quality. The goal is provide an overview of data quality issue and possible ways the data can be shown to have better and more engaging informative value. Another goal was to make a proposal solution to the visualization of the system state, particularly the section of data quality, at the concrete enterprise. The output of this thesis should provide a quideline for the implementation of the proposed solution.
|
37 |
A Systems Approach to Rule-Based Data CleaningAmr H Ebaid (6274220) 10 May 2019 (has links)
<div>High quality data is a vital asset for several businesses and applications. With flawed data costing billions of dollars every year, the need for data cleaning is unprecedented. Many data-cleaning approaches have been proposed in both academia and industry. However, there are no end-to-end frameworks for detecting and repairing errors with respect to a set of <i>heterogeneous</i> data-quality rules.</div><div><br></div><div>Several important challenges exist when envisioning an end-to-end data-cleaning system: (1) It should deal with heterogeneous types of data-quality rules and interleave their corresponding repairs. (2) It can be extended by various data-repair algorithms to meet users' needs for effectiveness and efficiency. (3) It must support continuous data cleaning and adapt to inevitable data changes. (4) It has to provide user-friendly interpretable explanations for the detected errors and the chosen repairs.</div><div><br></div><div>This dissertation presents a systems approach to rule-based data cleaning that is <b>generalized</b>, <b>extensible</b>, <b>continuous </b>and <b>explaining</b>. This proposed system distinguishes between a <i>programming interface</i> and a <i>core </i>to address the above challenges. The programming interface allows the user to specify various types of data-quality rules that uniformly define and explain what is wrong with the data, and how to fix it. Handling all the rules as black-boxes, the core encapsulates various algorithms to holistically and continuously detect errors and repair data. The proposed system offers a simple interface to define data-quality rules, summarizes the data, highlights violations and fixes, and provides relevant auditing information to explain the errors and the repairs.</div>
|
38 |
Řízení kvality klientských dat / Client data quality managementVacek, Martin January 2011 (has links)
There are series of competition battles emerging in present day while companies are recovering from last economic crisis. These battles are for customers. Take financial market for example -- it's quite saturated. Most of people do have some financial product since their birth date. Each one of us has insurance and most of us have at least standard banking account. It is imperative that insurance companies, banks and such firms have needed information about ourselves, for us to be allowed the use of these products. As the time passes we change the settings of these products, we change products themselves, buy new ones, set their portfolios, go to competition, even employees and financial advisors who take care of us do change over time. All of the above means new data (or a change, to say at least). Our every action specified leaves a digital footprint in the information systems of financial services providers who then try to process these data and use them to raise the profit using various methods. From the individual company's point of view is customer (in this case a person who has at least one product historically) unfortunately tracked multiple times due to the above changes, so this person actually seems like multiple persons instead of one. There are many reasons behind this and they are well known in common practice (Many of them are named in theoretical part). One of the main reasons for this is a fact that data quality was not a priority in past. However, this is not the case of present day and one of the success factors when it comes to spoiling client base portfolio is the level of quality of information that are tracked by companies. Several methodologies for data quality governance are being created and defined nowadays, although there is still lack of knowledge of their implementation (not just in the local Czech market). These experiences are well prized but most of internal IT departments are facing lack of knowledge and capacity dispositions. This is where great opportunity emerges for companies that use accumulated know-how from various projects that are not quite frequent in individual firms. One of such company is KPMG, Czech republic, LLC., thanks to which this work was created. So, what is the purpose and field of knowledge that is covered on the pages following? The purpose is to describe one such project concerning analysis and implementation of chosen tools and methodologies of data quality in real company. Main output is represented by a supporting framework as well as a tool that will help managers cease administration and difficulties when managing projects that concern data quality.
|
39 |
Perception of Key Barriers in Using and Publishing Open DataPolleres, Axel, Umbrich, Jürgen, Figl, Kathrin, Beno, Martin January 2017 (has links) (PDF)
There is a growing body of literature recognizing the benefits of Open Data. However, many potential data providers are unwilling to publish their data and at the same time, data users are often faced with difficulties when attempting to use Open Data in practice. Despite various barriers in using and publishing Open Data still being present, studies which systematically collect and assess these barriers are rare. Based on this observation we present a review on prior literature on barriers and the results of an empirical study aimed at assessing both the users' and publishers' views on obstacles regarding Open Data adoption. We collected data with an online survey in Austria and internationally. Using a sample of 183 participants, we draw conclusions about the relative importance of the barriers reported in the literature. In comparison to a previous conference paper presented at the conference for E-Democracy and Open Government, this article includes new additional data from participants outside Austria, reports new analyses, and substantially extends the discussion of results and of possible strategies for the mitigation of Open Data barriers.
|
40 |
Privacy preservation in data mining through noise additionIslam, Md Zahidul January 2008 (has links)
Research Doctorate - Doctor of Philosophy (PhD) / Due to advances in information processing technology and storage capacity, nowadays huge amount of data is being collected for various data analyses. Data mining techniques, such as classification, are often applied on these data to extract hidden information. During the whole process of data mining the data get exposed to several parties and such an exposure potentially leads to breaches of individual privacy. This thesis presents a comprehensive noise addition technique for protecting individual privacy in a data set used for classification, while maintaining the data quality. We add noise to all attributes, both numerical and categorical, and both to class and non-class, in such a way so that the original patterns are preserved in a perturbed data set. Our technique is also capable of incorporating previously proposed noise addition techniques that maintain the statistical parameters of the data set, including correlations among attributes. Thus the perturbed data set may be used not only for classification but also for statistical analysis. Our proposal has two main advantages. Firstly, as also suggested by our experimental results the perturbed data set maintains the same or very similar patterns as the original data set, as well as the correlations among attributes. While there are some noise addition techniques that maintain the statistical parameters of the data set, to the best of our knowledge this is the first comprehensive technique that preserves the patterns and thus removes the so called Data Mining Bias from the perturbed data set. Secondly, re-identification of the original records directly depends on the amount of noise added, and in general can be made arbitrarily hard, while still preserving the original patterns in the data set. The only exception to this is the case when an intruder knows enough about the record to learn the confidential class value by applying the classifier. However, this is always possible, even when the original record has not been used in the training data set. In other words, providing that enough noise is added, our technique makes the records from the training set as safe as any other previously unseen records of the same kind. In addition to the above contribution, this thesis also explores the suitability of pre-diction accuracy as a sole indicator of data quality, and proposes technique for clustering both categorical values and records containing such values.
|
Page generated in 0.0494 seconds