Global ETD Search

101	Adaptive windows for duplicate detection Draisbach, Uwe, Naumann, Felix, Szott, Sascha, Wonneberg, Oliver January 2012 (has links) Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose several variations of SNM that have in common a varying window size and advancement. The general intuition of such adaptive windows is that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. We propose and thoroughly evaluate several adaption strategies, some of which are provably better than the original SNM in terms of efficiency (same results with fewer comparisons). / Duplikaterkennung beschreibt das Auffinden von mehreren Datensätzen, die das gleiche Realwelt-Objekt repräsentieren. Diese Aufgabe ist nicht trivial, da sich (i) die Datensätze geringfügig unterscheiden können, so dass Ähnlichkeitsmaße für einen paarweisen Vergleich benötigt werden, und (ii) aufgrund der Datenmenge ein vollständiger, paarweiser Vergleich nicht möglich ist. Zur Lösung des zweiten Problems existieren verschiedene Algorithmen, die die Datenmenge partitionieren und nur noch innerhalb der Partitionen Vergleiche durchführen. Einer dieser Algorithmen ist die Sorted-Neighborhood-Methode (SNM), welche Daten anhand eines Schlüssels sortiert und dann ein Fenster über die sortierten Daten schiebt. Vergleiche werden nur innerhalb dieses Fensters durchgeführt. Wir beschreiben verschiedene Variationen der Sorted-Neighborhood-Methode, die auf variierenden Fenstergrößen basieren. Diese Ansätze basieren auf der Intuition, dass Bereiche mit größerer und geringerer Ähnlichkeiten innerhalb der sortierten Datensätze existieren, für die entsprechend größere bzw. kleinere Fenstergrößen sinnvoll sind. Wir beschreiben und evaluieren verschiedene Adaptierungs-Strategien, von denen nachweislich einige bezüglich Effizienz besser sind als die originale Sorted-Neighborhood-Methode (gleiches Ergebnis bei weniger Vergleichen). Informationssysteme Datenqualität Datenintegration Duplikaterkennung Duplicate Detection Data Quality Data Integration Information Systems Data processing Computer science
102	Improving Data Quality: Development and Evaluation of Error Detection Methods Lee, Nien-Chiu 25 July 2002 (has links) High quality of data are essential to decision support in organizations. However estimates have shown that 15-20% of data within an organization¡¦s databases can be erroneous. Some databases contain large number of errors, leading to a large potential problem if they are used for managerial decision-making. To improve data quality, data cleaning endeavors are needed and have been initiated by many organizations. Broadly, data quality problems can be classified into three categories, including incompleteness, inconsistency, and incorrectness. Among the three data quality problems, data incorrectness represents the major sources for low quality data. Thus, this research focuses on error detection for improving data quality. In this study, we developed a set of error detection methods based on the semantic constraint framework. Specifically, we proposed a set of error detection methods including uniqueness detection, domain detection, attribute value dependency detection, attribute domain inclusion detection, and entity participation detection. Empirical evaluation results showed that some of our proposed error detection techniques (i.e., uniqueness detection) achieved low miss rates and low false alarm rates. Overall, our error detection methods together could identify around 50% of the errors introduced by subjects during experiments. Semantic Constraint Error Detection Data Quality Outlier Detection Data Cleaning Decision Tree Induction
103	Data Quality in Data Warehouses: a Case Study Bringle, Per January 1999 (has links) <p>Companies today experience problems with poor data quality in their systems. Because of the enormous amount of data in companies, the data has to be of good quality if companies want to take advantage of it. Since the purpose with a data warehouse is to gather information from several databases for decision support, it is absolutely vital that data is of good quality. There exists several ways of determining or classifying data quality in databases. In this work the data quality management in a large Swedish company's data warehouse is examined, through a case study, using a framework specialized for data warehouses. The quality of data is examined from syntactic, semantic and pragmatic point of view. The results of the examination is then compared with a similar case study previously conducted in order to find any differences and similarities.</p> data quality data warehouse evaluation framework Computer and systems science Data- och systemvetenskap
104	Die Wirkung von Incentives auf die Antwortqualität in Umfragen / The effect of incentives on response quality in surveys Dingelstedt, André 24 November 2015 (has links) Die standardisierte Befragung ist in der sozialwissenschaftlichen Forschung ein anerkanntes und häufig genutztes Erhebungsverfahren, um Einblicke in die Einstellungen von Bevölkerungsgruppen zu erlangen. In den letzten Jahrzehnten konnte jedoch ein deutlicher Rückgang der Teilnahmebereitschaft an Umfragen festgestellt werden. Zur Erhöhung der Teilnahmebereitschaft wird zumeist der Einsatz monetärer Anreize (= Incentives) empfohlen, wobei diese zu Beginn oder am Ende der Befragung ausgehändigt werden können. Es ist jedoch unklar, ob und inwiefern ein Incentive auch die Antwortqualität während der Befragung beeinflusst. Die bisher durchgeführten Studien weisen zumeist keine klare Begriffsdefinition für Antwortqualität auf und wählen daher Indikatoren zur Prüfung von Zusammenhängen ohne abgeleiteten theoretischen Bezug aus. Darüber hinaus fehlen im Forschungsfeld empirisch abgesicherte Theorien zur Erklärung der Wirkung von Incentives auf die Datenqualität in Befragungen. Eine theoretische Absicherung erscheint umso wichtiger, da in aktuellen Studien negative Befunde zur Antwortqualität aufgrund der Incentivierung berichtet werden (vgl. Barge & Gehlbach (2012)). Das Ziel der vorliegenden Arbeit ist daher auf Grundlage theoretischer Konzepte – unter Verwendung eines Incentive-Experiments – die Frage zu klären, ob und inwiefern Incentives systematisch auf die Antwortqualität wirken. Hierfür wurde zu Beginn eine Definition für Antwortqualität aus dem Konzept des Total Survey Error (vgl. Biemer & Lyberg (2003); Weisberg (2005)), dem Satisficing-Ansatz nach Krosnick (1991) und dem Mikrozensusgesetz (2005) abgeleitet. Es wurden vier Facetten der Antwortqualität herausgearbeitet, welche als Grundlage für die später folgenden Analysen dienten. Darauf folgend wurde zum einen als motivationspsychologischer Ansatz die Cognitive Evaluation Theory (Deci & Ryan (1985)) herangezogen und zum anderen die Reziprozitätshypothese (Gouldner (1960)) vorgestellt. Aus diesen theoretischen Ansätzen wurden Zusammenhangshypothesen abgeleitet, welche stets einen positiven Effekt von Incentives auf die Antwortqualität postulierten. Im nächsten Schritt wurde das Erhebungsdesign beschrieben (= drei Versuchsgruppen mit unterschiedlicher Incentivierung: 0 Euro, 5 Euro, 20 Euro; als Versuchspersonen wurden Studierende der Universität Göttingen herangezogen) und der zur Hypothesenprüfung benötigte, selbst entwickelte Fragebogen vorgestellt. Die zentrale Schlussfolgerung der auf Basis der Ergebnisse lautet, dass Incentives heterogene Effekte auf die vier Facetten der Antwortqualität aufweisen. Die Höhe des Incentives beeinflusst dabei nicht nur die Stärke der Effekte, sondern auch deren Wirkrichtung. Darüber hinaus konnten bei einem Incentive in Höhe von 5 Euro tendenziell positive Effekte bezüglich der Antwortqualität beobachtet werden, wobei bei einem Incentive in Höhe von 20 Euro prinzipiell eher negative Effekte festgestellt wurden. Es konnten dabei auch negative Effekte auf die Facetten der Antwortqualität in der Versuchsgruppe ohne Incentive festgestellt werden. Diese negativen Zusammenhänge werden über die Definition der Situation erklärt. Hierbei wird vermutet, dass die Befragten Forscher in ihren Studien unterstützen wollen, aber aufgrund von Fehlinterpretationen über die Ziele und Erwartungen der Forscher zu einem unerwünschten Antwortverhalten tendieren. Aus dieser Erklärung heraus wird die Vermutung formuliert, dass mit steigender intrinsischer Motivation, bzw. Reziprozität nicht die Antwortqualität steigt, sondern höchstens der Wille der Befragten für eine verbesserte Antwortqualität. 300 Antwortqualität Incentive Datenqualität Umfrage Befragung response quality data quality survey incentive Soziologie (PPN62125505X)
105	A new method of data quality control in production data using the capacitance-resistance model Cao, Fei, active 21st century 02 November 2011 (has links) Production data are the most abundant data in the field. However, they can often be of poor quality because of undocumented operational problems, or changes in operating conditions, or even recording mistakes (Nobakht et al. 2009). If this poor quality or inconsistency is not recognized as such, it can be misinterpreted as a reservoir issue other than the data quality problem that it is. Thus quality control of production data is a crucial and necessary step that must precede any further interpretation using the production data. To restore production data, we propose to use the capacitance resistance model (CRM) to realize data reconciliation. CRM is a simple reservoir simulation model that characterizes the connectivity between injectors and producers using only production and injection rate data. Because the CRM model is based on the continuity equation, it can be used to analyze the production corresponding to the injection signal in the reservoir. The problematic production data are then put into the CRM model directly and the resulting CRM output parameters are used to evaluate what the correct production response would be under current injection scheme. We also make sensitivity analysis based on synthetic fields, which are heterogeneous ideal reservoir models with imposed geology and well features in Eclipse. The aim is to show how bad data could be misleading and the best way to restore the production data. Using the CRM model itself to control data quality is a novel method to obtain clean production data. We can then apply the new clean production data in reservoir simulators or any other processes where production data quality matters. This data quality control process can help better understand the reservoir, analyze its behavior in a more ensured way and make more reliable decisions. / text Data quality Quality control CRM model Capacitance resistance model Reservoir simulation models Production data Injection
106	Multi-Sensor Vegetation Index and Land Surface Phenology Earth Science Data Records in Support of Global Change Studies: Data Quality Challenges and Data Explorer System Barreto-Munoz, Armando January 2013 (has links) Synoptic global remote sensing provides a multitude of land surface state variables. The continuous collection, for more than 30 years, of global observations has contributed to the creation of a unique and long term satellite imagery archive from different sensors. These records have become an invaluable source of data for many environmental and global change related studies. The problem, however, is that they are not readily available for use in research and application environment and require multiple preprocessing. Here, we looked at the daily global data records from the Advanced Very High Resolution Radiometer (AVHRR) and the Moderate Resolution Imaging Spectroradiometer (MODIS), two of the most widely available and used datasets, with the objective of assessing their quality and suitability to support studies dealing with global trends and changes at the land surface. Findings show that clouds are the major data quality inhibitors, and that the MODIS cloud masking algorithm performs better than the AVHRR. Results show that areas of high ecological importance, like the Amazon, are most prone to lack of data due to cloud cover and aerosols leading to extended periods of time with no useful data, sometimes months. While the standard approach to these challenges has been compositing of daily images to generate a representative map over a preset time periods, our results indicate that preset compositing is not the optimal solution and a hybrid location dependent method that preserves the high frequency of these observations over the areas where clouds are not as prevalent works better. Using this data quality information the Vegetation Index and Phenology (VIP) Laboratory at The University of Arizona produced over 30 years of seamless sensor independent record of vegetation indices and land surface phenology metrics. These data records consist of 0.05-degree resolution global images for daily, 7-days, 15-days and monthly temporal frequency. These sort of remote sensing based products are normally made available through the internet by large data centers, like the Land Processes Distributed Active Archive Center (LP DAAC), however, in this project an online tool, the VIP Data Explorer, was developed to support the visualization, exploration, and distribution of these Earth Science Data Records (ESDRs) keeping it closer to the data generation center which provides a more active data support and distribution model. This web application has made it possible for users to explore and evaluate the products suite before download and use. long term modis NDVI phenology seamless data quality
107	Linked Data Quality Assessment and its Application to Societal Progress Measurement Zaveri, Amrapali 19 May 2015 (has links) (PDF) In recent years, the Linked Data (LD) paradigm has emerged as a simple mechanism for employing the Web as a medium for data and knowledge integration where both documents and data are linked. Moreover, the semantics and structure of the underlying data are kept intact, making this the Semantic Web. LD essentially entails a set of best practices for publishing and connecting structure data on the Web, which allows publish- ing and exchanging information in an interoperable and reusable fashion. Many different communities on the Internet such as geographic, media, life sciences and government have already adopted these LD principles. This is confirmed by the dramatically growing Linked Data Web, where currently more than 50 billion facts are represented. With the emergence of Web of Linked Data, there are several use cases, which are possible due to the rich and disparate data integrated into one global information space. Linked Data, in these cases, not only assists in building mashups by interlinking heterogeneous and dispersed data from multiple sources but also empowers the uncovering of meaningful and impactful relationships. These discoveries have paved the way for scientists to explore the existing data and uncover meaningful outcomes that they might not have been aware of previously. In all these use cases utilizing LD, one crippling problem is the underlying data quality. Incomplete, inconsistent or inaccurate data affects the end results gravely, thus making them unreliable. Data quality is commonly conceived as fitness for use, be it for a certain application or use case. There are cases when datasets that contain quality problems, are useful for certain applications, thus depending on the use case at hand. Thus, LD consumption has to deal with the problem of getting the data into a state in which it can be exploited for real use cases. The insufficient data quality can be caused either by the LD publication process or is intrinsic to the data source itself. A key challenge is to assess the quality of datasets published on the Web and make this quality information explicit. Assessing data quality is particularly a challenge in LD as the underlying data stems from a set of multiple, autonomous and evolving data sources. Moreover, the dynamic nature of LD makes assessing the quality crucial to measure the accuracy of representing the real-world data. On the document Web, data quality can only be indirectly or vaguely defined, but there is a requirement for more concrete and measurable data quality metrics for LD. Such data quality metrics include correctness of facts wrt. the real-world, adequacy of semantic representation, quality of interlinks, interoperability, timeliness or consistency with regard to implicit information. Even though data quality is an important concept in LD, there are few methodologies proposed to assess the quality of these datasets. Thus, in this thesis, we first unify 18 data quality dimensions and provide a total of 69 metrics for assessment of LD. The first methodology includes the employment of LD experts for the assessment. This assessment is performed with the help of the TripleCheckMate tool, which was developed specifically to assist LD experts for assessing the quality of a dataset, in this case DBpedia. The second methodology is a semi-automatic process, in which the first phase involves the detection of common quality problems by the automatic creation of an extended schema for DBpedia. The second phase involves the manual verification of the generated schema axioms. Thereafter, we employ the wisdom of the crowds i.e. workers for online crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) to assess the quality of DBpedia. We then compare the two approaches (previous assessment by LD experts and assessment by MTurk workers in this study) in order to measure the feasibility of each type of the user-driven data quality assessment methodology. Additionally, we evaluate another semi-automated methodology for LD quality assessment, which also involves human judgement. In this semi-automated methodology, selected metrics are formally defined and implemented as part of a tool, namely R2RLint. The user is not only provided the results of the assessment but also specific entities that cause the errors, which help users understand the quality issues and thus can fix them. Finally, we take into account a domain-specific use case that consumes LD and leverages on data quality. In particular, we identify four LD sources, assess their quality using the R2RLint tool and then utilize them in building the Health Economic Research (HER) Observatory. We show the advantages of this semi-automated assessment over the other types of quality assessment methodologies discussed earlier. The Observatory aims at evaluating the impact of research development on the economic and healthcare performance of each country per year. We illustrate the usefulness of LD in this use case and the importance of quality assessment for any data analysis. Linked Data Datenqualität Semantic Web Linked Data Data Quality Semantic Web ddc:500
108	SemDQ: A Semantic Framework for Data Quality Assessment Zhu, Lingkai January 2014 (has links) Objective: Access to, and reliance upon, high quality data is an enabling cornerstone of modern health delivery systems. Sadly, health systems are often awash with poor quality data which contributes both to adverse outcomes and can compromise the search for new knowledge. Traditional approaches to purging poor data from health information systems often require manual, laborious and time-consuming procedures at the collection, sanitizing and processing stages of the information life cycle with results that often remain sub-optimal. A promising solution may lie with semantic technologies - a family of computational standards and algorithms capable of expressing and deriving the meaning of data elements. Semantic approaches purport to offer the ability to represent clinical knowledge in ways that can support complex searching and reasoning tasks. It is argued that this ability offers exciting promise as a novel approach to assessing and improving data quality. This study examines the effectiveness of semantic web technologies as a mechanism by which high quality data can be collected and assessed in health settings. To make this assessment, key study objectives include determining the ability to construct of valid semantic data model that sufficiently expresses the complexity present in the data as well as the development of a comprehensive set of validation rules that can be applied semantically to test the effectiveness of the proposed semantic framework. Methods: The Semantic Framework for Data Quality Assessment (SemDQ) was designed. A core component of the framework is an ontology representing data elements and their relationships in a given domain. In this study, the ontology was developed using openEHR standards with extensions to capture data elements used in for patient care and research purposes in a large organ transplant program. Data quality dimensions were defined and corresponding criteria for assessing data quality were developed for each dimension. These criteria were then applied using semantic technology to an anonymized research dataset containing medical data on transplant patients. Results were validated by clinical researchers. Another test was performed on a simulated dataset with the same attributes as the research dataset to confirm the computational accuracy and effectiveness of the framework. Results: A prototype of SemDQ was successfully implemented, consisting of an ontological model integrating the openEHR reference model, a vocabulary of transplant variables and a set of data quality dimensions. Thirteen criteria in three data quality dimensions were transformed into computational constructs using semantic web standards. Reasoning and logic inconsistency checking were first performed on the simulated dataset, which contains carefully constructed test cases to ensure the correctness and completeness of logical computation. The same quality checking algorithms were applied to an established research database. Data quality defects were successfully identified in the dataset which was manually cleansed and validated periodically. Among the 103,505 data entries, application of two criteria did not return any error, while eleven of the criteria detected erroneous or missing data, with the error rates ranging from 0.05% to 79.9%. Multiple review sessions were held with clinical researchers to verify the results. The SemDQ framework was refined to reflect the intricate clinical knowledge. Data corrections were implemented in the source dataset as well as in the clinical system used in the transplant program resulting in improved quality of data for both clinical and research purposes. Implications: This study demonstrates the feasibility and benefits of using semantic technologies in data quality assessment processes. SemDQ is based on semantic web standards which allows easy reuse of rules and leverages generic reasoning engines for computation purposes. This mechanism avoids the shortcomings that come with proprietary rule engines which often make ruleset and knowledge developed for one dataset difficult to reuse in different datasets, even in a similar clinical domain. SemDQ can implement rules that have shown to have a greater capacity of detect complex cross-reference logic inconsistencies. In addition, the framework allows easy extension of knowledge base to cooperate more data types and validation criteria. It has the potential to be incorporated into current workflow in clinical care setting to reduce data errors during the process of data capture. data quality assessment ontology modelling semantic web semantic framework health informatics
109	Record Linkage for Web Data Hassanzadeh, Oktie 15 August 2013 (has links) Record linkage refers to the task of finding and linking records (in a single database or in a set of data sources) that refer to the same entity. Automating the record linkage process is a challenging problem, and has been the topic of extensive research for many years. However, the changing nature of the linkage process and the growing size of data sources create new challenges for this task. This thesis studies the record linkage problem for Web data sources. Our hypothesis is that a generic and extensible set of linkage algorithms combined within an easy-to-use framework that integrates and allows tailoring and combining of these algorithms can be used to effectively link large collections of Web data from different domains. To this end, we first present a framework for record linkage over relational data, motivated by the fact that many Web data sources are powered by relational database engines. This framework is based on declarative specification of the linkage requirements by the user and allows linking records in many real-world scenarios. We present algorithms for translation of these requirements to queries that can run over a relational data source, potentially using a semantic knowledge base to enhance the accuracy of link discovery. Effective specification of requirements for linking records across multiple data sources requires understanding the schema of each source, identifying attributes that can be used for linkage, and their corresponding attributes in other sources. Schema or attribute matching is often done with the goal of aligning schemas, so attributes are matched if they play semantically related roles in their schemas. In contrast, we seek to find attributes that can be used to link records between data sources, which we refer to as linkage points. In this thesis, we define the notion of linkage points and present the first linkage point discovery algorithms. We then address the novel problem of how to publish Web data in a way that facilitates record linkage. We hypothesize that careful use of existing, curated Web sources (their data and structure) can guide the creation of conceptual models for semi-structured Web data that in turn facilitate record linkage with these curated sources. Our solution is an end-to-end framework for data transformation and publication, which includes novel algorithms for identification of entity types and their relationships out of semi-structured Web data. A highlight of this thesis is showcasing the application of the proposed algorithms and frameworks in real applications and publishing the results as high-quality data sources on the Web. Data Management Databases Record Linkage Entity Resolution Link Discovery Data Integration Data Quality 0984
110	Record Linkage for Web Data Hassanzadeh, Oktie 15 August 2013 (has links) Record linkage refers to the task of finding and linking records (in a single database or in a set of data sources) that refer to the same entity. Automating the record linkage process is a challenging problem, and has been the topic of extensive research for many years. However, the changing nature of the linkage process and the growing size of data sources create new challenges for this task. This thesis studies the record linkage problem for Web data sources. Our hypothesis is that a generic and extensible set of linkage algorithms combined within an easy-to-use framework that integrates and allows tailoring and combining of these algorithms can be used to effectively link large collections of Web data from different domains. To this end, we first present a framework for record linkage over relational data, motivated by the fact that many Web data sources are powered by relational database engines. This framework is based on declarative specification of the linkage requirements by the user and allows linking records in many real-world scenarios. We present algorithms for translation of these requirements to queries that can run over a relational data source, potentially using a semantic knowledge base to enhance the accuracy of link discovery. Effective specification of requirements for linking records across multiple data sources requires understanding the schema of each source, identifying attributes that can be used for linkage, and their corresponding attributes in other sources. Schema or attribute matching is often done with the goal of aligning schemas, so attributes are matched if they play semantically related roles in their schemas. In contrast, we seek to find attributes that can be used to link records between data sources, which we refer to as linkage points. In this thesis, we define the notion of linkage points and present the first linkage point discovery algorithms. We then address the novel problem of how to publish Web data in a way that facilitates record linkage. We hypothesize that careful use of existing, curated Web sources (their data and structure) can guide the creation of conceptual models for semi-structured Web data that in turn facilitate record linkage with these curated sources. Our solution is an end-to-end framework for data transformation and publication, which includes novel algorithms for identification of entity types and their relationships out of semi-structured Web data. A highlight of this thesis is showcasing the application of the proposed algorithms and frameworks in real applications and publishing the results as high-quality data sources on the Web. Data Management Databases Record Linkage Entity Resolution Link Discovery Data Integration Data Quality 0984

Search results