• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 24
  • 6
  • 5
  • 3
  • 2
  • 1
  • 1
  • 1
  • Tagged with
  • 50
  • 50
  • 15
  • 13
  • 12
  • 7
  • 7
  • 6
  • 6
  • 6
  • 6
  • 6
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Query Answering over Functional Dependency Repairs

Galiullin, Artur 11 September 2013 (has links)
Inconsistency often arises in real-world databases and, as a result, critical queries over dirty data may lead users to make ill-informed decisions. Functional dependencies (FDs) can be used to specify intended semantics of the underlying data and aid with the cleaning task. Enumerating and evaluating all the possible repairs to FD violations is infeasible, while approaches that produce a single repair or attempt to isolate the dirty portion of data are often too destructive or constraining. In this thesis, we leverage a recent advance in data cleaning that allows sampling from a well-defined space of reasonable repairs, and provide the user with a data management tool that gives uncertain query answers over this space. We propose a framework to compute probabilistic query answers as though each repair sample were a possible world. We show experimentally that queries over many possible repairs produce results that are more useful than other approaches and that our system can scale to large datasets.
2

Query Answering over Functional Dependency Repairs

Galiullin, Artur 11 September 2013 (has links)
Inconsistency often arises in real-world databases and, as a result, critical queries over dirty data may lead users to make ill-informed decisions. Functional dependencies (FDs) can be used to specify intended semantics of the underlying data and aid with the cleaning task. Enumerating and evaluating all the possible repairs to FD violations is infeasible, while approaches that produce a single repair or attempt to isolate the dirty portion of data are often too destructive or constraining. In this thesis, we leverage a recent advance in data cleaning that allows sampling from a well-defined space of reasonable repairs, and provide the user with a data management tool that gives uncertain query answers over this space. We propose a framework to compute probabilistic query answers as though each repair sample were a possible world. We show experimentally that queries over many possible repairs produce results that are more useful than other approaches and that our system can scale to large datasets.
3

Open City Data Pipeline

Bischof, Stefan, Kämpgen, Benedikt, Harth, Andreas, Polleres, Axel, Schneider, Patrik 02 1900 (has links) (PDF)
Statistical data about cities, regions and at country level is collected for various purposes and from various institutions. Yet, while access to high quality and recent such data is crucial both for decision makers as well as for the public, all to often such collections of data remain isolated and not re-usable, let alone properly integrated. In this paper we present the Open City Data Pipeline, a focused attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and republish this data in a reusable manner as Linked Data. The main feature of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques as well as ontological reasoning over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such imputations per indicator. Additionally, (iv) we make the integrated and enriched data available both in a we browser interface and as machine-readable Linked Data, using standard vocabularies such as QB and PROV, and linking to e.g. DBpedia. Lastly, in an exhaustive evaluation of our approach, we compare our enrichment and cleansing techniques to a preliminary version of the Open City Data Pipeline presented at ISWC2015: firstly, we demonstrate that the combination of equational knowledge and standard machine learning techniques significantly helps to improve the quality of our missing value imputations; secondly, we arguable show that the more data we integrate, the more reliable our predictions become. Hence, over time, the Open City Data Pipeline shall provide a sustainable effort to serve Linked Data about cities in increasing quality. / Series: Working Papers on Information Systems, Information Business and Operations
4

Relational Data Curation by Deduplication, Anonymization, and Diversification

Huang, Yu January 2020 (has links)
Enterprises acquire large amounts of data from a variety of sources with the goal of extracting valuable insights and enabling informed analysis. Unfortunately, organizations continue to be hindered by poor data quality as they wrangle with their data to extract value since most real datasets are rarely error-free. Poor data quality is a pervasive problem that spans across all industries causing unreliable data analysis, and costing billions of dollars. The large body of datasets, the pace of data acquisition, and the heterogeneity of data sources pose challenges towards achieving high-quality data. These challenges are further exacerbated with data privacy and data diversity requirements. In this thesis, we study and propose solutions to address data duplication, managing the trade-off between data cleaning and data privacy, and computing diverse data instances. In the first part of this thesis, we address the data duplication problem. We propose a duplication detection framework, which combines word-embeddings with constraints among attributes to improve the accuracy of deduplication. We propose a set of constraint-based statistical features to capture the semantic relationship among attributes. We showed that our techniques achieve comparative accuracy on real datasets. In the second part of this thesis, we study the problem of data privacy and data cleaning, and we present a Privacy-Aware data Cleaning-As-a-Service (PACAS) framework to protect privacy during the cleaning process. Our evaluation shows that PACAS safeguards semantically related sensitive values, and provides lower repair errors compared to existing privacy-aware cleaning techniques. In the third part of this thesis, we study the problem of finding a diverse anonymized data instance where diversity is measured via a set of diversity constraints, and propose an algorithm to seek a k-anonymous relation with value suppression as well as satisfying given diversity constraints. We conduct extensive experiments using real and synthetic data showing the effectiveness of our techniques, and improvement over existing baselines. / Thesis / Doctor of Philosophy (PhD)
5

Novel Online Data Cleaning Protocols for Data Streams in Trajectory, Wireless Sensor Networks

Pumpichet, Sitthapon 12 November 2013 (has links)
The promise of Wireless Sensor Networks (WSNs) is the autonomous collaboration of a collection of sensors to accomplish some specific goals which a single sensor cannot offer. Basically, sensor networking serves a range of applications by providing the raw data as fundamentals for further analyses and actions. The imprecision of the collected data could tremendously mislead the decision-making process of sensor-based applications, resulting in an ineffectiveness or failure of the application objectives. Due to inherent WSN characteristics normally spoiling the raw sensor readings, many research efforts attempt to improve the accuracy of the corrupted or “dirty” sensor data. The dirty data need to be cleaned or corrected. However, the developed data cleaning solutions restrict themselves to the scope of static WSNs where deployed sensors would rarely move during the operation. Nowadays, many emerging applications relying on WSNs need the sensor mobility to enhance the application efficiency and usage flexibility. The location of deployed sensors needs to be dynamic. Also, each sensor would independently function and contribute its resources. Sensors equipped with vehicles for monitoring the traffic condition could be depicted as one of the prospective examples. The sensor mobility causes a transient in network topology and correlation among sensor streams. Based on static relationships among sensors, the existing methods for cleaning sensor data in static WSNs are invalid in such mobile scenarios. Therefore, a solution of data cleaning that considers the sensor movements is actively needed. This dissertation aims to improve the quality of sensor data by considering the consequences of various trajectory relationships of autonomous mobile sensors in the system. First of all, we address the dynamic network topology due to sensor mobility. The concept of virtual sensor is presented and used for spatio-temporal selection of neighboring sensors to help in cleaning sensor data streams. This method is one of the first methods to clean data in mobile sensor environments. We also study the mobility pattern of moving sensors relative to boundaries of sub-areas of interest. We developed a belief-based analysis to determine the reliable sets of neighboring sensors to improve the cleaning performance, especially when node density is relatively low. Finally, we design a novel sketch-based technique to clean data from internal sensors where spatio-temporal relationships among sensors cannot lead to the data correlations among sensor streams.
6

A Systems Approach to Rule-Based Data Cleaning

Amr H Ebaid (6274220) 10 May 2019 (has links)
<div>High quality data is a vital asset for several businesses and applications. With flawed data costing billions of dollars every year, the need for data cleaning is unprecedented. Many data-cleaning approaches have been proposed in both academia and industry. However, there are no end-to-end frameworks for detecting and repairing errors with respect to a set of <i>heterogeneous</i> data-quality rules.</div><div><br></div><div>Several important challenges exist when envisioning an end-to-end data-cleaning system: (1) It should deal with heterogeneous types of data-quality rules and interleave their corresponding repairs. (2) It can be extended by various data-repair algorithms to meet users' needs for effectiveness and efficiency. (3) It must support continuous data cleaning and adapt to inevitable data changes. (4) It has to provide user-friendly interpretable explanations for the detected errors and the chosen repairs.</div><div><br></div><div>This dissertation presents a systems approach to rule-based data cleaning that is <b>generalized</b>, <b>extensible</b>, <b>continuous </b>and <b>explaining</b>. This proposed system distinguishes between a <i>programming interface</i> and a <i>core </i>to address the above challenges. The programming interface allows the user to specify various types of data-quality rules that uniformly define and explain what is wrong with the data, and how to fix it. Handling all the rules as black-boxes, the core encapsulates various algorithms to holistically and continuously detect errors and repair data. The proposed system offers a simple interface to define data-quality rules, summarizes the data, highlights violations and fixes, and provides relevant auditing information to explain the errors and the repairs.</div>
7

Integration of heterogeneous data types using self organizing maps

Bourennani, Farid 01 July 2009 (has links)
With the growth of computer networks and the advancement of hardware technologies, unprecedented access to data volumes become accessible in a distributed fashion forming heterogeneous data sources. Understanding and combining these data into data warehouses, or merging remote public data into existing databases can significantly enrich the information provided by these data. This problem is called data integration: combining data residing at different sources, and providing the user with a unified view of these data. There are two issues with making use of remote data sources: (1) discovery of relevant data sources, and (2) performing the proper joins between the local data source and the relevant remote databases. Both can be solved if one can effectively identify semantically-related attributes between the local data sources and the available remote data sources. However, performing these tasks manually is time-consuming because of the large data sizes and the unavailability of schema documentation; therefore, an automated tool would be definitely more suitable. Automatically detecting similar entities based on the content is challenging due to three factors. First, because the amount of records is voluminous, it is difficult to perceive or discover information structures or relationships. Second, the schemas of the databases are unfamiliar; therefore, detecting relevant data is difficult. Third, the database entity types are heterogeneous and there is no existing solution for extracting a richer classification result from the processing of two different data types, or at least from textual and numerical data. We propose to utilize self-organizing maps (SOM) to aid the visual exploration of the large data volumes. The unsupervised classification property of SOM facilitates the integration of completely unfamiliar relational database tables and attributes based on the contents. In order to accommodate heterogeneous data types found in relational databases, we extended the term frequency – inverse document frequency (TF-IDF) measure to handle numerical and textual attribute types by unified vectorization processing. The resulting map allows the user to browse the heterogeneously typed database attributes and discover clusters of documents (attributes) having similar content. iii The discovered clusters can significantly aid in manual or automated constructions of data integrity constraints in data cleaning or schema mappings for data integration.
8

Enriching integrated statistical open city data by combining equational knowledge and missing value imputation

Bischof, Stefan, Harth, Andreas, Kämpgen, Benedikt, Polleres, Axel, Schneider, Patrik 19 October 2017 (has links) (PDF)
Several institutions collect statistical data about cities, regions, and countries for various purposes. Yet, while access to high quality and recent such data is both crucial for decision makers and a means for achieving transparency to the public, all too often such collections of data remain isolated and not re-useable, let alone comparable or properly integrated. In this paper we present the Open City Data Pipeline, a focused attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and re-publish the resulting dataset in a re-useable manner as Linked Data. The main features of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques and reasoning over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such imputations per indicator. Additionally, (iv) we make the integrated and enriched data, including links to external data sources, such as DBpedia, available both in a web browser interface and as machine-readable Linked Data, using standard vocabularies such as QB and PROV. Apart from providing a contribution to the growing collection of data available as Linked Data, our enrichment process for missing values also contributes a novel methodology for combining rule-based inference about equational knowledge with inferences obtained from statistical Machine Learning approaches. While most existing works about inference in Linked Data have focused on ontological reasoning in RDFS and OWL, we believe that these complementary methods and particularly their combination could be fruitfully applied also in many other domains for integrating Statistical Linked Data, independent from our concrete use case of integrating city data.
9

Mitigating Inconsistencies by Coupling Data Cleaning, Filtering, and Contextual Data Validation in Wireless Sensor Networks

Bakhtiar, Qutub A 26 March 2009 (has links)
With the advent of peer to peer networks, and more importantly sensor networks, the desire to extract useful information from continuous and unbounded streams of data has become more prominent. For example, in tele-health applications, sensor based data streaming systems are used to continuously and accurately monitor Alzheimer's patients and their surrounding environment. Typically, the requirements of such applications necessitate the cleaning and filtering of continuous, corrupted and incomplete data streams gathered wirelessly in dynamically varying conditions. Yet, existing data stream cleaning and filtering schemes are incapable of capturing the dynamics of the environment while simultaneously suppressing the losses and corruption introduced by uncertain environmental, hardware, and network conditions. Consequently, existing data cleaning and filtering paradigms are being challenged. This dissertation develops novel schemes for cleaning data streams received from a wireless sensor network operating under non-linear and dynamically varying conditions. The study establishes a paradigm for validating spatio-temporal associations among data sources to enhance data cleaning. To simplify the complexity of the validation process, the developed solution maps the requirements of the application on a geometrical space and identifies the potential sensor nodes of interest. Additionally, this dissertation models a wireless sensor network data reduction system by ascertaining that segregating data adaptation and prediction processes will augment the data reduction rates. The schemes presented in this study are evaluated using simulation and information theory concepts. The results demonstrate that dynamic conditions of the environment are better managed when validation is used for data cleaning. They also show that when a fast convergent adaptation process is deployed, data reduction rates are significantly improved. Targeted applications of the developed methodology include machine health monitoring, tele-health, environment and habitat monitoring, intermodal transportation and homeland security.
10

Data Cleaning with Minimal Information Disclosure

Gairola, Dhruv 11 1900 (has links)
Businesses analyze large datasets in order to extract valuable insights from the data. Unfortunately, most real datasets contain errors that need to be corrected before any analysis. Businesses can utilize various data cleaning systems and algorithms to automate the correction of data errors. Many systems correct the data errors by using information present within the dirty dataset itself. Some also incorporate user feedback in order to validate the quality of the suggested data corrections. However, users are not always available for feedback. Hence, some systems rely on clean data sources to help with the data cleaning process. This involves comparing records between the dirty dataset and the clean dataset in order to detect high quality fixes for the erroneous data. Every record in the dirty dataset is compared with every record in the clean dataset in order to find similar records. The values of the records in the clean dataset can be used to correct the values of the erroneous records in the dirty dataset. Realistically, comparing records across two datasets may not be possible due to privacy reasons. For example, there are laws to restrict the free movement of personal data. Additionally, different records within a dataset may have different privacy requirements. Existing data cleaning systems do not factor in these privacy requirements on the respective datasets. This motivates the need for privacy aware data cleaning systems. In this thesis, we examine the role of privacy in the data cleaning process. We present a novel data cleaning framework that supports the cooperation between the clean and the dirty datasets such that the clean dataset discloses a minimal amount of information and the dirty dataset uses this information to (maximally) clean its data. We investigate the tradeoff between information disclosure and data cleaning utility, modelling this tradeoff as a multi-objective optimization problem within our framework. We propose four optimization functions to solve our optimization problem. Finally, we perform extensive experiments on datasets containing up to 3 million records by varying parameters such as the error rate of the dataset, the size of the dataset, the number of constraints on the dataset, etc and measure the impact on accuracy and performance for those parameters. Our results demonstrate that disclosing a larger amount of information within the clean dataset helps in cleaning the dirty dataset to a larger extent. We find that with 80% information disclosure (relative to the weighted optimization function), we are able to achieve a precision of 91% and a recall of 85%. We also compare our algorithms against each other to discover which ones produce better data repairs and which ones take longer to find repairs. We incorporate ideas from Barone et al. into our framework and show that our approach is 30% faster, but 7% worse for precision. We conclude that our data cleaning framework can be applied to real-world scenarios where controlling the amount of information disclosed is important. / Thesis / Master of Computer Science (MCS) / Businesses analyze large datasets in order to extract valuable insights from the data. Unfortunately, most real datasets contain errors that need to be corrected before any analysis. Businesses can utilize various data cleaning systems and algorithms to automate the correction of data errors. Many systems correct the data errors by using information present within the dirty dataset itself. Some also incorporate user feedback in order to validate the quality of the suggested data corrections. However, users are not always available for feedback. Hence, some systems rely on clean data sources to help with the data cleaning process. This involves comparing records between the dirty dataset and the clean dataset in order to detect high quality fixes for the erroneous data. Every record in the dirty dataset is compared with every record in the clean dataset in order to find similar records. The values of the records in the clean dataset can be used to correct the values of the erroneous records in the dirty dataset. Realistically, comparing records across two datasets may not be possible due to privacy reasons. For example, there are laws to restrict the free movement of personal data. Additionally, different records within a dataset may have different privacy requirements. Existing data cleaning systems do not factor in these privacy requirements on the respective datasets. This motivates the need for privacy aware data cleaning systems. In this thesis, we examine the role of privacy in the data cleaning process. We present a novel data cleaning framework that supports the cooperation between the clean and the dirty datasets such that the clean dataset discloses a minimal amount of information and the dirty dataset uses this information to (maximally) clean its data. We investigate the tradeoff between information disclosure and data cleaning utility, modelling this tradeoff as a multi-objective optimization problem within our framework. We propose four optimization functions to solve our optimization problem. Finally, we perform extensive experiments on datasets containing up to 3 million records by varying parameters such as the error rate of the dataset, the size of the dataset, the number of constraints on the dataset, etc and measure the impact on accuracy and performance for those parameters. Our results demonstrate that disclosing a larger amount of information within the clean dataset helps in cleaning the dirty dataset to a larger extent. We find that with 80% information disclosure (relative to the weighted optimization function), we are able to achieve a precision of 91% and a recall of 85%. We also compare our algorithms against each other to discover which ones produce better data repairs and which ones take longer to find repairs. We incorporate ideas from Barone et al. into our framework and show that our approach is 30% faster, but 7% worse for precision. We conclude that our data cleaning framework can be applied to real-world scenarios where controlling the amount of information disclosed is important.

Page generated in 0.1474 seconds