11 |
A More Decentralized Vision for Linked DataPolleres, Axel, Kamdar, Maulik R., Fernandez Garcia, Javier David, Tudorache, Tania, Musen, Mark A. January 2018 (has links) (PDF)
We claim that ten years into Linked Data there are still many unresolved challenges towards arriving at a truly machine-readable and decentralized Web of data. With a focus on the the biomedical domain, currently, one of the most promising adopters of Linked Data, we highlight and exemplify key technical and non-technical challenges to the success of Linked Data, and we outline potential solution strategies.
|
12 |
Open City Data PipelineBischof, Stefan, Kämpgen, Benedikt, Harth, Andreas, Polleres, Axel, Schneider, Patrik 02 1900 (has links) (PDF)
Statistical data about cities, regions and at country level is collected for various purposes and from various institutions. Yet, while
access to high quality and recent such data is crucial both for decision makers as well as for the public, all to often such collections of
data remain isolated and not re-usable, let alone properly integrated. In this paper we present the Open City Data Pipeline, a focused
attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and republish this data in a reusable manner
as Linked Data. The main feature of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a
modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques as well as ontological reasoning
over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such
imputations per indicator. Additionally, (iv) we make the integrated and enriched data available both in a we browser interface and as
machine-readable Linked Data, using standard vocabularies such as QB and PROV, and linking to e.g. DBpedia.
Lastly, in an exhaustive evaluation of our approach, we compare our enrichment and cleansing techniques to a preliminary version
of the Open City Data Pipeline presented at ISWC2015: firstly, we demonstrate that the combination of equational knowledge and
standard machine learning techniques significantly helps to improve the quality of our missing value imputations; secondly, we
arguable show that the more data we integrate, the more reliable our predictions become. Hence, over time, the Open City Data
Pipeline shall provide a sustainable effort to serve Linked Data about cities in increasing quality. / Series: Working Papers on Information Systems, Information Business and Operations
|
13 |
The Gourmet Guide to Statistics: For an Instructional Strategy That Makes Teaching and Learning Statistics a Piece of CakeEdirisooriya, Gunapala 01 January 2003 (has links)
This article draws analogies between the activities of statisticians and of chefs. It suggests how these analogies can be used in teaching, both to help understanding of what statistics is about and to increase motivation to learn the subject.
|
14 |
Relational Data Curation by Deduplication, Anonymization, and DiversificationHuang, Yu January 2020 (has links)
Enterprises acquire large amounts of data from a variety of sources with the goal of extracting valuable insights and enabling informed analysis. Unfortunately, organizations continue to be hindered by poor data quality as they wrangle with their data to extract value since most real datasets are rarely error-free. Poor data quality is a pervasive problem that spans across all industries causing unreliable data analysis, and costing billions of dollars. The large body of datasets, the pace of data acquisition, and the heterogeneity of data sources pose challenges towards achieving high-quality data. These challenges are further exacerbated with data privacy and data diversity requirements. In this thesis, we study and propose solutions to address data duplication, managing the trade-off between data cleaning and data privacy, and computing diverse data instances.
In the first part of this thesis, we address the data duplication problem. We propose a duplication detection framework, which combines word-embeddings with constraints among attributes to improve the accuracy of deduplication. We propose a set of constraint-based statistical features to capture the semantic relationship among attributes. We showed that our techniques achieve comparative accuracy on real datasets. In the second part of this thesis, we study the problem of data privacy and data cleaning, and we present a Privacy-Aware data Cleaning-As-a-Service (PACAS) framework to protect privacy during the cleaning process. Our evaluation shows that PACAS safeguards semantically related sensitive values, and provides lower repair errors compared to existing privacy-aware cleaning techniques. In the third part of this thesis, we study the problem of finding a diverse anonymized data instance where diversity is measured via a set of diversity constraints, and propose an algorithm to seek a k-anonymous relation with value suppression as well as satisfying given diversity constraints. We conduct extensive experiments using real and synthetic data showing the effectiveness of our techniques, and improvement over existing baselines. / Thesis / Doctor of Philosophy (PhD)
|
15 |
An Integrated Approach to Improve Data QualityAl-janabi, Samir 06 1900 (has links)
Thesis / A huge quantity of data is created and saved everyday in databases from different types of data sources, including financial data, web log data, sensor data, and human input. Information technology enables organizations to collect and store large amounts of data in databases. Different organizations worldwide use data to support their activities through various applications. Issues in data quality such as duplicate records, inaccurate data, violations of integrity constraints, and outdated data are common in databases. Thus, data in databases are often unclean. Such issues in data quality might cost billions of dollars annually and might have severe consequences on critical tasks such as analysis, decision making, and planning. Data cleaning processes are required to detect and correct errors in the unclean data. Despite the fact that there are multiple quality issues, current data cleaning techniques generally deal with only one or two aspects of quality. The techniques assume either the availability of master data, or training data, or the involvement of users in data cleaning. For instance, users might manually place confidence scores that represent the correctness of the values of data or they may be consulted about the repairs. In addition, the techniques may depend on high-quality master data or pre-labeled training data to fix errors. However, relying on human effort to correct errors is expensive, and master data or training data are not always available. These factors make it challenging to discover which values have issues, thereby making it difficult to fix the data (e.g., merging several duplicate records into a single representative record). To address these problems in data cleaning, we propose algorithms that integrate multiple data quality issues in the cleaning. In this thesis, we apply this approach in the context of multiple data quality issues where errors in data are introduced from multiple causes. The issues include duplicate records, violations of integrity constraints, inaccurate data, and outdated data. We fix these issues holistically, without a need for human manual interaction, master data, or training data. We propose an algorithm to tackle the problem of data cleaning. We concentrate on issues in data quality including duplicate records, violations of integrity constraints, and inaccurate data. We utilize the embedded density information in data to eliminate duplicates based on data density, where tuples that are close to each other are packed together. Density information enables us to reduce manual user interaction in the deduplication process, and the dependency on master data or training data. To resolve inconsistency in duplicate records, we present a weight model to automatically assign confidence scores that are based on the density of data. We consider the inconsistent data in terms of violations with respect to a set of functional dependencies (FDs). We present a cost model for data repair that is based on the weight model. To resolve inaccurate data in duplicate records, we measure the relatedness of the words of the attributes in the duplicate records based on hierarchical clustering. In the context of integrating the fix of outdated data and inaccurate data in duplicate elimination, we propose an algorithm for data cleaning by introducing techniques based on corroboration, i.e. taking into consideration the trustworthiness of the attribute values. The algorithm integrates data deduplication with data currency and accuracy. We utilize the density information embedded inside the tuples in order to guide the cleaning process to fix multiple data quality issues. By using density information in corroboration, we reduce relying on manual user interaction, and the dependency on master data or training data. / Thesis / Doctor of Philosophy (PhD)
|
16 |
The arrival of a new era in data processing – can ‘big data’ really deliver value to its users: A managerial forecastHussain, Zahid I., Asad, M. 04 1900 (has links)
No
|
17 |
Pattern discovery from spatiotemporal dataCao, Huiping., 曹會萍. January 2006 (has links)
published_or_final_version / abstract / Computer Science / Doctoral / Doctor of Philosophy
|
18 |
Medical image compression techniques for archiving and teleconsultation applicationsVlahakis, Vassilios January 1999 (has links)
No description available.
|
19 |
A Novel Method to Intelligently Mine Social Media to Assess Consumer Sentiment of Pharmaceutical DrugsAkay, Altug January 2017 (has links)
This thesis focuses on the development of novel data mining techniques that convert user interactions in social media networks into readable data that would benefit users, companies, and governments. The readable data can either warn of dangerous side effects of pharmaceutical drugs or improve intervention strategies. A weighted model enabled us to represent user activity in the network, that allowed us to reflect user sentiment of a pharmaceutical drug and/or service. The result is an accurate representation of user sentiment. This approach, when modified for specific diseases, drugs, and services, can enable rapid user feedback that can be converted into rapid responses from consumers to industry and government to withdraw possibly dangerous drugs and services from the market or improve said drugs and services. Our approach monitors social media networks in real-time, enabling government and industry to rapidly respond to consumer sentiment of pharmaceutical drugs and services. / <p>QC 20170314</p>
|
20 |
Forensic Reconstruction of Fragmented Variable Bitrate MP3 filesSajja, Abhilash 17 December 2010 (has links)
File carving is a technique used to recover data from a digital device without the help of file system metadata. The current file carvers use techniques such as using a list of header and footer values and key word searching to retrieve the information specific to a file type. These techniques tend to fail when the files to be recovered are fragmented. Recovering the fragmented files is one of the primary challenges faced by file carving. In this research we focus on Variable Bit Rate (VBR) MP3 files. MP3 is one of the most widely used file formats for storing audio data. We develop a technique which uses the MP3 file structure information to improve the performance of file carvers in reconstructing fragmented MP3 data. This technique uses a large number of MP3 files and performs statistical analysis on the bitrates of these files.
|
Page generated in 0.1564 seconds