Global ETD Search

1	Towards a Data Quality Framework for Heterogeneous Data Micic, Natasha, Neagu, Daniel, Campean, Felician, Habib Zadeh, Esmaeil 22 April 2017 (has links) yes / Every industry has signiﬁcant data output as a product of their working process, and with the recent advent of big data mining and integrated data warehousing it is the case for a robust methodology for assessing the quality for sustainable and consistent processing. In this paper a review is conducted on Data Quality (DQ) in multiple domains in order to propose connections between their methodologies. This critical review suggests that within the process of DQ assessment of heterogeneous data sets, not often are they treated as separate types of data in need of an alternate data quality assessment framework. We discuss the need for such a directed DQ framework and the opportunities that are foreseen in this research area and propose to address it through degrees of heterogeneity.
2	Exploring Methods for Comparing Similarity of Dimensionally Inconsistent Multivariate Numerical Data Micic, Natasha, Neagu, Daniel, Torgunov, Denis, Campean, Felician 28 June 2018 (has links) no / When developing multivariate data classiﬁcation and clustering methodologies for data mining, it is clear that most literature contributions only really consider data that contain consistently the same attributes. There are however many cases in current big data analytics applications where for same topic and even same source data sets there are diﬀering attributes being measured, for a multitude of reasons (whether the speciﬁc design of an experiment or poor data quality and consistency). We deﬁne this class of data a dimensionally inconsistent multivariate data, a topic that can be considered a subclass of the Big Data Variety research. This paper explores some classiﬁcation methodologies commonly used in multivariate classiﬁcation and clustering tasks and considers how these traditional methodologies could be adapted to compare dimensionally inconsistent data sets. The study focuses on adapting two similarity measures: Robinson-Foulds tree distance metrics and Variation of Information; for comparing clustering of hierarchical cluster algorithms (such clusters are derived from the raw multivariate data). The results from experiments on engineering data highlight that adapting pairwise measures to exclude non-common attributes from the traditional distance metrics may not be the best method of classiﬁcation. We suggest that more specialised metrics of similarity are required to address challenges presented by dimensionally inconsistent multivariate data, with speciﬁc applications for big engineering data analytics. / Jaguar Land-Rover Big data Clustering Heterogeneous data sets Classiﬁcation methodologies Inconsistent multivariate data

Search results

Towards a Data Quality Framework for Heterogeneous Data

Exploring Methods for Comparing Similarity of Dimensionally Inconsistent Multivariate Numerical Data