Return to search

Classification of heterogeneous data based on data type impact of similarity

Yes / Real-world datasets are increasingly heterogeneous, showing a mixture of numerical, categorical and other feature types. The main challenge for mining heterogeneous datasets is how to deal with heterogeneity present in the dataset records. Although some existing classifiers (such as decision trees) can handle heterogeneous data in specific circumstances, the performance of such models may be still improved, because heterogeneity involves specific adjustments to similarity measurements and calculations. Moreover, heterogeneous data is still treated inconsistently and in ad-hoc manner. In this paper, we study the problem of heterogeneous data classification: our purpose is to use heterogeneity as a positive feature of the data classification effort by using consistently the similarity between data objects. We address the heterogeneity issue by studying the impact of mixing data types in the calculation of data objects’ similarity. To reach our goal, we propose an algorithm to divide the initial data records based on pairwise similarity for classification subtasks with the aim to increase the quality of the data subsets and apply specialized classifier models on them. The performance of the proposed approach is evaluated on 10 publicly available heterogeneous data sets. The results show that the models achieve better performance for heterogeneous datasets when using the proposed similarity process.

Identiferoai:union.ndltd.org:BRADFORD/oai:bradscholars.brad.ac.uk:10454/16760
Date11 August 2018
CreatorsAli, N., Neagu, Daniel, Trundle, Paul R.
Source SetsBradford Scholars
LanguageEnglish
Detected LanguageEnglish
TypeConference paper, Accepted Manuscript
Rights© Springer Nature Switzerland AG 2019. Reproduced in accordance with the publisher's self-archiving policy. The final publication is available at Springer via https://doi.org/10.1007/978-3-319-97982-3_21.

Page generated in 0.0017 seconds