Return to search

Contributions for Handling Big Data Heterogeneity. Using Intuitionistic Fuzzy Set Theory and Similarity Measures for Classifying Heterogeneous Data

A huge amount of data is generated daily by digital technologies such as
social media, web logs, traffic sensors, on-line transactions, tracking data,
videos, and so on. This has led to the archiving and storage of larger and
larger datasets, many of which are multi-modal, or contain different types
of data which contribute to the problem that is now known as “Big Data”.
In the area of Big Data, volume, variety and velocity problems remain difficult
to solve. The work presented in this thesis focuses on the variety
aspect of Big Data. For example, data can come in various and mixed formats
for the same feature(attribute) or different features and can be identified
mainly by one of the following data types: real-valued, crisp and
linguistic values. The increasing variety and ambiguity of such data are
particularly challenging to process and to build accurate machine learning
models. Therefore, data heterogeneity requires new methods of analysis
and modelling techniques to enable useful information extraction and the
modelling of achievable tasks. In this thesis, new approaches are proposed
for handling heterogeneous Big Data. these include two techniques for filtering
heterogeneous data objects are proposed. The two techniques called
Two-Dimensional Similarity Space(2DSS) for data described by numeric
and categorical features, and Three-Dimensional Similarity Space(3DSS)
for real-valued, crisp and linguistic data are proposed for filtering such data. Both filtering techniques are used in this research to reduce the noise
from the initial dataset and make the dataset more homogeneous. Furthermore,
a new similarity measure based on intuitionistic fuzzy set theory
is proposed. The proposed measure is used to handle the heterogeneity
and ambiguity within crisp and linguistic data. In addition, new combine
similarity models are proposed which allow for a comparison between the
heterogeneous data objects represented by a combination of crisp and linguistic
values. Diverse examples are used to illustrate and discuss the efficiency
of the proposed similarity models. The thesis also presents modification
of the k-Nearest Neighbour classifier, called k-Nearest Neighbour
Weighted Average (k-NNWA), to classify the heterogeneous dataset described
by real-valued, crisp and linguistic data. Finally, the thesis also
introduces a novel classification model, called FCCM (Filter Combined
Classification Model), for heterogeneous data classification. The proposed
model combines the advantages of the 3DSS and k-NNWA classifier and
outperforms the latter algorithm. All the proposed models and techniques
have been applied to weather datasets and evaluated using accuracy, Fscore
and ROC area measures. The experiments revealed that the proposed
filtering techniques are an efficient approach for removing noise from heterogeneous
data and improving the performance of classification models.
Moreover, the experiments showed that the proposed similarity measure
for intuitionistic fuzzy data is capable of handling the fuzziness of heterogeneous
data and the intuitionistic fuzzy set theory offers some promise
in solving some Big Data problems by handling the uncertainties, and the
heterogeneity of the data.

Identiferoai:union.ndltd.org:BRADFORD/oai:bradscholars.brad.ac.uk:10454/19418
Date January 2019
CreatorsAli, Najat
ContributorsNeagu, Daniel, Trundle, Paul R.
PublisherUniversity of Bradford, Department of Computer Science
Source SetsBradford Scholars
LanguageEnglish
Detected LanguageEnglish
TypeThesis, doctoral, PhD
Rights<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by-nc-nd/3.0/88x31.png" /></a><br />The University of Bradford theses are licenced under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/">Creative Commons Licence</a>.

Page generated in 0.0026 seconds