Spelling suggestions: "subject:"categorical features"" "subject:"cathegorical features""
1 |
Neural Decoding of Categorical Features in Naturalistic Social InteractionsKim, Eunbin 19 December 2018 (has links)
No description available.
|
2 |
Contributions for Handling Big Data Heterogeneity. Using Intuitionistic Fuzzy Set Theory and Similarity Measures for Classifying Heterogeneous DataAli, Najat January 2019 (has links)
A huge amount of data is generated daily by digital technologies such as
social media, web logs, traffic sensors, on-line transactions, tracking data,
videos, and so on. This has led to the archiving and storage of larger and
larger datasets, many of which are multi-modal, or contain different types
of data which contribute to the problem that is now known as “Big Data”.
In the area of Big Data, volume, variety and velocity problems remain difficult
to solve. The work presented in this thesis focuses on the variety
aspect of Big Data. For example, data can come in various and mixed formats
for the same feature(attribute) or different features and can be identified
mainly by one of the following data types: real-valued, crisp and
linguistic values. The increasing variety and ambiguity of such data are
particularly challenging to process and to build accurate machine learning
models. Therefore, data heterogeneity requires new methods of analysis
and modelling techniques to enable useful information extraction and the
modelling of achievable tasks. In this thesis, new approaches are proposed
for handling heterogeneous Big Data. these include two techniques for filtering
heterogeneous data objects are proposed. The two techniques called
Two-Dimensional Similarity Space(2DSS) for data described by numeric
and categorical features, and Three-Dimensional Similarity Space(3DSS)
for real-valued, crisp and linguistic data are proposed for filtering such data. Both filtering techniques are used in this research to reduce the noise
from the initial dataset and make the dataset more homogeneous. Furthermore,
a new similarity measure based on intuitionistic fuzzy set theory
is proposed. The proposed measure is used to handle the heterogeneity
and ambiguity within crisp and linguistic data. In addition, new combine
similarity models are proposed which allow for a comparison between the
heterogeneous data objects represented by a combination of crisp and linguistic
values. Diverse examples are used to illustrate and discuss the efficiency
of the proposed similarity models. The thesis also presents modification
of the k-Nearest Neighbour classifier, called k-Nearest Neighbour
Weighted Average (k-NNWA), to classify the heterogeneous dataset described
by real-valued, crisp and linguistic data. Finally, the thesis also
introduces a novel classification model, called FCCM (Filter Combined
Classification Model), for heterogeneous data classification. The proposed
model combines the advantages of the 3DSS and k-NNWA classifier and
outperforms the latter algorithm. All the proposed models and techniques
have been applied to weather datasets and evaluated using accuracy, Fscore
and ROC area measures. The experiments revealed that the proposed
filtering techniques are an efficient approach for removing noise from heterogeneous
data and improving the performance of classification models.
Moreover, the experiments showed that the proposed similarity measure
for intuitionistic fuzzy data is capable of handling the fuzziness of heterogeneous
data and the intuitionistic fuzzy set theory offers some promise
in solving some Big Data problems by handling the uncertainties, and the
heterogeneity of the data.
|
3 |
An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing / En undersökning av kodningstekniker för diskreta variabler inom maskininlärning: binär mot one-hot och feature hashingSeger, Cedric January 2018 (has links)
Machine learning methods can be used for solving important binary classification tasks in domains such as display advertising and recommender systems. In many of these domains categorical features are common and often of high cardinality. Using one-hot encoding in such circumstances lead to very high dimensional vector representations, causing memory and computability concerns for machine learning models. This thesis investigated the viability of a binary encoding scheme in which categorical values were mapped to integers that were then encoded in a binary format. This binary scheme allowed for representing categorical features using log2(d)-dimensional vectors, where d is the dimension associated with a one-hot encoding. To evaluate the performance of the binary encoding, it was compared against one-hot and feature hashed representations with the use of linear logistic regression and neural networks based models. These models were trained and evaluated using data from two publicly available datasets: Criteo and Census. The results showed that a one-hot encoding with a linear logistic regression model gave the best performance according to the PR-AUC metric. This was, however, at the expense of using 118 and 65,953 dimensional vector representations for Census and Criteo respectively. A binary encoding led to a lower performance but used only 35 and 316 dimensions respectively. For Criteo, binary encoding suffered significantly in performance and feature hashing was perceived as a more viable alternative. It was also found that employing a neural network helped mitigate any loss in performance associated with using binary and feature hashed representations. / Maskininlärningsmetoder kan användas för att lösa viktiga binära klassificeringsuppgifter i domäner som displayannonsering och rekommendationssystem. I många av dessa domäner är kategoriska variabler vanliga och ofta av hög kardinalitet. Användning av one-hot-kodning under sådana omständigheter leder till väldigt högdimensionella vektorrepresentationer. Detta orsakar minnesoch beräkningsproblem för maskininlärningsmodeller. Denna uppsats undersökte användbarheten för ett binärt kodningsschema där kategoriska värden var avbildade på heltalvärden som sedan kodades i ett binärt format. Detta binära system tillät att representera kategoriska värden med hjälp av log2(d) -dimensionella vektorer, där d är dimensionen förknippad med en one-hot kodning. För att utvärdera prestandan för den binära kodningen jämfördes den mot one-hot och en hashbaserad kodning. En linjär logistikregression och ett neuralt nätverk tränades med hjälp av data från två offentligt tillgängliga dataset: Criteo och Census, och den slutliga prestandan jämfördes. Resultaten visade att en one-hot kodning med en linjär logistisk regressionsmodell gav den bästa prestandan enligt PR-AUC måttet. Denna metod använde dock 118 och 65,953 dimensionella vektorrepresentationer för Census respektive Criteo. En binär kodning ledde till en lägre prestanda generellt, men använde endast 35 respektive 316 dimensioner. Den binära kodningen presterade väsentligt sämre specifikt för Criteo datan, istället var hashbaserade kodningen en mer attraktiv lösning. Försämringen i prestationen associerad med binär och hashbaserad kodning kunde mildras av att använda ett neuralt nätverk.
|
Page generated in 0.2895 seconds