Return to search

Data Dimensionality Reduction Techniques: What Works with Machine Learning Models

High-dimensional data has a wide range of applications in research, such as education, health, social media, and many other research fields. However, the high dimensionality of data can raise many problems for data analyses. This study focuses on commonly used techniques of dimensionality reduction for machine learning models, which play an essential and inevitable role in data prepossessing and statistical analysis. The main issues of high-dimensional data for machine learning tasks include the accuracy of data classification and visualization in machine learning models. Therefore, in this study, machine learning algorithms are used to predict and classify datasets to evaluate the accuracy, precision, recall, and F1 score of results, which are evaluated and compared by mean, variance, confidence intervals, and coverage. This study focuses on data mining issues, comparing and discussing different dimensionality reduction techniques with different dataset features. Eight dimensionality reduction techniques (Principal Component Analysis, Kernel Principal Component Analysis, Singular Value Decomposition, Non-negative matrix factorization, Independent Component Analysis, Multidimensional Scaling, Isomap, and Auto-encoder) are compared and evaluated on simulated datasets. Specifically, this study evaluates and compares the performances of the commonly used dimensionality reduction techniques by exploring the issues about features and characteristics of different techniques through Monte Carlo simulation studies with four machine learning classification models: logistic regression, linear support vector machine, nonlinear support vector machine, and k-nearest neighbors. The results of this study indicated that the DRTs decreased the accuracy, precision, recall, and F1 scores compared with results without DRTs. And overall, MDS performed dramatically better than other DRTs. SVD, PCA, and ICA had similar results because they are all linear DRTs. Although it is also a linear DRT, NMF performed as poorly as KPCA, which is a nonlinear DRT. The other two nonlinear DRTs, Isomap and Autoencoder, had the worst performance in this study. The results provided recommendations for empirical researchers using machine learning models with high dimensional data under specific conditions.

Identiferoai:union.ndltd.org:ucf.edu/oai:stars.library.ucf.edu:etd2020-2728
Date15 December 2022
CreatorsChen, Yuting
PublisherSTARS
Source SetsUniversity of Central Florida
LanguageEnglish
Detected LanguageEnglish
Typetext
Formatapplication/pdf
SourceElectronic Theses and Dissertations, 2020-

Page generated in 0.0022 seconds