Dimension reduction is an important tool used to describe the structure of complex data (explicitly or implicitly) through a small but sufficient number of variables, and thereby make data analysis more efficient. It is also useful for visualization purposes. Dimension reduction helps statisticians to overcome the ‘curse of dimensionality’. However, most dimension reduction techniques require the intrinsic dimension of the low-dimensional subspace to be fixed in advance. The availability of reliable intrinsic dimension (ID) estimation techniques is of major importance. The main goal of this thesis is to develop algorithms for determining the intrinsic dimensions of recorded data sets in a nonlinear context. Whilst this is a well-researched topic for linear planes, based mainly on principal components analysis, relatively little attention has been paid to ways of estimating this number for non–linear variable interrelationships. The proposed algorithms here are based on existing concepts that can be categorized into local methods, relying on randomly selected subsets of a recorded variable set, and global methods, utilizing the entire data set. This thesis provides an overview of ID estimation techniques, with special consideration given to recent developments in non–linear techniques, such as charting manifold and fractal–based methods. Despite their nominal existence, the practical implementation of these techniques is far from straightforward. The intrinsic dimension is estimated via Brand’s algorithm by examining the growth point process, which counts the number of points in hyper-spheres. The estimation needs to determine the starting point for each hyper-sphere. In this thesis we provide settings for selecting starting points which work well for most data sets. Additionally we propose approaches for estimating dimensionality via Brand’s algorithm, the Dip method and the Regression method. Other approaches are proposed for estimating the intrinsic dimension by fractal dimension estimation methods, which exploit the intrinsic geometry of a data set. The most popular concept from this family of methods is the correlation dimension, which requires the estimation of the correlation integral for a ball of radius tending to 0. In this thesis we propose new approaches to approximate the correlation integral in this limit. The new approaches are the Intercept method, the Slop method and the Polynomial method. In addition we propose a new approach, a localized global method, which could be defined as a local version of global ID methods. The objective of the localized global approach is to improve the algorithm based on a local ID method, which could significantly reduce the negative bias. Experimental results on real world and simulated data are used to demonstrate the algorithms and compare them to other methodology. A simulation study which verifies the effectiveness of the proposed methods is also provided. Finally, these algorithms are contrasted using a recorded data set from an industrial melter process.
Identifer | oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:590603 |
Date | January 2014 |
Creators | Kalantan, Zakiah Ibrahim |
Publisher | Durham University |
Source Sets | Ethos UK |
Detected Language | English |
Type | Electronic Thesis or Dissertation |
Source | http://etheses.dur.ac.uk/9500/ |
Page generated in 0.0016 seconds