Return to search

Assessing the Robustness of Clustering Methods for use in Vegetation Classification

<p>Numerical clustering encompasses a group of exploratory multivariate statistical methods devoted to finding groups in data based on either responses of individual variables or dissimilarity measures calculated from the variables. Despite their popularity, there have been few controlled comparisons of methods using data of known clustering structure and which compare more than a few methods. This study utilizes simulated plant community data to assess what data properties affect the performance of numerical clustering methods used in vegetation classification, including properties that can be controlled during data collection and measured before statistical analysis. This was done by with simulation experiments varying properties of species assemblages themselves ? ?-diversity, ?-diversity, and the level and type of noise in the data ? or the clustering structure of sampling units (SUs) in environmental space ? number of SUs per group, equality of number of SUs or cluster dispersion among groups, the proximity of adjacent clusters, and the number of clusters.
Cluster recovery was measured using the Adjusted Rand Index (ARI) ? a chance-corrected measure of the proportion of elements classified similarly in two clustering results. ARI is an approximation of the proportion of sites correctly classified, so scores near 1.0 indicate accurate cluster recovery, while scores near 0.0 indicate poor cluster recovery. Methods are robust if they have a mean ARI score near 1.0 despite variation in data properties. Methods tested include flexible beta clustering, TWINSPAN, average, complete, and single linkage, K-means, Partitioning Around Medoids, ISOPAM, OPTPART, OPTSIL, Noise Clustering, model-based EM clustering (Mclust), Fuzzy Analysis (FANNY), and Information Analysis. Where applicable, methods were tested using four combinations of standardization and dissimilarity, yielding 59 unique combinations of method, standardization, and dissimilarity.
Across all experiments, a couple of general trends emerge. No methods are robust when either ?-diversity or ?-diversity are very low. When ?-diversity is lowered by including a second set of generalist species along with a set of specialists, mean ARI scores are considerably higher than when decreasing ?-diversity by increasing the range of all species in the data set. Most methods are less robust when implemented with Euclidean distances, except for Ward?s method, PAM, FANNY, and Information Analysis (which only uses the information statistic calculated from presence data as a dissimilarity measure). Nonhierarchical methods fail when the number of SUs is highly unequal between clusters, except for OPTSIL initiated form Flexible Beta clustering results. Hierarchical methods are more sensitive to intermediate and outlier sites, though Ward?s method, Flexible Beta, Information Analysis, and TWINSPAN all perform better than UPGMA, complete linkage, and single linkage. Sources of random error are unimportant individually, but may be more important when paired with other factors.
The optimal choice of clustering method is a product of trade-offs between near optimal performance in most experiments and robustness where other methods fail. For this reason, I recommend using Flexible Beta clustering with possible refinement by OPTSIL as a standard clustering method for vegetation classification. Flexible Beta clustering achieved mean ARI scores that are among the highest in all experiments, while remaining robust to factors that nonhierarchical methods (equality of number of SUs) and other hierarchical methods (intermediate/outlier SUs) are not robust to. OPTSIL did not always drastically improve Flexible Beta results, but it also never made them worse. Nevertheless, in models with low ?-diversity and when adjacent clusters are close together, OPTSIL does improve Flexible Beta Results.

Identiferoai:union.ndltd.org:PROQUEST/oai:pqdtoai.proquest.com:10275943
Date04 August 2017
CreatorsDell, Noah David
PublisherSouthern Illinois University at Edwardsville
Source SetsProQuest.com
LanguageEnglish
Detected LanguageEnglish
Typethesis

Page generated in 0.0014 seconds