The silhouette score is a widely used technique to evaluate the quality of a clustering result. One of the current issues with the silhouette score is its sensitivity to outliers, which can lead to misleading interpretations. This problem is caused by the silhouette score using the arithmetic mean to calculate the average intra and inter-cluster distances.
To address this issue, three modified silhouette scores are presented: GenSil, TrimSil, and extended TrimSil, which replace the arithmetic mean with the generalized mean, the trimmed mean and a modified trimmed mean, respectively. Experiments on both simulated and real-world datasets show that GenSil is the most effective method, significantly reducing the impact of outliers and achieving high silhouette scores with negative parameter values. TrimSil also improves silhouette scores but performs worse than GenSil, while the extended TrimSil outperforms TrimSil but is still less effective than GenSil. To further aid in selecting the optimal number of clusters with these modified silhouette scores, a more straightforward visualization technique, the silhouette-parameter plot, is also introduced. / Thesis / Master of Science (MSc)
Identifer | oai:union.ndltd.org:mcmaster.ca/oai:macsphere.mcmaster.ca:11375/28888 |
Date | January 2023 |
Creators | Zhang, Yiran |
Contributors | McNicholas, Paul, Mathematics and Statistics |
Source Sets | McMaster University |
Language | English |
Detected Language | English |
Type | Thesis |
Page generated in 0.003 seconds