In cancer research, class discovery is the first process for investigating a new dataset for which hidden groups there are by similar attributes. However datasets from gene expressions, RNA microarray or RNA-sequence, are high-dimensional. Which makes it hard to perform clusteranalysis and to get clusters that are well separated. Well separated clusters are wanted because that tells that objects are most likely not placed in wrong clusters. This report investigate in an experiment whether using K-Means and hierarchical are suitable for clustering gene expressions in RNA-sequence data from various tumors. Dimensionality reduction methods are also applied to see whether that helps create well-separated clusters. The results tell that well separated clusters are only achieved by using PCA as dimensionality reduction and K-Means on correlation. The main contribution of this paper is determining that using K-Means or hierarchical clustering on the full natural dimensionality of RNA-sequence data returns unwanted silhouette average width, under 0,4.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:his-17492 |
Date | January 2019 |
Creators | Henriksson, William |
Publisher | Högskolan i Skövde, Institutionen för informationsteknologi |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0015 seconds