Global ETD Search

Return to search

POPULATION STRUCTURE INFERENCE USING PCA AND CLUSTERING ALGORITHMS

Genotype data, consisting large numbers of markers, is used as demographic and association studies to determine genes related to specific traits or diseases. Handling of these datasets usually takes a significant amount of time in its application of population structure inference. Therefore, we suggested applying PCA on genotyped data and then clustering algorithms to specify the individuals to their particular subpopulations. We collected both real and simulated datasets in this study. We studied PCA and selected significant features, then applied five different clustering techniques to obtain better results. Furthermore, we studied three different methods for predicting the optimal number of subpopulations in a collected dataset. The results of four different simulated datasets and two real human genotype datasets show that our approach performs well in the inference of population structure. NbClust is more effective to infer subpopulations in the population. In this study, we showed that centroid-based clustering: such as k-means and PAM, performs better than model-based, spectral, and hierarchical clustering algorithms. This approach also has the benefit of being fast and flexible in the inference of population structure.

Identifer	oai:union.ndltd.org:siu.edu/oai:opensiuc.lib.siu.edu:theses-3874
Date	01 September 2021
Creators	Rimal, Suraj
Publisher	OpenSIUC
Source Sets	Southern Illinois University Carbondale
Detected Language	English
Type	text
Format	application/pdf
Source	Theses

Page generated in 0.0021 seconds

POPULATION STRUCTURE INFERENCE USING PCA AND CLUSTERING ALGORITHMS

Description

Links & Downloads

Tags

Additional Fields