The Norway spruce is of great importance from both an ecological- and economic standpoint. Information about which genes that causes certain phenotypic traits in the species is therefore highly valuable. The purpose of this project was to apply machine learning to find such genotype-phenotype correlations. The purpose was also to compare the results from different machine learning algorithms to a more traditional linear mixed model GWAS (where correlation to the phenotype is estimated for each SNP one by one) to find which is the better method for GWAS. The machine learning algorithms tested were decision tree, support vector machine and support vector regression. The phenotypes analyzed were wood density and initiation frequency of zygotic embryogenesis (ZE). The latter is related to a new method for cloning. The genetic data consisted of single-nucleotide polymorphisms (SNPs). Due to the large genome size of Norway spruce and due to limitations in the packages used in R two different approaches were taken to reduce the sample size. The first approach used Kendall’s rank correlation coefficient to remove redundant SNPs and the second used an iterative approach to the machine learning model. The iterative approach was proven to be the best and support vector machine/regression was found to be better than decision tree for both phenotypes. Support vector regression from the iterative approach resulted in a squared correlation coefficient of 0.83 for density and 0.94 for ZE initiation frequency. Note that these very high values should be interpreted with caution, as it is possible that some of the significant correlations are only due to random chance. Even a small chance for random correlations will result in findings when the number of SNPs are this large (1908552 SNPs). The significant SNPs identified by the machine learning models were compared to SNPs identified by the linear mixed model GWAS. This indeed showed some overlaps of significant SNPs, which increases the credibility of my results. However, further investigation of the identified significant SNPs is needed to determine their functional mode of action. My conclusion is that using machine learning to predict phenotypic traits from SNP data can be a good choice. However, the model might not use all correlated SNPs, just enough to get a good prediction. Therefore, for the purpose of finding significant SNPs, the linear mixed model approach might be better. In other words, the method used should be determined by the purpose of the study.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:umu-209837 |
Date | January 2023 |
Creators | Sandberg, Matilda |
Publisher | Umeå universitet, Institutionen för fysik |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0024 seconds