Identifying biosynthetic gene clusters from genomic data is challenging, with many in-silico tools suffering from a high rediscovery rate due to their dependence on rule-based algorithms. Next generation sequencing has provided an abundance of genomic information, and it has been hypothesized that there are many undiscovered biosynthetic gene clusters within this dataset. Here, we aim to develop a machine learning tool, ML-Miner, that infers patterns that describe a biosynthetic gene cluster in an unbiased manner and, as such, enables the identification of new biosynthetic gene clusters from genomic data. To solve this challenging problem, we define a simpler one to predict the class of a known BGC. Specifically, ML-Miner receives as input the concatenation of sequences that are known or believed to be part of a biosynthetic gene cluster. Its task is to identify which class it belongs, i.e. NPRS, PKS terpene and RiPPs.
ML-Miner is a machine learning tool that uses Natural Language Processing, dimensionality reduction, and supervised learning to identify novel biosynthetic gene clusters. BioVec is a biological word embedding that we use to transform protein sequences from the highly curated MIBiG database of characterized biosynthetic gene clusters into their respective continuous distributed vector representations. Because the resulting protein vectors are of high dimensionality, a supervised Uniform Manifold and Approximation algorithm was employed to transform the high dimensional vectors into a robust lower-dimensional representation, as evaluated by Silhouette analysis, Hopkins’ statistic, and trustworthiness analysis. The density-Based Spatial Clustering of Applications and Noise algorithm showed that the clusters identified from the low dimensional datasets mapped to biosynthetic gene cluster types, defined with high accuracy in the MIBiG database. A random forest classifier was then trained and evaluated using the low dimensional vectors. It was shown to classify each biosynthetic gene cluster from the MIBiG database with excellent performance metrics. Finally, the model's ability to generalize was evaluated using biosynthetic gene clusters from the antiSMASH dataset, an uncurated database containing uncharacterized biosynthetic gene clusters. The performance metrics were high, with a balanced accuracy of ~85%. After a hyperparameter search, the balanced accuracy rose to ~90%. This suggests that ML-Miner is a robust machine learning pipeline that can be used to identify novel biosynthetic gene clusters. Future development of a confidence score for classification and a workflow for processing bacterial genomes into gene clusters will significantly improve the utility of this tool.
Identifer | oai:union.ndltd.org:uottawa.ca/oai:ruor.uottawa.ca:10393/43432 |
Date | 04 April 2022 |
Creators | Wambo, Paul A. |
Contributors | Boddy, Christopher |
Publisher | Université d'Ottawa / University of Ottawa |
Source Sets | Université d’Ottawa |
Language | English |
Detected Language | English |
Type | Thesis |
Format | application/pdf |
Page generated in 0.0025 seconds