1 |
ML-Miner: A Machine Learning Tool Used for Identification of Novel Biosynthetic Gene ClustersWambo, Paul A. 04 April 2022 (has links)
Identifying biosynthetic gene clusters from genomic data is challenging, with many in-silico tools suffering from a high rediscovery rate due to their dependence on rule-based algorithms. Next generation sequencing has provided an abundance of genomic information, and it has been hypothesized that there are many undiscovered biosynthetic gene clusters within this dataset. Here, we aim to develop a machine learning tool, ML-Miner, that infers patterns that describe a biosynthetic gene cluster in an unbiased manner and, as such, enables the identification of new biosynthetic gene clusters from genomic data. To solve this challenging problem, we define a simpler one to predict the class of a known BGC. Specifically, ML-Miner receives as input the concatenation of sequences that are known or believed to be part of a biosynthetic gene cluster. Its task is to identify which class it belongs, i.e. NPRS, PKS terpene and RiPPs.
ML-Miner is a machine learning tool that uses Natural Language Processing, dimensionality reduction, and supervised learning to identify novel biosynthetic gene clusters. BioVec is a biological word embedding that we use to transform protein sequences from the highly curated MIBiG database of characterized biosynthetic gene clusters into their respective continuous distributed vector representations. Because the resulting protein vectors are of high dimensionality, a supervised Uniform Manifold and Approximation algorithm was employed to transform the high dimensional vectors into a robust lower-dimensional representation, as evaluated by Silhouette analysis, Hopkins’ statistic, and trustworthiness analysis. The density-Based Spatial Clustering of Applications and Noise algorithm showed that the clusters identified from the low dimensional datasets mapped to biosynthetic gene cluster types, defined with high accuracy in the MIBiG database. A random forest classifier was then trained and evaluated using the low dimensional vectors. It was shown to classify each biosynthetic gene cluster from the MIBiG database with excellent performance metrics. Finally, the model's ability to generalize was evaluated using biosynthetic gene clusters from the antiSMASH dataset, an uncurated database containing uncharacterized biosynthetic gene clusters. The performance metrics were high, with a balanced accuracy of ~85%. After a hyperparameter search, the balanced accuracy rose to ~90%. This suggests that ML-Miner is a robust machine learning pipeline that can be used to identify novel biosynthetic gene clusters. Future development of a confidence score for classification and a workflow for processing bacterial genomes into gene clusters will significantly improve the utility of this tool.
|
2 |
Computational Analysis of the Evolution of Non-Coding Genomic SequencesSaha Mandal, Arnab 21 August 2013 (has links)
No description available.
|
3 |
Dynamique des blooms phytoplanctoniques dans le gyre subpolaire de l'Atlantique Nord / Phytoplankton blooms dynamics in the North Atlantic Subpolar GyreLacour, Léo 08 December 2016 (has links)
Le gyre subpolaire de l'Atlantique Nord est le siège de la plus importante floraison (bloom) phytoplanctonique de l'océan global. Cet événement biologique majeur joue un rôle crucial sur le fonctionnement des écosystèmes océaniques et sur le cycle global du carbone. L'objectif de cette thèse est de mieux comprendre les processus bio-physiques qui contrôlent la dynamique du bloom phytoplanctonique et l'export de carbone à différentes échelles spatio-temporelles. Dans une première étude, basée sur des données satellites climatologiques, le gyre subpolaire a été biorégionalisé en fonction des différents cycles annuels de biomasse phytoplanctonique. Les conditions de mélange, couplées à l’intensité de la lumière de surface, contrôlent l’initiation du bloom printanier au sein des différentes biorégions. La nouvelle génération de flotteurs BGC-Argo a permis, dans une deuxième étude, d’explorer des processus à des échelles plus fines, en particulier pendant la période hivernale jusqu’à présent très peu étudiée. En hiver, des restratifications intermittentes et locales de la couche de mélange, liées à des processus de sous-mésoéchelle, initient des blooms transitoires qui influencent la dynamique du bloom printanier. Enfin, une troisième étude a montré que la variabilité haute-fréquence de la profondeur de la couche de mélange pendant la transition hiver-printemps joue aussi un rôle crucial sur l’export de carbone. / The North Atlantic Subpolar Gyre exhibits the largest phytoplancton bloom of the global ocean. This major biological event plays a crucial role for the functioning of marine ecosystems and the global carbon cycle. The aim of this thesis is to better understand the bio-physical processes driving the dynamics of the phytoplankton bloom and carbon export at various spatiotemporal scales.In a first study, based on satellite data at a climatological scale, the subpolar gyre is bioregionalized according to distinct annual phytoplankton biomass cycles. The light-mixing regime controls the phytoplankton bloom dynamics in the different bioregions.In a second study, the new generation of BGC-Argo floats allowed for processes to be explored at a finer scale, especially during the overlooked winter season. In winter, intermittent and local restratifications of the mixed layer, triggered by sub-mesoscale processes, initiate transient winter blooms impacting the spring bloom dynamics.Finally, a third study showed how the high-frequency variability of the mixed layer depth during the winter-spring transition plays a crucial role on carbon export.
|
Page generated in 0.0176 seconds