Return to search

Multi-task regression QSAR/QSPR prediction utilizing text-based Transformer Neural Network and single-task using feature-based models

With the recent advantages of machine learning in cheminformatics, the drug discovery process has been accelerated; providing a high impact in the field of medicine and public health. Molecular property and activity prediction are key elements in the early stages of drug discovery by helping prioritize the experiments and reduce the experimental work. In this thesis, a novel approach for multi-task regression using a text-based Transformer model is introduced and thoroughly explored for training on a number of properties or activities simultaneously. This multi-task regression with Transformer based model is inspired by the field of Natural Language Processing (NLP) which uses prefix tokens to distinguish between each task. In order to investigate our architecture two data categories are used; 133 biological activities from ExCAPE database and three physical chemistry properties from MoleculeNet benchmark datasets. The Transformer model consists of the embedding layer with positional encoding, a number of encoder layers, and a Feedforward Neural Network (FNN) to turn it into a regression problem. The molecules are represented as a string of characters using the Simplified Molecular-Input Line-Entry System (SMILES) which is a ’chemistry language’ with its own syntax. In addition, the effect of Transfer Learning is explored by experimenting with two pretrained Transformer models, pretrained on 1.5 million and on 100 million molecules. The text-base Transformer models are compared with a feature-based Support Vector Regression (SVR) with the Tanimoto kernel where the input molecules are encoded as Extended Connectivity Fingerprint (ECFP), which are calculated features. The results have shown that Transfer Learning is crucial for improving the performance on both property and activity predictions. On bioactivity tasks, the larger pretrained Transformer on 100 million molecules achieved comparable performance to the feature-based SVR model; however, overall SVR performed better on the majority of the bioactivity tasks. On the other hand, on physicochemistry property tasks, the larger pretrained Transformer outperformed SVR on all three tasks. Concluding, the multi-task regression architecture with the prefix token had comparable performance with the traditional feature-based approach on predicting different molecular properties or activities. Lastly, using the larger pretrained models trained on a wide chemical space can play a key role in improving the performance of Transformer models on these tasks.

Identiferoai:union.ndltd.org:UPSALLA1/oai:DiVA.org:liu-177186
Date January 2021
CreatorsDimitriadis, Spyridon
PublisherLinköpings universitet, Statistik och maskininlärning
Source SetsDiVA Archive at Upsalla University
LanguageEnglish
Detected LanguageEnglish
TypeStudent thesis, info:eu-repo/semantics/bachelorThesis, text
Formatapplication/pdf
Rightsinfo:eu-repo/semantics/openAccess

Page generated in 0.0019 seconds