Spelling suggestions: "subject:"c.reaction aprediction"" "subject:"c.reaction iprediction""
1 |
Towards algorithmic use of chemical dataJacob, Philipp-Maximilian January 2018 (has links)
The growth of chemical knowledge available via online databases opens opportunities for new types of chemical research. In particular, by converting the data into a network, graph theoretical approaches can be used to study chemical reactions. In this thesis several research questions from the field of data science and graph theory are re-formulated for the chemistry-specific data. Firstly, the structure of chemical reactions data was studied using graph theory. It was found that the network of reactions obtained from the Reaxys data was scale-free, that on average any two species were separated by six reactions, and that evidence for a hierarchy of nodes existed, most clearly in that the hubs that combine a large share of connections onto them also facilitate a large proportion of routes across the network. The hierarchy was also evidenced in the clustering and degree correlations of nodes. Next, it was investigated whether Reaxys could be mined to construct a network of reactions and use it to plan and evaluate synthesis routes in two case studies. A number of heuristics were developed to find synthesis routes using the network taking chemical structures into account. These routes were fed into a multi-criteria decision making framework scoring the routes along environmental sustainability considerations. The approach was successful in discovering and scoring synthesis route candidates. It was found that Reaxys lacked process data in many instances. To address this a proposal for extension of the RInChI reaction data format was developed. The final question addressed was whether the network could be used to predict future reactions by using Stochastic Block Models. Block model-based link prediction performed impressively, being able to achieve a classification accuracy of close to 95% during time-split validation on historic data, differentiating future reaction discoveries from random data. Next, a set of transformation suggestions was thus evaluated and a framework for analysing these results was presented. Overall, the thesis was able to further the understanding of the network’s topology and to present a framework allowing the mining of Reaxys to plan synthesis routes and target R&D efforts in a specific area to discover new reactions.
|
2 |
MACHINE LEARNING STRATEGIES FOR AUTOMATICALLY IDENTIFYING AND GENERATING MOLECULAR STRUCTURESTianfan Jin (20946329) 27 March 2025 (has links)
<p dir="ltr">Chemistry has been a major beneficiary of machine learning (ML) methods. In chemistry, all ML approaches can be roughly categorized into two types: forward prediction models and inverse prediction models. Forward prediction models take in the information of molecular graph, and use this information to predict structure-related targets such as characterization results or molecular properties. On the other hand, inverse prediction paradigms, take in the information that is relevant to the molecular structure, aiming to reconstruct the molecules based on the input information. My phd work mainly focuses on the latter problem of inverse prediction, and our target is to build ML architectures capable of: (1) automating compound identification given spectral data, and (2) generating satisfied molecular structures given required properties. Manual chemical structure identification based on spectral data sources remains a time-consuming process in traditional chemical workflows. Although this problem seems susceptible to ML, limited training data and the absence of model architectures suitable for ingesting spectral data from multiple sources has led to limited progress. My phd work tackled this problem by developing transformer-based models that used self and cross-attention mechanisms to compress and integrate the information from 1H-NMR, IR, and EI-MS spectra to predict the chemical structure of unknown analytes. The spectra to structure (StS) models were trained and tested on newly generated spectra for 957,856 distinct organic CHONSSePFClBrISiB-containing species drawn from the synthetic literature and Pubchem database. Top-1 and top-10 accuracies of 51.2\% and 71.1\%, respectively were obtained for structure prediction on testing data. The transferability of the StS models were also tested by providing incomplete or contradictory information, testing on structures with experimental spectral references rather than simulated spectra. Near identical performance was achieved in these scenarios illustrating useful domain transferability for this problem. Though the StS models above displayed satisfied overall accuracy, they are inherently limited by the insufficiency of information even with the combination of three spectral sources. Additional information regarding reaction reactants was introduced when extending StS models to automatically analyze the reaction outcomes, where a new deductive framework was build to predict reaction target using both the information from reactants and characterization results of the products. Compared to the traditional reaction prediction models where the only input was the reactants information, the resulting reaction deduction models could distinguish between intended and unintended reaction outcomes and identify starting material based on a mixture of spectral sources. The deduction models also performed well on tasks that they were not directly trained on, like predicting minor products from named organic chemistry reactions, identifying reagents and isomers as plausible impurities, and handling missing or conflicting information. Apart from compound identification, inverse molecular design, or automatic molecular generation is also foreseen to be valuable in real scenarios. Generative models for the inverse design of molecules with particular properties have been heavily hyped but have yet to demonstrate significant gains over machine learning augmented expert intuition. A major challenge of such models is their limited accuracy in predicting molecules with targeted properties in the data scarce regime, which is the regime typical of the prized outliers that inverse models are hoped to discover. For example, activity data for a drug target or stability data for a material may only number in the tens to hundreds of samples, which is insufficient to learn an accurate and reasonably general property-to-structure inverse mapping from scratch. My thesis hypothesizes that the property to structure mapping becomes unique when a sufficient number of properties are supplied to the models during training. This hypothesis has several important corollaries if true. It would imply that data scarce properties can be completely determined by a set of more accessible molecular properties. It would also imply that a generative model trained on multiple properties would exhibit an accuracy phase transition after achieving a sufficient size, a process analogous to what has been observed in the context of large language models. To interrogate these behaviors, I have built the first transformers trained on the property to molecular graph task, which this work dub “large property models” (LPMs). A key ingredient is supplementing these models during training with relatively basic but abundant chemical property data. The proof-of-concept study on LPM is based on \textasciitilde1M molecules sampled from Pubchem database with over 40\% of test cases that the generated molecules successfully reproduce all input properties. (within 10\% of error range)</p>
|
3 |
Pre-training Molecular Transformers Through Reaction Prediction / Förträning av molekylär transformer genom reaktionsprediktionBroberg, Johan January 2022 (has links)
Molecular property prediction has the ability to improve many processes in molecular chemistry industry. One important application is the development of new drugs where molecular property prediction can decrease both the cost and time of finding new drugs. The current trend is to use graph neural networks or transformers which tend to need moderate and large amounts of data respectively to perform well. Because of the scarceness of molecular property data it is of great interest to find an effective method to transfer learning from other more data-abundant problems. In this thesis I present an approach to pre-train transformer encoders on reaction prediction in order to improve performance on downstream molecular property prediction tasks. I have built a model based on the full transformer architecture but modify it for the purpose of pre-training the encoder. Model performance and specifically the effect of pre-training is tested by predicting lipophilicity, HIV inhibition and hERG channel blocking using both pre-trained models and models without any pre-training. The results demonstrate a tendency for improvement of performance on all molecular property prediction tasks using the suggested pre-training but this tendency for improvement is not statistically significant. The major limitation with the conclusive evaluation stems from the limited simulations due to computational constraints
|
Page generated in 0.1007 seconds