Global ETD Search

Return to search

Classification and interpretation in quantitative structure-activity relationships

A good QSAR model comprises several components. Predictive accuracy is paramount, but it is not the only important aspect. In addition, one should apply robust and appropriate statistical tests to the models to assess their significance or the significance of any apparent improvements. The real impact of a QSAR, however, perhaps lies in its chemical insight and interpretation, an aspect which is often overlooked. This thesis covers three main topics: a comparison of contemporary classifiers, interpretability of random forests and usage of interpretable descriptors. The selection of data mining technique and descriptors entirely determine the available interpretation. Using interpretable approaches we have demonstrated their success on a variety of data sets. By using robust multiple comparison statistics with eight data sets we demonstrate that a random forest has comparable predictive accuracies to the de facto standard, support vector machine. A random forest is inherently more interpretable than support vector machine, due to the underlying tree construction. We can extract some chemical insight from the random forest. However, with additional tools further insight would be available. A decision tree is easier to interpret than a random forest. Therefore, to obtain useful interpretation from a random forest we have employed a selection of tools. This includes alternative representations of the trees using SMILES and SMARTS. Using existing methods we can compare and cluster the trees in this representation. Descriptor analysis and importance can be measured at the tree and forest level. Pathways in the trees can be compared and frequently occurring subgraphs identified. These tools have been built around the Weka machine learning workbench and are designed to allow further additions of new functionality. The interpretability of a model is dependent on the model and the descriptors. They must describe something meaningful. To this end we have used the TMACC descriptors in the Solubility Challenge and literature data sets. We report how our retrospective analysis confirms existing knowledge and how we identify novel C-domain inhibition of ACE. In order to test our hypotheses we extended and developed existing software forming two applications. The Nottingham Cheminformatics Workbench (NCW) will generate TMACC descriptors and allows the user to build and analyse models, including visualising the chemical interpretation. Forest Based Interpretation (FBI) provides various tools for interpretating a random forest model. Both applications are written in Java with full documentation and simple installations wizards are available for Windows, Linux and Mac.

http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.537681

541

Identifer	oai:union.ndltd.org:bl.uk/oai:ethos.bl.uk:537681
Date	January 2010
Creators	Bruce, Craig L.
Publisher	University of Nottingham
Source Sets	Ethos UK
Detected Language	English
Type	Electronic Thesis or Dissertation
Source	http://eprints.nottingham.ac.uk/11666/

Page generated in 0.0018 seconds

Classification and interpretation in quantitative structure-activity relationships

Description

Links & Downloads

Tags

Additional Fields