This thesis is concerned with developing methodologies that enable existing
models to be effectively reused. Results of this thesis are presented in
the framework of Quantitative Structural-Activity Relationship (QSAR)
models, but their application is much more general. QSAR models relate
chemical structures with their biological, chemical or environmental
activity. There are many applications that offer an environment to build
and store predictive models. Unfortunately, they do not provide advanced
functionalities that allow for efficient model selection and for interpretation
of model predictions for new data. This thesis aims to address these
issues and proposes methodologies for dealing with three research problems:
model governance (management), model identification (selection),
and interpretation of model predictions. The combination of these methodologies
can be employed to build more efficient systems for model reuse
in QSAR modelling and other areas.
The first part of this study investigates toxicity data and model formats
and reviews some of the existing toxicity systems in the context of model
development and reuse. Based on the findings of this review and the principles
of data governance, a novel concept of model governance is defined.
Model governance comprises model representation and model governance
processes. These processes are designed and presented in the context of
model management. As an application, minimum information requirements
and an XML representation for QSAR models are proposed.
Once a collection of validated, accepted and well annotated models is
available within a model governance framework, they can be applied for
new data. It may happen that there is more than one model available for
the same endpoint. Which one to chose? The second part of this thesis
proposes a theoretical framework and algorithms that enable automated
identification of the most reliable model for new data from the collection
of existing models. The main idea is based on partitioning of the search
space into groups and assigning a single model to each group. The construction
of this partitioning is difficult because it is a bi-criteria problem.
The main contribution in this part is the application of Pareto points for
the search space partition. The proposed methodology is applied to three
endpoints in chemoinformatics and predictive toxicology.
After having identified a model for the new data, we would like to know
how the model obtained its prediction and how trustworthy it is. An interpretation
of model predictions is straightforward for linear models thanks
to the availability of model parameters and their statistical significance.
For non linear models this information can be hidden inside the model
structure. This thesis proposes an approach for interpretation of a random
forest classification model. This approach allows for the determination of
the influence (called feature contribution) of each variable on the model
prediction for an individual data. In this part, there are three methods proposed
that allow analysis of feature contributions. Such analysis might
lead to the discovery of new patterns that represent a standard behaviour
of the model and allow additional assessment of the model reliability for
new data. The application of these methods to two standard benchmark
datasets from the UCI machine learning repository shows a great potential
of this methodology. The algorithm for calculating feature contributions
has been implemented and is available as an R package called rfFC. / BBSRC and Syngenta (International Research Centre at Jealott’s Hill, Bracknell, UK).
Identifer | oai:union.ndltd.org:BRADFORD/oai:bradscholars.brad.ac.uk:10454/7349 |
Date | January 2014 |
Creators | Palczewska, Anna Maria |
Contributors | Neagu, Daniel, Ridley, Mick J., Travis, Kim |
Publisher | University of Bradford, School of Electrical Engineering and Computer Science |
Source Sets | Bradford Scholars |
Language | English |
Detected Language | English |
Type | Thesis, doctoral, PhD |
Rights | <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by-nc-nd/3.0/88x31.png" /></a><br />The University of Bradford theses are licenced under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/">Creative Commons Licence</a>. |
Page generated in 0.0022 seconds