Return to search

Contributions to Ensembles of Models for Predictive Toxicology Applications. On the Representation, Comparison and Combination of Models in Ensembles.

The increasing variety of data mining tools offers a large palette
of types and representation formats for predictive models. Managing
the models then becomes a big challenge, as well as reusing the
models and keeping the consistency of model and data repositories.
Sustainable access and quality assessment of these models become
limited to researchers. The approach for the Data and Model Governance
(DMG) makes easier to process and support complex solutions.
In this thesis, contributions are proposed towards ensembles
of models with a focus on model representation, comparison and
usage.
Predictive Toxicology was chosen as an application field to demonstrate
the proposed approach to represent predictive models linked
to data for DMG. Further analysing methods such as predictive models
comparison and predictive models combination for reusing the
models from a collection of models were studied. Thus in this thesis,
an original structure of the pool of models was proposed to
represent predictive toxicology models called Predictive Toxicology
Markup Language (PTML). PTML offers a representation scheme for
predictive toxicology data and models generated by data mining tools.
In this research, the proposed representation offers possibilities
to compare models and select the relevant models based on different
performance measures using proposed similarity measuring techniques.
The relevant models were selected using a proposed cost
function which is a composite of performance measures such as
Accuracy (Acc), False Negative Rate (FNR) and False Positive Rate
(FPR). The cost function will ensure that only quality models be
selected as the candidate models for an ensemble.
The proposed algorithm for optimisation and combination of Acc,
FNR and FPR of ensemble models using double fault measure as
the diversity measure improves Acc between 0.01 to 0.30 for all toxicology
data sets compared to other ensemble methods such as Bagging,
Stacking, Bayes and Boosting. The highest improvements for
Acc were for data sets Bee (0.30), Oral Quail (0.13) and Daphnia
(0.10). A small improvement (of about 0.01) in Acc was achieved
for Dietary Quail and Trout. Important results by combining all
the three performance measures are also related to reducing the
distance between FNR and FPR for Bee, Daphnia, Oral Quail and
Trout data sets for about 0.17 to 0.28. For Dietary Quail data set
the improvement was about 0.01 though, but this data set is well
known as a difficult learning exercise. For five UCI data sets tested,
similar results were achieved with Acc improvement between 0.10 to
0.11, closing more the gaps between FNR and FPR.
As a conclusion, the results show that by combining performance
measures (Acc, FNR and FPR), as proposed within this thesis, the
Acc increased and the distance between FNR and FPR decreased.

Identiferoai:union.ndltd.org:BRADFORD/oai:bradscholars.brad.ac.uk:10454/5478
Date January 2012
CreatorsMakhtar, Mokhairi
ContributorsNeagu, Daniel, Ridley, Mick J.
PublisherUniversity of Bradford, School of Computing, Informatics and Media
Source SetsBradford Scholars
LanguageEnglish
Detected LanguageEnglish
TypeThesis, doctoral, PhD
Rights<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by-nc-nd/3.0/88x31.png" /></a><br />The University of Bradford theses are licenced under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/">Creative Commons Licence</a>.

Page generated in 0.0018 seconds