Toxic compounds, such as pesticides, are routinely tested against a range of aquatic,
avian and mammalian species as part of the registration process. The need for
reducing dependence on animal testing has led to an increasing interest in alternative
methods such as in silico modelling. The QSAR (Quantitative Structure Activity
Relationship)-based models are already in use for predicting physicochemical
properties, environmental fate, eco-toxicological effects, and specific biological
endpoints for a wide range of chemicals. Data plays an important role in modelling
QSARs and also in result analysis for toxicity testing processes. This research
addresses number of issues in predictive toxicology. One issue is the problem of data
quality. Although large amount of toxicity data is available from online sources, this
data may contain some unreliable samples and may be defined as of low quality. Its
presentation also might not be consistent throughout different sources and that makes
the access, interpretation and comparison of the information difficult. To address this
issue we started with detailed investigation and experimental work on DEMETRA
data. The DEMETRA datasets have been produced by the EC-funded project
DEMETRA. Based on the investigation, experiments and the results obtained, the
author identified a number of data quality criteria in order to provide a solution for
data evaluation in toxicology domain. An algorithm has also been proposed to assess
data quality before modelling. Another issue considered in the thesis was the missing
values in datasets for toxicology domain. Least Square Method for a paired dataset
and Serial Correlation for single version dataset provided the solution for the problem
in two different situations. A procedural algorithm using these two methods has been
proposed in order to overcome the problem of missing values. Another issue we paid
attention to in this thesis was modelling of multi-class data sets in which the severe
imbalance class samples distribution exists. The imbalanced data affect the
performance of classifiers during the classification process. We have shown that as
long as we understand how class members are constructed in dimensional space in
each cluster we can reform the distribution and provide more knowledge domain for
the classifier.
Identifer | oai:union.ndltd.org:BRADFORD/oai:bradscholars.brad.ac.uk:10454/4262 |
Date | January 2008 |
Creators | Malazizi, Ladan |
Contributors | Neagu, Daniel, Graves-Morris, Peter R. |
Publisher | University of Bradford, School of Informatics |
Source Sets | Bradford Scholars |
Language | English |
Detected Language | English |
Type | Thesis, doctoral, PhD |
Rights | <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by-nc-nd/3.0/88x31.png" /></a><br />The University of Bradford theses are licenced under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/">Creative Commons Licence</a>. |
Page generated in 0.0076 seconds