Return to search

Statistical Learning in Drug Discovery via Clustering and Mixtures

In drug discovery, thousands of compounds are assayed to detect activity against a
biological target. The goal of drug discovery is to identify compounds that are active against the target (e.g. inhibit a virus). Statistical learning in drug discovery seeks to build a model that uses descriptors characterizing molecular structure to predict biological activity. However, the characteristics of drug discovery data can make it difficult to model the relationship between molecular descriptors and biological activity. Among these characteristics are the rarity of active compounds, the large
volume of compounds tested by high-throughput screening, and the complexity of
molecular structure and its relationship to activity.

This thesis focuses on the design of statistical learning algorithms/models and
their applications to drug discovery. The two main parts of the thesis are: an
algorithm-based statistical method and a more formal model-based approach. Both
approaches can facilitate and accelerate the process of developing new drugs. A
unifying theme is the use of unsupervised methods as components of supervised
learning algorithms/models.

In the first part of the thesis, we explore a sequential screening approach, Cluster
Structure-Activity Relationship Analysis (CSARA). Sequential screening integrates
High Throughput Screening with mathematical modeling to sequentially select the
best compounds. CSARA is a cluster-based and algorithm driven method. To
gain further insight into this method, we use three carefully designed experiments
to compare predictive accuracy with Recursive Partitioning, a popular structureactivity
relationship analysis method. The experiments show that CSARA outperforms
Recursive Partitioning. Comparisons include problems with many descriptor
sets and situations in which many descriptors are not important for activity.

In the second part of the thesis, we propose and develop constrained mixture
discriminant analysis (CMDA), a model-based method. The main idea of CMDA
is to model the distribution of the observations given the class label (e.g. active
or inactive class) as a constrained mixture distribution, and then use Bayes’ rule
to predict the probability of being active for each observation in the testing set.
Constraints are used to deal with the otherwise explosive growth of the number
of parameters with increasing dimensionality. CMDA is designed to solve several
challenges in modeling drug data sets, such as multiple mechanisms, the rare target
problem (i.e. imbalanced classes), and the identification of relevant subspaces of
descriptors (i.e. variable selection).

We focus on the CMDA1 model, in which univariate densities form the building
blocks of the mixture components. Due to the unboundedness of the CMDA1 log
likelihood function, it is easy for the EM algorithm to converge to degenerate solutions.
A special Multi-Step EM algorithm is therefore developed and explored via
several experimental comparisons. Using the multi-step EM algorithm, the CMDA1
model is compared to model-based clustering discriminant analysis (MclustDA).
The CMDA1 model is either superior to or competitive with the MclustDA model,
depending on which model generates the data. The CMDA1 model has better
performance than the MclustDA model when the data are high-dimensional and
unbalanced, an essential feature of the drug discovery problem!

An alternate approach to the problem of degeneracy is penalized estimation. By
introducing a group of simple penalty functions, we consider penalized maximum
likelihood estimation of the CMDA1 and CMDA2 models. This strategy improves
the convergence of the conventional EM algorithm, and helps avoid degenerate
solutions. Extending techniques from Chen et al. (2007), we prove that the PMLE’s
of the two-dimensional CMDA1 model can be asymptotically consistent.

Identiferoai:union.ndltd.org:WATERLOO/oai:uwspace.uwaterloo.ca:10012/3263
Date January 2007
CreatorsWang, Xu
Source SetsUniversity of Waterloo Electronic Theses Repository
LanguageEnglish
Detected LanguageEnglish
TypeThesis or Dissertation

Page generated in 0.0021 seconds