Cancer can develop through a series of genetic events in combination with
external influential factors that alter the progression of the disease. Gene expression
studies are designed to provide an enhanced understanding of the progression of cancer
and to develop clinically relevant biomarkers of disease, prognosis and response to
treatment. One of the main aims of microarray gene expression analyses is to develop
signatures that are highly predictive of specific biological states, such as the molecular
stage of cancer. This dissertation analyzes the classification complexity inherent in gene
expression studies, proposing both techniques for measuring complexity and algorithms
for reducing this complexity.
Classifier algorithms that generate predictive signatures of cancer models must
generalize to independent datasets for successful translation to clinical practice. The
predictive performance of classifier models is shown to be dependent on the inherent
complexity of the gene expression data. Three specific quantitative measures of
classification complexity are proposed and one measure ( f) is shown to correlate highly
(R 2=0.82) with classifier accuracy in experimental data.
Three quantization methods are proposed to enhance contrast in gene expression
data and reduce classification complexity. The accuracy for cancer prognosis prediction
is shown to improve using quantization in two datasets studied: from 67% to 90% in lung
cancer and from 56% to 68% in colorectal cancer. A corresponding reduction in
classification complexity is also observed.
A random subspace based multivariable feature selection approach using costsensitive
analysis is proposed to model the underlying heterogeneous cancer biology and
address complexity due to multiple molecular pathways and unbalanced distribution of
samples into classes. The technique is shown to be more accurate than the univariate ttest
method. The classifier accuracy improves from 56% to 68% for colorectal cancer
prognosis prediction.
A published gene expression signature to predict radiosensitivity of tumor cells is
augmented with clinical indicators to enhance modeling of the data and represent the
underlying biology more closely. Statistical tests and experiments indicate that the
improvement in the model fit is a result of modeling the underlying biology rather than
statistical over-fitting of the data, thereby accommodating classification complexity
through the use of additional variables.
Identifer | oai:union.ndltd.org:USF/oai:scholarcommons.usf.edu:etd-4796 |
Date | 29 July 2010 |
Creators | Kamath, Vidya P. |
Publisher | Scholar Commons |
Source Sets | University of South Flordia |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | Graduate Theses and Dissertations |
Rights | default |
Page generated in 0.0022 seconds