Return to search

Statistical Analysis of High-Dimensional Gene Expression Data

The use of diagnostic rules based on microarray gene expression data has received wide attention in bioinformatics research. In order to form diagnostic rules, statistical techniques are needed to form classifiers with estimates for their associated error rates, and to correct for any selection biases in the estimates. There are also the associated problems of identifying the genes most useful in making these predictions. Traditional statistical techniques require the number of samples to be much larger than the number of features. Gene expression datasets usually have a small number of samples, but a large number of features. In this thesis, some new techniques are developed, and traditional techniques are used innovatively after appropriate modification to analyse gene expression data. Classification: We first consider classifying tissue samples based on the gene expression data. We employ an external cross-validation with recursive feature elimination to provide classification error rates for tissue samples with different numbers of genes. The techniques are implemented as an R package BCC (Bias-Corrected Classification), and are applied to a number of real-world datasets. The results demonstrate that the error rates vary with different numbers of genes. For each dataset, there is usually an optimal number of genes that returns the lowest cross-validation error rate. Detecting Differentially Expressed Genes: We then consider the detection of genes that are differentially expressed in a given number of classes. As this problem concerns the selection of significant genes from a large pool of candidate genes, it needs to be carried out within the framework of multiple hypothesis testing. The focus is on the use of mixture models to handle the multiplicity issue. The mixture model approach provides a framework for the estimation of the prior probability that a gene is not differentially expressed. It estimates various error rates, including the FDR (False Discovery Rate) and the FNR (False Negative Rate). We also develop a method for selecting biomarker genes for classification, based on their repeatability among the highly differentially expressed genes in cross-validation trials. The latter method incorporates both gene selection and classification. Selection Bias: When forming a prediction rule on the basis of a small number of classified tissue samples, some form of feature (gene) selection is usually adopted. This is a necessary step if the number of features is high. As the subset of genes used in the final form of the rule has not been randomly selected but rather chosen according to some criteria designed to reflect the predictive power of the rule, there will be a selection bias inherent in estimates of the error rates of the rule if care is not taken. Various situations are presented where selection biases arise in the formation of a prediction rule and where there is a consequent need for the correction of the biases. Three types of selection biases are analysed: selection bias from not using external cross-validation, selection bias of not working with the full set of genes, and the selection bias from optimizing the classification error rate over a number of subsets obtained according to a selection method. Here we mostly employ the support vector machine with recursive feature elimination. This thesis includes a description of cross-validation schemes that are able to correct for these selection biases. Furthermore, we examine the bias incurred when using the predicted rather than the true outcomes to define the class labels in forming and evaluating the performance of the discriminant rule. Case Study: We present a case study using the breast cancer datasets. In the study, we compare the 70 highly differentially expressed genes proposed by van 't Veer and colleagues, against the set of the genes selected using our repeatability method. The results demonstrate that there is more than one set of biomarker genes. We also examine the selection biases that may exist when analysing this dataset. The selection biases are demonstrated to be substantial.

Identiferoai:union.ndltd.org:ADTP/279128
CreatorsJustin Zhu
Source SetsAustraliasian Digital Theses Program
Detected LanguageEnglish

Page generated in 0.0018 seconds