With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes (features or genomic biomarkers) simultaneously in one single experiment. Robust and accurate gene selection methods are required to identify differentially expressed genes across different samples for disease diagnosis or prognosis. The problem of identifying significantly differentially expressed genes can be stated as follows: Given gene expression measurements from an experiment of two (or more)conditions, find a subset of all genes having significantly
different expression levels across these two (or more) conditions.
Analysis of genomic data is challenging due to high dimensionality of data and low sample size. Currently several mathematical and statistical methods exist to identify significantly differentially expressed genes. The methods typically focus on gene by gene analysis within a parametric hypothesis testing framework. In this study, we propose three flexible procedures for analyzing microarray data.
In the first method we propose a parametric method which is based on a flexible distribution, Generalized Logistic Distribution of Type II (GLDII), and an approximate likelihood ratio test (ALRT) is
developed. Though the method considers gene-by-gene analysis, the ALRT method with distributional assumption GLDII appears to provide a favourable fit to microarray data.
In the second method we propose a test statistic for testing whether area under receiver operating characteristic curve (AUC) for each gene is greater than 0.5 allowing different variances for each gene.
This proposed method is computationally less intensive and can identify genes that are reasonably stable with satisfactory
prediction performance. The third method is based on comparing two AUCs for a pair of genes that is designed for selecting highly
correlated genes in the microarray datasets. We propose a nonparametric procedure for selecting genes with expression levels
correlated with that of a ``seed" gene in microarray experiments.
The test proposed by DeLong et al. (1988) is the conventional nonparametric procedure for comparing correlated AUCs. It uses a
consistent variance estimator and relies on asymptotic normality of the AUC estimator. Our proposed method includes DeLong's variance estimation technique in comparing pair of genes and can identify genes with biologically sound implications.
In this thesis, we focus on the primary step in the gene selection process, namely, the ranking of genes with respect to a statistical measure of differential expression. We assess the proposed
approaches by extensive simulation studies and demonstrate the methods on real datasets. The simulation study indicates that the parametric method performs favorably well at any settings of variance, sample size and treatment effects. Importantly, the method is found less sensitive to contaminated by noise. The proposed nonparametric methods do not involve complicated formulas and do not
require advanced programming skills. Again both methods can identify a large fraction of truly differentially expressed (DE) genes,
especially if the data consists of large sample sizes or the presence of outliers. We conclude that the proposed methods offer
good choices of analytical tools to identify DE genes for further biological and clinical analysis.
Identifer | oai:union.ndltd.org:TORONTO/oai:tspace.library.utoronto.ca:1807/29749 |
Date | 30 August 2011 |
Creators | Hossain, Ahmed |
Contributors | Beyene, Joseph, Willan, Andrew R. |
Source Sets | University of Toronto |
Language | en_ca |
Detected Language | English |
Type | Thesis |
Page generated in 0.0033 seconds