Return to search

A new normalized EM algorithm for clustering gene expression data

Microarray data clustering represents a basic exploratory tool to find groups of genes exhibiting similar expression patterns or to detect relevant classes of molecular subtypes. Among a wide range of clustering approaches proposed and applied in the gene expression community to analyze microarray data, mixture model-based clustering has received much attention to its sound statistical framework and its flexibility in data modeling. However, clustering algorithms following the model-based framework suffer from two serious drawbacks. The first drawback is that the performance of these algorithms critically depends on the starting values for their iterative clustering procedures. Additionally, they are not capable of working directly with very high dimensional data sets in the sample clustering problem where the dimension of the data is up to hundreds or thousands. The thesis focuses on the two challenges and includes the following contributions: First, the thesis introduces the statistical model of our proposed normalized Expectation Maximization (EM) algorithm followed by its clustering performance analysis on a number of real microarray data sets. The normalized EM is stable even with random initializations for its EM iterative procedure. The stability of the normalized EM is demonstrated through its performance comparison with other related clustering algorithms. Furthermore, the normalized EM is the first mixture model-based clustering approach to be capable of working directly with very high dimensional microarray data sets in the sample clustering problem, where the number of genes is much larger than the number of samples. This advantage of the normalized EM is illustrated through the comparison with the unnormalized EM (The conventional EM algorithm for Gaussian mixture model-based clustering). Besides, for experimental microarray data sets with the availability of class labels of data points, an interesting property of the convergence speed of the normalized EM with respect to the radius of the hypersphere in its corresponding statistical model is uncovered. Second, to support the performance comparison of different clusterings a new internal index is derived using fundamental concepts from information theory. This index allows the comparison of clustering approaches in which the closeness between data points is evaluated by their cosine similarity. The method for deriving this internal index can be utilized to design other new indexes for comparing clustering approaches which employ a common similarity measure.

Identiferoai:union.ndltd.org:ADTP/258600
Date January 2008
CreatorsNguyen, Phuong Minh, Electrical Engineering & Telecommunications, Faculty of Engineering, UNSW
PublisherPublisher:University of New South Wales. Electrical Engineering & Telecommunications
Source SetsAustraliasian Digital Theses Program
LanguageEnglish
Detected LanguageEnglish
Rightshttp://unsworks.unsw.edu.au/copyright, http://unsworks.unsw.edu.au/copyright

Page generated in 0.0078 seconds