Global ETD Search

331	Computational Complexity Of Bi-clustering Wulff, Sharon Jay January 2008 (has links) In this work we formalize a new natural objective (or cost) function for bi-clustering - Monochromatic bi-clustering. Our objective function is suitable for detecting meaningful homogenous clusters based on categorical valued input matrices. Such problems have arisen recently in systems biology where researchers have inferred functional classifications of biological agents based on their pairwise interactions. We analyze the computational complexity of the resulting optimization problems. We show that finding optimal solutions is NP-hard and complement this result by introducing a polynomial time approximation algorithm for this bi-clustering task. This is the first positive approximation guarantee for bi-clustering algorithms. We also show that bi-clustering with our objective function can be viewed as a generalization of correlation clustering. Bi-Clustering NP-hardness correlation clustering Computer Science
332	Voting-Based Consensus of Data Partitions Ayad, Hanan 08 1900 (has links) Over the past few years, there has been a renewed interest in the consensus problem for ensembles of partitions. Recent work is primarily motivated by the developments in the area of combining multiple supervised learners. Unlike the consensus of supervised classifications, the consensus of data partitions is a challenging problem due to the lack of globally defined cluster labels and to the inherent difficulty of data clustering as an unsupervised learning problem. Moreover, the true number of clusters may be unknown. A fundamental goal of consensus methods for partitions is to obtain an optimal summary of an ensemble and to discover a cluster structure with accuracy and robustness exceeding those of the individual ensemble partitions. The quality of the consensus partitions highly depends on the ensemble generation mechanism and on the suitability of the consensus method for combining the generated ensemble. Typically, consensus methods derive an ensemble representation that is used as the basis for extracting the consensus partition. Most ensemble representations circumvent the labeling problem. On the other hand, voting-based methods establish direct parallels with consensus methods for supervised classifications, by seeking an optimal relabeling of the ensemble partitions and deriving an ensemble representation consisting of a central aggregated partition. An important element of the voting-based aggregation problem is the pairwise relabeling of an ensemble partition with respect to a representative partition of the ensemble, which is refered to here as the voting problem. The voting problem is commonly formulated as a weighted bipartite matching problem. In this dissertation, a general theoretical framework for the voting problem as a multi-response regression problem is proposed. The problem is formulated as seeking to estimate the uncertainties associated with the assignments of the objects to the representative clusters, given their assignments to the clusters of an ensemble partition. A new voting scheme, referred to as cumulative voting, is derived as a special instance of the proposed regression formulation corresponding to fitting a linear model by least squares estimation. The proposed formulation reveals the close relationships between the underlying loss functions of the cumulative voting and bipartite matching schemes. A useful feature of the proposed framework is that it can be applied to model substantial variability between partitions, such as a variable number of clusters. A general aggregation algorithm with variants corresponding to cumulative voting and bipartite matching is applied and a simulation-based analysis is presented to compare the suitability of each scheme to different ensemble generation mechanisms. The bipartite matching is found to be more suitable than cumulative voting for a particular generation model, whereby each ensemble partition is generated as a noisy permutation of an underlying labeling, according to a probability of error. For ensembles with a variable number of clusters, it is proposed that the aggregated partition be viewed as an estimated distributional representation of the ensemble, on the basis of which, a criterion may be defined to seek an optimally compressed consensus partition. The properties and features of the proposed cumulative voting scheme are studied. In particular, the relationship between cumulative voting and the well-known co-association matrix is highlighted. Furthermore, an adaptive aggregation algorithm that is suited for the cumulative voting scheme is proposed. The algorithm aims at selecting the initial reference partition and the aggregation sequence of the ensemble partitions the loss of mutual information associated with the aggregated partition is minimized. In order to subsequently extract the final consensus partition, an efficient agglomerative algorithm is developed. The algorithm merges the aggregated clusters such that the maximum amount of information is preserved. Furthermore, it allows the optimal number of consensus clusters to be estimated. An empirical study using several artificial and real-world datasets demonstrates that the proposed cumulative voting scheme leads to discovering substantially more accurate consensus partitions compared to bipartite matching, in the case of ensembles with a relatively large or a variable number of clusters. Compared to other recent consensus methods, the proposed method is found to be comparable with or better than the best performing methods. Moreover, accurate estimates of the true number of clusters are often achieved using cumulative voting, whereas consistently poor estimates are achieved based on bipartite matching. The empirical evidence demonstrates that the bipartite matching scheme is not suitable for these types of ensembles. Electrical and Computer Engineering
333	Fuzzy Clustering with Principal Component Analysis Rau, Min-Zong 14 August 2010 (has links) We propose a clustering algorithm which incorporates a similarity-based fuzzy clustering and principal component analysis. The proposed algorithm is capable of discovering clusters with hyper-spherical, hyperellipsoidal, or oblique hyper-ellipsoidal shapes. Besides, the number of the clusters need not be specified in advance by the user. For a given dataset, the orientation, locations, and the number of clusters obtained can truthfully reflect the characteristics of the dataset. Experimental results, obtained by running on datasets generated synthetically, show that our method performs better than other methods. incremental clustering principal component analysis fuzzy clustering oblique hyper-ellipsoidal cluster
334	Feature Reduction and Multi-label Classification Approaches for Document Data Jiang, Jung-Yi 08 August 2011 (has links) This thesis proposes some novel approaches for feature reduction and multi-label classification for text datasets. In text processing, the bag-of-words model is commonly used, with each document modeled as a vector in a high dimensional space. This model is often called the vector-space model. Usually, the dimensionality of the document vector is huge. Such high-dimensionality can be a severe obstacle for text processing algorithms. To improve the performance of text processing algorithms, we propose a feature clustering approach to reduce the dimensionality of document vectors. We also propose an efficient algorithm for text classification. Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. We propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document set are grouped into clusters based on similarity test. Words that are similar to each other are grouped into the same cluster. Each cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature corresponding to a cluster is a weighted combination of the words contained in the cluster. By this algorithm, the derived membership functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be avoided. Experimental results show that our method can run faster and obtain better extracted features than other methods. We also propose a fuzzy similarity clustering scheme for multi-label text categorization in which a document can belong to one or more than one category. Firstly, feature transformation is performed. An input document is transformed to a fuzzy-similarity vector. Next, the relevance degrees of the input document to a collection of clusters are calculated, which are then combined to obtain the relevance degree of the input document to each participating category. Finally, the input document is classified to a certain category if the associated relevance degree exceeds a threshold. In text categorization, the number of the involved terms is usually huge. An automatic classification system may suffer from large memory requirements and poor efficiency. Our scheme can do without these difficulties. Besides, we allow the region a category covers to be a combination of several sub-regions that are not necessarily connected. The effectiveness of our proposed scheme is demonstrated by the results of several experiments. multi-label document classification self-constructing clustering text classification dimension reduction feature clustering
335	Clustering Multilingual Documents: A Latent Semantic Indexing Based Approach Lin, Chia-min 09 February 2006 (has links) Document clustering automatically organizes a document collection into distinct groups of similar documents on the basis of their contents. Most of existing document clustering techniques deal with monolingual documents (i.e., documents written in one language). However, with the trend of globalization and advances in Internet technology, an organization or individual often generates/acquires and subsequently archives documents in different languages, thus creating the need for multilingual document clustering (MLDC). Motivated by its significance and need, this study designs a Latent Semantic Indexing (LSI) based MLDC technique. Our empirical evaluation results show that the proposed LSI-based multilingual document clustering technique achieves satisfactory clustering effectiveness, measured by both cluster recall and cluster precision. Document clustering Latent semantic analysis Latent semantic indexing Text mining Document management Multilingual document clustering
336	Development of Personalized Document Clustering Technique for Accommodating Hierarchical Categorization Preferences Lee, Kuan-yi 27 July 2006 (has links) With the advances in information and networking technologies and the proliferation of e-commerce and knowledge management applications, individuals and organizations generate and acquire tremendous amount of online information that is typically available as textual documents. To manage the ever-increasing volume of documents, an individual or organization frequently organizes his/her documents into a set or hierarchy of categories in order to facilitate document management and subsequent information access and browsing. Furthermore, document clustering is an intentional act that reflects individual preferences with regard to the semantic coherency and relevant categorization of documents. Hence, effective document-clustering must consider individual preferences for supporting personalization in document categorization and should be capable of organizing documents into a category hierarchy. However, document-clustering research traditionally has been anchored in analyses of document content. As a consequence, most of existing document-clustering techniques are not tailored to individuals¡¦ preferences and therefore are unable to facilitate personalization. On the other hand, existing document-clustering techniques generally are designed to generate from a document collection a set of document clusters rather than a hierarchy of document clusters. In response, we develop in this study a hierarchical personalized document-clustering (HPEC) technique that takes into account an individual¡¦s folder hierarchy representing the individual¡¦s categorization preferences and produces document-clusters in a hierarchical structure for the target individual. Our empirical evaluation results suggest that the proposed HPEC technique outperformed its benchmark technique (i.e., HAC+P) in cluster recall while maintaining the same level of cluster precision and location discrepancy as its benchmark technique did. Hierarchical document management Personalized document clustering Text mining Personalization Document clustering
337	Improving Search Result Clustering By Integrating Semantic Information From Wikipedia Calli, Cagatay 01 September 2010 (has links) (PDF) Suffix Tree Clustering (STC) is a search result clustering (SRC) algorithm focused on generating overlapping clusters with meaningful labels in linear time. It showed the feasibility of SRC but in time, subsequent studies introduced description-first algorithms that generate better labels and achieve higher precision. Still, STC remained as the fastest SRC algorithm and there appeared studies concerned with different problems of STC. In this thesis, semantic relations between cluster labels and documents are exploited to filter out noisy labels and improve merging phase of STC. Wikipedia is used to identify these relations and methods for integrating semantic information to STC are suggested. Semantic features are shown to be effective for SRC task when used together with term frequency vectors. Furthermore, there were no SRC studies on Turkish up to now. In this thesis, a dataset for Turkish is introduced and a number of methods are tested on Turkish. QA Computer Software 76.75-76.765
338	A Self-Constructing Fuzzy Feature Clustering for Text Categorization Liu, Ren-jia 26 August 2009 (has links) Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. In this paper, we propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document set are grouped into clusters based on similarity test. Words that are similar to each other are grouped into the same cluster. Each cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature corresponding to a cluster is a weighted combination of the words contained in the cluster. By this algorithm, the derived membership functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be avoided. 20 Newsgroups data set and Cade 12 web directory are introduced to be our experimental data. We adopt the support vector machine to classify the documents. Experimental results show that our method can run faster and obtain better extracted features than other methods. text classification feature reduction feature clustering feature extraction fuzzy clustering fuzzy similarity
339	Robust clustering algorithms Gupta, Pramod 05 April 2011 (has links) One of the most widely used techniques for data clustering is agglomerative clustering. Such algorithms have been long used across any different fields ranging from computational biology to social sciences to computer vision in part because they are simple and their output is easy to interpret. However, many of these algorithms lack any performance guarantees when the data is noisy, incomplete or has outliers, which is the case for most real world data. It is well known that standard linkage algorithms perform extremely poorly in presence of noise. In this work we propose two new robust algorithms for bottom-up agglomerative clustering and give formal theoretical guarantees for their robustness. We show that our algorithms can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also extend our algorithms to an inductive setting with similar guarantees, in which we randomly choose a small subset of points from a much larger instance space and generate a hierarchy over this sample and then insert the rest of the points to it to generate a hierarchy over the entire instance space. We then do a systematic experimental analysis of various linkage algorithms and compare their performance on a variety of real world data sets and show that our algorithms do much better at handling various forms of noise as compared to other hierarchical algorithms in the presence of noise. Robust algorithms Hierarchical clustering Unsupervised learning Clustering Machine learning Cluster analysis Cluster analysis Computer programs Algorithms
340	Functional Characterization of the NSF1 (YPL230W) Gene using Correlation Clustering and Genetic Analysis in Saccharomyces Cerevisiae Bessonov, Kyrylo 09 January 2012 (has links) High throughput technologies such as microarrays and modern genome sequencers produce enormous amounts of data that require novel data processing. This thesis proposes a method called Interdependent Correlation Cluster (ICC) to analyze the relations between genes represented by microarray data that are conditioned on a specific target gene. Based on Correlation Clustering, the proposed method analyzes a large set of correlation values related to the gene expression profiles extracted from given microarray datasets. The proposed method works on any size microarray datasets and could be applied to any target gene. In this study the selected target gene, NSF1 /USV1 / YPL230W, encodes a poorly characterized C2H2 zinc finger transcription factor (TF) involved in stress responses in yeast. The method is successful in the identification of novel NSF1 functional roles during fermentation stress conditions in the M2 industrial yeast strain. The new identified functions include regulation of energy and sulfur metabolism, protein synthesis, ribosomal assembly and protein trafficking as well as other processes. NSF1 involvement in sulfur metabolism was experimentally confirmed using biological laboratory techniques. Importantly, implication of NSF1 in sulfur metabolism regulation has highly relevant implications to wine and beer production industries concerned with production of compounds having sulfur-like off odour (SLO) and toxic properties. The correlation clustering also provides a means of understanding complex interactions existing between genes. / The pdf file contains numerous hyperlinks and bookmarks to facilitate navigation. This thesis will be of interest to those working with topics such as data mining of microarray data, novel gene function discovery and prediction, and genome-wide responses to fermentation stresses. / Ministry of Training, Colleges and Universities of Ontario (Ontario Graduate Scholarship and Ontario Graduate Scholarships in Science and Technology); The Natural Sciences and Engineering Research Council of Canada (NSERC) microarray analysis NSF1 ( YPL230W ) correlation clustering fermentation stress novel gene functions functional networks clustering

Search results