Global ETD Search

11	Optimal Clustering: Genetic Constrained K-Means and Linear Programming Algorithms Zhao, Jianmin 01 January 2006 (has links) Methods for determining clusters of data under- specified constraints have recently gained popularity. Although general constraints may be used, we focus on clustering methods with the constraint of a minimal cluster size. In this dissertation, we propose two constrained k-means algorithms: Linear Programming Algorithm (LPA) and Genetic Constrained K-means Algorithm (GCKA). Linear Programming Algorithm modifies the k-means algorithm into a linear programming problem with constraints requiring that each cluster have m or more subjects. In order to achieve an acceptable clustering solution, we run the algorithm with a large number of random sets of initial seeds, and choose the solution with minimal Root Mean Squared Error (RMSE) as our final solution for a given data set. We evaluate LPA with both generic data and simulated data and the results indicate that LPA can obtain a reasonable clustering solution. Genetic Constrained K-Means Algorithm (GCKA) hybridizes the Genetic Algorithm with a constrained k-means algorithm. We define Selection Operator, Mutation Operator and Constrained K-means operator. Using finite Markov chain theory, we prove that the GCKA converges in probability to the global optimum. We test the algorithm with several datasets. The analysis shows that we can achieve a good clustering solution by carefully choosing parameters such as population size, mutation probability and generation. We also propose a Bi-Nelder algorithm to search for an appropriate cluster number with minimal RMSE. cluster analysis hierarchical clustering K-means clustering LPA algorithm evaluation Biostatistics Physical Sciences and Mathematics Statistics and Probability
12	SIMD Algorithms for Single Link and Complete Link Pattern Clustering Arumugavelu, Shankar 08 March 2007 (has links) Clustering techniques play an important role in exploratory pattern analysis, unsupervised pattern recognition and image segmentation applications. Clustering algorithms are computationally intensive in nature. This thesis proposes new parallel algorithms for Single Link and Complete Link hierarchical clustering. The parallel algorithms have been mapped on a SIMD machine model with a linear interconnection network. The model consists of a linear array of N (number of patterns to be clustered) processing elements (PEs), interfaced to a host machine and the interconnection network provides inter-PE and PE-to-host/host-to-PE communication. For single link clustering, each PE maintains a sorted list of its first logN nearest neighbors and the host maintains a heap of the root elements of all the PEs. The determination of the smallest entry in the distance matrix and update of the distance matrix is achieved in O(logN) time. In the case of complete link clustering, each PE maintains a heap data structure of the inter pattern distances. This significantly reduces the computation time for the determination of the smallest entry in the distance matrix during each iteration, from O(N2) to O(N), as the root element in each PE gives its nearest neighbor. The proposed algorithms are faster and simpler than previously known algorithms for hierarchical clustering. For clustering a data set with N patterns, using N PEs, the computation time for the single link clustering algorithm is shown to be O(NlogN) and the time complexity for the complete link clustering algorithm is shown to be O(N2). The parallel algorithms have been verified through simulations on the Intel iPSC/2 parallel machine. Hierarchical clustering Pattern recognition Parallel algorithms Pattern matrix Proximity matrix American Studies Arts and Humanities
13	Filtering Social Tags for Songs based on Lyrics using Clustering Methods Chawla, Rahul 21 July 2011 (has links) In the field of Music Data Mining, Mood and Topic information has been considered as a high level metadata. The extraction of mood and topic information is difficult but is regarded as very valuable. The immense growth of Web 2.0 resulted in Social Tags being a direct interaction with users (humans) and their feedback through tags can help in classification and retrieval of music. One of the major shortcomings of the approaches that have been employed so far is the improper filtering of social tags. This thesis delves into the topic of information extraction from songs’ tags and lyrics. The main focus is on removing all erroneous and unwanted tags with help of other features. The hierarchical clustering method is applied to create clusters of tags. The clusters are based on semantic information any given pair of tags share. The lyrics features are utilized by employing CLOPE clustering method to form lyrics clusters, and Naïve Bayes method to compute probability values that aid in classification process. The outputs from classification are finally used to estimate the accuracy of a tag belonging to the song. The results obtained from the experiments all point towards the success of the method proposed and can be utilized by other research projects in the similar field. social tags lyrics classification hierarchical clustering Naive Bayes CLOPE clustering Last.fm
14	Social Cohesion Analysis of Networks: A Novel Method for Identifying Cohesive Subgroups in Social Hypertext Chin, Alvin Yung Chian 23 September 2009 (has links) Finding subgroups within social networks is important for understanding and possibly influencing the formation and evolution of online communities. This thesis addresses the problem of finding cohesive subgroups within social networks inferred from online interactions. The dissertation begins with a review of relevant literature and identifies existing methods for finding cohesive subgroups. This is followed by the introduction of the SCAN method for identifying subgroups in online interaction. The SCAN (Social Cohesion Analysis of Networks) methodology involves three steps: selecting the possible members (Select), collecting those members into possible subgroups (Collect) and choosing the cohesive subgroups over time (Choose). Social network analysis, clustering and partitioning, and similarity measurement are then used to implement each of the steps. Two further case studies are presented, one involving the TorCamp Google group and the other involving YouTube vaccination videos, to demonstrate how the methodology works in practice. Behavioural measures of Sense of Community and the Social Network Questionnaire are correlated with the SCAN method to demonstrate that the SCAN approach can find meaningful subgroups. Additional empirical findings are reported. Betweenness centrality appears to be a useful filter for screening potential subgroup members, and members of cohesive subgroups have stronger community membership and influence than others. Subgroups identified using weighted average hierarchical clustering are consistent with the subgroups identified using the more computationally expensive k-plex analysis. The value of similarity measurement in assessing subgroup cohesion over time is demonstrated, and possible problems with the use of Q modularity to identify cohesive subgroups are noted. Applications of this research to marketing, expertise location, and information search are also discussed. cohesive subgroups social network social network analysis hierarchical clustering centrality similarity analysis social hypertext 0984
15	Social Cohesion Analysis of Networks: A Novel Method for Identifying Cohesive Subgroups in Social Hypertext Chin, Alvin Yung Chian 23 September 2009 (has links) Finding subgroups within social networks is important for understanding and possibly influencing the formation and evolution of online communities. This thesis addresses the problem of finding cohesive subgroups within social networks inferred from online interactions. The dissertation begins with a review of relevant literature and identifies existing methods for finding cohesive subgroups. This is followed by the introduction of the SCAN method for identifying subgroups in online interaction. The SCAN (Social Cohesion Analysis of Networks) methodology involves three steps: selecting the possible members (Select), collecting those members into possible subgroups (Collect) and choosing the cohesive subgroups over time (Choose). Social network analysis, clustering and partitioning, and similarity measurement are then used to implement each of the steps. Two further case studies are presented, one involving the TorCamp Google group and the other involving YouTube vaccination videos, to demonstrate how the methodology works in practice. Behavioural measures of Sense of Community and the Social Network Questionnaire are correlated with the SCAN method to demonstrate that the SCAN approach can find meaningful subgroups. Additional empirical findings are reported. Betweenness centrality appears to be a useful filter for screening potential subgroup members, and members of cohesive subgroups have stronger community membership and influence than others. Subgroups identified using weighted average hierarchical clustering are consistent with the subgroups identified using the more computationally expensive k-plex analysis. The value of similarity measurement in assessing subgroup cohesion over time is demonstrated, and possible problems with the use of Q modularity to identify cohesive subgroups are noted. Applications of this research to marketing, expertise location, and information search are also discussed. cohesive subgroups social network social network analysis hierarchical clustering centrality similarity analysis social hypertext 0984
16	Determination Of Weak Transmission Links By Cluster Analysis Ertugrul, Hamza Oguz 01 November 2009 (has links) (PDF) Due to faults and switching, transmission lines encounter power oscillations referred as power swings. Although in most cases they do not lead to an eventual instability, severe changes in power flows on the lines may cause the operation of impedance relays incorrectly, leading to cascaded tripping of other lines. Out-of-Step tripping function is employed in modern distance relays to distinguish such an unstable swing but setting the parameters and deciding lines to be tripped require detailed dynamic power system modelling and analysis. The proposed method aims to determine possible out-of-step (OOS) locations on a power system without performing detailed dynamic simulations. Method presented here, is based on grouping of the buses by statistical clustering analysis of the network impedance matrix. Inter-cluster lines are shown to be more vulnerable to give rise to OOS as proven with dynamic simulations on IEEE 39 bus test system.
17	An Efficient Hilbert Curve-based Clustering Strategy for Large Spatial Databases Lu, Yun-Tai 25 July 2003 (has links) Recently, millions of databases have been used and we need a new technique that can automatically transform the processed data into useful information and knowledge. Data mining is the technique of analyzing data to discover previously unknown information and spatial data mining is the branch of data mining that deals with spatial data. In spatial data mining, clustering is one of useful techniques for discovering interesting data in the underlying data objects. The problem of clustering is that give n data points in a d-dimensional metric space, partition the data points into k clusters such that the data points within a cluster are more similar to each other than data points in different clusters. Cluster analysis has been widely applied to many areas such as medicine, social studies, bioinformatics, map regions and GIS, etc. In recent years, many researchers have focused on finding efficient methods to the clustering problem. In general, we can classify these clustering algorithms into four approaches: partition, hierarchical, density-based, and grid-based approaches. The k-means algorithm which is based on the partitioning approach is probably the most widely applied clustering method. But a major drawback of k-means algorithm is that it is difficult to determine the parameter k to represent ``natural' cluster, and it is only suitable for concave spherical clusters. The k-means algorithm has high computational complexity and is unable to handle large databases. Therefore, in this thesis, we present an efficient clustering algorithm for large spatial databases. It combines the hierarchical approach with the grid-based approach structure. We apply the grid-based approach, because it is efficient for large spatial databases. Moreover, we apply the hierarchical approach to find the genuine clusters by repeatedly combining together these blocks. Basically, we make use of the Hilbert curve to provide a way to linearly order the points of a grid. Note that the Hilbert curve is a kind of space-filling curves, where a space-filling curve is a continuous path which passes through every point in a space once to form a one-one correspondence between the coordinates of the points and the one-dimensional sequence numbers of the points on the curve. The goal of using space-filling curve is to preserve the distance that points which are close in 2-D space and represent similar data should be stored close together in the linear order. This kind of mapping also can minimize the disk access effort and provide high speed for clustering. This new algorithm requires only one input parameter and supports the user in determining an appropriate value for it. In our simulation, we have shown that our proposed clustering algorithm can have shorter execution time than other algorithms for the large databases. Since the number of data points is increased, the execution time of our algorithm is increased slowly. Moreover, our algorithm can deal with clusters with arbitrary shapes in which the k-means algorithm can not discover. hierarchical clustering grid-based clustering clustering space-filling curve spatial data mining
18	Robust clustering algorithms Gupta, Pramod 05 April 2011 (has links) One of the most widely used techniques for data clustering is agglomerative clustering. Such algorithms have been long used across any different fields ranging from computational biology to social sciences to computer vision in part because they are simple and their output is easy to interpret. However, many of these algorithms lack any performance guarantees when the data is noisy, incomplete or has outliers, which is the case for most real world data. It is well known that standard linkage algorithms perform extremely poorly in presence of noise. In this work we propose two new robust algorithms for bottom-up agglomerative clustering and give formal theoretical guarantees for their robustness. We show that our algorithms can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also extend our algorithms to an inductive setting with similar guarantees, in which we randomly choose a small subset of points from a much larger instance space and generate a hierarchy over this sample and then insert the rest of the points to it to generate a hierarchy over the entire instance space. We then do a systematic experimental analysis of various linkage algorithms and compare their performance on a variety of real world data sets and show that our algorithms do much better at handling various forms of noise as compared to other hierarchical algorithms in the presence of noise. Robust algorithms Hierarchical clustering Unsupervised learning Clustering Machine learning Cluster analysis Cluster analysis Computer programs Algorithms
19	Statistical Analysis of Operational Data for Manufacturing System Performance Improvement Wang, Zhenrui January 2013 (has links) The performance of a manufacturing system relies on its four types of elements: operators, machines, computer system and material handling system. To ensure the performance of these elements, operational data containing various aspects of information are collected for monitoring and analysis. This dissertation focuses on the operator performance evaluation and machine failure prediction. The proposed research work is motivated by the following challenges in analyzing operational data. (i) the complex relationship between the variables, (ii) the implicit information important to failure prediction, and (iii) data with outliers, missing and erroneous measurements. To overcome these challenges, the following research has been conducted. To compare operator performance, a methodology combining regression modeling and multiple comparisons technique is proposed. The regression model quantifies and removes the complex effects of other impacting factors on the operator performance. A robust zero-inflated Poisson (ZIP) model is developed to reduce the impacts of the excessive zeros and outliers in the performance metric, i.e. the number of defects (NoD), on regression analysis. The model residuals are plotted in non-parametric statistical charts for performance comparison. The estimated model coefficients are also used to identify under-performing machines. To detect temporal patterns from operational data sequence, an algorithm is proposed for detecting interval-based asynchronous periodic patterns (APP). The algorithm effectively and efficiently detects pattern through a modified clustering and a convolution-based template matching method. To predict machine failures based on the covariates with erroneous measurements, a new method is proposed for statistical inference of proportional hazard model under a mixture of classical and Berkson errors. The method estimates the model coefficients with an expectation-maximization (EM) algorithm with expectation step achieved by Monte Carlo simulation. The model estimated with the proposed method will improve the accuracy of the inference on machine failure probability. The research work presented in this dissertation provides a package of solutions to improve manufacturing system performance. The effectiveness and efficiency of the proposed methodologies have been demonstrated and justified with both numerical simulations and real-world case studies. Hierarchical clustering Measurement error Multiple comparisons Robust regression Systems & Industrial Engineering Expectation-maximization
20	Stability Selection of the Number of Clusters Reizer, Gabriella v 18 April 2011 (has links) Selecting the number of clusters is one of the greatest challenges in clustering analysis. In this thesis, we propose a variety of stability selection criteria based on cross validation for determining the number of clusters. Clustering stability measures the agreement of clusterings obtained by applying the same clustering algorithm on multiple independent and identically distributed samples. We propose to measure the clustering stability by the correlation between two clustering functions. These criteria are motivated by the concept of clustering instability proposed by Wang (2010), which is based on a form of clustering distance. In addition, the effectiveness and robustness of the proposed methods are numerically demonstrated on a variety of simulated and real world samples. Consistency Cross validation Hierarchical clustering Instability k-means clustering Spectral clustering Stability Mathematics

Search results