Global ETD Search

1	Overlapping clustering Krumpelman, Chase Serhur 13 December 2010 (has links) Analysis of large collections of data has become inescapable in many areas of scientific and commercial endeavor. As the size and dimensionality of these collections exceed the pattern recognition capability of the human mind computational analysis tools become a necessity for interpretation. Clustering algorithms, which aim to find interesting groupings within collections of data, are one such tool. Each algorithm incorporates into its design an inherent definition of “interesting” intended to capture nonrandom data groupings likely to have some interpretation to human users. Most existing algorithms include as part of their definition of “interesting” an assumption that each data point can belong at most to one grouping. While this assumption allows for algorithmic convenience and ease of analysis, it is often an artificial imposition on true underlying data structure. The idea of allowing points to belong to multiple groupings - known as “overlapping” or “multiple membership” clustering - has emerged in several domains in ad hoc solutions lacking conceptual unity in approach, interpretation, and analysis. This dissertation proposes general, domain-independent elucidations and practical techniques which address each of these. We begin by positing overlapping clustering’s role specifically, and clustering’s role in general, as assistive technologic tools allowing human minds to represent and interpret structures in data beyond the capability of our innate senses. With this guiding purpose clarified, we provide a catalog of existing techniques. We then address the issue of objectively comparing the results of different algorithms, specifically examining the previously defined Omega index, as well as multiple membership generalizations of normalized mutual information. Following that comparison, we propose a novel approach to com- paring clusterings called cluster alignment. By combining a sorting algorithm with a greedy matching algorithm, we produce comparably organized membership matrices and a means for both numerically and visually comparing multiple-membership assignments. With overlapping clustering’s purpose defined, and the means to analyze results, we move on to presenting algorithms for efficiently discovering overlapping clusters in data. First, we present a generalization of one of the common themes in the ad hoc approaches: additive clustering. Starting with a previously developed structural model of additive clustering, we generalize it to be applicable to any regular exponential family distribution thereby extending its utility into several domains, notably high-dimensional sparse domains including text and recommender systems. Finally, we address overlapping clustering by examining the properties of data in similarity spaces. We develop a probabilistic generative model of overlapping data in similarity spaces, and then develop two conceptual approaches to discovering overlapping clustering in similarity spaces. The first of these is the conceptual multiple-membership generalization of hierarchical agglomerative clustering, and the second is an iterative density hill-climbing algorithm. / text Clustering algorithms
2	Voting in clustering and finding the number of clusters Dimitriadou, Evgenia, Weingessel, Andreas, Hornik, Kurt January 1999 (has links) (PDF) In this paper we present an unsupervised algorithm which performs clustering given a data set and which can also find the number of clusters existing in it. This algorithm consists of two techniques. The first, the voting technique, allows us to combine several runs of clustering algorithms, with the number of clusters predefined, resulting in a common partition. We introduce the idea that there are cases where an input point has a structure with a certain degree of confidence and may belong to more than one cluster with a certain degree of "belongingness". The second part consists of an index measure which receives the results of every voting process for diffrent number of clusters and makes the decision in favor of one. This algorithm is a complete clustering scheme which can be applied to any clustering method and to any type of data set. Moreover, it helps us to overcome instabilities of the clustering algorithms and to improve the ability of a clustering algorithm to find structures in a data set. / Series: Report Series SFB "Adaptive Information Systems and Modelling in Economics and Management Science"
3	Apriori approach to graph-based clustering of text documents Hossain, Mahmud Shahriar. January 2008 (has links) (PDF) Thesis (MS)--Montana State University--Bozeman, 2008. / Typescript. Chairperson, Graduate Committee: Rafal A. Angryk. Includes bibliographical references (leaves 59-65).
4	Parallel Computing in Statistical-Validation of Clustering Algorithm for the Analysis of High throughput Data Atlas, Mourad 12 May 2005 (has links) Currently, clustering applications use classical methods to partition a set of data (or objects) in a set of meaningful sub-classes, called clusters. A cluster is therefore a collection of objects which are “similar” among them, thus can be treated collectively as one group, and are “dissimilar” to the objects belonging to other clusters. However, there are a number of problems with clustering. Among them, as mentioned in [Datta03], dealing with large number of dimensions and large number of data items can be problematic because of computational time. In this thesis, we investigate all clustering algorithms used in [Datta03] and we present a parallel solution to minimize the computational time. We apply parallel programming techniques to the statistical algorithms as a natural extension to sequential programming technique using R. The proposed parallel model has been tested on a high throughput dataset. It is microarray data on the transcriptional profile during sporulation in budding yeast. It contains more than 6,000 genes. Our evaluation includes clustering algorithm scalability pertaining to datasets with varying dimensions, the speedup factor, and the efficiency of the parallel model over the sequential implementation. Our experiments show that the gene expression data follow the pattern predicted in [Datta03] that is Diana appears to be solid performer also the group means for each cluster coincides with that in [Datta03]. We show that our parallel model is applicable to the clustering algorithms and more useful in applications that deal with high throughput data, such as gene expression data. Statistical Validation Clustering Algorithms High Throughput Data Parallel computing Mathematics
5	Automatic clustering with application to time dependent fault detection in chemical processes Labuschagne, Petrus Jacobus 06 July 2009 (has links) Fault detection and diagnosis presents a big challenge within the petrochemical industry. The annual economic impact of unexpected shutdowns is estimated to be $20 billion. Assistive technologies will help with the effective detection and classification of the faults causing these shutdowns. Clustering analysis presents a form of unsupervised learning which identifies data with similar properties. Various algorithms were used and included hard-partitioning algorithms (K-means and K-medoid) and fuzzy algorithms (Fuzzy C-means, Gustafson-Kessel and Gath-Geva). A novel approach to the clustering problem of time-series data is proposed. It exploits the time dependency of variables (time delays) within a process engineering environment. Before clustering, process lags are identified via signal cross-correlations. From this, a least-squares optimal signal time shift is calculated. Dimensional reduction techniques are used to visualise the data. Various nonlinear dimensional reduction techniques have been proposed in recent years. These techniques have been shown to outperform their linear counterparts on various artificial data sets including the Swiss roll and helix data sets but have not been widely implemented in a process engineering environment. The algorithms that were used included linear PCA and standard Sammon and fuzzy Sammon mappings. Time shifting resulted in better clustering accuracy on a synthetic data set based on than traditional clustering techniques based on quantitative criteria (including Partition Coefficient, Classification Entropy, Partition Index, Separation Index, Dunn’s Index and Alternative Dunn Index). However, the time shifted clustering results of the Tennessee Eastman process were not as good as the non-shifted data. Copyright / Dissertation (MEng)--University of Pretoria, 2009. / Chemical Engineering / unrestricted Time delay estimation Dimensional reduction Clustering algorithms Fault detection UCTD
6	Using density-based clustering to improve skeleton embedding in the Pinocchio automatic rigging system Wang, Haolei January 1900 (has links) Master of Science / Department of Computing and Information Sciences / William H. Hsu / Automatic rigging is a targeting approach that takes a 3-D character mesh and an adapted skeleton and automatically embeds it into the mesh. Automating the embedding step provides a savings over traditional character rigging approaches, which require manual guidance, at the cost of occasional errors in recognizing parts of the mesh and aligning bones of the skeleton with it. In this thesis, I examine the problem of reducing such errors in an auto-rigging system and apply a density-based clustering algorithm to correct errors in a particular system, Pinocchio (Baran & Popovic, 2007). I show how the density-based clustering algorithm DBSCAN (Ester et al., 1996) is able to filter out some impossible vertices to correct errors at character extremities (hair, hands, and feet) and those resulting from clothing that hides extremities such as legs. Computer graphics Automatic rigging Skeleton embedding Character modeling Clustering algorithms Computer Science (0984)
7	A Genetic Algorithm that Exchanges Neighboring Centers for Fuzzy c-Means Clustering Chahine, Firas Safwan 01 January 2012 (has links) Clustering algorithms are widely used in pattern recognition and data mining applications. Due to their computational efficiency, partitional clustering algorithms are better suited for applications with large datasets than hierarchical clustering algorithms. K-means is among the most popular partitional clustering algorithm, but has a major shortcoming: it is extremely sensitive to the choice of initial centers used to seed the algorithm. Unless k-means is carefully initialized, it converges to an inferior local optimum and results in poor quality partitions. Developing improved method for selecting initial centers for k-means is an active area of research. Genetic algorithms (GAs) have been successfully used to evolve a good set of initial centers. Among the most promising GA-based methods are those that exchange neighboring centers between candidate partitions in their crossover operations. K-means is best suited to work when datasets have well-separated non-overlapping clusters. Fuzzy c-means (FCM) is a popular variant of k-means that is designed for applications when clusters are less well-defined. Rather than assigning each point to a unique cluster, FCM determines the degree to which each point belongs to a cluster. Like k-means, FCM is also extremely sensitive to the choice of initial centers. Building on GA-based methods for initial center selection for k-means, this dissertation developed an evolutionary program for center selection in FCM called FCMGA. The proposed algorithm utilized region-based crossover and other mechanisms to improve the GA. To evaluate the effectiveness of FCMGA, three independent experiments were conducted using real and simulated datasets. The results from the experiments demonstrate the effectiveness and consistency of the proposed algorithm in identifying better quality solutions than extant methods. Moreover, the results confirmed the effectiveness of region-based crossover in enhancing the search process for the GA and the convergence speed of FCM. Taken together, findings in these experiments illustrate that FCMGA was successful in solving the problem of initial center selection in partitional clustering algorithms. Clustering Algorithms Fuzzy c-Means Genetic Algorithms k-Means Pattern Recognition Computer Sciences
8	On the re-creation of site-specific directional wave conditions Draycott, Samuel Thomas January 2017 (has links) Wave tank tests facilitate the understanding of how complex sea conditions influence the dynamics of man-made structures. If a potential deployment location is known, site data can be used to improve the relevance and realism of the test conditions, thus helping de-risk device development. Generally this data is difficult to obtain and even if available is used simplistically due to established practices and limitations of test facilities. In this work four years of buoy data from the European Marine Energy Centre is characterised and simulated at the FloWave Ocean Energy Research Facility; a circular combined wave-current test tank. Particular emphasis is placed on the characterisation and validation processes, aiming to preserve spectral and directional complexity of the site, whilst proving that the defined representative conditions can be effectively created. When creating representative site-specific sea states, particular focus is given to the application of clustering algorithms, which enable the entire spectral (frequency or directional) form to be considered in the characterisation process. This enables the true complex nature of the site to be considered in the data reduction process. Prior to generating and measuring the resulting sea states, issues with scaling are explored, the facility itself is characterised, and emphasis is placed on developing measurement strategies for the validation of directional spectra. Wave gauge arrays are designed and used to characterise various elements of the FloWave tank, including reflections, spatio-temporal variability and wave shape. A new method for directional spectrum reconstruction (SPAIR) is also developed, enabling more effective measurement and validation of the resulting directional sea states. Through comparison with other characterisation methods, inherent method-induced trade-offs are understood, and it is found that there is no absolute favourable approach, necessitating an application specific procedure. Despite this, a useful set of 'generic' sea states are created for the simulation of both production and extreme conditions. For sea state measurement, the SPAIR method is proven to be significantly more effective than current approaches, reducing errors and introducing additional capability. This method is used in combination with a directional wave gauge array to effectively measure, correct, and validate the resulting directional wave conditions. It is also demonstrated that site-specific wave-current scenarios can be effectively re-created, thus demonstrating that truly complex ocean conditions can be simulated at FloWave. This ability, along with the considered characterisation approach used, means that representative site-specific sea states can be simulated with confidence, increasing the realism of the test environment and helping de-risk device development.
9	Characterizing Popularity Dynamics of User-generated Videos: A Category-based Study of YouTube 2013 August 1900 (has links) Understanding the growth pattern of content popularity has become a subject of immense interest to Internet service providers, content makers and on-line advertisers. This understanding is also important for the sustainable development of content distribution systems. As an approach to comprehend the characteristics of this growth pattern, a significant amount of research has been done in analyzing the popularity growth patterns of YouTube videos. Unfortunately, no work has been done that intensively investigates the popularity patterns of YouTube videos based on video object category. In this thesis, an in-depth analysis of the popularity pattern of YouTube videos is performed, considering the categories of videos. Metadata and request patterns were collected by employing category-specific YouTube crawlers. The request patterns were observed for a period of five months. Results confirm that the time varying popularity of di fferent YouTube categories are conspicuously diff erent, in spite of having sets of categories with very similar viewing patterns. In particular, News and Sports exhibit similar growth curves, as do Music and Film. While for some categories views at early ages can be used to predict future popularity, for some others predicting future popularity is a challenging task and require more sophisticated techniques, e.g., time-series clustering. The outcomes of these analyses are instrumental towards designing a reliable workload generator, which can be further used to evaluate diff erent caching policies for YouTube and similar sites. In this thesis, workload generators for four of the YouTube categories are developed. Performance of these workload generators suggest that a complete category-specific workload generator can be developed using time-series clustering. Patterns of users' interaction with YouTube videos are also analyzed from a dataset collected in a local network. This shows the possible ways of improving the performance of Peer-to-Peer video distribution technique along with a new video recommendation method. YouTube categories growth patterns of on-line content clustering algorithms K-SC algorithm workload generation.
10	Service Discovery Oriented Clustering For Mobile And Adhoc Networks Bulut, Gulsah 01 May 2010 (has links) (PDF) Adhoc networks do not depend on any fixed infrastructure. The most outstanding features of adhoc networks are non-centralized structure and dynamic topology change due to high mobility. Since mentioned dynamics of mobile adhoc networks complicate reaching the resources in the network, service discovery is significantly an important part of constructing stand-alone and self-configurable mobile adhoc networks. The heterogeneity of the devices and limited resources such as battery are also load up more difficulty to service discovery. Due to the volatile nature of the adhoc networks, service discovery algorithms proposed for mobile and adhoc networks suffer from some problems. Scalability becomes a problem when the service discovery is based on flooding messages over the network. Furthermore, the high traffic which occurs due to the message exchange between network nodes makes the communication almost impossible. Partitioning a network into sub-networks is an efficient way of handling scalability problem. In this thesis, a mobility based service discovery algorithm for clustered MANET is presented. The algorithm has two main parts. First one is for partitioning the MANET into sub-networks, named &ldquo / clustering&rdquo / . Second part is composed of an efficient discovery of services on overall network. Clustering algorithm used in this study is enhanced version of DMAC (Distributed Mobility Adaptive Clustering, which is one of the golden algorithms of the wireless network clustering area). To be fast and flexible in service discovery layer, a simple and fastresponding algorithm is implemented. Integration of two algorithms enables devices to be mobile in the network

Search results