Spelling suggestions: "subject:"data clustering"" "subject:"mata clustering""
41 |
Development of Real-Time Predictive Analytics Tools for Small Water Distribution SystemWoo, Hyoungmin January 2017 (has links)
No description available.
|
42 |
Learning Statistical and Geometric Models from Microarray Gene Expression DataZhu, Yitan 01 October 2009 (has links)
In this dissertation, we propose and develop innovative data modeling and analysis methods for extracting meaningful and specific information about disease mechanisms from microarray gene expression data.
To provide a high-level overview of gene expression data for easy and insightful understanding of data structure, we propose a novel statistical data clustering and visualization algorithm that is comprehensively effective for multiple clustering tasks and that overcomes some major limitations of existing clustering methods. The proposed clustering and visualization algorithm performs progressive, divisive hierarchical clustering and visualization, supported by hierarchical statistical modeling, supervised/unsupervised informative gene/feature selection, supervised/unsupervised data visualization, and user/prior knowledge guidance through human-data interactions, to discover cluster structure within complex, high-dimensional gene expression data.
For the purpose of selecting suitable clustering algorithm(s) for gene expression data analysis, we design an objective and reliable clustering evaluation scheme to assess the performance of clustering algorithms by comparing their sample clustering outcome to phenotype categories. Using the proposed evaluation scheme, we compared the performance of our newly developed clustering algorithm with those of several benchmark clustering methods, and demonstrated the superior and stable performance of the proposed clustering algorithm.
To identify the underlying active biological processes that jointly form the observed biological event, we propose a latent linear mixture model that quantitatively describes how the observed gene expressions are generated by a process of mixing the latent active biological processes. We prove a series of theorems to show the identifiability of the noise-free model. Based on relevant geometric concepts, convex analysis and optimization, gene clustering, and model stability analysis, we develop a robust blind source separation method that fits the model to the gene expression data and subsequently identify the underlying biological processes and their activity levels under different biological conditions.
Based on the experimental results obtained on cancer, muscle regeneration, and muscular dystrophy gene expression data, we believe that the research work presented in this dissertation not only contributes to the engineering research areas of machine learning and pattern recognition, but also provides novel and effective solutions to potentially solve many biomedical research problems, for improving the understanding about disease mechanisms. / Ph. D.
|
43 |
Statistical Modeling and Analysis of Bivariate Spatial-Temporal Data with the Application to Stream Temperature StudyLi, Han 04 November 2014 (has links)
Water temperature is a critical factor for the quality and biological condition of streams. Among various factors affecting stream water temperature, air temperature is one of the most important factors related to water temperature. To appropriately quantify the relationship between water and air temperatures over a large geographic region, it is important to accommodate the spatial and temporal information of the steam temperature. In this dissertation, I devote effort to several statistical modeling techniques for analyzing bivariate spatial-temporal data in a stream temperature study.
In the first part, I focus our analysis on the individual stream. A time varying coefficient model (VCM) is used to study the relationship between air temperature and water temperature for each stream. The time varying coefficient model enables dynamic modeling of the relationship, and therefore can be used to enhance the understanding of water and air temperature relationships. The proposed model is applied to 10 streams in Maryland, West Virginia, Virginia, North Carolina and Georgia using daily maximum temperatures. The VCM approach increases the prediction accuracy by more than 50% compared to the simple linear regression model and the nonlinear logistic model.
The VCM that describes the relationship between water and air temperatures for each stream is represented by slope and intercept curves from the fitted model. In the second part, I consider water and air temperatures for different streams that are spatial correlated. I focus on clustering multiple streams by using intercept and slope curves estimated from the VCM. Spatial information is incorporated to make clustering results geographically meaningful. I further propose a weighted distance as a dissimilarity measure for streams, which provides a flexible framework to interpret the clustering results under different weights. Real data analysis shows that streams in same cluster share similar geographic features such as solar radiation, percent forest and elevation.
In the third part, I develop a spatial-temporal VCM (STVCM) to deal with missing data. The STVCM takes both spatial and temporal variation of water temperature into account. I develop a novel estimation method that emphasizes the time effect and treats the space effect as a varying coefficient for the time effect. A simulation study shows that the performance of the STVCM on missing data imputation is better than several existing methods such as the neural network and the Gaussian process. The STVCM is also applied to all 156 streams in this study to obtain a complete data record. / Ph. D.
|
44 |
AN EMPIRICAL STUDY OF AN INNOVATIVE CLUSTERING APPROACH TOWARDS EFFICIENT BIG DATA ANALYSISBowers, Jacob Robert 01 May 2024 (has links) (PDF)
The dramatic growth of big data presents formidable challenges for traditional clustering methodologies, which often prove unwieldy and computationally expensive when processing vast quantities of data. This study explores a novel clustering approach exemplified by Sow & Grow, a density-based clustering algorithm akin to DBSCAN developed to address the issues inherent to big data by enabling end-users to strategically allocate computational resources toward regions of noted interest. Achieved through a unique procedure of seeding points and subsequently fostering their growth into coherent clusters, this method significantly reduces computational waste by ignoring insignificant segments of the dataset and provides information relevant to the end user. The implementation of this algorithm developed as part of this research showcases promising results in various experimental settings, exhibiting notable speedup over conventional clustering methods. Additionally, the incorporation of dynamic load balancing further enhances the algorithm's performance, ensuring optimal resource utilization across parallel processing threads when handling superclusters or unbalanced data distributions. Through a detailed study of the theoretical underpinnings of this innovative clustering approach and the limitations of traditional clustering techniques, this research demonstrates the practical utility of the Sow & Grow algorithm in expediting the clustering processes while providing results pertinent to end users.
|
45 |
Constructing topic-based Twitter listsDe Villiers, Francois 03 1900 (has links)
Thesis (MSc)--Stellenbosch University, 2013. / ENGLISH ABSTRACT: The amount of information that users of social networks consume on a daily
basis is steadily increasing. The resulting information overload is usually
associated with a loss of control over the management of information sources,
leaving users feeling overwhelmed.
To address this problem, social networks have introduced tools with which
users can organise the people in their networks. However, these tools do not
integrate any automated processing. Twitter has lists that can be used to
organise people in the network into topic-based groups. This feature is a
powerful organisation tool that has two main obstacles to widespread user
adoption: the initial setup time and continual curation.
In this thesis, we investigate the problem of constructing topic-based Twitter
lists. We identify two subproblems, an unsupervised and supervised task,
that need to be considered when tackling this problem. These subproblems
correspond to a clustering and classification approach that we evaluate on
Twitter data sets.
The clustering approach is evaluated using multiple representation techniques,
similarity measures and clustering algorithms. We show that it is possible to incorporate a Twitter user’s social graph data into the clustering approach
to find topic-based clusters. The classification approach is implemented,
from a statistical relational learning perspective, with kLog. We show that
kLog can use a user’s tweet content and social graph data to perform accurate
topic-based classification. We conclude that it is feasible to construct useful
topic-based Twitter lists with either approach. / AFRIKAANSE OPSOMMING: Die stroom van inligting wat sosiale-netwerk gebruikers op ’n daaglikse basis
verwerk, is aan die groei. Vir baie gebruikers, skep hierdie oordosis inligting ’n
gevoel dat hulle beheer oor hul inligtingsbronne verloor.
As ’n oplossing, het sosiale-netwerke meganismes geïmplementeer waarmee
gebruikers die inligting in hul netwerk kan bestuur. Hierdie meganismes is
nie selfwerkend nie, maar kort toevoer van die gebruiker. Twitter het lyste
geïmplementeer waarmee gebruikers ander mense in hul sosiale-netwerk kan
groepeer. Lyste is ’n kragtige organiserings meganisme, maar tog vind grootskaal
gebruik daarvan nie plaas nie. Gebruikers voel dat die opstelling te veel
tyd in beslag neem en die onderhoud daarvan te veel moeite is.
Hierdie tesis ondersoek die probleem om onderwerp-gerigte Twitter lyste te
skep. Ons identisifeer twee subprobleme wat aangepak word deur ’n nie-toesig
en ’n toesighoudende metode. Hierdie twee metodes hou verband met trosvorming
en klassifikasie onderskeidelik. Ons evalueer beide die trosvorming en
klassifikasie op twee Twitter datastelle. Die trosvorming metode word geëvalueer
deur te kyk na verskillende voorstellingstegnieke, eendersheid maatstawwe
en trosvorming algoritmes. Ons wys dat dit moontlik is om ’n gebruiker se Twitter netwerkdata in te
sluit om onderwerp-gerigte groeperinge te vind. Die klassifikasie benadering
word geïmplementeer met kLog, vanuit ’n statistiese relasionele leertoerie
perspektief. Ons wys dat akkurate onderwerp-gerigte klassifikasie resultate
verkry kan word met behulp van gebruikers se tweet-inhoud en sosiale-netwerk
data. In beide gevalle wys ons dat dit moontlik is om onderwerp-gerigte Twitter
lyste, met goeie resultate, te bou.
|
46 |
Přístupy k shlukování funkčních dat / Approaches to Functional Data ClusteringPešout, Pavel January 2007 (has links)
Classification is a very common task in information processing and important problem in many sectors of science and industry. In the case of data measured as a function of a dependent variable such as time, the most used algorithms may not pattern each of the individual shapes properly, because they are interested only in the choiced measurements. For the reason, the presented paper focuses on the specific techniques that directly address the curve clustering problem and classifying new individuals. The main goal of this work is to develop alternative methodologies through the extension to various statistical approaches, consolidate already established algorithms, expose their modified forms fitted to demands of clustering issue and compare some efficient curve clustering methods thanks to reported extensive simulated data experiments. Last but not least is made, for the sake of executed experiments, comprehensive confrontation of effectual utility. Proposed clustering algorithms are based on two principles. Firstly, it is presumed that the set of trajectories may be probabilistic modelled as sequences of points generated from a finite mixture model consisting of regression components and hence the density-based clustering methods using the Maximum Likehood Estimation are investigated to recognize the most homogenous partitioning. Attention is paid to both the Maximum Likehood Approach, which assumes the cluster memberships to be some of the model parameters, and the probabilistic model with the iterative Expectation-Maximization algorithm, that assumes them to be random variables. To deal with the hidden data problem both Gaussian and less conventional gamma mixtures are comprehended with arranging for use in two dimensions. To cope with data with high variability within each subpopulation it is introduced two-level random effects regression mixture with the ability to let an individual vary from the template for its group. Secondly, it is taken advantage of well known K-Means algorithm applied to the estimated regression coefficients, though. The task of the optimal data fitting is devoted, because K-Means is not invariant to linear transformations. In order to overcome this problem it is suggested integrating clustering issue with the Markov Chain Monte Carlo approaches. What is more, this paper is concerned in functional discriminant analysis including linear and quadratic scores and their modified probabilistic forms by using random mixtures. Alike in K-Means it is shown how to apply Fisher's method of canonical scores to the regression coefficients. Experiments of simulated datasets are made that demonstrate the performance of all mentioned methods and enable to choose those with the most result and time efficiency. Considerable boon is the facture of new advisable application advances. Implementation is processed in Mathematica 4.0. Finally, the possibilities offered by the development of curve clustering algorithms in vast research areas of modern science are examined, like neurology, genome studies, speech and image recognition systems, and future investigation with incorporation with ubiquitous computing is not forbidden. Utility in economy is illustrated with executed application in claims analysis of some life insurance products. The goals of the thesis have been achieved.
|
47 |
Um modelo dinâmico de clusterização de dados aplicado na detecção de intrusãoFurukawa, Rogério Akiyoshi 25 April 2003 (has links)
Atualmente, a segurança computacional vem se tornando cada vez mais necessária devido ao grande crescimento das estatísticas que relatam os crimes computacionais. Uma das ferramentas utilizadas para aumentar o nível de segurança é conhecida como Sistemas de Detecção de Intrusão (SDI). A flexibilidade e usabilidade destes sistemas têm contribuído, consideravelmente, para o aumento da proteção dos ambientes computacionais. Como grande parte das intrusões seguem padrões bem definidos de comportamento em uma rede de computadores, as técnicas de classificação e clusterização de dados tendem a ser muito apropriadas para a obtenção de uma forma eficaz de resolver este tipo de problema. Neste trabalho será apresentado um modelo dinâmico de clusterização baseado em um mecanismo de movimentação dos dados. Apesar de ser uma técnica de clusterização de dados aplicável a qualquer tipo de dados, neste trabalho, este modelo será utilizado para a detecção de intrusão. A técnica apresentada neste trabalho obteve resultados de clusterização comparáveis com técnicas tradicionais. Além disso, a técnica proposta possui algumas vantagens sobre as técnicas tradicionais investigadas, como realização de clusterizações multi-escala e não necessidade de determinação do número inicial de clusters / Nowadays, the computational security is becoming more and more necessary due to the large growth of the statistics that describe computer crimes. One of the tools used to increase the safety level is named Intrusion Detection Systems (IDS). The flexibility and usability of these systems have contributed, considerably, to increase the protection of computational environments. As large part of the intrusions follows behavior patterns very well defined in a computers network, techniques for data classification and clustering tend to be very appropriate to obtain an effective solutions to this problem. In this work, a dynamic clustering model based on a data movement mechanism are presented. In spite of a clustering technique applicable to any data type, in this work, this model will be applied to the detection intrusion. The technique presented in this work obtained clustering results comparable to those obtained by traditional techniques. Besides the proposed technique presents some advantages on the traditional techniques investigated, like multi-resolution clustering and no need to previously know the number of clusters
|
48 |
Fuzzy C-Means Clustering Approach to Design a Warehouse LayoutNaik, Vaibhav C 08 July 2004 (has links)
Allocation of products in a warehouse is done by various storage policies. These are broadly classified into three main categories: dedicated storage, randomized storage, and class-based storage. In dedicated storage policy a product is assigned a designated slot while in random storage policy incoming product is randomly assigned a storage location close to the input/output point. Finally, the class-based storage is a mixed policy where products are randomly assigned within their fixed class. Dedicated storage policy is most commonly used in practice. While designing large warehouse layout, the product information in terms of throughput and storage level is either uncertain or is not available to the warehouse designer. Hence it is not possible to locate products on the basis of the throughput to storage ratio method used in the above mentioned storage location policies. To take care of this uncertainty in product data we propose a fuzzy C-means clustering (FCM) approach.
This research is mainly directed to improve the efficiency (distance or time traveled) by designing a fuzzy logic based warehouse with large number of products. The proposed approach looks for similarity in the product data to form clusters. The obtained clusters can be directly utilized to develop the warehouse layout. Further, it is investigated if the FCM approach can take into account other factors such as product size, similarity and/or characteristics to generate layouts which are not only efficient in terms of reducing distance traveled to store/retrieve products but are effective in terms of retrieval time, space utilization and/or better material control.
|
49 |
Greedy Representative Selection for Unsupervised Data AnalysisHelwa, Ahmed Khairy Farahat January 2012 (has links)
In recent years, the advance of information and communication technologies has allowed the storage and transfer of massive amounts of data. The availability of this overwhelming amount of data stimulates a growing need to develop fast and accurate algorithms to discover useful information hidden in the data. This need is even more acute for unsupervised data, which lacks information about the categories of different instances.
This dissertation addresses a crucial problem in unsupervised data analysis, which is the selection of representative instances and/or features from the data. This problem can be generally defined as the selection of the most representative columns of a data matrix, which is formally known as the Column Subset Selection (CSS) problem. Algorithms for column subset selection can be directly used for data analysis or as a pre-processing step to enhance other data mining algorithms, such as clustering. The contributions of this dissertation can be summarized as outlined below.
First, a fast and accurate algorithm is proposed to greedily select a subset of columns of a data matrix such that the reconstruction error of the matrix based on the subset of selected columns is minimized. The algorithm is based on a novel recursive formula for calculating the reconstruction error, which allows the development of time and memory-efficient algorithms for greedy column subset selection. Experiments on real data sets demonstrate the effectiveness and efficiency of the proposed algorithms in comparison to the state-of-the-art methods for column subset selection.
Second, a kernel-based algorithm is presented for column subset selection. The algorithm greedily selects representative columns using information about their pairwise similarities. The algorithm can also calculate a Nyström approximation for a large kernel matrix based on the subset of selected columns. In comparison to different Nyström methods, the greedy Nyström method has been empirically shown to achieve significant improvements in approximating kernel matrices, with minimum overhead in run time.
Third, two algorithms are proposed for fast approximate k-means and spectral clustering. These algorithms employ the greedy column subset selection method to embed all data points in the subspace of a few representative points, where the clustering is performed. The approximate algorithms run much faster than their exact counterparts while achieving comparable clustering performance.
Fourth, a fast and accurate greedy algorithm for unsupervised feature selection is proposed. The algorithm is an application of the greedy column subset selection method presented in this dissertation. Similarly, the features are greedily selected such that the reconstruction error of the data matrix is minimized. Experiments on benchmark data sets show that the greedy algorithm outperforms state-of-the-art methods for unsupervised feature selection in the clustering task.
Finally, the dissertation studies the connection between the column subset selection problem and other related problems in statistical data analysis, and it presents a unified framework which allows the use of the greedy algorithms presented in this dissertation to solve different related problems.
|
50 |
Εξόρυξη γνώσης από δεδομέναΟικονομάκης, Εμμανουήλ Κ. 20 October 2009 (has links)
Στη συγκεκριμένη διπλωματική εργασία αναλύεται το πρόβλημα του εντοπισμού ομάδων σε σύνολα δεδομένων (ομαδοποίηση δεδομένων).
Δίνεται μια σύντομη ανασκόπηση των μεθόδων που χρησιμοποιούνται σήμερα στην ομαδοποίηση δεδομένων και ιδιαίτερα στην ολοένα και αυξανόμενη χρήση Εξελικτικών Αλγόριθμων (ΕΑ) στην ομαδοποίηση. Οι ΕΑ έχουν αποδειχθεί ιδιαίτερα αποτελεσματικοί σε μια πληθώρα προβλημάτων βελτιστοποίησης. Η χρήση ΕΑ είναι αναμενόμενη, καθώς η ομαδοποίηση δεδομένων μπορεί να εκφραστεί και ως πρόβλημα
βελτιστοποίησης. Επιπρόσθετα, παρουσιάζεται μια μέθοδος αντιμετώπισης της (συνήθως) μεγάλης διάστασης των προβλημάτων ομαδοποίησης, κάτι που επιβαρύνει ιδιαίτερα τους ΕΑ.
Αναλυτικότερα, το πρώτο μέρος της διπλωματικής εργασίας παρέχει μια σφαιρική εικόνα του προβλήματος της ομαδοποίησης καθώς και των κατηγοριών των αλγορίθμων, που έχουν προταθεί για τον εντοπισμό ομάδων. Επιπλέον, παρουσιάζονται δομές δεδομένων που χρησιμοποιούνται από αλγόριθμους ομαδοποίησης για την επιτάχυνσή τους, όπως είναι τα Range Trees και τα BBD Trees.
Εν συνεχεία, παρουσιάζονται αναλυτικά οι ΕΑ και ο τρόπος εφαρμογής τους σε προβλήματα ομαδοποίησης δεδομένων, αναλύοντας τρόπους αναπαράστασης του προβλήματος ομαδοποίησης, έτσι ώστε να είναι δυνατή η χρήση ΕΑ καθώς επίσης και οι μορφές των αντικειμενικών συναρτήσεων. Εισάγεται μια νέα προσέγγιση της εφαρμογής των ΕΑ σε προβλήματα ομαδοποίησης με σκοπό την πλήρη αποδέσμευση της διαδικασίας από εκτιμήσεις του πλήθους των ομάδων. Η διπλωματική εργασία κλείνει με τη σύγκριση υπάρχοντων αλγορίθμων ομαδοποίησης, που εφαρμόζουν την καθιερωμένη προσέγγιση της εφαρμογής των ΕΑ σε προβλήματα ομαδοποίησης, ένα νέο τρόπο εφαρμογής των ΕΑ, καθώς και κλασικούς αλγόριθμους όπως ο k-means και ο DBSCAN. Η σύγκριση γίνεται σε τεχνητά σύνολα δεδομένων, το κάθε ένα με διαφορετικές ιδιαιτερότητες. / In this master thesis, the problem of finding groups in data sets (data clustering) is analyzed. Data clustering methods in general and, more specifically, Evolutionary Algorithms (EA) based methods are shortly reviewed. EA's have proven to be effective in a extensive number of optimization problems. Since data clustering can be formulated as an optimization problem, EA can be utilized. Additionally, a method of reducing the (usually) large dimensionality of clustering problems is presented, since this hinders the performance and stability of EAs.
The first part of this thesis provides an introduction to clustering as well as to existing clustering algorithms. Additionally, data structures used by clustering algorithms such as Range trees and BBD trees are described. After that, EA is described thoroughly as well as approaches of applying them on clustering problems, by analyzing forms of presenting a clustering problem in a way than an EA can be used, as well as and possible objective functions. A new approach of applying EAs on clustering problems is introduced, in an attempt to automatically determine the number of clusters present in a data set. Finally, an existing EA-based method and well known clustering algorithms such as k-means and DBSCAN are compared to the proposed approach. This comparison is made on artificial data sets, each one with its own characteristics.
|
Page generated in 0.1252 seconds