61 |
Enhancements to the Microbial Source Tracking Process Through the Utilization of Clustering and K-nearest Clusters AlgorithmLai, Tram B 01 March 2018 (has links) (PDF)
Bacterial contamination in water sources is a serious health risk and the sources of the bacterial strains must be identified to keep people safe. This project is the result of a collaboration effort at Cal Poly to develop a new library-dependent Microbial Source Tracking method for determining sources of fecal contamination in the environment. The library used in this study is called Cal Poly Library of Pyroprints (CPLOP). The process of building CPLOP requires students to collect fecal samples from a multitude of sources in the San Luis Obispo area. A novel method developed by the biologists at Cal Poly called pyroprinting is then applied on the two intergenic regions of the E. coli isolates from these samples to obtain their fingerprints. These fingerprints are stored in the CPLOP database. In our study, we consider any E. coli samples whose fingerprints match above a certain threshold to be in the same group of bacterial strain. However, there has not yet been a final MST method that produces an acceptable level of accuracy. In this thesis, we propose a two-step MST classifier that combines two previous works: pyro-DBSCAN and k-RAP. These algorithms were developed specifically for CPLOP. We call our classifier HAP - Hybrid Algorithm for Pyroprints. The classifier works as follows. Given an unknown isolate, the first step requires performing clustering on the known isolates in the library and comparing the unknown isolate against the resulting clusters. If the isolate falls into a cluster, its classification will be returned as the dominant species of that cluster. Otherwise, we apply the k-Nearest Clusters Algorithm on this isolate to determine its final classification. Ultimately, HAP provides us a set of 16 decision strategies that identify the host species of an unknown sample with high accuracy.
|
62 |
Clustering Gaussian Processes: A Modified EM Algorithm for Functional Data Analysis with Application to British Columbia Coastal Rainfall PatternsPaton, Forrest January 2018 (has links)
Functional data analysis is a statistical framework where data are assumed to follow some functional form. This method of analysis is commonly applied to time series data, where time, measured continuously or in discrete intervals, serves as the lo- cation for a function’s value. In this thesis Gaussian processes, a generalization of the multivariate normal distribution to function space, are used. When multiple processes are observed on a comparable interval, clustering them into sub-populations can provide significant insights. A modified EM algorithm is developed for cluster- ing processes. The model presented clusters processes based on how similar their underlying covariance kernel is. In other words, cluster formation arises from modelling correlation between inputs (as opposed to magnitude between process values). The method is applied to both simulated data and British Columbia coastal rainfall patterns. Results show clustering yearly processes can accurately classify extreme weather patterns. / Thesis / Master of Science (MSc)
|
63 |
Finding Succinct Representations For ClustersGupta, Aparna 09 July 2019 (has links)
Improving the explainability of results from machine learning methods has become an important research goal. In this thesis, we have studied the problem of making clusters more interpretable using a recent approach by Davidson et al., and Sambaturu et al., based on succinct representations of clusters. Given a set of objects S, a partition of S (into clusters), and a universe T of descriptors such that each element in S is associated with a subset of descriptors, the goal is to find a representative set of descriptors for each cluster such that those sets are pairwise-disjoint and the total size of all the representatives is at most a given budget. Since this problem is NP-hard in general, Sambaturu et al. have developed a suite of approximation algorithms for the problem. We also show applications to explain clusters of genomic sequences that represent different threat levels / Master of Science / Improving the explainability of results from machine learning methods has become an important research goal. Clustering is a commonly used Machine Learning technique which is performed on a variety of datasets. In this thesis, we have studied the problem of making clusters more interpretable; and have tried to answer whether it is possible to explain clusters using a set of attributes which were not used while generating these clusters.
|
64 |
Design and implementation of scalable hierarchical density based clusteringDhandapani, Sankari 09 November 2010 (has links)
Clustering is a useful technique that divides data points into groups, also known as clusters, such that the data points of the same cluster exhibit similar properties. Typical clustering algorithms assign each data point to at least one cluster. However, in practical datasets like microarray gene dataset, only a subset of the genes are highly correlated and the dataset is often polluted with a huge volume of genes that are irrelevant. In such cases, it is important to ignore the poorly correlated genes and just cluster the highly correlated genes.
Automated Hierarchical Density Shaving (Auto-HDS) is a non-parametric density based technique that partitions only the relevant subset of the dataset into multiple clusters while pruning the rest. Auto-HDS performs a hierarchical clustering that identifies dense clusters of different densities and finds a compact hierarchy of the clusters identified. Some of the key features of Auto-HDS include selection and ranking of clusters using custom stability criterion and a topologically meaningful 2D projection and visualization of the clusters discovered in the higher dimensional original space. However, a key limitation of Auto-HDS is that it requires O(n*n) storage, and O(n*n*logn) computational complexity, making it scale up to only a few 10s of thousands of points. In this thesis, two extensions to Auto-HDS are presented for lower dimensional datasets that can generate clustering identical to Auto-HDS but can scale to much larger datasets. We first introduce Partitioned Auto-HDS that provides significant reduction in time and space complexity and makes it possible to generate the Auto-HDS cluster hierarchy on much larger datasets with 100s of millions of data points. Then, we describe Parallel Auto-HDS that takes advantage of the inherent parallelism available in Partitioned Auto-HDS to scale to even larger datasets without a corresponding increase in actual run time when a group of processors are available for parallel execution. Partitioned Auto-HDS is implemented on top of GeneDIVER, a previously existing Java based streaming implementation of Auto-HDS, and thus it retains all the key features of Auto-HDS including ranking, automatic selection of clusters and 2D visualization of the discovered cluster topology. / text
|
65 |
Caractérisation des réseaux multi-sujets en IRMf : apport du clustering basé sur la connectivité fonctionnelle / Characterization of multi-subject networks in fMRI : contribution of clustering based on functional connectivity.Emeriau, Samuel 16 December 2011 (has links)
La compréhension du fonctionnement cérébral est en constante évolution depuis l’essor des neurosciences.Les nouvelles modalités d’imagerie ont permis de mettre en évidence une architecture de notre cerveau en réseaux complexes. Mon travail a pour but de développer une méthode mettant en évidence les réseaux les plus représentatifs d’un groupe de sujet en IRM fonctionnelle.Dans un premier temps, j’ai développé une méthode de réduction des données basées sur le clustering.J’ai introduit une nouvelle caractérisation de l’information fonctionnelle par le profil de connectivité.Celui-ci permet de réduire le biais induit par le bruit présent au sein des données d’IRM fonctionnelle.De plus ce profil ne nécessite pas d’a priori sur les données contrairement aux méthodesi nférentielles classiques.Dans un deuxième temps, j’ai développé une méthode qui permet l’identification de réseaux communs sur un groupe de sujets tout en prenant en compte les variabilités spatiales et fonctionnelles inter-sujets. Les réseaux obtenus peuvent ensuite être caractérisés par leur distribution spatiale mais également par les liens de connectivités se manisfestant en leur sein.Cette méthode permet également la comparaison des réseaux de différents groupes de sujets et la mise en évidence de l’implication de réseaux différents en fonction de stimulations différentes ou d’un état pathologique. / The comprehension of cerebral operations is in constant evolution since the rise of the neurosciences.New methods of imagery made it possible to highlight an architecture of our brain in complex networks.The purpose of my work is to develop a method to find the most representative networks of a group of subjects in Functional MRI.In the first step, I developed a method to reduce the fMRI data size based on clustering. I introduced a new characterization of functional information by the profile of connectivity. This one makes it possible to reduce the variance induced by the noise present within the data of Functional MRI.Moreover this profile does not require a priori information on the data contrary to the traditional inferential methods.In the second step, I developed a method to identify common networks on a group of subjects while taking into account of spatial and functional inter-subjects variability. The networks obtained can then be characterized by their spatial organization but also by their inner connectivity links.This method also allows the comparison of the networks of various groups of subjects, making it possible to highlight the implications of different networks according to different stimulations or pathological states.
|
66 |
Framework para classificação das mutações de vírus HIV / HIV mutation classification frameworkOzahata, Mina Cintho 15 May 2014 (has links)
Um grande número de medicamentos utilizados no tratamento contra o HIV agem procurando inibir a ação das proteínas transcriptase reversa e protease. Mutações existentes nas sequências dessas proteínas podem estar relacionadas à resistência aos medicamentos e podem prejudicar o desempenho de um tratamento. O estudo do genótipo dos vírus pode ajudar na tomada de escolhas específicas em tratamentos para cada indivíduo, tornando maiores a chance de sucesso. Com a maior acessibilidade a exames de genotipagem, uma grande quantidade de sequências do vírus está disponível, contendo um grande volume de informação. Padrões de ocorrência de mutações são exemplos de informações contidas nessas sequências e são importantes por estarem relacionados à resistência aos medicamentos. Um dos caminhos que pode ser capaz de nos levar ao entendimento desses padrões de mutações é a aplicação de técnicas de agrupamento e biclustering. Essas técnicas visam a geração de grupos ou biclusters que possuam dados com propriedades em comum. São empregadas em casos em que não há grande quantidade de informação prévia e existem poucas hipóteses sobre os dados. Assim, pode-se encontrar os padrões de mutações que ocorrem nessas sequências e tentar relacioná-los com a resistência aos medicamentos, utilizando métodos de agrupamento e bicluster em sequências de protease e transcriptase reversa. Existem alguns sistemas que tentam predizer a resistência ou susceptibilidade das sequências, porém, devido à grande complexidade dessa relação, ainda é necessário esclarecer o vínculo entre combinações de mutações e níveis de resistência fenotípica. Desta forma, a principal contribuição deste trabalho é o desenvolvimento de um framework baseado na aplicação dos algoritmos KMédias e Bimax às sequências de transcriptase reversa e protease de pacientes infectados com HIV, em uma codificação binária. O presente trabalho também introduz uma representação visual dos grupos e biclusters baseada em dados de microarranjos para casos em que se tem grandes volumes de dados, de forma a facilitar a visualização da informação extraída e a caracterização dos grupos e biclusters no domínio da doença. / Drugs used in HIV treatment intend to inhibit protease and reverse transcriptase. Mutations in the sequences of these proteins can be related to drug resistance and can reduce treatment efficacy. Studying virus genotype may help choosing specific treatments for each patient, increasing success probability. As genotyping tests become available, a great amount of virus sequences, which comprehend lots of information, are more accessible. Patterns of mutation are examples of information comprised in the sequences and are important since are related to drug resistance. One way that can lead to the understanding of these mutation patterns is the use of clustering and biclustering techniques. These techniques search for clusters or biclusters comprising data with similar attributes. They are used when there is not a lot of previous information and there are few hypothesis about the data. Therefore, it may be possible to find patterns of mutations in the sequences and to relate them to drug resistance using clustering and biclustering techniques with protease and reverse transcriptase sequences. There are a few systems that predict drug resistance according to the sequence of the virus, however, due to the complexity of the relationship, it is still necessary to elucidate the connection between mutation combinations and the level of phenotypic resistance. Accordingly, this work main contribution is the development of a framework based on Kmeans and Bimax algorithms with protease and reverse transcriptase sequences from HIV patients in a binary form. This work also presents a visual representation of the clusters and biclusters based on microarray data suitable for large data volumes, helping the visualization of information extracted from data and cluster and bicluster characterization in the disease domain.
|
67 |
Estudo e desenvolvimento de algoritmos para agrupamento fuzzy de dados em cenários centralizados e distribuídos / Study and development of fuzzy clustering algorithms in centralized and distributed scenariosVendramin, Lucas 05 July 2012 (has links)
Agrupamento de dados é um dos problemas centrais na áea de mineração de dados, o qual consiste basicamente em particionar os dados em grupos de objetos mais similares (ou relacionados) entre si do que aos objetos dos demais grupos. Entretanto, as abordagens tradicionais pressupõem que cada objeto pertence exclusivamente a um único grupo. Essa hipótese não é realista em várias aplicações práticas, em que grupos de objetos apresentam distribuições estatísticas que possuem algum grau de sobreposição. Algoritmos de agrupamento fuzzy podem lidar naturalmente com problemas dessa natureza. A literatura sobre agrupamento fuzzy de dados é extensa, muitos algoritmos existem atualmente e são mais (ou menos) apropriados para determinados cenários, por exemplo, na procura por grupos que apresentam diferentes formatos ou ao operar sobre dados descritos por conjuntos de atributos de tipos diferentes. Adicionalmente, existem cenários em que os dados podem estar distribuídos em diferentes locais (sítios de dados). Nesses cenários o objetivo de um algoritmo de agrupamento de dados consiste em encontrar uma estrutura que represente os dados existentes nos diferentes sítios sem a necessidade de transmissão e armazenamento/processamento centralizado desses dados. Tais algoritmos são denominados algoritmos de agrupamento distribuído de dados. O presente trabalho visa o estudo e aperfeiçoamento de algoritmos de agrupamento fuzzy centralizados e distribuídos existentes na literatura, buscando identificar as principais características, vantagens, desvantagens e cenários mais apropriados para a aplicação de cada um deles, incluindo análises de complexidade de tempo, espaço e de comunicação para os algoritmos distribuídos / Data clustering is a fundamental conceptual problem in data mining, in which one aims at determining a finite set of categories to describe a data set according to similarities among its objects. Traditional algorithms assume that each object belongs exclusively to a single cluster. This may be not realistic in many applications, in which groups of objects present statistical distributions with some overlap. Fuzzy clustering algorithms can naturally deal with these problems. The literature on fuzzy clustering is extensive, several fuzzy clustering algorithms with different characteristics and for different purposes have been proposed and investigated and are more (or less) suitable for specific scenarios, e.g., finding clusters with different shapes or working with data sets described by different types of attributes. Additionally, there are scenarios in which the data are (or can be) distributed among different sites. In these scenarios, the goal of a clustering algorithm consists in finding a structure that describes the distributed data without the need of data and processing centralization. Such algorithms are known as distributed clustering algorithms. The present document aims at the study and improvement of centralized and distributed fuzzy clustering algorithms, identifying the main characteristics, advantages, disadvantages and appropriate scenarios for each application, including complexity analysis of time, space and communication for the distributed algorithms
|
68 |
A clustering scheme for large high-dimensional document datasetsChen, Jing-wen 09 August 2007 (has links)
Peoples pay more and more attention on document clustering methods. Because of the high dimension and the large number of data, clustering methods usually need a lot of time to calculate. We propose a scheme to make the clustering algorithm much faster then original. We partition the whole dataset to several parts. First, use one of these parts for clustering. Then according to the label after clustering, we reduce the number of features by a certain ratio. Add another part of data, convert these data to lower dimension and cluster them again. Repeat this until all partitions are used. According to the experimental result, this scheme may run twice faster then the original clustering method.
|
69 |
Clustering Methods and Their Applications to Adolescent Healthcare DataMayer-Jochimsen, Morgan 01 April 2013 (has links)
Clustering is a mathematical method of data analysis which identifies trends in data by efficiently separating data into a specified number of clusters so is incredibly useful and widely applicable for questions of interrelatedness of data. Two methods of clustering are considered here. K-means clustering defines clusters in relation to the centroid, or center, of a cluster. Spectral clustering establishes connections between all of the data points to be clustered, then eliminates those connections that link dissimilar points. This is represented as an eigenvector problem where the solution is given by the eigenvectors of the Normalized Graph Laplacian. Spectral clustering establishes groups so that the similarity between points of the same cluster is stronger than similarity between different clusters. K-means and spectral clustering are used to analyze adolescent data from the 2009 California Health Interview Survey. Differences were observed between the results of the clustering methods on 3294 individuals and 22 health-related attributes. K-means clustered the adolescents by exercise, poverty, and variables related to psychological health while spectral clustering groups were informed by smoking, alcohol use, low exercise, psychological distress, low parental involvement, and poverty. We posit some guesses as to this difference, observe characteristics of the clustering methods, and comment on the viability of spectral clustering on healthcare data.
|
70 |
A Theoretical Study of Clusterability and Clustering QualityAckerman, Margareta January 2007 (has links)
Clustering is a widely used technique, with applications ranging
from data mining, bioinformatics and image analysis to marketing,
psychology, and city planning. Despite the practical importance of
clustering, there is very limited theoretical analysis of the topic.
We make a step towards building theoretical foundations for
clustering by carrying out an abstract analysis of two central
concepts in clustering; clusterability and clustering quality.
We compare a number of notions of clusterability found in the
literature. While all these notions attempt to measure the same
property, and all appear to be reasonable, we show that they are
pairwise inconsistent. In addition, we give the first computational
complexity analysis of a few notions of clusterability.
In the second part of the thesis, we discuss how the quality of a
given clustering can be defined (and measured). Users often need to
compare the quality of clusterings obtained by different methods.
Perhaps more importantly, users need to determine whether a given
clustering is sufficiently good for being used in further data
mining analysis. We analyze what a measure of clustering quality
should look like. We do that by introducing a set of requirements
(`axioms') of clustering quality measures. We propose a number of
clustering quality measures that satisfy these requirements.
|
Page generated in 0.0665 seconds