• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 1329
  • 364
  • 187
  • 126
  • 69
  • 39
  • 37
  • 33
  • 26
  • 25
  • 22
  • 21
  • 19
  • 12
  • 9
  • Tagged with
  • 2676
  • 604
  • 524
  • 422
  • 390
  • 333
  • 283
  • 281
  • 270
  • 242
  • 237
  • 206
  • 204
  • 199
  • 191
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
61

Clustering Gaussian Processes: A Modified EM Algorithm for Functional Data Analysis with Application to British Columbia Coastal Rainfall Patterns

Paton, Forrest January 2018 (has links)
Functional data analysis is a statistical framework where data are assumed to follow some functional form. This method of analysis is commonly applied to time series data, where time, measured continuously or in discrete intervals, serves as the lo- cation for a function’s value. In this thesis Gaussian processes, a generalization of the multivariate normal distribution to function space, are used. When multiple processes are observed on a comparable interval, clustering them into sub-populations can provide significant insights. A modified EM algorithm is developed for cluster- ing processes. The model presented clusters processes based on how similar their underlying covariance kernel is. In other words, cluster formation arises from modelling correlation between inputs (as opposed to magnitude between process values). The method is applied to both simulated data and British Columbia coastal rainfall patterns. Results show clustering yearly processes can accurately classify extreme weather patterns. / Thesis / Master of Science (MSc)
62

Finding Succinct Representations For Clusters

Gupta, Aparna 09 July 2019 (has links)
Improving the explainability of results from machine learning methods has become an important research goal. In this thesis, we have studied the problem of making clusters more interpretable using a recent approach by Davidson et al., and Sambaturu et al., based on succinct representations of clusters. Given a set of objects S, a partition of S (into clusters), and a universe T of descriptors such that each element in S is associated with a subset of descriptors, the goal is to find a representative set of descriptors for each cluster such that those sets are pairwise-disjoint and the total size of all the representatives is at most a given budget. Since this problem is NP-hard in general, Sambaturu et al. have developed a suite of approximation algorithms for the problem. We also show applications to explain clusters of genomic sequences that represent different threat levels / Master of Science / Improving the explainability of results from machine learning methods has become an important research goal. Clustering is a commonly used Machine Learning technique which is performed on a variety of datasets. In this thesis, we have studied the problem of making clusters more interpretable; and have tried to answer whether it is possible to explain clusters using a set of attributes which were not used while generating these clusters.
63

Design and implementation of scalable hierarchical density based clustering

Dhandapani, Sankari 09 November 2010 (has links)
Clustering is a useful technique that divides data points into groups, also known as clusters, such that the data points of the same cluster exhibit similar properties. Typical clustering algorithms assign each data point to at least one cluster. However, in practical datasets like microarray gene dataset, only a subset of the genes are highly correlated and the dataset is often polluted with a huge volume of genes that are irrelevant. In such cases, it is important to ignore the poorly correlated genes and just cluster the highly correlated genes. Automated Hierarchical Density Shaving (Auto-HDS) is a non-parametric density based technique that partitions only the relevant subset of the dataset into multiple clusters while pruning the rest. Auto-HDS performs a hierarchical clustering that identifies dense clusters of different densities and finds a compact hierarchy of the clusters identified. Some of the key features of Auto-HDS include selection and ranking of clusters using custom stability criterion and a topologically meaningful 2D projection and visualization of the clusters discovered in the higher dimensional original space. However, a key limitation of Auto-HDS is that it requires O(n*n) storage, and O(n*n*logn) computational complexity, making it scale up to only a few 10s of thousands of points. In this thesis, two extensions to Auto-HDS are presented for lower dimensional datasets that can generate clustering identical to Auto-HDS but can scale to much larger datasets. We first introduce Partitioned Auto-HDS that provides significant reduction in time and space complexity and makes it possible to generate the Auto-HDS cluster hierarchy on much larger datasets with 100s of millions of data points. Then, we describe Parallel Auto-HDS that takes advantage of the inherent parallelism available in Partitioned Auto-HDS to scale to even larger datasets without a corresponding increase in actual run time when a group of processors are available for parallel execution. Partitioned Auto-HDS is implemented on top of GeneDIVER, a previously existing Java based streaming implementation of Auto-HDS, and thus it retains all the key features of Auto-HDS including ranking, automatic selection of clusters and 2D visualization of the discovered cluster topology. / text
64

Caractérisation des réseaux multi-sujets en IRMf : apport du clustering basé sur la connectivité fonctionnelle / Characterization of multi-subject networks in fMRI : contribution of clustering based on functional connectivity.

Emeriau, Samuel 16 December 2011 (has links)
La compréhension du fonctionnement cérébral est en constante évolution depuis l’essor des neurosciences.Les nouvelles modalités d’imagerie ont permis de mettre en évidence une architecture de notre cerveau en réseaux complexes. Mon travail a pour but de développer une méthode mettant en évidence les réseaux les plus représentatifs d’un groupe de sujet en IRM fonctionnelle.Dans un premier temps, j’ai développé une méthode de réduction des données basées sur le clustering.J’ai introduit une nouvelle caractérisation de l’information fonctionnelle par le profil de connectivité.Celui-ci permet de réduire le biais induit par le bruit présent au sein des données d’IRM fonctionnelle.De plus ce profil ne nécessite pas d’a priori sur les données contrairement aux méthodesi nférentielles classiques.Dans un deuxième temps, j’ai développé une méthode qui permet l’identification de réseaux communs sur un groupe de sujets tout en prenant en compte les variabilités spatiales et fonctionnelles inter-sujets. Les réseaux obtenus peuvent ensuite être caractérisés par leur distribution spatiale mais également par les liens de connectivités se manisfestant en leur sein.Cette méthode permet également la comparaison des réseaux de différents groupes de sujets et la mise en évidence de l’implication de réseaux différents en fonction de stimulations différentes ou d’un état pathologique. / The comprehension of cerebral operations is in constant evolution since the rise of the neurosciences.New methods of imagery made it possible to highlight an architecture of our brain in complex networks.The purpose of my work is to develop a method to find the most representative networks of a group of subjects in Functional MRI.In the first step, I developed a method to reduce the fMRI data size based on clustering. I introduced a new characterization of functional information by the profile of connectivity. This one makes it possible to reduce the variance induced by the noise present within the data of Functional MRI.Moreover this profile does not require a priori information on the data contrary to the traditional inferential methods.In the second step, I developed a method to identify common networks on a group of subjects while taking into account of spatial and functional inter-subjects variability. The networks obtained can then be characterized by their spatial organization but also by their inner connectivity links.This method also allows the comparison of the networks of various groups of subjects, making it possible to highlight the implications of different networks according to different stimulations or pathological states.
65

Framework para classificação das mutações de vírus HIV / HIV mutation classification framework

Ozahata, Mina Cintho 15 May 2014 (has links)
Um grande número de medicamentos utilizados no tratamento contra o HIV agem procurando inibir a ação das proteínas transcriptase reversa e protease. Mutações existentes nas sequências dessas proteínas podem estar relacionadas à resistência aos medicamentos e podem prejudicar o desempenho de um tratamento. O estudo do genótipo dos vírus pode ajudar na tomada de escolhas específicas em tratamentos para cada indivíduo, tornando maiores a chance de sucesso. Com a maior acessibilidade a exames de genotipagem, uma grande quantidade de sequências do vírus está disponível, contendo um grande volume de informação. Padrões de ocorrência de mutações são exemplos de informações contidas nessas sequências e são importantes por estarem relacionados à resistência aos medicamentos. Um dos caminhos que pode ser capaz de nos levar ao entendimento desses padrões de mutações é a aplicação de técnicas de agrupamento e biclustering. Essas técnicas visam a geração de grupos ou biclusters que possuam dados com propriedades em comum. São empregadas em casos em que não há grande quantidade de informação prévia e existem poucas hipóteses sobre os dados. Assim, pode-se encontrar os padrões de mutações que ocorrem nessas sequências e tentar relacioná-los com a resistência aos medicamentos, utilizando métodos de agrupamento e bicluster em sequências de protease e transcriptase reversa. Existem alguns sistemas que tentam predizer a resistência ou susceptibilidade das sequências, porém, devido à grande complexidade dessa relação, ainda é necessário esclarecer o vínculo entre combinações de mutações e níveis de resistência fenotípica. Desta forma, a principal contribuição deste trabalho é o desenvolvimento de um framework baseado na aplicação dos algoritmos KMédias e Bimax às sequências de transcriptase reversa e protease de pacientes infectados com HIV, em uma codificação binária. O presente trabalho também introduz uma representação visual dos grupos e biclusters baseada em dados de microarranjos para casos em que se tem grandes volumes de dados, de forma a facilitar a visualização da informação extraída e a caracterização dos grupos e biclusters no domínio da doença. / Drugs used in HIV treatment intend to inhibit protease and reverse transcriptase. Mutations in the sequences of these proteins can be related to drug resistance and can reduce treatment efficacy. Studying virus genotype may help choosing specific treatments for each patient, increasing success probability. As genotyping tests become available, a great amount of virus sequences, which comprehend lots of information, are more accessible. Patterns of mutation are examples of information comprised in the sequences and are important since are related to drug resistance. One way that can lead to the understanding of these mutation patterns is the use of clustering and biclustering techniques. These techniques search for clusters or biclusters comprising data with similar attributes. They are used when there is not a lot of previous information and there are few hypothesis about the data. Therefore, it may be possible to find patterns of mutations in the sequences and to relate them to drug resistance using clustering and biclustering techniques with protease and reverse transcriptase sequences. There are a few systems that predict drug resistance according to the sequence of the virus, however, due to the complexity of the relationship, it is still necessary to elucidate the connection between mutation combinations and the level of phenotypic resistance. Accordingly, this work main contribution is the development of a framework based on Kmeans and Bimax algorithms with protease and reverse transcriptase sequences from HIV patients in a binary form. This work also presents a visual representation of the clusters and biclusters based on microarray data suitable for large data volumes, helping the visualization of information extracted from data and cluster and bicluster characterization in the disease domain.
66

Estudo e desenvolvimento de algoritmos para agrupamento fuzzy de dados em cenários centralizados e distribuídos / Study and development of fuzzy clustering algorithms in centralized and distributed scenarios

Vendramin, Lucas 05 July 2012 (has links)
Agrupamento de dados é um dos problemas centrais na áea de mineração de dados, o qual consiste basicamente em particionar os dados em grupos de objetos mais similares (ou relacionados) entre si do que aos objetos dos demais grupos. Entretanto, as abordagens tradicionais pressupõem que cada objeto pertence exclusivamente a um único grupo. Essa hipótese não é realista em várias aplicações práticas, em que grupos de objetos apresentam distribuições estatísticas que possuem algum grau de sobreposição. Algoritmos de agrupamento fuzzy podem lidar naturalmente com problemas dessa natureza. A literatura sobre agrupamento fuzzy de dados é extensa, muitos algoritmos existem atualmente e são mais (ou menos) apropriados para determinados cenários, por exemplo, na procura por grupos que apresentam diferentes formatos ou ao operar sobre dados descritos por conjuntos de atributos de tipos diferentes. Adicionalmente, existem cenários em que os dados podem estar distribuídos em diferentes locais (sítios de dados). Nesses cenários o objetivo de um algoritmo de agrupamento de dados consiste em encontrar uma estrutura que represente os dados existentes nos diferentes sítios sem a necessidade de transmissão e armazenamento/processamento centralizado desses dados. Tais algoritmos são denominados algoritmos de agrupamento distribuído de dados. O presente trabalho visa o estudo e aperfeiçoamento de algoritmos de agrupamento fuzzy centralizados e distribuídos existentes na literatura, buscando identificar as principais características, vantagens, desvantagens e cenários mais apropriados para a aplicação de cada um deles, incluindo análises de complexidade de tempo, espaço e de comunicação para os algoritmos distribuídos / Data clustering is a fundamental conceptual problem in data mining, in which one aims at determining a finite set of categories to describe a data set according to similarities among its objects. Traditional algorithms assume that each object belongs exclusively to a single cluster. This may be not realistic in many applications, in which groups of objects present statistical distributions with some overlap. Fuzzy clustering algorithms can naturally deal with these problems. The literature on fuzzy clustering is extensive, several fuzzy clustering algorithms with different characteristics and for different purposes have been proposed and investigated and are more (or less) suitable for specific scenarios, e.g., finding clusters with different shapes or working with data sets described by different types of attributes. Additionally, there are scenarios in which the data are (or can be) distributed among different sites. In these scenarios, the goal of a clustering algorithm consists in finding a structure that describes the distributed data without the need of data and processing centralization. Such algorithms are known as distributed clustering algorithms. The present document aims at the study and improvement of centralized and distributed fuzzy clustering algorithms, identifying the main characteristics, advantages, disadvantages and appropriate scenarios for each application, including complexity analysis of time, space and communication for the distributed algorithms
67

A clustering scheme for large high-dimensional document datasets

Chen, Jing-wen 09 August 2007 (has links)
Peoples pay more and more attention on document clustering methods. Because of the high dimension and the large number of data, clustering methods usually need a lot of time to calculate. We propose a scheme to make the clustering algorithm much faster then original. We partition the whole dataset to several parts. First, use one of these parts for clustering. Then according to the label after clustering, we reduce the number of features by a certain ratio. Add another part of data, convert these data to lower dimension and cluster them again. Repeat this until all partitions are used. According to the experimental result, this scheme may run twice faster then the original clustering method.
68

Clustering Methods and Their Applications to Adolescent Healthcare Data

Mayer-Jochimsen, Morgan 01 April 2013 (has links)
Clustering is a mathematical method of data analysis which identifies trends in data by efficiently separating data into a specified number of clusters so is incredibly useful and widely applicable for questions of interrelatedness of data. Two methods of clustering are considered here. K-means clustering defines clusters in relation to the centroid, or center, of a cluster. Spectral clustering establishes connections between all of the data points to be clustered, then eliminates those connections that link dissimilar points. This is represented as an eigenvector problem where the solution is given by the eigenvectors of the Normalized Graph Laplacian. Spectral clustering establishes groups so that the similarity between points of the same cluster is stronger than similarity between different clusters. K-means and spectral clustering are used to analyze adolescent data from the 2009 California Health Interview Survey. Differences were observed between the results of the clustering methods on 3294 individuals and 22 health-related attributes. K-means clustered the adolescents by exercise, poverty, and variables related to psychological health while spectral clustering groups were informed by smoking, alcohol use, low exercise, psychological distress, low parental involvement, and poverty. We posit some guesses as to this difference, observe characteristics of the clustering methods, and comment on the viability of spectral clustering on healthcare data.
69

A Theoretical Study of Clusterability and Clustering Quality

Ackerman, Margareta January 2007 (has links)
Clustering is a widely used technique, with applications ranging from data mining, bioinformatics and image analysis to marketing, psychology, and city planning. Despite the practical importance of clustering, there is very limited theoretical analysis of the topic. We make a step towards building theoretical foundations for clustering by carrying out an abstract analysis of two central concepts in clustering; clusterability and clustering quality. We compare a number of notions of clusterability found in the literature. While all these notions attempt to measure the same property, and all appear to be reasonable, we show that they are pairwise inconsistent. In addition, we give the first computational complexity analysis of a few notions of clusterability. In the second part of the thesis, we discuss how the quality of a given clustering can be defined (and measured). Users often need to compare the quality of clusterings obtained by different methods. Perhaps more importantly, users need to determine whether a given clustering is sufficiently good for being used in further data mining analysis. We analyze what a measure of clustering quality should look like. We do that by introducing a set of requirements (`axioms') of clustering quality measures. We propose a number of clustering quality measures that satisfy these requirements.
70

Cooperative Clustering Model and Its Applications

Kashef, Rasha January 2008 (has links)
Data clustering plays an important role in many disciplines, including data mining, machine learning, bioinformatics, pattern recognition, and other fields, where there is a need to learn the inherent grouping structure of data in an unsupervised manner. There are many clustering approaches proposed in the literature with different quality/complexity tradeoffs. Each clustering algorithm works on its domain space with no optimum solution to all datasets of different properties, sizes, structures, and distributions. Challenges in data clustering include, identifying proper number of clusters, scalability of the clustering approach, robustness to noise, tackling distributed datasets, and handling clusters of different configurations. This thesis addresses some of these challenges through cooperation between multiple clustering approaches. We introduce a Cooperative Clustering (CC) model that involves multiple clustering techniques; the goal of the cooperative model is to increase the homogeneity of objects within clusters through cooperation by developing two data structures, cooperative contingency graph and histogram representation of pair-wise similarities. The two data structures are designed to find the matching sub-clusters between different clusterings and to obtain the final set of cooperative clusters through a merging process. Obtaining the co-occurred objects from the different clusterings enables the cooperative model to group objects based on a multiple agreement between the invoked clustering techniques. In addition, merging this set of sub-clusters using histograms poses a new trend of grouping objects into more homogenous clusters. The cooperative model is consistent, reusable, and scalable in terms of the number of the adopted clustering approaches. In order to deal with noisy data, a novel Cooperative Clustering Outliers Detection (CCOD) algorithm is implemented through the implication of the cooperation methodology for better detection of outliers in data. The new detection approach is designed in four phases, (1) Global non-cooperative Clustering, (2) Cooperative Clustering, (3) Possible outlier’s Detection, and finally (4) Candidate Outliers Detection. The detection of outliers is established in a bottom-up scenario. The thesis also addresses cooperative clustering in distributed Peer-to-Peer (P2P) networks. Mining large and inherently distributed datasets poses many challenges, one of which is the extraction of a global model as a global summary of the clustering solutions generated from all nodes for the purpose of interpreting the clustering quality of the distributed dataset as if it was located at one node. We developed distributed cooperative model and architecture that work on a two-tier super-peer P2P network. The model is called Distributed Cooperative Clustering in Super-peer P2P Networks (DCCP2P). This model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as two layers of peer neighborhoods and super-peers. Summarization of the global distributed clusters is achieved through a distributed version of the cooperative clustering model. Three clustering algorithms, k-means (KM), Bisecting k-means (BKM) and Partitioning Around Medoids (PAM) are invoked in the cooperative model. Results on various gene expression and text documents datasets with different properties, configurations and different degree of outliers reveal that: (i) the cooperative clustering model achieves significant improvement in the quality of the clustering solutions compared to that of the non-cooperative individual approaches; (ii) the cooperative detection algorithm discovers the nonconforming objects in data with better accuracy than the contemporary approaches, and (iii) the distributed cooperative model attains the same quality or even better as the centralized approach and achieves decent speedup by increasing number of nodes. The distributed model offers high degree of flexibility, scalability, and interpretability of large distributed repositories. Achieving the same results using current methodologies requires polling the data first to one center location, which is sometimes not feasible.

Page generated in 0.0846 seconds