Global ETD Search

1	Scalable algorithms for correlation clustering on large graphs Cordner, Nathan 01 October 2024 (has links) Correlation clustering (CC) is a widely-used clustering paradigm, where objects are represented as graph nodes and clustering is performed based on relationships between objects (positive or negative edges between pairs of nodes). The CC objective is to obtain a graph clustering that minimizes the number of incorrectly assigned edges (negative edges within clusters, and positive edges between clusters). Many of the state-of-the-art algorithms for solving correlation clustering rely on subroutines that cause significant memory and run time bottlenecks when applied to larger graphs. Several algorithms with the best theoretical guarantees for clustering quality need to first solve a relatively large linear program; others perform brute-force searches over sizeable sets, or store large amounts of unnecessary information. Because of these issues, algorithms that run quicker (e.g. in linear time) but have lower quality approximation guarantees have still remained popular. In this thesis we examine three such popular linear time CC algorithms: Pivot, Vote, and LocalSearch. For the general CC problem we show that these algorithms perform well against slower state-of-the-art algorithms; we also develop a lightweight InnerLocalSearch method that runs much faster and delivers nearly the same quality of results as the full LocalSearch. We adapt Pivot, Vote, and LocalSearch for two constrained CC variants (limited cluster sizes, and limited total number of clusters), and show their practicality when compared against slower algorithms with better approximation guarantees. Finally, we give two practical run time improvements for applying CC algorithms to the related consensus clustering problem. Computer science Algorithms Correlation clustering Data mining
2	Equ?librio estrutural aplicado ? detec??o de casos de corrup??o / Structural equation applied to the detection of cases of corruption Ponciano, Vitor dos Santos 21 February 2017 (has links) Submitted by Celso Magalhaes (celsomagalhaes@ufrrj.br) on 2018-08-28T12:21:13Z No. of bitstreams: 1 2017 - Victor dos Santos Ponciano.pdf: 551672 bytes, checksum: c01424f760ef08254a63d236b62f72c5 (MD5) / Made available in DSpace on 2018-08-28T12:21:15Z (GMT). No. of bitstreams: 1 2017 - Victor dos Santos Ponciano.pdf: 551672 bytes, checksum: c01424f760ef08254a63d236b62f72c5 (MD5) Previous issue date: 2017-02-21 / Conselho Nacional de Desenvolvimento Cient?fico e Tecnol?gico - CNPq / In 1946, Heider developed Signal Graph Theory with the purpose of describing the emotional relationships between people pertaining to the same social group. In this work, we study graph partitioning problems associated with structural balance. These problems are known in the computer science literature as partition correlation problems: correlation clustering (CC) and a relaxed version (RCC). The solution of CC and RCC problems has been previously used in the literature as a tool for the evaluation of structural balance in a social network. The aim of this work is to apply the solution of these problems in the detection of corruption in public contracts. We describe integer linear programming formulations from the literature for these problems. We also discuss a probabilistic model for Structural balance and the solution of these problems applied to the detection of cases of corruption existing in public bids. / Em 1946, Heider desenvolveu a Teoria de Grafos de Sinais a fim de descrever as rela??es emocionais entre as pessoas pertencentes ao mesmo grupo social. Neste trabalho, estudamos problemas de particionamento de grafo associados com equil?brio estrutural, que na literatura de Ci?ncia da Computa??o s?o conhecidos como problemas de correla??o de parti??es ou, em ingl?s, correlation clustering (CC), al?m de uma vers?o relaxada (RCC). As solu??es dos problemas CC e RCC foram anteriormente utilizadas na literatura como ferramentas para a avalia??o de equil?brio estrutural numa rede social. O objetivo deste trabalho ? aplicar as solu??es destes problemas na detec??o de corrup??o em contratos p?blicos. Para esse fim, s?o utilizadas formula??es de programa??o linear inteira existentes na literatura para estes problemas. Al?m disso, ? discutido tamb?m um modelo probabil?stico para o Equil?brio Estrutural e as solu??es destes problemas s?o aplicadas ? detec??o de casos de corrup??es existentes em licita??es p?blicas. Equil?brio estrutural Grafos de sinais Correlation Clustering Detec??o de Corrup??o Structural balance Signal Graphs Correlation Clustering Corruption Detection Ci?ncias Exatas e da Terra
3	Computational Complexity Of Bi-clustering Wulff, Sharon Jay January 2008 (has links) In this work we formalize a new natural objective (or cost) function for bi-clustering - Monochromatic bi-clustering. Our objective function is suitable for detecting meaningful homogenous clusters based on categorical valued input matrices. Such problems have arisen recently in systems biology where researchers have inferred functional classifications of biological agents based on their pairwise interactions. We analyze the computational complexity of the resulting optimization problems. We show that finding optimal solutions is NP-hard and complement this result by introducing a polynomial time approximation algorithm for this bi-clustering task. This is the first positive approximation guarantee for bi-clustering algorithms. We also show that bi-clustering with our objective function can be viewed as a generalization of correlation clustering. Bi-Clustering NP-hardness correlation clustering Computer Science
4	Computational Complexity Of Bi-clustering Wulff, Sharon Jay January 2008 (has links) In this work we formalize a new natural objective (or cost) function for bi-clustering - Monochromatic bi-clustering. Our objective function is suitable for detecting meaningful homogenous clusters based on categorical valued input matrices. Such problems have arisen recently in systems biology where researchers have inferred functional classifications of biological agents based on their pairwise interactions. We analyze the computational complexity of the resulting optimization problems. We show that finding optimal solutions is NP-hard and complement this result by introducing a polynomial time approximation algorithm for this bi-clustering task. This is the first positive approximation guarantee for bi-clustering algorithms. We also show that bi-clustering with our objective function can be viewed as a generalization of correlation clustering. Bi-Clustering NP-hardness correlation clustering Computer Science
5	Functional Characterization of the NSF1 (YPL230W) Gene using Correlation Clustering and Genetic Analysis in Saccharomyces Cerevisiae Bessonov, Kyrylo 09 January 2012 (has links) High throughput technologies such as microarrays and modern genome sequencers produce enormous amounts of data that require novel data processing. This thesis proposes a method called Interdependent Correlation Cluster (ICC) to analyze the relations between genes represented by microarray data that are conditioned on a specific target gene. Based on Correlation Clustering, the proposed method analyzes a large set of correlation values related to the gene expression profiles extracted from given microarray datasets. The proposed method works on any size microarray datasets and could be applied to any target gene. In this study the selected target gene, NSF1 /USV1 / YPL230W, encodes a poorly characterized C2H2 zinc finger transcription factor (TF) involved in stress responses in yeast. The method is successful in the identification of novel NSF1 functional roles during fermentation stress conditions in the M2 industrial yeast strain. The new identified functions include regulation of energy and sulfur metabolism, protein synthesis, ribosomal assembly and protein trafficking as well as other processes. NSF1 involvement in sulfur metabolism was experimentally confirmed using biological laboratory techniques. Importantly, implication of NSF1 in sulfur metabolism regulation has highly relevant implications to wine and beer production industries concerned with production of compounds having sulfur-like off odour (SLO) and toxic properties. The correlation clustering also provides a means of understanding complex interactions existing between genes. / The pdf file contains numerous hyperlinks and bookmarks to facilitate navigation. This thesis will be of interest to those working with topics such as data mining of microarray data, novel gene function discovery and prediction, and genome-wide responses to fermentation stresses. / Ministry of Training, Colleges and Universities of Ontario (Ontario Graduate Scholarship and Ontario Graduate Scholarships in Science and Technology); The Natural Sciences and Engineering Research Council of Canada (NSERC) microarray analysis NSF1 ( YPL230W ) correlation clustering fermentation stress novel gene functions functional networks clustering
6	Adding Value to Food Safety Systems through Secondary Analysis of Regulatory Microbiological Testing Data Beczkiewicz, Aaron Thomas Edward January 2021 (has links) No description available. Food Science
7	Data mining in large sets of complex data / Mineração de dados em grande conjuntos de dados complexos Cordeiro, Robson Leonardo Ferreira 29 August 2011 (has links) Due to the increasing amount and complexity of the data stored in the enterprises\' databases, the task of knowledge discovery is nowadays vital to support strategic decisions. However, the mining techniques used in the process usually have high computational costs that come from the need to explore several alternative solutions, in different combinations, to obtain the desired knowledge. The most common mining tasks include data classification, labeling and clustering, outlier detection and missing data prediction. Traditionally, the data are represented by numerical or categorical attributes in a table that describes one element in each tuple. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, audio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making most of the existing techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. This Ph.D. work focuses on the development of new data mining techniques for large sets of complex data, especially for the task of clustering, tightly associated to other data mining tasks that are performed together. Specifically, this Doctoral dissertation presents three novel, fast and scalable data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering; the method BoW for clustering Terabyte-scale datasets; and the method QMAS for labeling and summarization. Our algorithms were evaluated on real, very large datasets with up to billions of complex elements, and they always presented highly accurate results, being at least one order of magnitude faster than the fastest related works in almost all cases. The real data used come from the following applications: automatic breast cancer diagnosis, satellite imagery analysis, and graph mining on a large web graph crawled by Yahoo! and also on the graph with all users and their connections from the Twitter social network. Such results indicate that our algorithms allow the development of real time applications that, potentially, could not be developed without this Ph.D. work, like a software to aid on the fly the diagnosis process in a worldwide Healthcare Information System, or a system to look for deforestation within the Amazon Rainforest in real time / O crescimento em quantidade e complexidade dos dados armazenados nas organizações torna a extração de conhecimento utilizando técnicas de mineração uma tarefa ao mesmo tempo fundamental para aproveitar bem esses dados na tomada de decisões estratégicas e de alto custo computacional. O custo vem da necessidade de se explorar uma grande quantidade de casos de estudo, em diferentes combinações, para se obter o conhecimento desejado. Tradicionalmente, os dados a explorar são representados como atributos numéricos ou categóricos em uma tabela, que descreve em cada tupla um caso de teste do conjunto sob análise. Embora as mesmas tarefas desenvolvidas para dados tradicionais sejam também necessárias para dados mais complexos, como imagens, grafos, áudio e textos longos, a complexidade das análises e o custo computacional envolvidos aumentam significativamente, inviabilizando a maioria das técnicas de análise atuais quando aplicadas a grandes quantidades desses dados complexos. Assim, técnicas de mineração especiais devem ser desenvolvidas. Este Trabalho de Doutorado visa a criação de novas técnicas de mineração para grandes bases de dados complexos. Especificamente, foram desenvolvidas duas novas técnicas de agrupamento e uma nova técnica de rotulação e sumarização que são rápidas, escaláveis e bem adequadas à análise de grandes bases de dados complexos. As técnicas propostas foram avaliadas para a análise de bases de dados reais, em escala de Terabytes de dados, contendo até bilhões de objetos complexos, e elas sempre apresentaram resultados de alta qualidade, sendo em quase todos os casos pelo menos uma ordem de magnitude mais rápidas do que os trabalhos relacionados mais eficientes. Os dados reais utilizados vêm das seguintes aplicações: diagnóstico automático de câncer de mama, análise de imagens de satélites, e mineração de grafos aplicada a um grande grafo da web coletado pelo Yahoo! e também a um grafo com todos os usuários da rede social Twitter e suas conexões. Tais resultados indicam que nossos algoritmos permitem a criação de aplicações em tempo real que, potencialmente, não poderiam ser desenvolvidas sem a existência deste Trabalho de Doutorado, como por exemplo, um sistema em escala global para o auxílio ao diagnóstico médico em tempo real, ou um sistema para a busca por áreas de desmatamento na Floresta Amazônica em tempo real Agrupamento de correlação Correlation clustering Dados de média à alta dimensionalidade Labeling and summarization MapReduce MapReduce Moderante-to-high dimensionality data Rotulação e sumarização Terabyte-scale data mining
8	Data mining in large sets of complex data / Mineração de dados em grande conjuntos de dados complexos Robson Leonardo Ferreira Cordeiro 29 August 2011 (has links) Due to the increasing amount and complexity of the data stored in the enterprises\' databases, the task of knowledge discovery is nowadays vital to support strategic decisions. However, the mining techniques used in the process usually have high computational costs that come from the need to explore several alternative solutions, in different combinations, to obtain the desired knowledge. The most common mining tasks include data classification, labeling and clustering, outlier detection and missing data prediction. Traditionally, the data are represented by numerical or categorical attributes in a table that describes one element in each tuple. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, audio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making most of the existing techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. This Ph.D. work focuses on the development of new data mining techniques for large sets of complex data, especially for the task of clustering, tightly associated to other data mining tasks that are performed together. Specifically, this Doctoral dissertation presents three novel, fast and scalable data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering; the method BoW for clustering Terabyte-scale datasets; and the method QMAS for labeling and summarization. Our algorithms were evaluated on real, very large datasets with up to billions of complex elements, and they always presented highly accurate results, being at least one order of magnitude faster than the fastest related works in almost all cases. The real data used come from the following applications: automatic breast cancer diagnosis, satellite imagery analysis, and graph mining on a large web graph crawled by Yahoo! and also on the graph with all users and their connections from the Twitter social network. Such results indicate that our algorithms allow the development of real time applications that, potentially, could not be developed without this Ph.D. work, like a software to aid on the fly the diagnosis process in a worldwide Healthcare Information System, or a system to look for deforestation within the Amazon Rainforest in real time / O crescimento em quantidade e complexidade dos dados armazenados nas organizações torna a extração de conhecimento utilizando técnicas de mineração uma tarefa ao mesmo tempo fundamental para aproveitar bem esses dados na tomada de decisões estratégicas e de alto custo computacional. O custo vem da necessidade de se explorar uma grande quantidade de casos de estudo, em diferentes combinações, para se obter o conhecimento desejado. Tradicionalmente, os dados a explorar são representados como atributos numéricos ou categóricos em uma tabela, que descreve em cada tupla um caso de teste do conjunto sob análise. Embora as mesmas tarefas desenvolvidas para dados tradicionais sejam também necessárias para dados mais complexos, como imagens, grafos, áudio e textos longos, a complexidade das análises e o custo computacional envolvidos aumentam significativamente, inviabilizando a maioria das técnicas de análise atuais quando aplicadas a grandes quantidades desses dados complexos. Assim, técnicas de mineração especiais devem ser desenvolvidas. Este Trabalho de Doutorado visa a criação de novas técnicas de mineração para grandes bases de dados complexos. Especificamente, foram desenvolvidas duas novas técnicas de agrupamento e uma nova técnica de rotulação e sumarização que são rápidas, escaláveis e bem adequadas à análise de grandes bases de dados complexos. As técnicas propostas foram avaliadas para a análise de bases de dados reais, em escala de Terabytes de dados, contendo até bilhões de objetos complexos, e elas sempre apresentaram resultados de alta qualidade, sendo em quase todos os casos pelo menos uma ordem de magnitude mais rápidas do que os trabalhos relacionados mais eficientes. Os dados reais utilizados vêm das seguintes aplicações: diagnóstico automático de câncer de mama, análise de imagens de satélites, e mineração de grafos aplicada a um grande grafo da web coletado pelo Yahoo! e também a um grafo com todos os usuários da rede social Twitter e suas conexões. Tais resultados indicam que nossos algoritmos permitem a criação de aplicações em tempo real que, potencialmente, não poderiam ser desenvolvidas sem a existência deste Trabalho de Doutorado, como por exemplo, um sistema em escala global para o auxílio ao diagnóstico médico em tempo real, ou um sistema para a busca por áreas de desmatamento na Floresta Amazônica em tempo real Agrupamento de correlação Dados de média à alta dimensionalidade MapReduce Rotulação e sumarização Correlation clustering Labeling and summarization MapReduce Moderante-to-high dimensionality data Terabyte-scale data mining
9	Canonical Correlation and Clustering for High Dimensional Data Ouyang, Qing January 2019 (has links) Multi-view datasets arise naturally in statistical genetics when the genetic and trait profile of an individual is portrayed by two feature vectors. A motivating problem concerning the Skin Intrinsic Fluorescence (SIF) study on the Diabetes Control and Complications Trial (DCCT) subjects is presented. A widely applied quantitative method to explore the correlation structure between two domains of a multi-view dataset is the Canonical Correlation Analysis (CCA), which seeks the canonical loading vectors such that the transformed canonical covariates are maximally correlated. In the high dimensional case, regularization of the dataset is required before CCA can be applied. Furthermore, the nature of genetic research suggests that sparse output is more desirable. In this thesis, two regularized CCA (rCCA) methods and a sparse CCA (sCCA) method are presented. When correlation sub-structure exists, stand-alone CCA method will not perform well. To tackle this limitation, a mixture of local CCA models can be employed. In this thesis, I review a correlation clustering algorithm proposed by Fern, Brodley and Friedl (2005), which seeks to group subjects into clusters such that features are identically correlated within each cluster. An evaluation study is performed to assess the effectiveness of CCA and correlation clustering algorithms using artificial multi-view datasets. Both sCCA and sCCA-based correlation clustering exhibited superior performance compare to the rCCA and rCCA-based correlation clustering. The sCCA and the sCCA-clustering are applied to the multi-view dataset consisted of PrediXcan imputed gene expression and SIF measurements of DCCT subjects. The stand-alone sparse CCA method identified 193 among 11538 genes being correlated with SIF#7. Further investigation of these 193 genes with simple linear regression and t-test revealed that only two genes, ENSG00000100281.9 and ENSG00000112787.8, were significance in association with SIF#7. No plausible clustering scheme was detected by the sCCA based correlation clustering method. / Thesis / Master of Science (MSc) Machine Learning Correlation Clustering Sparse Canonical Correlation Analysis Skin Intrinsic Fluorescence Multi-view dataset Lasso Dimensionality reduction PrediXcan High dimensional data

Search results