Spelling suggestions: "subject:"graph clustering"" "subject:"raph clustering""
11 |
Enriching the Web of Data with topics and linksBöhm, Christoph January 2013 (has links)
This thesis presents novel ideas and research findings for the Web of Data – a global data space spanning many so-called Linked Open Data sources. Linked Open Data adheres to a set of simple principles to allow easy access and reuse for data published on the Web. Linked Open Data is by now an established concept and many (mostly academic) publishers adopted the principles building a powerful web of structured knowledge available to everybody. However, so far, Linked Open Data does not yet play a significant role among common web technologies that currently facilitate a high-standard Web experience.
In this work, we thoroughly discuss the state-of-the-art for Linked Open Data and highlight several shortcomings – some of them we tackle in the main part of this work.
First, we propose a novel type of data source meta-information, namely the topics of a dataset. This information could be published with dataset descriptions and support a variety of use cases, such as data source exploration and selection. For the topic retrieval, we present an approach coined Annotated Pattern Percolation (APP), which we evaluate with respect to topics extracted from Wikipedia portals.
Second, we contribute to entity linking research by presenting an optimization model for joint entity linking, showing its hardness, and proposing three heuristics implemented in the LINked Data Alignment (LINDA) system. Our first solution can exploit multi-core machines, whereas the second and third approach are designed to run in a distributed shared-nothing environment. We discuss and evaluate the properties of our approaches leading to recommendations which algorithm to use in a specific scenario. The distributed algorithms are among the first of their kind, i.e., approaches for joint entity linking in a distributed fashion. Also, we illustrate that we can tackle the entity linking problem on the very large scale with data comprising more than 100 millions of entity representations from very many sources.
Finally, we approach a sub-problem of entity linking, namely the alignment of concepts. We again target a method that looks at the data in its entirety and does not neglect existing relations. Also, this concept alignment method shall execute very fast to serve as a preprocessing for further computations. Our approach, called Holistic Concept Matching (HCM), achieves the required speed through grouping the input by comparing so-called knowledge representations. Within the groups, we perform complex similarity computations, relation conclusions, and detect semantic contradictions. The quality of our result is again evaluated on a large and heterogeneous dataset from the real Web.
In summary, this work contributes a set of techniques for enhancing the current state of the Web of Data. All approaches have been tested on large and heterogeneous real-world input. / Die vorliegende Arbeit stellt neue Ideen sowie Forschungsergebnisse für das Web of Data vor. Hierbei handelt es sich um ein globales Netz aus sogenannten Linked Open Data (LOD) Quellen. Diese Datenquellen genügen gewissen Prinzipien, um Nutzern einen leichten Zugriff über das Internet und deren Verwendung zu ermöglichen. LOD ist bereits weit verbreitet und es existiert eine Vielzahl von Daten-Veröffentlichungen entsprechend der LOD Prinzipien. Trotz dessen ist LOD bisher kein fester Baustein des Webs des 21. Jahrhunderts.
Die folgende Arbeit erläutert den aktuellen Stand der Forschung und Technik für Linked Open Data und identifiziert dessen Schwächen. Einigen Schwachstellen von LOD widmen wir uns in dem darauf folgenden Hauptteil.
Zu Beginn stellen wir neuartige Metadaten für Datenquellen vor – die Themen von Datenquellen (engl. Topics). Solche Themen könnten mit Beschreibungen von Datenquellen veröffentlicht werden und eine Reihe von Anwendungsfällen, wie das Auffinden und Explorieren relevanter Daten, unterstützen. Wir diskutieren unseren Ansatz für die Extraktion dieser Metainformationen – die Annotated Pattern Percolation (APP). Experimentelle Ergebnisse werden mit Themen aus Wikipedia Portalen verglichen.
Des Weiteren ergänzen wir den Stand der Forschung für das Auffinden verschiedener Repräsentationen eines Reale-Welt-Objektes (engl. Entity Linking). Für jenes Auffinden werden nicht nur lokale Entscheidungen getroffen, sondern es wird die Gesamtheit der Objektbeziehungen genutzt. Wir diskutieren unser Optimierungsmodel, beweisen dessen Schwere und präsentieren drei Ansätze zur Berechnung einer Lösung. Alle Ansätze wurden im LINked Data Alignment (LINDA) System implementiert. Die erste Methode arbeitet auf einer Maschine, kann jedoch Mehrkern-Prozessoren ausnutzen. Die weiteren Ansätze wurden für Rechnercluster ohne gemeinsamen Speicher entwickelt. Wir evaluieren unsere Ergebnisse auf mehr als 100 Millionen Entitäten und erläutern Vor- sowie Nachteile der jeweiligen Ansätze.
Im verbleibenden Teil der Arbeit behandeln wir das Linking von Konzepten – ein Teilproblem des Entity Linking. Unser Ansatz, Holistic Concept Matching (HCM), betrachtet abermals die Gesamtheit der Daten. Wir gruppieren die Eingabe um eine geringe Laufzeit bei der Verarbeitung von mehreren Hunderttausenden Konzepten zu erreichen. Innerhalb der Gruppen berechnen wir komplexe Ähnlichkeiten, und spüren semantische Schlussfolgerungen und Widersprüche auf. Die Qualität des Ergebnisses evaluieren wir ebenfalls auf realen Datenmengen.
Zusammenfassend trägt diese Arbeit zum aktuellen Stand der Forschung für das Web of Data bei. Alle diskutierten Techniken wurden mit realen, heterogenen und großen Datenmengen getestet.
|
12 |
Machine Learning Based Drug-Disease Relationship Prediction and CharacterizationYaddanapudi, Suryanarayana 01 October 2019 (has links)
No description available.
|
13 |
Large-Scale Graph Visual AnalyticsZhang, Fangyan 08 December 2017 (has links)
Large-scale graph analysis and visualization is becoming a more challenging task, due to the increasing amount of graph data. This dissertation focuses on methods to ease the task of exploring large-scale graphs through graph sampling and visualization. Graph sampling aims to reduce the complexity of graph drawing, while preserving properties of the original graph, allowing analysis of the smaller sample which yields the characteristics similar to those of the original graph. Graph visualization is an effective and intuitive approach to observing structures within graph data. For large-scale graphs, graph sampling and visualization are straightforward strategies to gain insights into common issues that are often encountered. This dissertation evaluates commonly used graph sampling methods through a combined visual and statistical comparison of graphs sampled at various rates based on random graphs, small-world graphs, scaleree graphs, and real-world graphs. This benchmark study can be used as a guideline in choosing the appropriate method for a particular graph sampling task. In addition, this thesis proposes three types of distributed sampling algorithms and develops a sampling package on Spark. Compared with traditional/non-distributed graph sampling approaches, the scalable distributed sampling approaches are as reliable as the traditional/non-distributed graph sampling techniques, and they bring much needed improvement to sampling efficiency, especially with regards to topology-based sampling. This benchmark study in traditional/non-distributed graph sampling is also applicable to distributed graph sampling as well. A contribution to the area of graph visualization is also made through the presentation of a scalable graph visualization system-BGS (Big Graph Surfer) that creates hierarchical structure from an original graph and provides interactive navigation along the hierarchy by expanding or collapsing clusters when visualizing large-scale graphs. A distributed computing framework-Spark provides the backend for BGS on clustering and visualization. This architecture makes it capable of visualizing a graph up to 1 billion nodes or edges in real-time. In addition, BGS provides a series of hierarchy and graph exploration methods, such as hierarchy view, hierarchy navigation, hierarchy search, graph view, graph navigation, graph search, and other useful interactions. These functionalities facilitate the exploration of very large-scale graphs. Evaluation of BGS is performed through application to several representative of large-scale graph datasets and comparison with other existing graph visualization tools in scalability, usability, and flexibility. The dissertation concludes with a summarization of the contributions and their improvement on large-scale graph analysis and visualization, and a discussion about possible future work on this research field.
|
14 |
Enterprise network topology discovery based on end-to-end metrics : Logical site discovery in enterprise networks based on application level measurements in peer- to-peer systemsBodvill, Jonatan January 2017 (has links)
In data intensive applications deployed in enterprise networks, especially applications utilizing peer-to-peer technology, locality is of high importance. Peers should aim to maximize data exchange with other peers where the connectivity is the best. In order to achieve this, locality information must be present which peers can base their decisions on. This information is not trivial to find as there is no readily available global knowledge of which nodes have good connectivity. Having each peer try other peers randomly until it finds good enough partners is costly and lowers the locality of the application until it converges. In this thesis a solution is presented which creates a logical topology of a peer-to-peer network, grouping peers into clusters based on their connectivity metrics. This can then be used to aid the peer-to-peer partner selection algorithm to allow for intelligent partner selection. A graph model of the system is created, where peers in the system are modelled as vertices and connections between peers are modelled as edges, with a weight in relation to the quality of the connection. The problem is then modelled as a weighted graph clustering problem which is a well-researched problem with a lot of published work tied to it. State-of-the-art graph community detection algorithms are researched, selected depending on factors such as performance and scalability, optimized for the current purpose and implemented. The results of running the algorithms on the streaming data are evaluated against known information. The results show that unsupervised graph community detection algorithm creates useful insights into networks connectivity structure and can be used in peer-to-peer contexts to find the best partners to exchange data with. / I dataintensiva applikationer i företagsnätverk, speciellt applikationer som använder sig av peer-to-peer teknologi, är lokalitet viktigt. Klienter bör försöka maximera datautbyte med andra klienter där nätverkskopplingen är som bäst. För att klienterna ska kunna göra sådana val måste information om vilka klienter som befinner sig vara vara tillgänglig som klienterna kan basera sina val på. Denna information är inte trivial att framställa då det inte finns någon färdig global information om vilka klienter som har bra uppkoppling med andra klienter och att låta varje klient prova sig fram blint tills de hittar de bästa partnerna är kostsamt och sänker applikationens lokalitet innan den konvergerar. I denna rapport presenteras en lösning som skapar en logisk vy över ett peer-to-peer nätverk, vilken grupperar klienter i kluster baserat på deras uppkopplingskvalitet. Denna vy kan sedan användas för att förbättra lokaliteten i peerto-peer applikationen. En grafmodell av systemet skapas, där klienter modelleras som hörn och kopplingar mellan klienter modelleras som kanter med en vikt i relation till uppkopplingskvaliteten. Problemet formuleras sedan som ett riktat grafklusterproblem vilket är ett väldokumenterat forskningsområde med mycket arbete publicerat kring. De mest framstående grafklusteralgoritmerna är sedan studerade, utvalda baserat på kravspecifikationer, optimerade för det aktuella problemet och implementerade. Resultaten som produceras av att algoritmerna körs på strömdata är evaluerade mot känd information. Resultaten visar att oövervakade grafklusteralgoritmer skapar användbar information kring nätverkens uppkopplingsstruktur och kan användas i peer-to-peerapplikationssammanhang för att hitta de bästa partnerna att utbyta data med.
|
15 |
Community Detection applied to Cross-Device Identity Graphs / Gemenskapsdetektering applicerades på gränsöverskridande identitetsgraferGeffrier, Valentin January 2017 (has links)
The personalization of online advertising has now become a necessity for marketing agencies. The tracking technologies such as third-party cookies gives advertisers the ability to recognize internet users across different websites, to understand their behavior and to assess their needs and their tastes. The amount of created data and interactions leads to the creation of a large cross-device identity graph that links different identifiers such as emails to different devices used on different networks. Over time, strongly connected components appear in this graph, too large to represent only the identifiers or devices of only one person or household. The aims of this project is to partition these components according to the structure of the graph and the features associated to the edges without separating identifiers used by a same person. Subsequent to this, the size reduction of these components leads to the isolation of individuals and the identifiers associated to them. This thesis presents the design of a bipartite graph from the available data, the implementation of different community detection graphs adapted to this specific case and different validation methods designed to assess the quality of our partition. Different graph metrics are then used to compare the outputs of the algorithms and we will observe how the adaptation of the algorithm to the bipartite case can lead to better results. / Anpassningen av onlineannonsering har nu blivit en nödvändighet för marknadsföringsbyråer. Spårningstekniken som cookies från tredje part ger annonsörer möjlighet att känna igen internetanvändare på olika webbplatser, för att förstå deras beteende och för att bedöma deras behov och deras smak. Mängden skapade data och interaktioner leder till skapandet av en stor identitetsgrafik för flera enheter som länkar olika identifierare, t.ex. e-postmeddelanden till olika enheter som används i olika nätverk. Över tiden visas starkt anslutna komponenter i det här diagrammet, för stora för att endast representera identifierare eller enheter av endast en person eller hushåll. Syftet med detta projekt är att partitionera dessa komponenter enligt grafens struktur och de egenskaper som är knutna till kanterna utan att separera identifierare som används av samma person. Efter detta leder storleksreduktionen av dessa komponenter till isoleringen av individer och de identifierare som är associerade med dem. Denna avhandling presenterar utformningen av en bifogad graf från tillgängliga data, genomförandet av olika samhällsdetekteringskurvor anpassade till detta specifika fall och olika valideringsmetoder som är utformade för att bedöma kvaliteten på vår partition. Olika grafvärden används då för att jämföra algoritmens utgångar och vi kommer att observera hur anpassningen av algoritmen till tvåpartsfallet kan leda till bättre resultat.
|
16 |
Algorithms For Discovering Communities In Complex NetworksBalakrishnan, Hemant 01 January 2006 (has links)
It has been observed that real-world random networks like the WWW, Internet, social networks, citation networks, etc., organize themselves into closely-knit groups that are locally dense and globally sparse. These closely-knit groups are termed communities. Nodes within a community are similar in some aspect. For example in a WWW network, communities might consist of web pages that share similar contents. Mining these communities facilitates better understanding of their evolution and topology, and is of great theoretical and commercial significance. Community related research has focused on two main problems: community discovery and community identification. Community discovery is the problem of extracting all the communities in a given network, whereas community identification is the problem of identifying the community, to which, a given set of nodes belong. We make a comparative study of various existing community-discovery algorithms. We then propose a new algorithm based on bibliographic metrics, which addresses the drawbacks in existing approaches. Bibliographic metrics are used to study similarities between publications in a citation network. Our algorithm classifies nodes in the network based on the similarity of their neighborhoods. One of the drawbacks of the current community-discovery algorithms is their computational complexity. These algorithms do not scale up to the enormous size of the real-world networks. We propose a hash-table-based technique that helps us compute the bibliometric similarity between nodes in O(m ?) time. Here m is the number of edges in the graph and ?, the largest degree. Next, we investigate different centrality metrics. Centrality metrics are used to portray the importance of a node in the network. We propose an algorithm that utilizes centrality metrics of the nodes to compute the importance of the edges in the network. Removal of the edges in ascending order of their importance breaks the network into components, each of which represent a community. We compare the performance of the algorithm on synthetic networks with a known community structure using several centrality metrics. Performance was measured as the percentage of nodes that were correctly classified. As an illustration, we model the ucf.edu domain as a web graph and analyze the changes in its properties like densification power law, edge density, degree distribution, diameter, etc., over a five-year period. Our results show super-linear growth in the number of edges with time. We observe (and explain) that despite the increase in average degree of the nodes, the edge density decreases with time.
|
17 |
A Framework of Transforming Vertex Deletion Algorithm to Edge Deletion AlgorithmWang, Nan January 2017 (has links)
No description available.
|
18 |
Avaliação de algoritmos de agrupamento em grafos para segmentação de imagens / Evaluation of graph clustering algorithms for images segmentationBelizario, Ivar Vargas 12 November 2012 (has links)
A segmentação de imagens e, em visão computacional, uma tarefa de grande importância, para a qual existem várias abordagem. A complexidade de tais abordagens está relacionada à natureza da imagem e também ao grau de precisão da segmentação, que e um conceito bastante subjetivo, normalmente associado a semelhança que apresenta a segmentaçã produzida pela visão humana. Na segmentação de imagens baseada em algoritmos de agrupamento em grafos, geralmente os pixels da imagem compôem os nós do grafo e as arestas representam a similaridade entre estes nós. Assim, a segmentação pode ser obtida por meio do agrupamento dos nós do grafo. É importante salientar, no entanto, que as técnicas de agrupamento em grafos surgiram no contexto de reconhecimento de padrões, cujo objetivo primario era o tratamento de dados diversos que não envolviam imagens. O uso de tais tecnicas para a segmentação de imagens e relativamente recente e revela alguns problemas desaadores. O primeiro deles é a deficiente escalabilidade de alguns métodos, o que impede o seu uso efetivo em imagens de altas dimensões. Outra questão é a falta de estudos que avaliam as medidas de similaridade na montagem do grafo e critérios que aferem a qualidade do agrupamento para a área específica de segmentação de imagens. Em outras palavras, faltam na literatura análises mais específicas que indiquem quais algoritmos de agrupamento em grafos são mais efetivos para a segmentação de imagens e que procurem associar (ou correlacionar) as várias medidas de similaridade e métricas de qualidade de agrupamento que produzam segmentações mais precisas. Neste trabalho é apresentada a avaliação de 6 algoritmos de agrupamento em grafos formulados em base a 3 categorias identificadas (agrupamento espectral, algoritmos de particionamento multinível e algoritmos para detectar comunidades) e aplicadas na segmentação automática de imagens de cenas naturais com grandes dimensões. Esta avaliação objetiva aferir, sobretudo, a qualidade da segmentação, a escalabilidade, o desempenho de 7 funções de similaridade formuladas, e também visa corroborar a existência da correlação entre a qualidade do agrupamento e a qualidade da segmentação. Para reduzir o esforço computacional e contribuir com a escalabilidade dos algoritmos formulados é utilizado um algoritmo de pré-processamento (SLIC) que agrupa váarios pixels da imagem em uma unica região (superpixels), o que contribui para reduzir o tamanho do grafo e, consequentemente, reduzindo o custo computacional do agrupamento. Os resultados demostram que os algoritmos formulados LP (Label Propagation) e FG (Fast Greedy) apresentam boa escalabilidade e boa qualidade de segmentação. Seis das sete funções de similaridade avaliadas apresentam um bom desempenho, independentemente do algoritmo de agrupamento empregado. É mostrado também que exites correlação entre a medida de qualidade de agrupamento conhecido como índice de silhueta e a qualidade de segmentação, ou seja, quanto maior o valor de silhueta, melhor a segmentação. A qualidade de segmentação foi avaliada quantitativamente, utilizando-se um conjunto de imagens segmentadas manualmente / Image segmentation is an important task within computer vision for which many approaches are available. The complexity of such approaches are intrinsically related with the nature of the image and also the desired accuracy aimed at. Image segmentation accuracy, however, is a subjective concept and is normally associated with how much it resembles segmentation produced by the human vision system. In graphbased clustering image segmentation algorithms, pixels are normally represented as nodes and edges convey the similarity between such nodes. Hence, segmentation may be attained by means of grouping node of a graph. It is important, though, to point out that graph-based clustering techniques rst appeared in the context of pattern recognition and its primary data source were not images. The usage of such techniques for image segmentation is a recent trend and poses some challenge issues. The first is the poor scalability that many methods exhibit, impairing its application in images of larger dimensions. Another issues is that lack of studies that assess the goodness of similarity measures employed in graph computation and also clustering quality criteria assessments for the specic area of image processing. In other words, there is no evidences in the literature on how effective graph-based clustering algorithms are for image segmentation and no studies that associate similarity functions and clustering quality metrics with image processing quality. This work presents an evaluation of six graph-based clustering algorithms according to three major categories found in the literature (spectral clustering, multi layer partitioning algorithms and community detectors) applied to automatic segmentation of image of larger dimensions. This evaluation takes into account segmentation quality, scalability, the performance of seven similarity functions and, nally, bring some insights on the correlation between clustering and segmentation quality. To reduce computational costs and enhance scalability of the methods we employ a pre processing algorithm (SLIC) which combines many pixels into one single region (superpixel). This contributes to reduce the graph size and, consequently, the cost of clustering. Results have shown that the LP (Label Propagation) and FG (Fast Greedy) algorithms exhibit good scalability and good segmentation. Six out of the seven similarity functions presented good performance, regardless the clustering algorithm. We also have shown that there is correlation between clustering quality and image segmentation quality, when the Silhouette measure is employed. That is, the higher the Silhouette values, the better the segmentation. Image segmentation quality has been quantitatively assessed by means of ground-truth data or user segmented images
|
19 |
Metaheurísticas para o problema de agrupamento de dados em grafo / Metaheuristics for the graph clustering problemNascimento, Mariá Cristina Vasconcelos 26 February 2010 (has links)
O problema de agrupamento de dados em grafos consiste em encontrar clusters de nós em um dado grafo, ou seja, encontrar subgrafos com alta conectividade. Esse problema pode receber outras nomenclaturas, algumas delas são: problema de particionamento de grafos e problema de detecção de comunidades. Para modelar esse problema, existem diversas formulações matemáticas, cada qual com suas vantagens e desvantagens. A maioria dessas formulações tem como desvantagem a necessidade da definição prévia do número de grupos que se deseja obter. Entretanto, esse tipo de informação não está contida em dados para agrupamento, ou seja, em dados não rotulados. Esse foi um dos motivos da popularização nas últimas décadas da medida conhecida como modularidade, que tem sido maximizada para encontrar partições em grafos. Essa formulação, além de não exigir a definição prévia do número de clusters, se destaca pela qualidade das partições que ela fornece. Nesta Tese, metaheurísticas Greedy Randomized Search Procedures para dois modelos existentes para agrupamento em grafos foram propostas: uma para o problema de maximização da modularidade e a outra para o problema de maximização da similaridade intra-cluster. Os resultados obtidos por essas metaheurísticas foram melhores quando comparadas àqueles de outras heurísticas encontradas na literatura. Entretanto, o custo computacional foi alto, principalmente o da metaheurística para o modelo de maximização da modularidade. Com o passar dos anos, estudos revelaram que a formulação que maximiza a modularidade das partições possui algumas limitações. A fim de promover uma alternativa à altura do modelo de maximização da modularidade, esta Tese propõe novas formulações matemáticas de agrupamento em grafos com e sem pesos que visam encontrar partições cujos clusters apresentem alta conectividade. Além disso, as formulações propostas são capazes de prover partições sem a necessidade de definição prévia do número de clusters. Testes com centenas de grafos com pesos comprovaram a eficiência dos modelos propostos. Comparando as partições provenientes de todos os modelos estudados nesta Tese, foram observados melhores resultados em uma das novas formulações propostas, que encontrou partições bastante satisfatórias, superiores às outras existentes, até mesmo para a de maximização de modularidade. Os resultados apresentaram alta correlação com a classificação real dos dados simulados e reais, sendo esses últimos, em sua maioria, de origem biológica / Graph clustering aims at identifying highly connected groups or clusters of nodes of a graph. This problem can assume others nomenclatures, such as: graph partitioning problem and community detection problem. There are many mathematical formulations to model this problem, each one with advantages and disadvantages. Most of these formulations have the disadvantage of requiring the definition of the number of clusters in the final partition. Nevertheless, this type of information is not found in graphs for clustering, i.e., whose data are unlabeled. This is one of the reasons for the popularization in the last decades of the measure known as modularity, which is being maximized to find graph partitions. This formulation does not require the definition of the number of clusters of the partitions to be produced, and produces high quality partitions. In this Thesis, Greedy Randomized Search Procedures metaheuristics for two existing graph clustering mathematical formulations are proposed: one for the maximization of the partition modularity and the other for the maximization of the intra-cluster similarity. The results obtained by these proposed metaheuristics outperformed the results from other heuristics found in the literature. However, their computational cost was high, mainly for the metaheuristic for the maximization of modularity model. Along the years, researches revealed that the formulation that maximizes the modularity of the partitions has some limitations. In order to promote a good alternative for the maximization of the partition modularity model, this Thesis proposed new mathematical formulations for graph clustering for weighted and unweighted graphs, aiming at finding partitions with high connectivity clusters. Furthermore, the proposed formulations are able to provide partitions without a previous definition of the true number of clusters. Computational tests with hundreds of weighted graphs confirmed the efficiency of the proposed models. Comparing the partitions from all studied formulations in this Thesis, it was possible to observe that the proposed formulations presented better results, even better than the maximization of partition modularity. These results are characterized by satisfactory partitions with high correlation with the true classification for the simulated and real data (mostly biological)
|
20 |
Refinamento multinível em redes complexas baseado em similaridade de vizinhança / Multilevel refinement in complex networks based on neighborhood similarityValejo, Alan Demetrius Baria 11 November 2014 (has links)
No contexto de Redes Complexas, particularmente das redes sociais, grupos de objetos densamente conectados entre si, esparsamente conectados a outros grupos, são denominados de comunidades. Detecção dessas comunidades tornou-se um campo de crescente interesse científico e possui inúmeras aplicações práticas. Nesse contexto, surgiram várias pesquisas sobre estratégias multinível para particionar redes com elevada quantidade de vértices e arestas. O objetivo dessas estratégias é diminuir o custo do algoritmo de particionamento aplicando-o sobre uma versão reduzida da rede original. Uma possibilidade dessa estratégia, ainda pouco explorada, é utilizar heurísticas de refinamento local para melhorar a solução final. A maioria das abordagens de refinamento exploram propriedades gerais de redes complexas, tais como corte mínimo ou modularidade, porém, não exploram propriedades inerentes de domínios específicos. Por exemplo, redes sociais são caracterizadas por elevado coeficiente de agrupamento e assortatividade significativa, consequentemente, maximizar tais características pode conduzir a uma boa solução e uma estrutura de comunidades bem definida. Motivado por essa lacuna, neste trabalho é proposto um novo algoritmo de refinamento, denominado RSim, que explora características de alto grau de transitividade e assortatividade presente em algumas redes reais, em particular em redes sociais. Para isso, adotou-se medidas de similaridade híbridas entre pares de vértices, que utilizam os conceitos de vizinhança e informações de comunidades para interpretar a semelhança entre pares de vértices. Uma análise comparativa e sistemática demonstrou que o RSim supera os algoritmos de refinamento habituais em redes com alto coeficiente de agrupamento e assortatividade. Além disso, avaliou-se o RSim em uma aplicação real. Nesse cenário, o RSim supera todos os métodos avaliado quanto a eficiência e eficácia, considerando todos os conjuntos de dados selecionados. / In the context of complex networks, particularly social networks, groups of densely interconnected objects, sparsely linked to other groups are called communities. Detection of these communities has become a field of increasing scientific interest and has numerous practical applications. In this context, several studies have emerged on multilevel strategies for partitioning networks with high amount of vertices and edges. The goal of these strategies is to reduce the cost of partitioning algorithm by applying it on a reduced version of the original network. The possibility for this strategy, yet little explored, is to apply local refinement heuristics to improve the final solution. Most refinement approaches explore general properties of complex networks, such as minimum cut or modularity, however, do not exploit inherent properties of specific domains. For example, social networks are characterized by high clustering coefficient and significant assortativity, hence maximize such characteristics may lead to a good solution and a well-defined community structure. Motivated by this gap, in this thesis, we propose a new refinement algorithm, called RSim, which exploits characteristics of high degree of transitivity and assortativity present in some real networks, particularly social networks. For this, we adopted hybrid similarity measures between pairs of vertices, using the concepts of neighborhood and community information to interpret the similarity between pairs of vertices. A systematic and comparative analysis showed that the RSim statistically outperforms usual refinement algorithms in networks with high clustering coefficient and assortativity. In addition, we assessed the RSim in a real application. In this scenario, the RSim surpasses all evaluated methods in efficiency and effectiveness, considering all the selected data sets.
|
Page generated in 0.1045 seconds