This dissertation investigates the community structure of web-like networks (i.e., large, random, real-life networks such as the World Wide Web and the Internet). Recently, it has been shown that many such networks have a locally dense and globally sparse structure with certain small, dense subgraphs occurring much more frequently than they do in the classical Erdös-Rényi random graphs. This peculiarity--which is commonly referred to as community structure--has been observed in seemingly unrelated networks such as the Web, email networks, citation networks, biological networks, etc. The pervasiveness of this phenomenon has led many researchers to believe that such cohesive groups of nodes might represent meaningful entities. For example, in the Web such tightly-knit groups of nodes might represent pages with a common topic, geographical location, etc., while in the neural networks they might represent evolved computational units. The notion of community has emerged in an effort to formalize the empirical observation of the locally dense globally sparse structure of web-like networks. In the broadest sense, a community in a web-like network is defined as a group of nodes that induces a dense subgraph which is sparsely linked with the rest of the network. Due to a wide array of envisioned applications, ranging from crawlers and search engines to network security and network compression, there has recently been a widespread interest in finding efficient community-mining algorithms. In this dissertation, the community structure of web-like networks is investigated by a combination of analytical and computational techniques: First, we consider the problem of modeling the web-like networks. In the recent years, many new random graph models have been proposed to account for some recently discovered properties of web-like networks that distinguish them from the classical random graphs. The vast majority of these random graph models take into account only the addition of new nodes and edges. Yet, several empirical observations indicate that deletion of nodes and edges occurs frequently in web-like networks. Inspired by such observations, we propose and analyze two dynamic random graph models that combine node and edge addition with a uniform and a preferential deletion of nodes, respectively. In both cases, we find that the random graphs generated by such models follow power-law degree distributions (in agreement with the degree distribution of many web-like networks). Second, we analyze the expected density of certain small subgraphs--such as defensive alliances on three and four nodes--in various random graphs models. Our findings show that while in the binomial random graph the expected density of such subgraphs is very close to zero, in some dynamic random graph models it is much larger. These findings converge with our results obtained by computing the number of communities in some Web crawls. Next, we investigate the computational complexity of the community-mining problem under various definitions of community. Assuming the definition of community as a global defensive alliance, or a global offensive alliance we prove--using transformations from the dominating set problem--that finding optimal communities is an NP-complete problem. These and other similar complexity results coupled with the fact that many web-like networks are huge, indicate that it is unlikely that fast, exact sequential algorithms for mining communities may be found. To handle this difficulty we adopt an algorithmic definition of community and a simpler version of the community-mining problem, namely: find the largest community to which a given set of seed nodes belong. We propose several greedy algorithms for this problem: The first proposed algorithm starts out with a set of seed nodes--the initial community--and then repeatedly selects some nodes from community's neighborhood and pulls them in the community. In each step, the algorithm uses clustering coefficient--a parameter that measures the fraction of the neighbors of a node that are neighbors themselves--to decide which nodes from the neighborhood should be pulled in the community. This algorithm has time complexity of order , where denotes the number of nodes visited by the algorithm and is the maximum degree encountered. Thus, assuming a power-law degree distribution this algorithm is expected to run in near-linear time. The proposed algorithm achieved good accuracy when tested on some real and computer-generated networks: The fraction of community nodes classified correctly is generally above 80% and often above 90% . A second algorithm based on a generalized clustering coefficient, where not only the first neighborhood is taken into account but also the second, the third, etc., is also proposed. This algorithm achieves a better accuracy than the first one but also runs slower. Finally, a randomized version of the second algorithm which improves the time complexity without affecting the accuracy significantly, is proposed. The main target application of the proposed algorithms is focused crawling--the selective search for web pages that are relevant to a pre-defined topic.
Identifer | oai:union.ndltd.org:ucf.edu/oai:stars.library.ucf.edu:etd-1535 |
Date | 01 January 2005 |
Creators | Cami, Aurel |
Publisher | STARS |
Source Sets | University of Central Florida |
Language | English |
Detected Language | English |
Type | text |
Format | application/pdf |
Source | Electronic Theses and Dissertations |
Page generated in 0.0028 seconds