1 |
Analyzing The Community Structure Of Web-like Networks: Models And AlgorithmsCami, Aurel 01 January 2005 (has links)
This dissertation investigates the community structure of web-like networks (i.e., large, random, real-life networks such as the World Wide Web and the Internet). Recently, it has been shown that many such networks have a locally dense and globally sparse structure with certain small, dense subgraphs occurring much more frequently than they do in the classical Erdös-Rényi random graphs. This peculiarity--which is commonly referred to as community structure--has been observed in seemingly unrelated networks such as the Web, email networks, citation networks, biological networks, etc. The pervasiveness of this phenomenon has led many researchers to believe that such cohesive groups of nodes might represent meaningful entities. For example, in the Web such tightly-knit groups of nodes might represent pages with a common topic, geographical location, etc., while in the neural networks they might represent evolved computational units. The notion of community has emerged in an effort to formalize the empirical observation of the locally dense globally sparse structure of web-like networks. In the broadest sense, a community in a web-like network is defined as a group of nodes that induces a dense subgraph which is sparsely linked with the rest of the network. Due to a wide array of envisioned applications, ranging from crawlers and search engines to network security and network compression, there has recently been a widespread interest in finding efficient community-mining algorithms. In this dissertation, the community structure of web-like networks is investigated by a combination of analytical and computational techniques: First, we consider the problem of modeling the web-like networks. In the recent years, many new random graph models have been proposed to account for some recently discovered properties of web-like networks that distinguish them from the classical random graphs. The vast majority of these random graph models take into account only the addition of new nodes and edges. Yet, several empirical observations indicate that deletion of nodes and edges occurs frequently in web-like networks. Inspired by such observations, we propose and analyze two dynamic random graph models that combine node and edge addition with a uniform and a preferential deletion of nodes, respectively. In both cases, we find that the random graphs generated by such models follow power-law degree distributions (in agreement with the degree distribution of many web-like networks). Second, we analyze the expected density of certain small subgraphs--such as defensive alliances on three and four nodes--in various random graphs models. Our findings show that while in the binomial random graph the expected density of such subgraphs is very close to zero, in some dynamic random graph models it is much larger. These findings converge with our results obtained by computing the number of communities in some Web crawls. Next, we investigate the computational complexity of the community-mining problem under various definitions of community. Assuming the definition of community as a global defensive alliance, or a global offensive alliance we prove--using transformations from the dominating set problem--that finding optimal communities is an NP-complete problem. These and other similar complexity results coupled with the fact that many web-like networks are huge, indicate that it is unlikely that fast, exact sequential algorithms for mining communities may be found. To handle this difficulty we adopt an algorithmic definition of community and a simpler version of the community-mining problem, namely: find the largest community to which a given set of seed nodes belong. We propose several greedy algorithms for this problem: The first proposed algorithm starts out with a set of seed nodes--the initial community--and then repeatedly selects some nodes from community's neighborhood and pulls them in the community. In each step, the algorithm uses clustering coefficient--a parameter that measures the fraction of the neighbors of a node that are neighbors themselves--to decide which nodes from the neighborhood should be pulled in the community. This algorithm has time complexity of order , where denotes the number of nodes visited by the algorithm and is the maximum degree encountered. Thus, assuming a power-law degree distribution this algorithm is expected to run in near-linear time. The proposed algorithm achieved good accuracy when tested on some real and computer-generated networks: The fraction of community nodes classified correctly is generally above 80% and often above 90% . A second algorithm based on a generalized clustering coefficient, where not only the first neighborhood is taken into account but also the second, the third, etc., is also proposed. This algorithm achieves a better accuracy than the first one but also runs slower. Finally, a randomized version of the second algorithm which improves the time complexity without affecting the accuracy significantly, is proposed. The main target application of the proposed algorithms is focused crawling--the selective search for web pages that are relevant to a pre-defined topic.
|
2 |
Topics in living cell miultiphoton laser scanning microscopy (MPLSM) image analysisZhang, Weimin 30 October 2006 (has links)
Multiphoton laser scanning microscopy (MPLSM) is an advanced fluorescence
imaging technology which can produce a less noisy microscope image and minimize the
damage in living tissue. The MPLSM image in this research is the dehydroergosterol
(DHE, a fluorescent sterol which closely mimics those of cholesterol in lipoproteins and
membranes) on living cell's plasma membrane area. The objective is to use a statistical
image analysis method to describe how cholesterol is distributed on a living cell's
membrane. Statistical image analysis methods applied in this research include image
segmentation/classification and spatial analysis. In image segmentation analysis, we
design a supervised learning method by using smoothing technique with rank statistics.
This approach is especially useful in a situation where we have only very limited
information of classes we want to segment. We also apply unsupervised leaning methods
on the image data. In image data spatial analysis, we explore the spatial correlation of
segmented data by a Monte Carlo test. Our research shows that the distributions of DHE
exhibit a spatially aggregated pattern. We fit two aggregated point pattern models, an
area-interaction process model and a Poisson cluster process model, to the data. For the area interaction process model, we design algorithms for maximum pseudo-likelihood
estimator and Monte Carlo maximum likelihood estimator under lattice data setting. For
the Poisson Cluster process parameter estimation, the method for implicit statistical
model parameter estimate is used. A group of simulation studies shows that the Monte
Carlo maximum estimation method produces consistent parameter estimates. The
goodness-of-fit tests show that we cannot reject both models. We propose to use the area
interaction process model in further research.
|
Page generated in 0.0991 seconds