Online Social Networks (OSNs) such as Facebook, Twitter, and YouTube are among the most popular sites on the Internet. Billions of users are connected through these sites, building strong and effective communities to share views and ideas, and make recommendations nowadays. Therefore, by choosing an appropriate user-base from billions of people is required to analyze the structure and key characteristics of the large social graphs to improve current systems and to design new applications. For this reason, node sampling technique plays an important role to study large-scale social networks. As a basic requirement, the sampled nodes and their links should possess similar statistical features of the original network, otherwise the conclusion drawn from the sampled network may not be appropriate for the entire population. Hence, good sampling strategies are key to many online social network applications. For instance, before introducing a new product or adding new feature(s) of a product to the online social network community, that specific new product or the additional feature has to be exposed to only a small set of users, who are carefully chosen to represent the complete set of users. As such, different random walk-based sampling techniques have been introduced to produce samples of nodes that not only are internally well-connected but also capture the statistical features of the whole network. Traditionally, walk-based techniques do not have the restriction on the number of times that a node can be re-visited while sampling. This may lead to an inefficient sampling method, because the walk may be "stuck" at a small number of high-degree nodes without being able to reach out to the rest of the nodes. A random walk, even after a large number of hops, may not be able to obtain a sampled network that captures the statistical features of the entire network.
In this thesis, we propose two walk-based sampling techniques to address the above problem, called K-Avoiding Random Walk (KARW) and Neighborhood-Avoiding Random Walk (NARW). With KARW, the number of times that a node can be re-visited is constrained within a given number K. With NARW, the random walk works in a "jump" fashion, since the walk starts outside of the N-hop neighborhood from the current node chosen randomly. By avoiding the current nodes neighboring area of level-N, NARW is expected to reach out the other nodes within the entire network quickly. We apply these techniques to construct multiple independent subgraphs from a social graph, consisting of 63K users with around a million connections between users collected from a Facebook dataset. By simulating our proposed strategies, we collect performance metrics and compare the results with the current state-of-the-art sampling techniques (Uniform Random Sampling, Random Walk, and Metropolis Hastings Random Walk). We also calculate some of the key statistical features (i.e., degree distribution, betweenness centrality, closeness centrality, modularity, and clustering coefficient) of the sampled graphs to get an idea about the network structures that essentially represent the original social graph. / Graduate / 0984 / shahed.anwar@gmail.com
Identifer | oai:union.ndltd.org:uvic.ca/oai:dspace.library.uvic.ca:1828/6785 |
Date | 05 November 2015 |
Creators | Anwar, Shahed |
Contributors | Wu, Kui, Pan, Jianping |
Source Sets | University of Victoria |
Language | English, English |
Detected Language | English |
Type | Thesis |
Rights | Available to the World Wide Web |
Page generated in 0.0023 seconds