This dissertation builds on two lines of research that are related to the task of community detection on large-scale network data.
Our first contribution is a novel generator for large-scale networks with overlapping communities. Synthetic generators are essential for algorithm testing and simulation studies for networks, as these data are scarce and constantly evolving. We propose a generator based on a flexible random graph model that allows for the control of two complementary measures of centrality -- the degree centrality and the eigencentrality. For an arbitrary centrality target and community structure, we study the problem of recovering the model parameters that enforce such targets in expectation. We find that this problem always admits a solution in the parameter space, which is also unique for large graphs. We propose to recover this solution via a properly initialized multivariate-Newton Raphson algorithm. The resulting benchmark generator is able to simulate networks with a billion edges and hundreds of millions of nodes in 30 seconds, while reproducing a wide spectrum of network topologies -- including assortative mixing and power-law centrality distributions.
Our second contribution involves variance reduction techniques for stochastic variational inference (SVI). SVI scales approximate inference to large-scale data -- including massive networks -- via stochastic optimization. SVI is efficient because, at each iteration, it only uses a random minibatch of the data to produce a noisy estimate of the gradient. However, such estimates can suffer from high variance, which slows down convergence. One strategy to reduce the variance of the gradient is to use importance sampling, biasing the distribution of data for each minibatch towards the data points that are most influential to the inference at hand. Here, we develop an importance sampling strategy for SVI. Our adaptive stochastic variational inference algorithm (AdaSVI) reweights the sampling distribution to minimize the variance of the stochastic natural gradient. We couple the importance sampling strategy with an adaptive learning rate providing a parameter-free stochastic optimization algorithm where the only user input required is the minibatch size. We study AdaSVI on a matrix factorization model and find that it significantly improves SVI, leading to faster convergence on synthetic data.
Identifer | oai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/6t26-m277 |
Date | January 2022 |
Creators | Grande, Alessandro Antonio |
Source Sets | Columbia University |
Language | English |
Detected Language | English |
Type | Theses |
Page generated in 0.0021 seconds