Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction being performed on diverse data sets, including protein-protein interaction networks, social networks and citation networks. A frequent approach to approaching these tasks begins by learning an Euclidean embedding of the network, to which machine learning algorithms developed for vector-valued data are applied.
For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. This distinguishes it from the setting of traditional i.i.d data where there is essentially only one way of subsampling the data - selecting the data points uniformly and without replacement. Despite the strong empirical performance when using embeddings produced in such a manner, they are not well understood theoretically, particularly with regards to the role of the sampling scheme.
Here, we develop a unifying framework which encapsulates representation learning methods for networks which are trained via performing gradient updates obtained by subsampling the network, including random-walk based approaches such as node2vec. In particular, we prove, under the assumption that the network has an exchangeable law, that the distribution of the learned embedding vectors asymptotically decouples. We characterize the asymptotic distribution of the learned embedding vectors, and give the corresponding rates of convergence, which depend on factors such as the sampling scheme, the choice of loss function, and the choice of embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks; in particular, we apply our results to argue that the embedding vectors produced by node2vec can be used to perform weakly consistent community detection.
Identifer | oai:union.ndltd.org:columbia.edu/oai:academiccommons.columbia.edu:10.7916/cv4k-2t44 |
Date | January 2022 |
Creators | Davison, Andrew |
Source Sets | Columbia University |
Language | English |
Detected Language | English |
Type | Theses |
Page generated in 0.0026 seconds