Global ETD Search

Return to search

Distributed Bootstrap for Massive Data

Modern massive data, with enormous sample size and tremendous dimensionality, are usually stored and processed using a cluster of nodes in a master-worker architecture. A shortcoming of this architecture is that inter-node communication can be over a thousand times slower than intra-node computation, which makes communication efficiency a desirable feature when developing distributed learning algorithms. In this dissertation, we tackle this challenge and propose communication-efficient bootstrap methods for simultaneous inference in the distributed computational framework.
 
First, we propose two generic distributed bootstrap methods, \texttt{k-grad} and \texttt{n+k-1-grad}, which apply multiplier bootstrap at the master node on the gradients communicated across nodes. Based on them, we develop a communication-efficient method of producing an $\ell_\infty$-norm confidence region using distributed data with dimensionality not exceeding the local sample size. Our theory establishes the communication efficiency by providing a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency and showing that $\tau_{\min}$ only increases logarithmically with the number of workers and the dimensionality. Our simulation studies validate our theory.
 
Then, we extend \texttt{k-grad} and \texttt{n+k-1-grad} to the high-dimensional regime and propose a distributed bootstrap method for simultaneous inference on high-dimensional distributed data. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $\tau_{\min}$ only increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset.

10.25394/pgs.19664184.v1

Statistics

Distributed Learning

High-dimensional Inference

Multiplier Bootstrap

De-biased LASSO

Simultaneous inference

Identifer	oai:union.ndltd.org:purdue.edu/oai:figshare.com:article/19664184
Date	27 April 2022
Creators	Yang Yu (12466911)
Source Sets	Purdue University
Detected Language	English
Type	Text, Thesis
Rights	CC BY 4.0
Relation	https://figshare.com/articles/thesis/Distributed_Bootstrap_for_Massive_Data/19664184

Page generated in 0.0025 seconds

Distributed Bootstrap for Massive Data

Description

Links & Downloads

Tags

Additional Fields