Return to search

Regression analysis of big count data via a-optimal subsampling

Indiana University-Purdue University Indianapolis (IUPUI) / There are two computational bottlenecks for Big Data analysis: (1) the data is too large
for a desktop to store, and (2) the computing task takes too long waiting time to finish.
While the Divide-and-Conquer approach easily breaks the first bottleneck, the Subsampling
approach simultaneously beat both of them.
The uniform sampling and the nonuniform sampling--the Leverage Scores sampling--
are frequently used in the recent development of fast randomized algorithms. However,
both approaches, as Peng and Tan (2018) have demonstrated, are not effective in extracting
important information from data.
In this thesis, we conduct regression analysis for big count data via A-optimal subsampling.
We derive A-optimal sampling distributions by minimizing the trace of certain dispersion matrices
in general estimating equations (GEE). We point out that the A-optimal distributions have the
same running times as the full data M-estimator. To fast compute the distributions,
we propose the A-optimal Scoring Algorithm, which is implementable by parallel computing and
sequentially updatable for stream data, and has faster running time than that of the
full data M-estimator. We present asymptotic normality for the estimates in GEE's and
in generalized count regression.
A data truncation method is introduced.
We conduct extensive simulations to evaluate the numerical performance of the proposed sampling
distributions. We apply the proposed A-optimal subsampling method to analyze
two real count data sets, the Bike Sharing data and the Blog Feedback data.
Our results in both simulations and real data sets indicated that
the A-optimal distributions substantially outperformed the uniform distribution,
and have faster running times than the full data M-estimators.

Identiferoai:union.ndltd.org:IUPUI/oai:scholarworks.iupui.edu:1805/16870
Date19 July 2018
CreatorsZhao, Xiaofeng
ContributorsTan, Fei, Peng, Hanxiang
Source SetsIndiana University-Purdue University Indianapolis
Languageen_US
Detected LanguageEnglish
TypeThesis

Page generated in 0.0035 seconds