Return to search

Multiple Imputation Methods for Large Multi-Scale Data Sets with Missing or Suppressed Values

Without proper treatment, direct analysis on data sets with missing or suppressed values can lead to biased results. Among all of
the missing data handling methods, multiple imputation (MI) methods are regarded as the state of the art. The multiple imputed data sets can,
on the one hand, generate unbiased estimates, and on the other hand, provide a reliable way to adjust standard errors based on missing data
uncertainty. Despite many advantages, existing MI methods have poor performance on complicated Multi-Scale data, especially when the data set
is large. The large data set of interest to us is the Quarterly Census of Employment and Wage (QCEW), which is the employment and wages of
every establishment in the US. These detailed data are aggregated up through three scales: industry structure, geographic levels and time.
The size of the QCEW data is as large as 210 x✕ 2217 ✕ 3193 ≈ 1.5 billion observations. For privacy concerns the data are heavily suppressed and this missingness could appear anywhere in this complicated structure. The existing methods are either accurate or fast but bot both in handling the QCEW data. Our goal is to develop a MI method which is capable of handling the missing value problem of large multi-scale data set both accurately and efficiently. This research addresses this goal in three directions. First, I improve the accuracy of the fastest MI method, Bootstrapping based Expectation Maximization (EMB) algorithm, by equipping it with a Multi-Scale Updating step. This updating step uses the information from the singular covariance matrix to take multi-scale structure into account and to simulate more accurate imputations. Second, I improve the MI method by using a Quasi Monte Carlo technique to accelerate its convergence speed. Finally, I develop a Sequential Parallel Imputation method which can detect the structure and missing pattern of large data sets, and partition it to small data sets automatically. The resulting Parallel Sequential Multi-Scale Bootstrapping Expectation Maximization Multiple Imputation (PSI-MBEMMI) method is accurate, very fast, and can be applied to very large data sets. / A Dissertation submitted to the Department of Economics in partial fulfillment of the requirements for the degree of Doctor of Philosophy. / Summer Semester 2018. / June 27, 2018. / Bayesian Inference, Bootstrapping, Expectation Maximization, Large Data Analysis, Multiple Imputation, Quasi-Monte Carlo / Includes bibliographical references. / Paul Beaumont, Professor Directing Dissertation; Dennis Duke, University Representative; Stefan Norrbin, Committee Member; Giray Okten, Committee Member; Javier Cano-Urbina, Committee Member.
ContributorsCao, Jian (author), Beaumont, Paul M. (professor directing dissertation), Duke, D. W. (university representative), Norrbin, Stefan C. (committee member), Ökten, Giray (committee member), Cano-Urbina, Javier (committee member), Florida State University (degree granting institution), College of Social Sciences and Public Policy (degree granting college), Department of Economics (degree granting departmentdgg)
PublisherFlorida State University
Source SetsFlorida State University
LanguageEnglish, English
Detected LanguageEnglish
TypeText, text, doctoral thesis
Format1 online resource (138 pages), computer, application/pdf

Page generated in 0.002 seconds