Releasing synthetic data in place of observed values is a method of statistical disclosure control for the public dissemination of survey data collected by national statistical agencies. The overall goal is to limit the risk of disclosure of survey respondents' identities or sensitive attributes, but simultaneously retain enough detail in the synthetic data to preserve the inferential conclusions drawn on the target population, in potential future legitimate statistical analyses. This thesis presents three new research contributions in the analysis and application of synthetic data. Firstly, to understand differences in types of input between the imputer, typically an agency, and the analyst, we present a definition of congeniality in the context of multiple imputation for synthetic data. Our definition is motivated by common examples of uncongeniality, specifically ignorance of the original survey design in analysis of fully synthetic data, and situations when the imputation model and analysis procedure condition upon different sets of records. We conclude that our definition provides a framework to assist the imputer to identify the source of a discrepancy between observed and synthetic data analytic results. Motivated by our definition, we derive an alternative approach to synthetic data inference, to recover the observed data set sampling distribution of sufficient statistics given the synthetic data. Secondly, we address the problem of negative method-of-moments variance estimates given fully synthetic data, which may be produced with the current inferential methods. We apply the adjustment for density maximization (ADM) method to variance estimation, and demonstrate using ADM as an alternative approach to produce positive variance estimates. Thirdly, we present a new application of synthetic data techniques to confidentialize survey data from a large-scale healthcare study. To date, application of synthetic data techniques to healthcare survey data is rare. We discuss identification of variables for synthesis, specification of imputation models, and working measures of disclosure risk assessment. Following comparison of observed and synthetic data analytic results based on published studies, we conclude that use of synthetic data for our healthcare survey is best suited for exploratory data analytic purposes. / Statistics
Identifer | oai:union.ndltd.org:harvard.edu/oai:dash.harvard.edu:1/9527319 |
Date | 07 September 2012 |
Creators | Loong, Bronwyn |
Contributors | Rubin, Donald B. |
Publisher | Harvard University |
Source Sets | Harvard University |
Language | en_US |
Detected Language | English |
Type | Thesis or Dissertation |
Rights | open |
Page generated in 0.0016 seconds