Return to search

Database alignment: fundamental limits and multiple databases setting

In modern data analysis, privacy is a critical concern when dealing with user-related databases. Ensuring user anonymity while extracting meaningful correlations from the data poses a significant challenge, especially when side information can potentially enable de-anonymization. This dissertation explores the standard information-theoretic problems in the correlated databases model. We define a "database" as a simple probabilistic model that contains a random feature vector for each user, with user labels shuffled to ensure anonymity.

We first investigate correlation detection between two databases, formulating it as a composite binary hypothesis testing problem. Under the alternate hypothesis, there exists an unknown permutation that aligns users in the first database with those in the second, thereby matching correlated entries. The null hypothesis assumes that the databases are independent, with no such alignment. For the special case of Gaussian feature vectors, we derive both upper and lower bounds on the correlation required to achieve or fail to achieve this statistical problem. Our results are tight up to a constant factor when the feature length exceeds the number of users. Regarding our achievability boundary, we draw connections to the user labeling recovery problem, highlighting significant parallels and insights. Additionally, for the two databases model, we initially examine the potential gaps in the statistical analysis conducted thus far for the large number of users regime by drawing parallels with similar problems in the literature. Motivated by these comparisons, we propose a novel approach to address the detection problem, focusing on the hidden permutation structure and intricate dependencies characterizing these relationships.

Building on our research, we present a comprehensive model for handling multiple correlated databases. In this multiple-databases setting, we address another fundamental information-theoretic problem: user label recovery. We evaluate the performance of the typicality matching estimator in relation to the asymptotic behavior of feature length, demonstrating an impossibility result that holds up to a multiplicative constant factor. This exploration into multiple databases not only broadens the scope of our study but also underscores the complexity and richness of correlation detection in a more generalized framework.

In conclusion, we summarize the statistical gaps identified in our findings, exploring their possible origins. We also discuss the limitations of our simple probabilistic model and propose strategies to address them. Finally, we outline potential future research directions, including the information-theoretic problem of change detection, which remains an open area of significant interest.

Identiferoai:union.ndltd.org:bu.edu/oai:open.bu.edu:2144/49260
Date13 September 2024
CreatorsK, Zeynep
ContributorsNazer, Bobak
Source SetsBoston University
Languageen_US
Detected LanguageEnglish
TypeThesis/Dissertation
RightsAttribution 4.0 International, http://creativecommons.org/licenses/by/4.0/

Page generated in 0.0024 seconds