Spelling suggestions: "subject:"sanitizing data"" "subject:"sanitization data""
1 |
Privacy-preserving computation for data miningBrickell, Justin Lee 01 June 2010 (has links)
As data mining matures as a field and develops more powerful algorithms for
discovering and exploiting patterns in data, the amount of data about individuals
that is collected and stored continues to rapidly increase. This increase
in data heightens concerns that data mining violates individual privacy. The
goal of data mining is to derive aggregate conclusions, which should not reveal
sensitive information. However, the data-mining algorithms run on databases
containing information about individuals which may be sensitive. The goal
of privacy-preserving data mining is to provide high-quality aggregate conclusions
while protecting the privacy of the constituent individuals.
The field of "privacy-preserving data mining" encompasses a wide variety
of different techniques and approaches, and considers many different threat
and trust models. Some techniques use perturbation, where noise is added (either
directly to the database that is the input to the algorithm or to the output
of queries) to obscure values of sensitive attributes; some use generalization, where identifying attributes are given less specific values; and some use cryp-
tography, where joint computations between multiple parties are performed
on encrypted data to hide inputs. Because these approaches are applied to
different scenarios with different threat models, their overall e ectiveness and
privacy properties are incomparable.
In this thesis I take a pragmatic approach to privacy-preserving data
mining and attempt to determine which techniques are suitable to real-world
problems that a data miner might wish to solve, such as evaluating and learning
decision-tree classifiers. I show that popular techniques for sanitizing
databases prior to publication either fail to provide any meaningful privacy
guarantees, or else degrade the data to the point of having only negligible
data-mining utility.
Cryptographic techniques for secure multi-party computation are a natural
alternative to sanitized data publication, and guarantee the privacy of
inputs by performing computations on encrypted data. Because of its heavy
reliance on public-key cryptography, it is conventionally thought to be too
slow to apply to real-world problems. I show that tailor-made protocols for
specific data-mining problems can be made fast enough to run on real-world
problems, and I strengthen this claim with empirical runtime analysis using
prototype implementations. I also expand the use of secure computation beyond
its traditional scope of applying a known algorithm to private inputs by
showing how it can be used to e ciently apply a private algorithm, chosen
from a specific class of algorithms, to a private input. / text
|
Page generated in 0.0791 seconds