Out of an anonymous set of credit card data from millions of people, how easily can you find one person?
Very easily, it turns out. It takes three pieces of outside information to correctly identify a person in an anonymous data set — even when the data removes easy identifiers like credit card numbers, names or addresses, said Yves-Alexandre de Montjoye, a graduate student in media arts and sciences from the Massachusetts Institute of Technology.
According to a study in Thursday’s issue of Science, a crafty user has a 94 percent chance of tracking all of your purchases with three pieces of extra information — one receipt, an Instagram of your lunch, a Tweet about the new shoes you bought or Facebook post tagging the bar where you went for happy hour.
“We are showing that the privacy we are told that we have isn’t real,” study co-author Alex “Sandy” Pentland, a data scientist also from MIT, told the Associated Press.
De Montjoye, lead author of the study, and his colleagues took an anonymous set of credit card data from 1.1 million users over a period of three months and 10,000 different shops in an OECD country. The data had removed names, account numbers — anything that would be considered an easy identifier.
Removing those easy identifiers, called personally identifiable information or PII, is required by the U.S. Privacy Act and the E.U. Privacy Directive. These metadata sets are used by governments and organizations who use the data for policy or new technology.
This data can be used for good, de Montjoye pointed out, but reidentifying users is still a risk.
“Sandy and I do really believe that this data has great potential and should be used,” de Montjoye said in a press release. “We, however, need to be aware and account for the risks of re-identification.”
Even without PII, the data lists dates of transactions, shop names and the price of each transaction. With that information and a bit of knowledge about an individual, it’s easy to pull out one person out of the data. The study explains:
“Let’s say that we are searching for Scott in a simply anonymized credit card data set. We know two points about Scott: he went to the bakery on 23 September and to the restaurant on 24 September. Searching through the data set reveals that there is one and only one person in the entire data set who went to these two places on these two days…Scott is reidentified, and we now know all of his other transactions, such as the fact that he went shopping for shoes and groceries on 23 September, and how much he spent.”
The study also found that women and people with higher-incomes were more easily identified. And even after coarsening the data by listing an approximate price or a date range, it was still possible to find an individual with 10 data points.
Ultimately, this means that removing PII doesn’t do enough to guarantee anonymity for metadata sets, the study concluded. De Montjoye and Pentland have done similar studies on cell phone data with the same results.
“If we show it with a couple of data sets, then it’s more likely to be true in general,” de Montjoye said in a press release.