Khaled El Emam: Is it safe to anonymize data?

Recently an article was published in Science claiming that it is easy to re-identify credit card transaction data that has been anonymized. While this is not health data, the authors generalize their conclusions to all other human behavior data. Credit card transactions can, however, leak health information if the store is a pharmacy, for example. The authors then, in media interviews, make broader claims about all personal information, for example stating that “the open sharing of raw data sets is not the future.”

This is alarming as there are significant efforts to share raw health data, for example, in the context of clinical trials. Furthermore, the authors argue that the foundation of privacy laws in the US and EU, the concept of personally identifiable information, is “inadequate:” quite a sweeping statement. Given that we are in the midst of changes to the EU’s data protection directive, and there are efforts to change the privacy provisions under HIPAA in the US, it is timely and important to have accurate and evidence based debates about the effectiveness of various approaches to protect data when it is being shared.

It is therefore also necessary to understand the nontrivial limitations of this study, and to caution against drawing such broad and all-encompassing conclusions.

What the study did

Using credit card data from 1.1 million individuals shopping in 10,000 stores over three months from a single bank in an unnamed OECD country, the authors analyzed these individuals’ transaction traces. If a person is the only one in that data set with a particular transaction trace of a certain size then that individual is considered unique. For example, if individual number 7, Sally, shopped at a shoe store on 26 December and bought something for $20, and then shopped at a grocery store on 27 December and purchased items worth $35, and this is the only individual in the data set who has that exact trace, then that individual is unique. In this particular example, that individual had a trace of 2 transactions.
The authors then compute that 90% of individuals in their specific data set are unique with a trace of 4 transactions, with each transaction containing the exact location of the store and date of the transaction. If the price of the purchase is included in the analysis then all individuals are unique. If the granularity of the information is reduced, for example, by grouping adjacent stores or combining days, then the percentage of unique individuals would also go down, but the uniqueness numbers were still considered high.

What is wrong with this study?

There are some fundamental challenges in the assumptions and methods used in this study. My commentary can only be based on what was in the published article and associated supplementary materials – it is not possible to speculate about unstated assumptions or techniques.

1. The 1.1 million people are a sample of the population in that country. Let’s say that the country has 5.5 million adult citizens who shop with credit cards, then there is a 1 in 5 chance that an individual is even in the data set. But it also means that the computations of uniqueness are potentially severely exaggerated because, in the context of measuring re-identification risk, uniqueness needs to be computed (or estimated) for the population. Since, based on the information in the article, that was not done here, all of these numbers are likely inflated: sample uniqueness does not equal population uniqueness. The authors do not acknowledge this in their article – that’s a big problem. In Sally’s case, there may be another individual who uses a different bank in that country with the exact trace as her. Sally may not really be unique.

2. The underlying assumption was that this data set is going to be shared openly with no controls whatsoever. In practice, I would be very surprised if a data set like this would be converted into a public use file. Such data, in practice, would be disclosed with contractual, security, and privacy controls in place. From a risk measurement perspective, these controls reduce the overall probability of re-identification considerably. The impact of controls was not considered in the published analysis. Stating that such a data set can be re-identified assumes a somewhat unrealistic data sharing use case.

3. Uniqueness does not equal re-identification. Uniqueness is one condition for re-identification, but it is not even a necessary one. It is easy for a data set to have a high probability of re-identification even if uniqueness is not high. The media narrative from this study was “the researchers identified 90 percent of the individuals in the data set. When they added the exact prices of transactions to the mix, they increased their ability to re-identify anonymous records by 22 percent.” In fact, the authors do not actually re-identify anyone. But they do repeatedly characterize and describe the results as a successful re-identification in their article, which is simply not true.

4. As the authors note, the original anonymization performed on that data set was somewhat simple. More sophisticated anonymization techniques, had they been applied to this data, would have produced a data set with an acceptable level of re-identification risk.

What Have We Learned?

There are key things that we can learn from this study.

Data with the individual names and addresses removed will have a high risk of re-identification. This is called pseudonymized data. We already know that pseudonymized data should be treated as personal information. The Science study re-enforces that point. This is an important lesson because in the context of changes to the EU Data Protection Regulation, there are proposals to allow the use of pseudonymized data for secondary purposes without consent: “where further processing takes place by using measures of pseudonymisation, it should not be considered as incompatible with the purpose for which the data have been initially collected as long as the data subject is not identified or identifiable.”

The authors of the Science article emphasize the need for quantitative measures of the likelihood of re-identification. This is good general advice – but the measurement needs to be based on sound metrics. A recent report from the Institute of Medicine on clinical trial data sharing provides somewhat detailed guidance on such measurement.

Khaled El Emam is the Canada Research Chair in Electronic Health Information at the University of Ottawa, an Associate Professor in the Department of Pediatrics, and is cross-appointed to the School of Electrical Engineering and Computer Science.

Competing Interests: I have read and understood the BMJ Group policy on declaration of interests and declare the following interests: I have financial interests in Privacy Analytics Inc, a University of Ottawa and Children’s Hospital of Eastern Ontario spin-off company which develops anonymization software for the health sector.

Information for Authors