Khaled E Emam: Pseudonymous data is not anonymous data

khaled_e_emamRecently, efforts have been made to make health data more generally available for secondary purposes, including research. These include the recent policy announcements from the European Medicines Agency (EMA) on making clinical trials data available, industry efforts to do the same, as well as care.data in the UK.

All of these are premised on being able to anonymize the data properly before it is shared, and in a manner that will meet multiple requirements: (a) ensure that the probability of re-identifying individual patients is small, (b) meet the regulatory and legal thresholds for what is an anonymized dataset, and (c) ensure that the anonymized data quality is sufficiently high to allow meaningful analysis.

It is important to clarify what practices will not meet the first two of these requirements, and to make some terminology clarifications so that we can have an informed discussion on data sharing practices. A health dataset will have two types of variables that we care about from a privacy perspective: (a) direct identifiers, and (b) indirect identifiers.

Direct identifiers are details such as names, addresses, social insurance numbers, telephone numbers, email addresses, and any other unique identifiers. These direct identifiers are typically removed from the dataset when it is shared for secondary purposes. The unique identifiers, such as a medical record number, would be converted to a pseudonym so that it can still be used to relate all of the records that belong to the same patient. When you do all of these things to protect against re-identifying individuals from these types of variables, the dataset is considered pseudonymized.

The EU Data Protection Directive still considers pseudonymized data as personally identifying information, and the UK Information Commissioner’s Office considers pseudonymized data to be personal information. There are good reasons for that.

Pseudonymized data leave all of the indirect identifiers intact, and individuals can still be identified using the indirect identifiers. All known re-identification attacks on clinical, administrative, and survey data were done using the indirect identifiers. And recently, an academic researcher gathered information from newspaper articles about vehicle accidents to re-identify individuals in a hospital discharge database, using information such as the year of birth, date of accident, the hospital the individual went to, and where they lived.

This is not surprising. There is evidence that basic demographics, such as the date of birth and the postal code, can uniquely identify almost all of the population. For example, these two pieces of information are unique identifiers for almost all of the population in Canada, and the Netherlands, and a high percentage of the population in the United States. These basic demographics are easy to get from public sources and can be used to re-identify individuals.

The risks are not theoretical: there have been re-identification attacks on health data. There is also the issue of public trust. How can the public trust that data custodians are protecting their data properly when they are sharing it without consent, if even the basics of disclosure control are not being adhered to?

There is a need for anonymization standards that go beyond pseudonymization to cover the indirect identifiers. As data sharing initiatives take flight, there is an urgency to address the standards gap before data go out of the door.

Read Khaled E Emam’s previous blogs in this series:

Towards standards for anonymizing clinical trials data

What are the privacy concerns when sharing clinical trials data?

Khaled E Emam is the Canada research chair in electronic health information at the University of Ottawa, and an associate professor in the department of pediatrics, and is cross-appointed to the school of electrical engineering and computer science.

Competing Interests: I have read and understood BMJ policy on declaration of interests and declare the following interests: I have financial interests in Privacy Analytics Inc., a University of Ottawa and Children`s Hospital of Eastern Ontario spin-off company, which develops anonymization software for the health sector.

  • susanne stevens

    Big Brother Watch has recently published a very detailed report ‘NHS Breaches of Data Protection Law between 2011 and 2014’, Published 14th November 2014. It is too long to sumarise here but thanks to Big Brother Watch for putting the information together to highlight the situation so clearly and permission to refer to their report.

  • susanne stevens

    The organisation ‘Big Brother Watch’ has published, 14th November 2014 ,a detailed account of breaches of data law in the NHS from 2011 to 2014

  • steve black

    If you are going to do a technical article on the topic of pseudonymization, it would be good to include the details of what the UK actually does with its major health datasets. The article doesn’t thereby creating a false fear that UK datasets are easy to re-identify.

    The author says, correctly, that “There is evidence that basic demographics, such as date of birth or postcode, can uniquely re-identify almost all the population” but ignores the fact that the routinely released English HES data only contains age and super output area (a much broader geography than postcode). It looks as though we have already done the extra work to protect people’s identity.

    While re-identification is still possible in any pseudonymized dataset, the detailed rules and processes in England make it hard enough not to be worth anyone’s while to attempt it. Especially since many providers will carelessly leak fully identifiable patient records.

    The central bodies such as the HSCIC have an excellent track record at protecting peoples’ identities and supporting the safe use of data. We shouldn’t stoke fear about them because we don’t pay attention to the detail.

  • steve black

    I’m not sure this is obvious, but essentially all the leaks reported by Big Brother Watch were from local systems and most were not pseudonymised at all. Despite slightly weak processes and an uncertain policy environment the HSCIC has not ‘leaked’ data in a harmful way and the implied link by referencing the Big Brother Watch report is spurious and misleading.

    In reality, centralised records have a very good track record of data protection (unlike some local NHS bodies) and restricting central collections will not help protect data at all as that isn’t where the leaks happen.

  • Anonymous

    This is why ethics standards are very important in data sharing for both evaluation and research. Most reputable organizations go above and beyond for data protection.

    For those looking for more information on best practices, the “Safe Harbor” method is a good start. See this guidance document for de-identification of protected health information in accordance with the Health Insurance Portability and
    Accountability Act (HIPAA) privacy rule:

    http://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#standard