Khaled E Emam: Anonymisation and creepy analytics

khaled_e_emamWhen health data is shared for secondary purposes, such as for research, there is always the concern about patient privacy. Data custodians want to at least meet the requirements in their relevant laws and regulations. One option for sharing data is to anonymize it beforehand. But anonymization does not protect against stigmatizing analytics, which are often seen as a form of privacy violation.

Stigmatizing analytics are inferences from the data that may have a negative impact on a data subject or, more often, a group of data subjects. The impact would occur due to decisions made from the inferences. Negative impact may be social, financial, reputational, or psychological. For example, an inference that individuals living in close proximity to an industrial site have a higher than expected incidence of cancer may make these individuals less employable (because they could cause their employee insurance premiums to be much higher) and publication of such a finding could dramatically reduce their property values.

Sometimes inferences from data are referred to as “creepy”. This is essentially the same idea except that the impact on the data subjects is that they feel violated. For example, if a supermarket can determine that you are pregnant before your family does based on changes in your purchasing behavior, or a web site can determine your sexual orientation from the sites that you visit or whom you friend on a social network (and starts serving you advertisements accordingly) – that would be creepy.

Stigmatizing analytics can occur even for individuals not in the dataset. For example, consider an anonymized dataset that was analysed to build a regression model that allows one to make inferences about a group of individuals. The model can then be used to make predictions about any individual irrespective of whether their data was in the original dataset or not. As long as the relevant values are available as input into the regression model, a prediction can be made.

This regression model can be built on properly anonymized data. Anonymization, as defined in contemporary laws and regulations, does not protect against such inferences. Anonymization is only concerned with ensuring that the identity of the data subjects cannot be determined. As long as that identity has a very small probability of being determined from the data, then the anonymization requirements are met.

There are two general ways to address the risks from stigmatizing analytics. One way is to modify the data to make it difficult to draw inferences. For example, one can add noise to the data or suppress records or values. This, however, will often result in a dataset that is not very analytically useful because the ability to draw inferences would have to be, by definition, curtailed.

The second approach is to use proper governance mechanisms. In the context of research, an ethics committee would normally review protocols to evaluate the risk of group harm, and stigma that may affect study participants. However, outside the research context these types of review committees would need to be created. They would evaluate analysis protocols to determine what types of models would be developed on the data and how these models would be used (for example, what kinds of decisions would be made using the models). The objective of such a committee is to ensure that model development and use are consistent with prevailing cultural and social norms, and assess the potential negative impacts on individuals, whether they are in the dataset or not. That is the only practical way to manage the risks from stigmatizing analytics.

Khaled E Emam is the Canada research chair in electronic health information at the University of Ottawa, and an associate professor in the department of pediatrics, and is cross-appointed to the school of electrical engineering and computer science.

Competing Interests: I have read and understood the BMJ Group policy on declaration of interests and declare the following interests: I have financial interests in Privacy Analytics Inc., a University of Ottawa and Children`s Hospital of Eastern Ontario spin-off company which develops anonymisation software for the health sector.

  • Niall Wallace

    Khaled. Great insights as usual. Certainly a challenge that needs to be addressed as bigger and more complex data sets are being gathered and used to address quality, safety and operational efficiencies.

    In my experience, governance is addressed as part of a project or deployment of healthcare IT.

    Part of the challenge is to ensure that we can ensure that we have the ability to support new academic learnings and to properly qualify results while still maintaining privacy – while avoiding or limit-ing the creep-back that is an ever-present risk from retrospective analysis.

  • Ranveig S Berg

    The Nuffield Council on Bioethics is currently holding a consultation to gather views and evidence on the ethical issues raised by the linking and use of biological and health data. Readers of this post may be interested: http://nuffieldbioethics.org/biological-and-health-data/biological-and-health-data-linking-and-use-biological-and-health-data-ope