Data re-identification

From Clinfowiki
Jump to: navigation, search

The process of analysing data or combining it with other data with the result that individuals become identifiable. Sometimes termed ‘de-anonymization’.[1] It is important to note that the risk of re-identification is always a matter of probability, and depends on assumptions about the attacker’s knowledge and the accuracy of the data, among other factors.

The threat of medical record re-identification to patient privacy first was first raised in 1997 with the case of then Massachusetts governor William Weld. Weld became unconscious during a public speech and was treated at a nearby hospital. His medical records were later identified by referencing claims data against voter registration data, both publicly available. Researcher Latanya Sweeney used this scandal to demonstrate that the combination of age, sex, and birth date could be used to uniquely identify 87% of the US population.[2] Later analysis indicated that this case was an exceptional situation made possible by the governor’s highly public status.[3] Nonetheless, this event was pivotal in revisions to the HIPAA Privacy Rule to restrict disclosure of full birthdate and zip codes under the Safe Harbor standards.

Cases of PHI re-identification since the Weld incident are apparently rare. In one case, researchers reported that they were able to derive patient addresses from data published in public health mapping studies.[4] Benitez et al. conducted a statistical analysis of re-identification risk, with a focus on demographic data in publicly-available voter registration datasets. They found that the risk varied by state, depending on the accuracy of the data, and was low overall, with 0.01% to 0.25% of the population at risk by this attack method.[5] A 2011 review noted that very few cases actually involved PHI: “The current evidence shows a high re-identification rate but is dominated by small-scale studies on data that was not de-identified according to existing HIPAA standards.”[6]

Nonetheless, the threat of re-identification has been a frequent topic of consideration by academics and policymakers, especially in the context of modern genomic data. Mark Rothstein critiqued the ethics of the current de-identification regime. He raises the issue of the lack of regulation once de-identified data is released, as well as uncertainty about actual compliance by covered entities. He notes that de-identification is not a standard feature of current EHR systems, so even Safe Harbor compliance is done on a case-by-case basis. He recommends expanding the scope of individual consent requirements and greater consideration of harms to groups as a result of downstream uses of data.[7] In one such case, the Havasupai indian tribe in Arizona consented to a genetic study of diabetes and sued when researchers published additional results that the tribe considered offensive.[8] McGraw recommends several revisions to the HIPAA policy to increase public trust in health research, including legal prohibition of re-identification, enhanced technical security standards, and abolishing the Safe Harbor standards in favor of statistical analysis of re-identification risk.[9]

Re-identification of genomic data

Genomic datasets pose a special problem for patient privacy. The nature of this data makes it difficult or impossible to render anonymous without reducing its usefulness for research. One of the most serious cases of re-identification used genetic fingerprints (STR markers) to link de-identified published genomic sequence data to online genealogy data. The authors estimate a 12% re-identification rate for US caucasian males and note that a “data release, even of a few markers, from one person can spread through deep genealogical ties and lead to the identification of another person who might have no acquaintance with the person who released his genetic data.”[10] Other research re-identified “personal genome” data by linking demographic information to public records.[11], and by linking patterns of ICD-9 codes associated with genomic data to insurance claims data.[12] Philibert et. al. were able to create individual genotypes from epigenetic data not typically used for this purpose. While they were able to associate these genotypes with tobacco and alcohol use status, they were not able to fully re-identify individuals without access to PHI.[13] The 2012 Presidential Commission for the Study of Bioethical Issues addressed this topic, noting in particular the risk of unauthorized access and disclosure of genomic data. HHS has not specified whether genetic data itself is to be treated as PHI. The report recommends requiring strict data access and usage policies by organizations that store genomic databases, but concludes that individuals are presently responsible for the risk of genomic privacy: “...entities must make individuals aware that incidental findings are likely to be discovered in the course of whole genome sequencing.”[14]


  1. Information Commissioner's Office. Anonymisation: managing data protection risk code of practice. 2012.[1]
  2. Sweeney, Latanya. (2013) Harvard Data Privacy Lab. [2]
  3. Barth-Jones, D. C. (2012). The “Re-Identification” of Governor William Weld’s Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now. SSRN Electronic Journal. doi:10.2139/ssrn.2076397 [3]
  4. John S. Brownstein, Christopher A. Cassa, K. D. M. (2006). No Place to Hide — Reverse Identification of Patients from Published Maps. NJEM. [4]
  5. Benitez, K., & Malin, B. (2010). Evaluating re-identification risks with respect to the HIPAA privacy rule. Journal of the American Medical Informatics Association : JAMIA, 17(2), 169–77. doi:10.1136/jamia.2009.000026 [5]
  6. El Emam, K., Jonker, E., Arbuckle, L., & Malin, B. (2011). A systematic review of re-identification attacks on health data. PloS One, 6(12), e28071. doi:10.1371/journal.pone.0028071 [6]
  7. Rothstein, M. A. (2010). Is deidentification sufficient to protect health privacy in research? The American Journal of Bioethics : AJOB, 10(9), 3–11. doi:10.1080/15265161.2010.494215 [7]
  8. Harmon, Amy. Indian Tribe Wins Fight to Limit Research of Its DNA. New York Times, 2010. [8]
  9. McGraw, D. (2013). Building public trust in uses of Health Insurance Portability and Accountability Act de-identified data. Journal of the American Medical Informatics Association : JAMIA, 20(1), 29–34. doi:10.1136/amiajnl-2012-000936 [9]
  10. Gymrek, M., McGuire, A. L., Golan, D., Halperin, E., & Erlich, Y. (2013). Identifying Personal Genomes by Surname Inference. Science, 339(6117), 321–324. doi:10.1126/science.1229566 [10]
  11. Sweeney, Latanya. Identifying Participants in the Personal Genome Project by Name. (n.d.). Retrieved April 28, 2015, from
  12. Loukides, G., Denny, J. C., & Malin, B. (2010). The disclosure of diagnosis codes can breach research participants’ privacy. Journal of the American Medical Informatics Association : JAMIA, 17(3), 322–7. doi:10.1136/jamia.2009.002725 [11]
  13. Philibert, R. A., Terry, N., Erwin, C., Philibert, W. J., Beach, S. R., & Brody, G. H. (2014). Methylation array data can simultaneously identify individuals and convey protected health information: an unrecognized ethical concern. Clinical Epigenetics, 6(1), 28. doi:10.1186/1868-7083-6-28 [12]
  14. Presidential Commission for the Study of Bioethical Issues. 2012. Privacy and Progress in Whole Genome Sequencing. [13]

See Also

El Emam, K. (2011). Methods for the de-identification of electronic health records for genomic research. Genome Medicine, 3(4), 25. doi:10.1186/gm239 [14]

Submitted by Jacob Schwartzman