Identifiable Health Data

From Clinfowiki
Jump to: navigation, search

Identifiable Health Data (or Personally identifiable Health Data) refers to any information that can be used, either alone or in combination with other information, to uniquely identify, contact, or locate a single person.


Data are considered “individually identifiable” if they include any of the 18 types of identifiers specified by the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule:

  • Name
  • Address (all geographic subdivisions smaller than state, including street address, city, county, ZIP code)
  • All elements (except years) of dates related to an individual (including birth date, admission date, discharge date, date of death and exact age if over 89)
  • Telephone numbers
  • FAX number
  • E-mail address
  • Social Security number
  • Medical record number
  • Health plan beneficiary number
  • Account number
  • Certificate/license number
  • Any vehicle or other device serial number
  • Device identifiers or serial numbers
  • Web URL
  • Internet Protocol (IP) address numbers
  • Finger or voice prints
  • Photographic images
  • Any other characteristic that could uniquely identify the individual


De-identified patient data

De-identified patient data is patient data that has been scrubbed of important identifiers such as birth date, gender, address, and age. De-identified patient data is often used for research. The Health Insurance Portability and Accountability Act (HIPAA) allows use or disclosure of such data without restrictions or individuals’ consent. The rule acknowledges the risk of Data re-identification and is meant to balance this risk with the utility of sharing data for research. The Privacy Rule allows for two methods to satisfy the HIPAA de-identification standard: the “Safe Harbor” method in which the above 18 identifiers are removed, or the “Expert Determination” method. The latter method requires the use of mitigation strategies and statistical determination that the risk of identification is very small, but not zero.[1] The Safe Harbor method is far more commonly used.[2].

Under the expert determination method, an analysis is conducted to assess three factors that may contribute to identification risk:

  • Replicability: the degree to which data will serve as a unique marker that is stable over time.
  • Data source availability: other sources that can be used to cross-reference patient records (ie, public records).
  • Distinguishability: identifiers that allow records to be uniquely identified, alone or in combination (such as birthdate, zip code, and gender).

Apart from the removal of Safe Harbor identifiers, de-identification methods must be adapted to the context. Methods may include suppression of identifiers, in whole or in part; generalization to reduce the specificity of identifiers (ie, truncating zip codes), and perturbation to introduce noise into the data [1]. Efforts are being made to automate the anonymization of health information by developing de-identifications models that can successfully remove personal health information.[3][4] One popular algorithm concept is known as k-anonymity, under which records in a de-identified dataset can be proven to be similar to a set number of other records. The data is thus transformed to meet suitably low criteria for records’ similarity while preserving the information content of the dataset.[2][5] Another example of risk mitigation involved the Heritage Health Prize dataset. This insurance claims data was released to researchers competing to predict the rate of future hospitalizations. An algorithm was developed to reduce identification risk. This included the use of pseudonyms for providers and patients, suppression of records deemed high risk, and generalization of certain claims. The competition also required participants to agree to an access restriction policy.[6]


  1. 1.0 1.1 U.S. Department of Health and Human Services Office of Civil Rights. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. [1]
  2. 2.0 2.1 Malin, B., Benitez, K., & Masys, D. (2011). Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA Privacy Rule. Journal of the American Medical Informatics Association : JAMIA, 18(1), 3–10. doi:10.1136/jamia.2010.004622 [2]
  3. State-of-the-art Anonymization of Medical Records Using an Iterative Machine Learning Framework; György Szarvas, Richárd Farkas, Róbert Busa-Fekete b J Am Med Inform Assoc. 2007 Sep–Oct; 14(5): 574–580.
  4. Friedlin, F. J., McDonald, C. J. A Software Tool for Removing Patient Identifying Information from Clinical Documents. (2008) JAMIA, 15 (5); 601 – 610. PMCID: PMC2528047
  5. El Emam, K., Dankar, F. K., Issa, R., Jonker, E., Amyot, D., Cogo, E., … Bottomley, J. (2009). A globally optimal k-anonymity method for the de-identification of health data. Journal of the American Medical Informatics Association : JAMIA, 16(5), 670–82. doi:10.1197/jamia.M3144 [3]
  6. El Emam, K., Arbuckle, L., Koru, G., Eze, B., Gaudette, L., Neri, E., … Gluck, J. (2012). De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset. Journal of Medical Internet Research, 14(1), e33. doi:10.2196/jmir.2001 [4]

See also

El Emam, Khaled. Guide to the De-Identification of Personal Health Information. CRC Press, 2013 [5]

Submitted by (Jacob Schwartzman)