Patient Confidentiality in the Research Use of Clinical Medical Databases

From Clinfowiki
Jump to: navigation, search

The following article is adapted from Krishna R, Kelleher K, and Stahlberg E's "Patient Confidentiality in the Research Use of Clinical Medical Databases".[1]

Modern computing power provided quantitative researchers with new techniques for exploring and identifying correlations in large data warehouses.[2] Common to such efforts in the need for access to large quantities of potentially sensitive protected health information(PHI). Interest in maintaining-and legal sanctions for-violating patient confidentiality are of particular concern for researchers who use medical data. Balancing the conflicting interests of ensuring patient confidentiality with providing access to sufficiently detailed information for adequate research is a serious challenge to healthcare organizations(HCOs), data providers, and their respective Institutional Review Board (IRBs). Although existing legal restrictions in the United States attempt to strike a balance, no computing system is entirely secure, and there is understandable concern about unintended or inappropriate releases of information.

Application of data security and statistical disclosure techniques allows trade offs between data usability and data security, giving researchers access to relevant data while at the same time minimizing the potential damage of a breach in data security. This authors' discussion is intended to be an introduction for researchers and their human participant oversight structures about the best security solutions for a given situation.


In the United States, current regulations on the use of PHI under HIPAA divide medical record sets into 3 categories: identified data, deidentified data, and limited data.

  • Identified data include any data that could be used by a recipient to uniquely identify the person from an individual patient record. Access to such record requires explicit consent by study participants or a waiver of consent requirement by an IRB. This further involves numerous restrictions that involve tracking of PHI disclosures
  • Deidentified data is data with all HIPAA-specified 18-data elements removed and this data may be used freely.
  • Limited data are available only to research, public health, and HCOs. Researchers may access data elements without restrictions for fully identified data.

Considerable research in privacy-preserving data mining,[3] [4] disclosure risk assessment,[5] [6] and data deidentification,obfuscation, and protection,[7] [8] found in computing and database management literature is often directly applicable to medical privacy issues. Researchers do not exploit the flexibility of disclosure limitation techniques. Instead, they depend on deidentified data or the physical security of data infrastructure.

This tendency gives decision makers the impression that deidentified data is safe for public use and data security restrictions on the use of identified data sets will ensure confidentiality. The greatest concern is that little effort is applied to the documentation of the data security efforts when the results of an analysis is published.


Data exclusion

Exclusion of specific data elements is the basis of most general restrictions on data use. In this realm, carefully constructed aggregate data or removal of entire records provide the highest level of confidentiality. Second to this is individual record deidentification. The goal is to verify that deidentification process maximizes data a particular researcher needs while ensuring sufficient commonality between records for anonymity. The concept of k-anonymity[9] and the use of systems such as Datafly[10] ensure that at least k records in any data set are indistinguishable along any parameter of interest. Field masking can maintain specific aspects of the data set that are of research interest. Concept Match[11] provides a system for deidentifying free text fields by removing words that do not match a predetermined set of interest words for a domain.

Data transformation

Data transformation[1] techniques provide a statistical guarantee of confidentiality. The common theme in these techniques is to make an irreversible modification to the data that destroys the original values or correlations while prescribing the relationships of interest. As with data exclusion, techniques exist to modify data globally or at the level of individual elements.

Data perturbation is an example of global data transformation. The idea is to preserve aggregate trends in the original data while removing or altering the actual data.

Hashing of individual data elements involves a lossy 1-way transformation or mapping of data. A simple hash of 20 unique zip codes(protected under HIPAA)may replace each unique zip code with a value between 1 and 1000 at each entry in the data set. This transformation probabilistically maintains the uniqueness of zip code values and thus preserve much of the research value. Many standard hashing algorithms exist include the Message-Digest Algorithm(MD5)[2][12] developed at the M.I.T[3], and the Secure Hash Algo-rithm1(SHA1)[4] developed by N.I.S.T[5].[13]

Data encryption

A further step from absolute confidentiality leads to reversible data transformations,such as data element encryption. The idea of encryption[6] is to take input data(plain text) and output new data(cipher text)[7] from which the original cannot be recovered without the use of a key. A simple example would be to create a 1-to-1 mapping letter for each letter in the alphabet or a code.

A good cryptographic technique will hide all relationships between the original text and cipher text. This however creates a problem for researchers,particularly in situations of semi-free text fields. If the name is entered as free text, small variations in the entry(eg., Rajanis vs Rajani's) could lead to substantial variations in the cipher text. Fixing these variations in letters used(syntax) for words with the same meaning is a process called Normalization[8].

Good encryption is not a substitute for good data access security. This can at best be an added safeguard. Given time,nearly every cryptographic technique can be compromised.

Another important lesson from cryptography[9] is the value of variability in data. Although constructing research data sets to establish one uniform deidentified data set for all researchers is the easiest and sometimes the only solution, it also increases the risk of exposure. This risk can be reduced by a number of simple steps. Ideally, individual data sets should be constructed for each research effort. Furthermore, each data set should be encoded independently. The ordering of individual records should be randomized whenever possible. This ensures that data from a research project cannot be compared against data from another, reduces the potential of a security breach, and limits the damage should a breach occur.

Data obfuscation

Data obfuscation is an approach of masking data that is weaker than cryptography and is employed primarily to preserve relationships within data set that would be destroyed by more rigorous masking techniques. Although this complicates recovery of the original information and reduces the pool of would-be intruders, it does not provide the level of structural security that encryption or hashing systems[10] do.


What data is needed?

The obvious first step in any data protection is careful specification of the data requirement. This is the standard practice in most research efforts,and consideration should be given for what records are necessary. Furthermore,as research progresses,access to any subsets of data deemed unnecessary upon inspection should be removed.

What data can be encrypted?

Any relevant relationships discovered in transformed or obfuscated data would be useless without the ability to recover original values. In the general case, this will require that at least one field be masked in a recoverable fashion. Should an intruder recover the encrypted data, exploiting the information would still require breaking the security of the main data base to gain full access to the record.

What data should be transformed or obfuscated?

Researchers should now determine confidentiality and the level of acceptable data loss for each field in the desired records. These fields that only require aggregate properties or probabilisitic uniqueness should be masked by lossy transformation techniques. However, some effort should be dedicated to considering how such data elements may be obfuscated without destroying relevant relationships.

Establishing the confidentiality of remaining data

It is clearly impractical and often detrimental to mask or obfuscate every field in a data set. As a final step in the construction of a research data set, it may be valuable to assess any remaining unique records. Application of techniques such as a k-anonymity can ensure that.

Physical data security and auditing

The best defense is good physical data security. Standard data security practices should be used to ensure that the data remain in a secure access-restricted storage area and separate unique credentials should be given to each authorized user. A short training session on basic data security may be warranted for study staff.


Data security is of particular relevance with the proliferation of electronic medical and administrative records and the ease with which data can be exported outside of the secure institutional infrastructure. The authors described an approach that researchers,information services departments, and IRB committees can use. Indeed,coordination among these groups and the incorporation of security considerations into IRB and journal approval procedures are the keys to ensuring continued patient data protection in an increasingly digital and interconnected world.


Maintaining patient confidentiality is of paramount importance to researchers. Balancing this with access to sufficiently detailed information for adequate research is a big challenge to HCOs and their respective IRBs. There is a lack of common vocabulary/infrastructure for describing data security measures. I totally agree with the authors who should be sincerely commended for producing such a crucial work of introducing a framework that researchers and IRBs can use in applying data security techniques.

Related Reads

Professionalism in Medical Informatics


  1. Krishna R, Kelleher K, and Stahlberg E. Patient Confidentiality in the Research Use of Clinical Medical Databases. Am J Public Health. 2007 April; 97(4): 654–658
  2. Castellani B, Castellani J. Data mining: qualitative analysis with health informatics data. Qual Health Res. 2003; 13:1005–1018
  3. Agrawal R, Srikant R. Privacy-preserving data mining. In: Proceedings of 2000 ACM SIGMOD Conference on Management of Data; May 16–18, 2000; Dallas, Tex
  4. Verykios V, Bertino E, Fovino I, Provenza L, Saygin Y, Theodoridis Y. State-of-the-art in privacy preserving data mining. ACM SIGMOD Record. 2004;33:50–57
  5. Steel P. Disclosure risk assessment for microdata. Available at:\%20risk\%20assessment\%20for\%20microdata.pdf. Accessed June 2005
  6. Domingo-Ferrer J, Torra V. Disclosure risk assessment in statistical data protection. J Computational Appl Math. 2004;164:285–293
  7. Sweeney L. Computational Disclosure Control: A Primer on Data Privacy Protection [PhD thesis]. Cambridge, Mass: Massachusetts Institute of Technology; 2001
  8. Bakken DE, Rarameswaran R, Blough DM, Franz AA, Palmer TJ. Data obfuscation: anonymity and desensitization of usable data sets. IEEE Secur Privacy. 2004;2:34–41
  9. Sweeney L. K-anonymity: a model for protecting privacy. Int J Uncertainty, Fuzziness Knowledge-Based Syst. 2002; 10:557–570
  10. Sweeney L. Datafly: a system for providing anonymity in medical data. In: Lin TY, Qian S, eds. Database Security XI: Status and Prospects. New York, NY: Chapman & Hall; 1998:356–381
  11. Berman J. Concept–match medical data scrubbing: how pathology text can be used in research. Arch Pathol Lab Med. 2003;127:680–686
  12. Rivest R. The MD5 message digest algorithm. Available at: Accessed December 22, 2006
  13. Schneier B. Applied Cryptography: Protocols, Algorithms, and Source Code in C. 2nd ed. New York, NY: John Wiley & Sons; 1995