Hazelhurst B, Sittig DF, Stevens VJ, Smith KS, Hollis JF, Vogt TM, Winickoff JP, et al. Natural language processing in the electronic medical record: assessing clinician adherence to tobacco treatment guidelines. Am J Prev Med. 2005 Dec; 29(5): 434-9

From Clinfowiki
Revision as of 17:19, 11 November 2006 by Hbc001 (Talk | contribs)

Jump to: navigation, search

Dr Hazelhurst, Dr Sittig, et al published a study that evaluated the accuracy of using MediClass [a Natural Language Processing (NLP) technology] within an electronic medical record system to assess clinician adherence to Tobacco Treatment Guidlelines. The electronic medical records of known smokers at 4 HMOs were analyzed for the presence or absence of each of the Five A’s of smoking cessation. Table 1. The ‘Five A’s’ as recommeneded by the current U.S. Public Health Service Clinical Practice Guideline (CPG) for tobacco treatment and prevention along with how they are most commonly documented in the electronic medical record (EMR). 5A step Operation definition Data most commonly in EMR as Ask Identify tobacco user status at every visit Codified Advise Advise all tobacco users to quit Free text Assess Determine patient’s willingness to make a quit attempt Free text Assist Aid the patient in quitting Codified Arrange Schedule follow-up contact, in person or via telephone Free text As noted in Table 1, only some of the 5A steps (e.g. identification of smoking status and prescriptions for smoking cessation) are codified data in the electronic medical record, other steps (e.g. assessment of readiness to change, provision of behavior change counseling) are entered within the free text clinician notes making NLP technology essential for assessing clinician adherence to tobacco CPG for treatment and prevention. Initially, the research team developed a controlled set of clinical concepts for MediClass based on common phrases found in the free-text sections of the EMR and codes found in the structured sections of the EMR. Human coders also went through an extensive training program that concluded with a test. Once this initial design,training and development was completed, the researcher first had human coders and MediClass code 125 randomly selected electronic medical records from each of the four sites. This allowed researchers to further ‘train’ MediClass by assessing the disagreements between the two, enhancing MediClass and then rerunning the coding for accuracy. This also ensured new discrepancies were not introduced with these changes. These medical records were then removed from the study data. The remaining 500 medical records were then evaluated by MediClass and the human coders and are the final study results published in the article. The researcher found that for the ‘Ask’ and ‘Assist’ steps where the data is most commonly codified, the mean agreement between the human coders was indistinguishable from the humans and MediClass (student’s t-test, p>0.05). For the ‘Assess’ and ‘Advise’ steps where the data is most commonly found in the free-text, the humans agreed more often with each other than they did with MediClass (student’s t-test, p<0.01). Arrange was coded to infrequently to by the human abstractors to be compared and was dropped from the analysis. The researcher then further analyzed the data by creating a ‘gold standard’ using the majority opinion of the humans and then computed the accuracy of MediClass against the ‘gold standard’. In this analysis. MediClass agreed with the gold standard 91%. The details of this analysis are shown in Table 2. Table 2. MediClass performance against gold standard created from human raters 5A step Frequency in gold standard (n=500) Sensitivity Specificity Ask 417 (83%) 0.97 (0.95-0.99) 0.95 (0.88-0.98) Advise 161 (32%) 0.68 (0.60-0.75) 1.0 (0.99-1.0) Assess 55 (11%) 0.64 (0.50-0.76) 0.96 (0.94-0.98) Assist 71 (14%) 1.0 (0.94-1.0) 0.82 (0.78-0.85) Arrange 1 (0.2%) NA NA The researcher then did a thorough job of analyzing, discussing and accounting for the lower specificity on ‘Assist’ and lower sensitivity on ‘Advise’ which were the result of 12 specific medical records. They went on to conclude that using a natural language processing application to automate the processing of the entire EMR to evaluate quality measures is both practical and also has the potential to provide high-quality measuring capabilities. Comments: A very well designed study which acknowledged the important need for both codified and free-text data within currently available electronic medical record systems. They chose a relatively simply, but well defined clinical practice guideline as the basis for their study which included both codified and free-text data. They also used innovative and thoughful study design to overcome the misconception that free-text data in electronic medical records can not be successfully used to measure and assess adherence and/or quality. Though desired, it is not practical in the short-term that all data in electronic medical record system be codified. This study should encourage all of us to not see free-text data in our electronic medical record systems as a barrier to the measuring of health care performance, quality and outcomes. Though this smoking cessation CPG example may not be as complex as other the CPGs, I think their study design and findings are mostly generalizable and display the value of using well-designed NLP technology to measure health care quality.