Use of a support vector machine for categorizing free-text notes

From Clinfowiki
Jump to: navigation, search

The use of Natural language processing (NLP) to extract discrete data from free-text documentation in an Electronic Health Record (EHR) is one of the most challenging and rewarding fields of study in the healthcare informatics community. In an article entitled "Use of a support vector machine for categorizing free-text notes: assessment of accuracy across two institutions" published in the Journal of the American Medical Informatics Association (JAMIA) Sep-Oct 2013, the authors describe the use of NLP to identify EHR Progress Notes which pertain to diabetes.[1]



Electronic health record (EHR) users must regularly review large amounts of data in order to make informed clinical decisions, and such review is time-consuming and often overwhelming. Technologies like automated summarization tools, EHR search engines and natural language processing have been shown to help clinicians manage this information.


To develop a support vector machine (SVM) -based system for identifying EHR progress notes pertaining to diabetes, and to validate it at two institutions.

Materials and Methods

We retrieved 2000 EHR progress notes from patients with diabetes at the Brigham and Women's Hospital (1000 for training and 1000 for testing) and another 1000 notes from the University of Texas Physicians (for validation). We manually annotated all notes and trained a SVM using a bag of words approach. We then used the SVM on the testing and validation sets and evaluated its performance with the area under the curve (AUC) and F statistics.


The model accurately identified diabetes-related notes in both the Brigham and Women's Hospital testing set (AUC=0.956, F=0.934) and the external University of Texas Faculty Physicians validation set (AUC=0.947, F=0.935).


Overall, the model we developed was quite accurate. Furthermore, it generalized, without loss of accuracy, to another institution with a different EHR and a distinct patient and provider population.


It is possible to use a SVM-based classifier to identify EHR progress notes pertaining to diabetes, and the model generalizes well.


My interest in this article stems from my ongoing work in an acute care hospital system to identify patients who fall into specific disease-process classifications in order to activate Clinical Decision Support (CDS) tools that provide both evidence-based practice options to healthcare providers as well as capturing documentation required for regulatory reporting purposes.

Using a fairly straight-forward NLP program, the authors were able to identify Progress Notes pertaining to diabetes with significant accuracy, which they solidly demonstrated using a validation group. Their method involved keyword matches, which the authors noted as creating limitations related to Progress Notes which may have involved diabetic patients but did not contain keywords.

This approach could provide some level of the desired identification of patients in the acute care setting; however, a much more comprehensive review of the entire medical record (labs, radiology reports, orders) which not only looked for key words but also for specific parameters (vital signs and labs outside of specified ranges, procedures performed in a certain amount of time, medications ordered) increases the complexity of the NLP required, especially as negation must also be taken into account.

Of note, the high degree of success in this study associated with a relatively simple approach did give me hope that some complex NLP challenges may actually be much less difficult to overcome given the right viewpoint and approach.

Related Resources

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features (pp. 137-142). Springer Berlin Heidelberg.

Using natural language processing to identify problem usage of prescription opioids

Visualizing unstructured patient data for assessing diagnostic and therapeutic history


  1. 1.0 1.1 Wright A, McCoy AB, Henkin S, Kale A, Sittig DF. Use of a support vector machine for categorizing free-text notes: assessment of accuracy across two institutions. Journal of the American Medical Informatics Association : JAMIA. 2013;20(5):887-890. doi:10.1136/amiajnl-2012-001576.