Multi-label classification of chronically ill patients with bag of words and supervised dimensionality reduction algorithms

From Clinfowiki
Jump to: navigation, search

With the increase in longevity of patients comes the difficulty for physicians of managing increasing numbers of chronic illnesses over a long period of time. Classifying patients affected by multiple illnesses by means of natural language processing (NLP) can enhance the decision support of medical doctors. There are two challenges to overcome in order to define a system capable of correctly classifying the multiple illnesses that may affect a chronically ill patient: (a) dealing with irregular multivariate time series; and (b) dealing with the interaction of multiple co-morbidities in a heterogeneous population of patients. In this article, the authors describe a method by which multi-label classification on health records of chronically ill patients may be performed. [1]



Multi-label classification, kernels, locality preserving projections and multi-class Fisher discriminant analysis are combined in a system for classification of multi-label chronically ill patients.


This research is motivated by the issue of classifying illnesses of chronically ill patients for decision support in clinical settings. The authors' main objective is to propose multi-label classification of multivariate time series contained in medical records of chronically ill patients, by means of quantization methods, such as bag of words (BoW), and multi-label classification algorithms. The second objective is to compare supervised dimensionality reduction techniques to state-of-the-art multi-label classification algorithms. The hypothesis is that kernel methods and locality preserving projections make such algorithms good candidates to study multi-label medical time series.

Study Design and Method

We combine BoW and supervised dimensionality reduction algorithms to perform multi-label classification on health records of chronically ill patients. The considered algorithms are compared with state-of-the-art multi-label classifiers in two real world datasets. Portavita dataset contains 525 diabetes type 2 (DT2) patients, with co-morbidities of DT2 such as hypertension, dyslipidemia, and microvascular or macrovascular issues. MIMIC II dataset contains 2635 patients affected by thyroid disease, diabetes mellitus, lipoid metabolism disease, fluid electrolyte disease, hypertensive disease, thrombosis, hypotension, chronic obstructive pulmonary disease (COPD), liver disease and kidney disease. The algorithms are evaluated using multi-label evaluation metrics such as hamming loss, one error, coverage, ranking loss, and average precision.


Non-linear dimensionality reduction approaches behave well on medical time series quantized using the BoW algorithm, with results comparable to state-of-the-art multi-label classification algorithms. Chaining the projected features has a positive impact on the performance of the algorithm with respect to pure binary relevance approaches.


The evaluation highlights the feasibility of representing medical health records using the BoW for multi-label classification tasks. The study also highlights that dimensionality reduction algorithms based on kernel methods, locality preserving projections or both are good candidates to deal with multi-label classification tasks in medical time series with many missing values and high label density.


The use of natural language processing (NLP) to aid physicians in classifying patient conditions and expediting clinical decision support (CDS) tools is an extremely important enterprise. The plethora of laboratory tests and diagnostic exams provide increasing opportunities to both identify the condition and severity of specific illnesses, but the ability of physicians to both recognize and treat those conditions is limited by their ability to sift this information out of the mountain of data present in a patient's electronic health record (EHR). With methods such as the one described in this article, the potential for meaningful classifications to arise from this data for use in clinical decision support provides an exciting opportunity for physicians to move to more intuitive approach to diagnosing and treating chronic conditions.

Related Articles

Visualizing unstructured patient data for assessing diagnostic and therapeutic history


  1. 1.0 1.1 Bromuri S, Zufferey D, Hennebert J, Schumacher M. Multi-label classification of chronically ill patients with bag of words and supervised dimensionality reduction algorithms. J Biomed Inform. 2014;51:165-75.