Probabilistic Case Detection for Disease Surveillance Using Data in Electronic Medical Records

From Clinfowiki
Jump to: navigation, search


Disease surveillance is critical to public health. The overall objective is to detect, at the earliest possible point, the outbreak of a communicable disease or toxic event. This detection relies on accurate diagnosis and reporting of individual cases. Public health informatics is the area of information science that specializes in matters of public health. The World Health Organization defines public health as "all organized measures (whether public or private) to prevent disease, promote health, and prolong life among the population as a whole." In 2011, the Online Journal of Public Health Informatics published a study by Tsui et al. [1] examining the efficacy of a probabilistic reasoning engine in combination with natural language processing (NLP) to capture and compute the probability of influenza and influenza-like illness (ILI). Accurate reporting of aggregate data from individual cases of influenza and ILI is a key factor in the effective detection of outbreaks. Cases are defined as "a written statement of findings that are both necessary and sufficient to classify an individual as having a disease or syndrome" [1] One of the current challenges to effective surveillance is the degree of clinical subjectivity resulting in errors in diagnosis. A second challenge is the preponderance of critical data that resides in medical records in the form of free text, rendering it unavailable for aggregation and accurate reporting.

  • Note: Although the article abbreviates "case detection system" as "CDS", this summary avoids the use of that abbreviation to avoid confusion with the clinical decision support.


The authors report using a case detection system for the 3 years prior to the publication of this study in 2011. Cases were randomly selected from emergency departments in the UMPC Health System. These cases’ data from the electronic medical record was converted to LOINC and SNOMED codes, then sent via HL-7 messaging standards to the case detection system. The Bayesian probabilistic case detection system consisted of:

  • (1) Topaz NLP to extract and standardize free-text clinical reports and chief complaints
  • (2) Individual and population disease models , and
  • (3) a Bayesian inference engine

Example Boolean diagnostic determination using NLP data:

  • Topaz NLP extraction of symptoms or findings including 'fever', 'cough', 'sore throat', and
  • Boolean case definition of ILI: (Fever) AND (Cough OR Sore Throat)

This system calculates the likelihood of influenza or ILI given the symptoms for individual patients e.g. P(Data | influenza), then passes these probabilities to the outbreak detection system at the regional department of public health. The system was evaluated by (1) case detection performance for one illness based on one data type (e.g. ED reports), and (2) NLP performance in the extraction of findings from ED reports. 580 test cases were evaluated.


The authors concluded that their Bayesian case detection system was highly accurate and returned results with high computation efficiency. The accuracy of the Bayesian diagnostic accuracy was reported to be 95% for each of two reasoning network variants (Expert model and EM-MAP trained model). Computational speed was reported as 15 milliseconds per case. The Topaz NLP performed with a kappa value of 0.79 and an overall accuracy of 91%


It is reasonable to expect that disease surveillance and outbreak detection will rise in importance in relation to increases in population, mobility, and violent social unrest. To this end, effective processing of surveillance data by computers will be critical. Structured data is, by extension, a pre-requisite to the aggregation of data on a scale sufficient for regional and national analysis. Although constrained to a narrow domain, the results of this study support further development of NLP as a mechanism to extract and standardize clinical findings (e.g. symptoms) originally captured as free-text. One such study was published in 2014 [2] and demonstrated that the human annotated findings our-performed Topaz NLP which in turn significantly out-performed MedLEE.

Related Articles

Innovative uses of electronic health records and social media for public health surveillance

IcuARM-An ICU Clinical Decision Support System Using Association Rule Mining


  1. 1.0 1.1 Tsui, F., Wagner, M., Cooper, G., Que, J., Harkema, H., Dowling, J., … Voorhees, R. (2011). Probabilistic Case Detection for Disease Surveillance Using Data in Electronic Medical Records. Online Journal of Public Health Informatics, 3(3), ojphi.v3i3.3793. doi:10.5210/ojphi.v3i3.3793
  2. Ye, Y., Tsui, F. R., Wagner, M., Espino, J. U., & Li, Q. (2014). Influenza detection from emergency department reports using natural language processing and Bayesian network classifiers. Journal of the American Medical Informatics Association, 21(5), 815-823.