From Clinfowiki
Jump to: navigation, search

Apache cTAKES: Natural Language Processor for Electronic Health Records


APACHE cTAKES is a software tool that processes Electronic Medical Record clinical free-text into more parseable/usable ontological representations via Natural Language Processing (NLP). It follows the standard NLP objective of extracting discrete atoms of information from free-text (an individual term, a relationship between different terms, what sort of information a given term represents, etc) that can then be taken and processed individually and/or in conjunction with other NLP atoms to derive meaning. This has many applications in the field of biomedical informatics, as information contained within free-text (as is common in clinical text) is unusable/undecipherable for software applications without further processing. In particular, one useful application of NLP would be to facilitate transition from paper records to EHR, as well as significantly aid in any form of clinical research that requires large scale analysis of clinical text.

Infrastructure and Workflow

cTAKES is a Java application that is built upon the UIMA (Unstructured Information Management Architecture) framework using components trained specifically for/developed separately for clinical text[1]. This allows it to be extended in all the same ways that UIMA can and also means that it follows the UIMA workflow. Documents that go through NLP processing via cTAKES go through the following steps[2]:

  • Basic Tokenization:
    • Sentence Splitting
    • Tokenization (split of sentences into individual tokens, most commonly individual words)
    • Normalization (conversion of synonyms into common standard term is one normalization procedure used, also plural->singular)
    • Part-of-speech Tagging (identification of individual tokens' part of speech, such as noun, verb, adjective)
  • Entity Recognition:
    • Identifies individual tokens that represent clinical concepts from tokenized sentence
    • Classifies using UMLS classifications (see Capabilities section)
    • Assigns a unique UMLS concept ID from dictionary (can also use others, such as SNOMED-CT)
  • Relation Recognition:
    • Chunking (grouping multiple tokens together into noun phrases, verb phrases, etc)
    • Constituency/Dependency Parsing (Identify relationship between chunks, e.g. what is the subject of the sentence and how do the other chunks modify it/each other?)
    • Semantic role labeling (Use relationships obtained from above to identify relationship between individual significant tokens within the sentence e.g. reveal being grouped with test, not, and lesions from ("the test revealed no signs of lesions"))
  • Entity Integration [3]
    • Assigns relational data obtained from relation recognition to entities discovered in entity recognition (what en entity pertains to (e.g. "Electrocardiogram" pertains to "Patient 1"), whether or not negation exists ("Test" showed "NO" "lesions"))
    • Extraction of temporal, locational relations and integration with relevant entities
    • Attempts to conform entities to HL7 template (with relevant fields filled out) for each identified entity (one of Sign/Symptom, Procedure, Disease/Disorder, Lab, Medication, Anatomical Site)

Due to its nature as a Java project, cTAKES can be run on any platform capable of supporting the Java Runtime Environment, which includes Windows, MacOS, and Linux.


cTAKES is stated to possess (but is not limited to) the following capabilities in terms of medical information recognition[4]:

  • Identification of items(events) significant within a medical context (namely examinations/observations performed and results)
  • Ability to classify individual events using UMLS/HL7 classifications, such as
    • Sign/Symptoms (...Enlarged Thyroid)
    • Test(Lab)/Procedure Performed (...appendectomy done in 2012...)
    • Disease/Diagnosis(Disorder) (...osteoporosis)
    • Medication Administered (...patient was put under x anaesthetic...)
    • Anatomical/General Observations (...patient had normal reflexes...)
  • Ability to identify relational information pertaining to events, such as
    • Negation (... no signs of difficulty breathing)
    • Uncertainty (...possible myocardial infarction)
    • Temporal (...[event] occurred 3 hours prior to patient arrival at hospital)
    • Locational (...lesion on right lobe of the lung)


While based on the UIMA Framework, as of version 3.2.6 cTAKES is not fully compatible with UIMA-AS (Asynchronous Scaleout) which is used to scale UIMA based NLP applications to be run across clusters of servers/on the cloud[5], despite the application being advertised as being scalable. This has several important implications for its usability, as cluster computing is nearly a requirement for the processing of large datasets that are common at larger institutions within a reasonable length of time.

See Also

Natural language processing (NLP)


  1. Savova, Guergana K et al. “Mayo Clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, Component Evaluation and Applications.” Journal of the American Medical Informatics Association : JAMIA 17.5 (2010): 507–513. PMC. Web. 19 Oct. 2016.
  2. Bleeker, Troy. "CTAKES 3.0 Component Use Guide". Apache Software Foundation, 22 Feb. 2013. Web. 19 Oct. 2016
  3. Presentation on CTAKES at the NIH Informatics Technology for Cancer Research Conference. N.p., n.d. Web. <https://www.youtube.com/watch?v=TpGZIKEDYMw>.
  4. "Why CTakes?" Apache CTAKES™ - Clinical Text Analysis Knowledge Extraction System. Apache Software Foundation, n.d. Web. 22 Oct. 2016. <http://ctakes.apache.org/whycTAKES.html>.
  5. Chu, Selina. "[CTAKES-374] Scaleout of CTAKES Pipeline - ASF JIRA." CTAKES Issue Tracker. N.p., n.d. Web. 23 Oct. 2016. <https://issues.apache.org/jira/browse/CTAKES-374>.

Submitted by Andrew Wen