Apache clinical Text Analysis and Knowledge Extraction System(cTAKES) is an open-source natural language processing system forinformation extraction from electronic medical record clinicalfree-text. It processes clinical notes, identifying types of clinicalnamed entities from various dictionaries including the Unified MedicalLanguage System (UMLS)- medications, diseases/disorders, signs/symptoms, anatomical sites andprocedures. Each named entity has attributes for the text span, theontology mapping code, subject (patient, family member, etc.) andcontext (negated/not negated, conditional, generic, degree ofcertainty). Some of the attributes are expressed as relations, forexample the location of a clinical condition (locationOf relation) orthe severity of a clinical condition (degreeOf relation).
Apache cTAKES was built using the Apache UIMA UnstructuredInformation Management Architecture engineering framework and ApacheOpenNLP natural language processing toolkit. Its components arespecifically trained for the clinical domain out of diverse manuallyannotated datasets, and create rich linguistic and semantic annotationsthat can be utilized by clinical decision support systems and clinicalresearch. cTAKES has been used in a variety of use cases in the domainof biomedicine such as phenotype discovery, translational science,pharmacogenomics and pharmacogenetics.
Apache cTAKES employs a number of rule-based and machine learning methods. Apache cTAKES components include:
- Sentence boundary detection
- Tokenization (rule-based)
- Morphologic normalization
- POS tagging
- Shallow parsing
- Named Entity Recognition
- Dictionary mapping
- Semantic typing is based on these UMLS semantic types: diseases/disorders, signs/symptoms, anatomical sites, procedures, medications
- Assertion module
- Dependency parser
- Constituency parser
- Semantic Role Labeler
- Coreference resolver
- Relation extractor
- Drug Profile module
- Smoking status classifier
The goal of cTAKES is to be a world-class natural language processingsystem in the healthcare domain. cTAKES can be used in a great varietyof retrievals and use cases. It is intended to be modular and expandableat the information model and method level.The cTAKES community is committed to best practices and R&D(research and development) by using cutting edge technologies and novelresearch. The idea is to quickly translate the best performing methodsinto cTAKES code.