|
Open Health Natural Language Processing Position1 #249455
| https://wiki.nci.nih.gov/display/VKC/Open+Health+Natural+Language+Processing+(OHNLP)+Consortium The goal of the Open Health Natural Language Processing Consortium is to establish an open source consortium to promote past and current development efforts and to encourage participation in advancing future efforts. The purpose of this consortium is to facilitate and encourage new annotator and pipeline development, exchange insights and collaborate on novel biomedical natural language processing systems and develop gold-standard corpora for development and testing. The Consortium promotes the open source UIMA framework and SDK as the basis for biomedical NLP systems. Applications created within UIMA consist of software components (referred to as annotators) and their associated configuration files and external resources. Within the framework, one can also create complete pipelines composed of a sequence of annotators and the data flow between them. Source code: http://sourceforge.net/projects/ohnlp/files/ |
+Citations (4) - CitationsAdd new citationList by: CiterankMapLink[1] Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model
Author: Anni Codena,Guergana Savovab,Igor Sominskya,Michael Tanenblatta, James Masanzb, Karin Schulerb, - other authors: James Coopera, Wei Guand,Piet C. de Groen Publication info: 2009 Journal of Biomedical Informatics Volume 42, Issue 5, October 2009, Cited by: Jack Park 1:55 AM 6 February 2013 GMT URL:
| Excerpt / Summary We introduce an extensible and modifiable knowledge representation model to represent cancer disease characteristics in a comparable and consistent fashion. We describe a system, MedTAS/P which automatically instantiates the knowledge representation model from free-text pathology reports. MedTAS/P is based on an open-source framework and its components use natural language processing principles, machine learning and rules to discover and populate elements of the model. To validate the model and measure the accuracy of MedTAS/P, we developed a gold-standard corpus of manually annotated colon cancer pathology reports. MedTAS/P achieves F1-scores of 0.97–1.0 for instantiating classes in the knowledge representation model such as histologies or anatomical sites, and F1-scores of 0.82–0.93 for primary tumors or lymph nodes, which require the extractions of relations. An F1-score of 0.65 is reported for metastatic tumors, a lower score predominantly due to a very small number of instances in the training and test sets. |
Link[3] Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications
Author: Guergana K Savova,James J Masanz, Philip V Ogren, Jiaping Zheng - other authors: Sunghwan Sohn, Karin C Kipper-Schuler, Christopher G Chute Publication info: 2010 Cited by: Jack Park 2:07 AM 6 February 2013 GMT URL: | Excerpt / Summary We aim to build and evaluate an open-source natural language processing system for information extraction from electronic medical record clinical free-text. We describe and evaluate our system, the clinical Text Analysis and Knowledge Extraction System (cTAKES), released open-source at http://www.ohnlp.org. The cTAKES builds on existing open-source technologiesdthe Unstructured Information Management Architecture framework and OpenNLP natural language processing toolkit. Its components, specifically trained for the clinical domain, create rich linguistic and semantic annotations. Performance of individual components: sentence boundary detector accuracy¼0.949; tokenizer accuracy¼0.949; part-ofspeech tagger accuracy¼0.936; shallow parser Fscore ¼0.924; named entity recognizer and system-level evaluation F-score¼0.715 for exact and 0.824 for overlapping spans, and accuracy for concept mapping, negation, and status attributes for exact and overlapping spans of 0.957, 0.943, 0.859, and 0.580, 0.939, and 0.839, respectively. Overall performance is discussed against five applications. The cTAKES annotations are the foundation for methods and modules for higher-level semantic processing of clinical free-text. |
Link[4] Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease
Author: Iftikhar J Kullo, Jin Fan, Jyotishman Pathak, Guergana K Savova, - other authors: Zeenat Ali, Christopher G Chute Publication info: 2010 Cited by: Jack Park 2:12 AM 6 February 2013 GMT URL: | Excerpt / Summary Background There is significant interest in leveraging the electronic medical record (EMR) to conduct genomewide association studies (GWAS). Methods A biorepository of DNA and plasma was created by recruiting patients referred for non-invasive lower extremity arterial evaluation or stress ECG. Peripheral arterial disease (PAD) was defined as a resting/post-exercise ankle-brachial index (ABI) less than or equal to 0.9, a history of lower extremity revascularization, or having poorly compressible leg arteries. Controls were patients without evidence of PAD. Demographic data and laboratory values were extracted from the EMR. Medication use and smoking status were established by natural language processing of clinical notes. Other risk factors and comorbidities were ascertained based on ICD-9-CM codes, medication use and laboratory data. Results Of 1802 patients with an abnormal ABI, 115 had non-atherosclerotic vascular disease such as vasculitis, Buerger’s disease, trauma and embolism (phenocopies) based on ICD-9-CM diagnosis codes and were excluded. The PAD cases (66611 years, 64% men) were older than controls (6168 years, 60% men) but had similar geographical distribution and ethnic composition. Among PAD cases, 1444 (85.6%) had an abnormal ABI, 233 (13.8%) had poorly compressible arteries and 10 (0.6%) had a history of lower extremity revascularization. In a random sample of 95 cases and 100 controls, risk factors and comorbidities ascertained from EMR-based algorithms had good concordance compared with manual record review; the precision ranged from 67% to 100% and recall from 84% to 100%. Conclusion This study demonstrates use of the EMR to ascertain phenocopies, phenotype heterogeneity and relevant covariates to enable a GWAS of PAD. Biorepositories linked to EMR may provide a relatively efficient means of conducting GWAS. |
|
|