UIMA Annotator to identify Chemical Names

http://sujitpal.blogspot.com/2011/12/uima-annotator-to-identify-chemical.html

Sometime back, our in-house pharmacists did some work to add systematic (chemical) names for drugs in our taxonomy. The expectation was that we (the search, concept mapping and indexing team) should now be able to find references to these chemical names in medical research journals and map them back to the associated drug concept.

I had almost completely forgotten about this (since I was focusing on a different aspect of the project), but one of the questions that had come up was how we were going to distinguish between these chemical names and regular synonyms for matching purposes. Here are some examples of some chemical names of some common drugs (taken from ChemSpider):

Aspirin 2-Acetoxybenzoic acid
Lipitor Calcium bis{(3R,5R)-7-[2-(4-fluorophenyl)-5-isopropyl-3-phenyl-4-(phenylcarbamoyl)-1H-pyrrol-1-yl]-3,5-dihydroxyheptanoate}

Recently, I've been working on building a faster loader for my TGNI application (more on that after I am done with it), and I noticed that my analyzer was thrashing on concepts that contained chemical names as synonyms, so I was forced to think about how to handle them. The TGNI approach is to treat these as keywords, which requires them to be identified somehow as chemical names.

As you can see, a human can easily look at a sequence like the ones shown and conclude that it is a chemical name, as opposed to something like, say "Calcium Hydroxide Poisoning". It is less obvious how a computer program would go about distinguishing them, however. I had been thinking along the lines of building some sort of super-regex that would match all these sequences, but since I am not much of an organic chemistry person, I did not make much progress.

After a bit of googling, I came upon this thread, where the original poster was stuck at about the same point as I was. In this post, I describe the solution I came up with (based heavily on the advice provided on the thread).

The idea is that these chemical names are built using a finite (or slowly evolving) set of components. Some of these components, such as numeric ones like 3 or 4, or single alphabets such as R, don't have much power to distinguish the sequences from non-chemical names, but components such as "benzoic" or "diethyl" do, since they are more likely to occur in chemical names than not. The other distinguishing feature of chemical names is that they always have one or more of a finite set of separator characters.

For my "dictionary" of highly distinguishable chemical name components, I downloaded a file from Protein Data Bank's Chemical Component Dictionary page (look for the link titled mmCIF) and parsed it with the Python script shown below.

RELATED ARTICLESExplain
OpenSherlock Project
References
Web pages
UIMA related web pages
UIMA Annotator to identify Chemical Names
An UIMA Noun Phrase POS Annotator using OpenNLP
An UIMA Sentence Annotator using OpenNLP
Annotating text in HTML with UIMA and Jericho
Combining GATE and UIMA
Create a UIMA component Web service
Running a UIMA Analysis Engine in a Lucene Analyzer Chain
Smart Query Parsing with UIMA
UIMA Analysis Engine for Keyword Recognition and Transformation
UIMA annotator for Semantic Turkey
UIMA Concept Mapping Interface to Lucene/Neo4j Datastore
Using an Adjacency Map to match Multi-word Phrases
Graph of this discussion
Enter the title of your article


Enter a short (max 500 characters) summation of your article
Enter the main body of your article
Lock
+Comments (0)
+Citations (0)
+About
Enter comment

Select article text to quote
welcome text

First name   Last name 

Email

Skip