http://sujitpal.blogspot.com/2011/12/uima-annotator-to-identify-chemical.html Sometime back, our in-house pharmacists did some work to add systematic (chemical) names for drugs in our taxonomy. The expectation was that we (the search, concept mapping and indexing team) should now be able to find references to these chemical names in medical research journals and map them back to the associated drug concept.
I had almost completely forgotten about this (since I was focusing on a different aspect of the project), but one of the questions that had come up was how we were going to distinguish between these chemical names and regular synonyms for matching purposes. Here are some examples of some chemical names of some common drugs (taken from ChemSpider):
Aspirin | 2-Acetoxybenzoic acid |
Lipitor | Calcium bis{(3R,5R)-7-[2-(4-fluorophenyl)-5-isopropyl-3-phenyl-4-(phenylcarbamoyl)-1H-pyrrol-1-yl]-3,5-dihydroxyheptanoate} |
Recently, I've been working on building a faster loader for my TGNI application (more on that after I am done with it), and I noticed that my analyzer was thrashing on concepts that contained chemical names as synonyms, so I was forced to think about how to handle them. The TGNI approach is to treat these as keywords, which requires them to be identified somehow as chemical names.
As you can see, a human can easily look at a sequence like the ones shown and conclude that it is a chemical name, as opposed to something like, say "Calcium Hydroxide Poisoning". It is less obvious how a computer program would go about distinguishing them, however. I had been thinking along the lines of building some sort of super-regex that would match all these sequences, but since I am not much of an organic chemistry person, I did not make much progress.
After a bit of googling, I came upon this thread, where the original poster was stuck at about the same point as I was. In this post, I describe the solution I came up with (based heavily on the advice provided on the thread).
The idea is that these chemical names are built using a finite (or slowly evolving) set of components. Some of these components, such as numeric ones like 3 or 4, or single alphabets such as R, don't have much power to distinguish the sequences from non-chemical names, but components such as "benzoic" or "diethyl" do, since they are more likely to occur in chemical names than not. The other distinguishing feature of chemical names is that they always have one or more of a finite set of separator characters.
For my "dictionary" of highly distinguishable chemical name components, I downloaded a file from Protein Data Bank's Chemical Component Dictionary page (look for the link titled mmCIF) and parsed it with the Python script shown below.