UIMA Concept Mapping Interface to Lucene/Neo4j Datastore
http://sujitpal.blogspot.com/2011/08/uima-concept-mapping-interface-to.html Over the past few weeks (months?) I have been trying to build a system for concept mapping text against a structured vocabulary of concepts stored in a RDBMS. The concepts (and their associated synonyms) are passed through a custom Lucene analyzer chainand stored in a Neo4j database with a Lucene Index retrieval backend. The concept mapping interface is an UIMA aggregate Analysis Engine (AE) that uses this Neo4j/Lucene combo to annotate HTML, plain text and query strings with concept annotations.
The aggregate AE is basically a fixed flow chain of 3 primitive AEs.
The first AE extracts non-boilerplate plain text from an HTML page, which I have described here. It uses a combination of text/markup density and chunk length to decide which parts of the page it should keep, similar to Boilerpipe, except that Boilerpipe has more rules. I would actually prefer to use Boilerpipe for this AE, but boilerpipe has no API to return character offsets of the non-boilerplate chunks, which I need. I have requested this feature, but haven't heard back, so until this becomes available, my homegrown code would have to do.
The second AE in the chain takes each non-boilerplate chunk (marked up as a TextAnnotation), and breaks each chunk into sentences, which I have described here. Each sentence is marked up as a Sentence Annotation.
For both the above AEs, the mime-type of the text (text/html, text/plain or string/plain) indicates if these stages should be short-circuited). This is set into the CAS using setSofaMimeType().
The third AE (which I will describe in this post) takes each Sentence Annotation, extracts the covered text, creates word shingles out of them (maximum shingle size set to 5), and sends each shingle to NodeService.getConcepts() described here. Each call can return one or more concepts, which are accumulated by the AE and returned to the caller.