UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

http://sujitpal.blogspot.com/2011/08/uima-concept-mapping-interface-to.html

Over the past few weeks (months?) I have been trying to build a system for concept mapping text against a structured vocabulary of concepts stored in a RDBMS. The concepts (and their associated synonyms) are passed through a custom Lucene analyzer chainand stored in a Neo4j database with a Lucene Index retrieval backend. The concept mapping interface is an UIMA aggregate Analysis Engine (AE) that uses this Neo4j/Lucene combo to annotate HTML, plain text and query strings with concept annotations.

The aggregate AE is basically a fixed flow chain of 3 primitive AEs.

The first AE extracts non-boilerplate plain text from an HTML page, which I have described here. It uses a combination of text/markup density and chunk length to decide which parts of the page it should keep, similar to Boilerpipe, except that Boilerpipe has more rules. I would actually prefer to use Boilerpipe for this AE, but boilerpipe has no API to return character offsets of the non-boilerplate chunks, which I need. I have requested this feature, but haven't heard back, so until this becomes available, my homegrown code would have to do.

The second AE in the chain takes each non-boilerplate chunk (marked up as a TextAnnotation), and breaks each chunk into sentences, which I have described here. Each sentence is marked up as a Sentence Annotation.

For both the above AEs, the mime-type of the text (text/html, text/plain or string/plain) indicates if these stages should be short-circuited). This is set into the CAS using setSofaMimeType().

The third AE (which I will describe in this post) takes each Sentence Annotation, extracts the covered text, creates word shingles out of them (maximum shingle size set to 5), and sends each shingle to NodeService.getConcepts() described here. Each call can return one or more concepts, which are accumulated by the AE and returned to the caller.

RELATED ARTICLESExplain
OpenSherlock Project
References
Web pages
UIMA related web pages
UIMA Concept Mapping Interface to Lucene/Neo4j Datastore
An UIMA Noun Phrase POS Annotator using OpenNLP
An UIMA Sentence Annotator using OpenNLP
Annotating text in HTML with UIMA and Jericho
Combining GATE and UIMA
Create a UIMA component Web service
Running a UIMA Analysis Engine in a Lucene Analyzer Chain
Smart Query Parsing with UIMA
UIMA Analysis Engine for Keyword Recognition and Transformation
UIMA annotator for Semantic Turkey
UIMA Annotator to identify Chemical Names
Using an Adjacency Map to match Multi-word Phrases
Graph of this discussion
Enter the title of your article


Enter a short (max 500 characters) summation of your article
Enter the main body of your article
Lock
+Comments (0)
+Citations (0)
+About
Enter comment

Select article text to quote
welcome text

First name   Last name 

Email

Skip