Annotating text in HTML with UIMA and Jericho

http://sujitpal.blogspot.com/2011/04/annotating-text-in-html-with-uima-and.html

Some time back, I wrote about an UIMA Sentence Annotator component that identified and annotated sentences in a chunk of text. This works well for plain text input, but in the application I am planning to build, I need to be able to annotate HTML and plain text.

The annotator that I ended up building is a two pass annotator. In the first pass, it iterates through the document text by node, applies the include and skip tag and attribute rules. In the second pass, it iterates through the (pre-processed) document text line by line, filtering by density as described here. The annotator annotates the text with the original character positions of the text blocks in the document.

RELATED ARTICLESExplain
OpenSherlock Project
References
Web pages
UIMA related web pages
Annotating text in HTML with UIMA and Jericho
An UIMA Noun Phrase POS Annotator using OpenNLP
An UIMA Sentence Annotator using OpenNLP
Combining GATE and UIMA
Create a UIMA component Web service
Running a UIMA Analysis Engine in a Lucene Analyzer Chain
Smart Query Parsing with UIMA
UIMA Analysis Engine for Keyword Recognition and Transformation
UIMA annotator for Semantic Turkey
UIMA Annotator to identify Chemical Names
UIMA Concept Mapping Interface to Lucene/Neo4j Datastore
Using an Adjacency Map to match Multi-word Phrases
Graph of this discussion
Enter the title of your article


Enter a short (max 500 characters) summation of your article
Enter the main body of your article
Lock
+Comments (0)
+Citations (0)
+About
Enter comment

Select article text to quote
welcome text

First name   Last name 

Email

Skip