Annotating text in HTML with UIMA and Jericho
http://sujitpal.blogspot.com/2011/04/annotating-text-in-html-with-uima-and.html Some time back, I wrote about an UIMA Sentence Annotator component that identified and annotated sentences in a chunk of text. This works well for plain text input, but in the application I am planning to build, I need to be able to annotate HTML and plain text.
The annotator that I ended up building is a two pass annotator. In the first pass, it iterates through the document text by node, applies the include and skip tag and attribute rules. In the second pass, it iterates through the (pre-processed) document text line by line, filtering by density as described here. The annotator annotates the text with the original character positions of the text blocks in the document.