Details view: boilerpipe

comments

Respond
Edit
- Edit article
- Delete article
Share
View
- Graph
  - Explorer
    
    Focus
    Down
    
    Load 1 level
    Load 2 levels
    Load 3 levels
    Load 4 levels
    Load all levels
    
    All
  - Dagre
    
    Focus
    Down
    
    Load 1 level
    Load 2 levels
    Load 3 levels
    Load 4 level
    Load all levels
    
    All
- Tree
  - SpaceTree
    
    Focus
    Expanding
    
    Load 1 level
    Load 2 levels
    Load 3 levels
    
    Down
    All
    Down
  - Radial
    
    Focus
    Expanding
    
    Load 1 level
    Load 2 levels
    Load 3 levels
    
    Down
    All
    Down
  - Box
    
    Focus
    Expanding
    Down
    Up
    All
    Down
- Article ✓
- Outline
- Document
  - Down
  - All
- Page
- Canvas
- Time
  - Timeline
  - Calendar
Updates
Contact us

boilerpipe

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

https://code.google.com/p/boilerpipe/

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.

The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA. Click here to read the paper and the presentation slides. A video of the presentation is freely available onVideolectures.net (turn speaker balance to the left to improve audio quality).

Commercial support is available through Kohlschütter Search Intelligence.

boilerpipe

Enter task details