Webstemmer

Free and open source web crawler and HTML layout analyzer
Download

Webstemmer Ranking & Summary

Advertisement

  • Rating:
  • License:
  • Freeware
  • Price:
  • FREE
  • Publisher Name:
  • Yusuke Shinyama
  • Publisher web site:
  • http://www.unixuser.org/~euske/
  • Operating Systems:
  • Mac OS X
  • File Size:
  • 317 KB

Webstemmer Tags


Webstemmer Description

Free and open source web crawler and HTML layout analyzer Webstemmer is an HTML layout analyzer and a web crawler that automatically extracts main text of a news site without having banners, ads and/or navigation links mixed up.Generally, extracting text contents from web sites (especially news sites) ends up with lots of unnecessary stuff: ads and banners. You could craft some regular expression patterns to pick up only desired parts, but to construct such a pattern is often a tricky and time consuming task. Furthermore, some patterns need to be aware of the surrounding contexts. Some news sites even have several different layouts.Webstemmer analyzes the layout of each page in a certain web site and figures out where the main text is located. Analysis can be done in a fully automatic manner with little human intervention. You only need to give a URL of the top page. Requirements: · Python What's New in This Release: · setup.py added


Webstemmer Related Software