WebstemmerFree and open source web crawler and HTML layout analyzer | |
Download |
Webstemmer Ranking & Summary
Advertisement
- License:
- Freeware
- Price:
- FREE
- Publisher Name:
- Yusuke Shinyama
- Publisher web site:
- http://www.unixuser.org/~euske/
- Operating Systems:
- Mac OS X
- File Size:
- 317 KB
Webstemmer Tags
Webstemmer Description
Free and open source web crawler and HTML layout analyzer Webstemmer is an HTML layout analyzer and a web crawler that automatically extracts main text of a news site without having banners, ads and/or navigation links mixed up.Generally, extracting text contents from web sites (especially news sites) ends up with lots of unnecessary stuff: ads and banners. You could craft some regular expression patterns to pick up only desired parts, but to construct such a pattern is often a tricky and time consuming task. Furthermore, some patterns need to be aware of the surrounding contexts. Some news sites even have several different layouts.Webstemmer analyzes the layout of each page in a certain web site and figures out where the main text is located. Analysis can be done in a fully automatic manner with little human intervention. You only need to give a URL of the top page. Requirements: · Python What's New in This Release: · setup.py added
Webstemmer Related Software