Web Crawler (aka Spider)

A crawler is a program that picks up a page and follows all the links on the page. Crawlers are used in search engines to index all the pages on a website, starting only from the first page (as long as it is linked).

There are several crawlers out there, but few are good quality open-source crawlers. The problem is most crawlers could fail if the parser they use is not powerful. Using HTMLParser, it is possible to crawl through dirty html - with great speed.

There are two types of crawlers: