Web Crawler (aka Spider)
A crawler is a program that picks up a page and follows all the links on the page. Crawlers are used in search engines to index all the pages on a website, starting only from the first page (as long as it is linked).
There are several crawlers out there, but few are good quality open-source crawlers. The problem is most crawlers could fail if the parser they use is not powerful. Using HTMLParser, it is possible to crawl through dirty html - with great speed.
There are two types of crawlers:
Breadth First crawlers use the BFS (Breadth-First Search) algorithm. Here's a brief description :
Get all links from the starting page and add to the queue Pick first link from queue, get all links on this page, and add to queue Repeat Step 2, until queue is empty
Depth First crawlers use the DFS (Depth-First Search) algorithm. Here's a brief description :
Get first link that has not yet been visited, from the starting page.
Visit link and get first non-visited link
Repeat Step 2, until there are no further non-visited links.
Go to next non-visited link in the previous level of recursion, and repeat step 2, until no more non-visited links are present
BFS crawlers are simple to write. DFS can be slightly more involved, so we shall present a simple DFS crawler program below. This is a basic program, and is included in the org.htmlparser.parserapplications package - Robot.java. Feel free to modify it or add functionality to it.
The method that does the crawling is the recursive method crawl(parser,depth). The crawler goes about creating multiple parsers and moving through sites using the DFS approach.
You have to be careful of the depth provided to the crawler. Studying the time taken to map all the links is itself an interesting research project. A word of caution, some sites dont like crawlers going through them. They would have a file called robots.txt in the root directory which should be accessed to know the rules and honor them. Read more about this. The above program is only a demonstration program. Please note that it will only follow links that have ".com", ".htm" or ".org" ending. In real-life situations, you'd also want to support dynamic links.
Before you set out to design an open-source or commercia crawler, please study what others have already researched in this area.
Some Useful Links on Crawlers
--SomikRaha, Sunday, February 16, 2003 2:13:46 pm.
import org.htmlparser.Parser;
public class Robot {
private Parser parser;
/**
* Robot crawler - Provide the starting url
*/
public Robot(String resourceLocation) {
try {
parser = new Parser(resourceLocation,new DefaultParserFeedback());
parser.registerScanners();
}
catch (ParserException e) {
System.err.println("Error, could not create parser object");
e.printStackTrace();
}
}
/**
* Crawl using a given crawl depth.
* @param crawlDepth Depth of crawling
*/
public void crawl(int crawlDepth) throws ParserException
{
try {
crawl(parser,crawlDepth);
}
catch (ParserException e) {
throw new ParserException("ParserException at crawl("+crawlDepth+")",e);
}
}
/**
* Crawl using a given parser object, and a given crawl depth.
* @param parser Parser object
* @param crawlDepth Depth of crawling
*/
public void crawl(Parser parser,int crawlDepth) throws ParserException {
System.out.println(" crawlDepth = "+crawlDepth);
for (NodeIterator e = parser.elements();e.hasMoreNodes();)
{
Node node = e.nextNode();
if (node instanceof LinkTag)
{
LinkTag linkTag = (LinkTag)node;
{
if (!linkTag.isMailLink())
{
if (linkTag.getLink().toUpperCase().indexOf("HTM")!=-1 ||
linkTag.getLink().toUpperCase().indexOf("COM")!=-1 ||
linkTag.getLink().toUpperCase().indexOf("ORG")!=-1)
{
if (crawlDepth>0)
{
Parser newParser = new Parser(linkTag.getLink(),new DefaultParserFeedback());
newParser.registerScanners();
System.out.print("Crawling to "+linkTag.getLink());
crawl(newParser,crawlDepth-1);
}
else System.out.println(linkTag.getLink());
}
}
}
}
}
}
public static void main(String[] args)
{
System.out.println("Robot Crawler v"+Parser.VERSION_STRING);
if (args.length<2 || args[0].equals("-help"))
{
System.out.println();
System.out.println("Syntax : java -classpath htmlparser.jar org.htmlparser.parserapplications.Robot <resourceLocn/website> <depth>");
System.out.println();
System.out.println(" <resourceLocn> the name of the file to be parsed (with complete path ");
System.out.println(" if not in current directory)");
System.out.println(" <depth> No of links to be followed from each link");
System.out.println(" -help This screen");
System.out.println();
System.out.println("HTML Parser home page : http://htmlparser.sourceforge.net");
System.out.println();
System.out.println("Example : java -classpath htmlparser.jar com.kizna.parserapplications.Robot http://www.google.com 3");
System.out.println();
System.out.println("If you have any doubts, please join the HTMLParser mailing list (user/developer) from the HTML Parser home page instead of mailing any of the contributors directly. You will be surprised with the quality of open source support. ");
System.exit(-1);
}
String resourceLocation="";
int crawlDepth = 1;
if (args.length!=0) resourceLocation = args[0];
if (args.length==2) crawlDepth=Integer.valueOf(args[1]).intValue();
Robot robot = new Robot(resourceLocation);
System.out.println("Crawling Site "+resourceLocation);
try {
robot.crawl(crawlDepth);
}
catch (ParserException e) {
e.printStackTrace();
}
}
}