nsitemap - functions for generating a site map for a given site URL
use nsitemap;
use LWP::UserAgent;
my $ua = new LWP::UserAgent;
my $sitemap = new nsitemap(
EMAIL => 'your@email.address',
USERAGENT => $ua,
ROOT => 'http://your.ip.address/'
);
$sitemap->generate();
$sitemap->option( 'VERBOSE' => 1 );
my $len = $sitemap->option( 'SUMMARY_LENGTH' );
my $root = $sitemap->root();
for my $url ( $sitemap->urls() )
{
if ( $sitemap->is_internal_url( $url ) )
{
# do something ...
}
my @links = $sitemap->links( $url );
my $title = $sitemap->title( $url );
my $summary = $sitemap->summary( $url );
my $depth = $sitemap->depth( $url );
my $digest = $sitemap->MD5digest( $url );
}
$sitemap->traverse(
sub {
my ( $sitemap, $url, $depth, $flag ) = @_;
if ( $flag == 0 )
{
# do something at the start of a list of sub-pages ...
}
elsif( $flag == 1 )
{
# do something for each page ...
}
elsif( $flag == 2 )
{
# do something at the end of a list of sub-pages ...
}
}
)
The nsitemap module creates a site map for a WWW site, by traversing the
site using the WWW::Robot module. The nsitemap object has a number of methods
to access a list of all the urls in the site; a list of all the links for each
url; page titles; page summaries; page fingerprints (MD5digest); and the depth,
or mimimum number of links from the root URL to a page.
my $sitemap = new nsitemap(
EMAIL => 'your@email.address',
USERAGENT => new LWP::UserAgent,
ROOT => 'http://www.my.com/'
);
Possible option are:
Method for generating the site map, based on the constructor options.
$site->generate();
Interface to get / set options after object construction.
$site->option( 'VERBOSE' => 1 );
my $len = $site->option( 'SUMMARY_LENGTH' );
Returns the root URL for the site.
my $root = $site->root();
Returns a list of all the URLs on the site map.
my @urls = $site->urls();
Returns 1 (one) if $url is an internal URL based on the ROOT value. Otherwise returns 0 (zero);
if ( $site->is_internal_url( $url ) )
{
# do something ...
}
Returns a list of all the links from a given URL in the site map.
my @links = $site->links( $url );
Returns the title of the URL based on the TITLE tag
my $title = $site->title( $url );
Returns the MD5_hex (fingerprint) of the URL.
my $fingerprint = $site->MD5digest( $url );
Returns a summary of the URL; generated using HTML::Summary. If the URL has a NAME='description' META tag, returns the value of CONTENT. Otherwise it attempts to summarize the text.
my $summary = $site>summary( $url );
Returns the minimum number of links to traverse from the root URL of the site to this URL. The root URL is at depth zero.
my $depth = $sitemap->depth( $url );
The traverse method walks the site map, starting at the root node (spcificed by -url), and visits each URL in the order that they would be displayed in a sequential site map of the site. The callback is called in a number of places in the traversal as indicated by the $flag argument to the callback:
LWP::UserAgent
HTML::Summary
WWW::Robot
Steve Horsburgh <shorsburgh@horsburgh.com>
This utility was inspired by the 1997 Sitemap.pm utility by Ave Wrigley <wrigley@cre.canon.co.uk>
Copyright (c) 2000, Horsburgh.com. All rights reserved.
This script is free software; you can redistribute it and/or modify it under GNU GPL. (See the file COPYING)