You can learn more about and experiment with Harvest starting with the Harvest Home Page (which includes lots of information and demos).
You can also retrieve the Harvest software.
A comprehensive User's Manual is also available.
Users seeking assistance should consult the Harvest newsgroup comp.infosystems.harvest and the newsgroup archive broker.
We will gladly accept well-documented bug reports.
Anyone with a World Wide Web client (e.g., NCSA Mosaic) can access and use Harvest servers. World Wide Web clients are available for most platforms, including DOS, Windows, OS/2, Macintosh, and UNIX/X-Windows. Most of these clients will work over any high-speed modem (e.g., 9600 baud or better). The World-Wide Web Consortium maintains a list of WWW clients.
A Unix system is required to run the Harvest servers. We support, and provide executables for SunOS 4.1.3, Solaris 2.4, and OSF/1 3.0 The code is also known to compile and operate with perhaps a few adjustments on IRIX 5.x, HP-UX 9.05, and FreeBSD 2.0. With a little more work it will even compile on AIX and Linux. Unsupported binary ports are available in our FTP contrib directory.
To compile Harvest requires GNU gcc 2.5.8, bison 1.22, and flex 2.4.7 (or later versions). To run Harvest requires Perl 4.0 or 5.0 and the GNU gzip compression program. All of these are available on the GNU FTP server.
Note that due to limited resources we are only able to support OSF/1, SunOS, and Solaris operating systems. We always welcome suggestions, patches, and binary distributions for other platforms via email. For more information on the unsupported platforms please see our notes on porting.
The Harvest Server Registry provides information on available Harvest servers (including instances of Gatherers, Brokers, Object Caches, and Replication Managers). You can register your Harvest server with the HSR via a forms interface.
There are several efforts in progress to re-implement some of the above systems on top of Harvest, to provide better performance and a simpler implementation than their original implementations provided.
The main downside to Harvest is it is more complex - more software and more choices about how to distribute things, what to customize, etc. However, you can make basic use of Harvest without understanding the complexities, by following the installation instructions in ftp://ftp.cs.colorado.edu/pub/distribs/harvest/INSTRUCTIONS.html. These instructions will walk you through the steps needed to install and make basic use of Harvest. A site with httpd already installed should be able to get Harvest installed and running in about 30 minutes, using one of the binary distributions of the software.
Here's a brief synopsis of our plans, in rough priority-order:
The easiest way to do this is to use the CreateBroker
program. CreateBroker will ask for a collection
point. Use the host and port on which your Gatherer is running
for the collection point. The Broker will then collect the
indexing information from your Gatherer.
See the User's Manual for more information.
If you are new to Harvest, you should use the RunHarvest
command to create and run both the Broker and the Gatherer.
If you suspect that your gatherer failed for some reason, here are some steps you can take to figure out what went wrong.
gdbmutil print PRODUCTION.gdbm | moreYou can view only the URLs in the gatherer database with
gdbmutil keys PRODUCTION.gdbm | moreThe ``gdbmutil'' program is located in $HARVEST_HOME/lib/gatherer and may not be on your PATH.
ps -ax | grep gatherdAlso try connecting to the gatherer's TCP port. For example:
gather localhost 8500 | moreThe output from this command should be the same as for ``gdbmutil print'' above.
When experimenting with the gatherer, it is usually a good idea to start it fresh each before running the RunGatherer script. The following command will remove all files from previous gathering runs:
rm -rf log.* data tmp
Beginning with release 1.2 of Harvest, HTML files are summarized with an SGML parser and the HTML Document Type Definition (DTD). The SGML parser is much more strict than WWW browsers have traditionally been in the past. What may appear to be correct and good-looking HTML in Mosaic may in fact be written incorrectly.
When the SGML parser encounters an error, it tends to generate a large number of warnings and/or errors. Because a large percentage of HTML pages on the WWW contain errors, we have turned off error messages from the SGML parser by default. If you are seeing SOIF summaries of HTML files which appear to be too small or incomplete, then there may be an error in the HTML which you do not know about. To turn on the error messages from the SGML parser, edit $HARVEST_HOME/lib/gatherer/SGML.sum, look for the string ``syntax_check'' and make that line read
$syntax_check = 1;
Then re-run your gatherer and look for messages from ``sgmls'' in the log.errors file. A good reference for the HTML 2.0 DTD is at http://www.oac.uci.edu/indiv/ehood/html2.0/DTD-HOME.html
If you would prefer to not deal with SGML and the HTML DTD, you may use our older, less-strict HTML summarizer. Simply change your HTML.sum shell script to
#!/bin/sh
exec HTML-lax.sum $*
Harvest 1.2 assigned relatively long time-to-live values to the objects in its databases. A number of steps have been taken in version 1.3 to keep databases more up-to-date:
Time-To-Live: 604800
Keep in mind that objects are removed from Harvest databases only when they time out. So, for example, just removing the gatherer database is not sufficient to remove objects from the broker.
By default, the gatherer will only retrieve objects from the host specified in a rootnode URL. A host-filter file can be used to let the gatherer wander away from the rootnode host while still controlling where it goes. For example, to keep the gatherer on ``colorado.edu'' machines, we use
allow \.colorado\.edu deny .*If this file is called cu-host-filter, then it is specified with the rootnode URL as
http://www.colorado.edu/ Host=300,cu-host-filter
It is also possible to limit the gatherer by IP address:
allow 128\.138\..* deny .*Note that the IP address must still be written as a regular expression so the ``dots'' are escaped with a backslash.
Harvest 1.3 includes a default URL filter to avoid retrieving non-textual objects such as images, sounds and movies. The default url filter file is $HARVEST_HOME/lib/gatherer/URL-filter-defualt and looks something like:
deny \.gif$
deny \.GIF$
deny \.jpg$
deny \.JPG$
deny \.mpg$
deny \.mpeg$
deny \.mov$
deny \.au$
deny \.snd$
allow .*
The URL-filter-defualt file can be used as a template to create your own filter. It might be specified on the rootnode url line as
http://www.ncsa.uiuc.edu/ URL=500,my.filter
NOTE: do not include access protocols or hostnames in the URL filter specifications. The URL filter only checks the pathname part of a URL. It is incorrect to write
allow ^http://.*\.colorado\.edu/.* deny .*
In order to gather data from hosts outside of your organization, your system must be able to resolve fully qualified domain names into IP addresses. If your system cannot resolve hostnames, you will see error messaes such as ``Unknown Host.'' In this case, either
To verify that your system is configured for DNS, make sure that the file /etc/resolv.conf exists and is readable. Read the resolv.conf(5) manual page for information on this file. You can verify that DNS is working with the nslookup(8) command.
The Harvest executables for SunOS (4.1.3_U1) are statically linked with the stock resolver library from /usr/lib/libresolv.a. If you seem to have problems with the statically linked executables, please try to compile Harvest from the source code. This will make use of your local libraries which may have been modified for your particular organization.
Some sites may use NIS instead of, or in addition to, DNS. We believe that Harvest works on systems where NIS has been properly configured. The NIS servers (ypwhich(1)) must be configured to query DNS servers for hostnames they do not know about. See the -b option in ypxfr(8).
We would welcome reports of Harvest successfully working with NIS. Please write to: harvest-dvl@cs.colorado.edu.
Harvest can now retrieve HTTP objects through a proxy server. Gopher and FTP objects can not be retrieved across a strict firewall.
If you see the ``Host is unreachable'' message, these are the likely problems:
If you see the ``Connection refused'' message, this is the likely problem:
The Harvest gatherer is essentially a WWW client. You should expect it to work the same as Mosaic. We would be very interested to hear about problems with Harvest and hostnames under the following condition:
When debugging a broker, it is a good idea to ``clean out'' the broker before running it. You should kill the broker process and then issue
make clean
If you see
Finished collection - received 0 objectsthen there is a good chance your gatherer contains no objects. Go to the Configuring your httpd server Also make sure that the userid which runs the HTTP server has read access to the query.html file.
Problems with BrokerQuery.pl.cgi are usually realted to Perl. Be sure that
Try running BrokerQuery.pl.cgi from the Unix shell. You should see
Content-Type: text/html
No query information to decode.
If you do not see this, then something is probably wrong with your
Perl installation. Perhaps it was compiled for an earlier operating
system release. Seriously consider installing the most recent version.
On Solaris and other SVR4 machines you may see
socket: protocol not supportedThis problem is described below
dumpregistry -countwill tell you how many objects are in the broker.
brkclient localhost 8501 '#USER #End foo'When using Glimpse, you can bypass the broker altogether and directly search the Glimpse indexes:
glimpse -a -H $HARVEST_HOME/brokers/test -y -i -w 'foo'
Yes. Typically, the broker and httpd run on the same machine. However, the broker can run on a different machine than httpd. But, if you want users to be able to view the Broker's object files (the content summaries), then the broker's files will need to be accessible to httpd. You can NFS mount those files; or manually copy them over periodically.
The Broker uses the GNU bison and flex programs (or yacc and lex) to build the grammar for the Query Manager. If you have problems compiling the Broker, then verify that you have flex v2.4.7 and bison v1.22. You can get them here:
ftp://ftp.gnu.ai.mit.edu/pub/gnu/bison-1.22.tar.gz
ftp://ftp.gnu.ai.mit.edu/pub/gnu/flex-2.4.7.tar.gz
Here's an example of what a compile problem might look like:
Making all in broker
bison -y -d query.y
flex query.l
gcc -I../common/include -I. -target sun4 -c lex.yy.c
In file included from broker.h:44, from lex.yy.c:45:
/usr/include/stdlib.h:27: conflicting types for `free'
lex.yy.c:38: previous declaration of `free'
/usr/include/stdlib.h:29: conflicting types for `malloc'
lex.yy.c:37: previous declaration of `malloc'
*** Error code 1
make: Fatal error: Command failed for target `lex.yy.o'
On Solaris and other SVR4 machines, Perl scripts which use TCP sockets may produce this error. It arises because the ``socket.ph'' which gets included is incorrect. Our installation scripts try to produce the correct socket.ph, but may possibly fail.
Harvest 1.3 includes a script which will test your Perl and socket.ph installation. Simply run
test-socket-ph.pl
If you don't see the message
Perl and socket.ph tests okay.then you need to figure out what is wrong. It is either
h2ph - < /usr/include/sys/socket.ph > $HARVEST_HOME/lib/socket.ph
By default Glimpse does not index numbers. If you need to search for numbers then add this line to your admin/broker.conf file
GlimpseIndex-Flags -n
cached relies on the ftpget.pl program to retrieve FTP files and directories. Verify that ftpget.pl is in your path when you execute cached, or that cache_ftp_program is correctly set in your cached.conf file. You can verify that ftpget.pl works by running:
ftpget.pl - ftp.dec.com / I anonymous harvest@
It is possible to specify a CERN (or other) proxy cache as a parent to a Harvest cache. The ASCII port should be where the parent listens for HTTP proxy requests. The UDP port should be the echo port (usually 7) of the parent machine. The TCP port should be zero.
For example, if a CERN cache at bigcache.world.net:8080 will act as a parent, it would be specified in cached.conf as
cache_host bigcache.world.net parent 8080 7
The cache now has the ability to limit client requests based on IP addresses. For example:
access_allow 128.125.51.0
access_allow 128.126.0.0
access_deny all
It is not (currently) possible to limit access based on domain names.
Last Updated: $Date: 1995/09/06 23:11:02 $