Frequently Asked Questions (and Answers) about Harvest


Table of Contents

General

Gatherer

Broker

Cache


GENERAL


1.1 What is Harvest?

Harvest is an integrated set of tools to gather, extract, organize, search, cache, and replicate relevant information across the Internet. With modest effort users can tailor Harvest to digest information in many different formats, and offer custom search services on the Internet. Moreover, Harvest makes very efficient use of network traffic, remote servers, and disk space.

1.2 Where can I get more information about Harvest?

You can learn more about and experiment with Harvest starting with the Harvest Home Page (which includes lots of information and demos).

You can also retrieve the Harvest software.

A comprehensive User's Manual is also available.

Users seeking assistance should consult the Harvest newsgroup comp.infosystems.harvest and the newsgroup archive broker.

We will gladly accept well-documented bug reports.


1.3 On which platforms does Harvest run?

Anyone with a World Wide Web client (e.g., NCSA Mosaic) can access and use Harvest servers. World Wide Web clients are available for most platforms, including DOS, Windows, OS/2, Macintosh, and UNIX/X-Windows. Most of these clients will work over any high-speed modem (e.g., 9600 baud or better). The World-Wide Web Consortium maintains a list of WWW clients.

A Unix system is required to run the Harvest servers. We support, and provide executables for SunOS 4.1.3, Solaris 2.4, and OSF/1 3.0 The code is also known to compile and operate with perhaps a few adjustments on IRIX 5.x, HP-UX 9.05, and FreeBSD 2.0. With a little more work it will even compile on AIX and Linux. Unsupported binary ports are available in our FTP contrib directory.

To compile Harvest requires GNU gcc 2.5.8, bison 1.22, and flex 2.4.7 (or later versions). To run Harvest requires Perl 4.0 or 5.0 and the GNU gzip compression program. All of these are available on the GNU FTP server.

Note that due to limited resources we are only able to support OSF/1, SunOS, and Solaris operating systems. We always welcome suggestions, patches, and binary distributions for other platforms via email. For more information on the unsupported platforms please see our notes on porting.


1.4 What is the Harvest Server Registry (HSR)?

The Harvest Server Registry provides information on available Harvest servers (including instances of Gatherers, Brokers, Object Caches, and Replication Managers). You can register your Harvest server with the HSR via a forms interface.


1.5 How does Harvest compare to related efforts such as WAIS, Archie, Veronica, GILS, WHOIS++, Web robots, etc.?

  1. Harvest provides a more scalable architecture than any of these systems, in terms of network bandwidth, server load, and disk space. In comparison with previous systems, our measurements indicate that Harvest can reduce FTP/HTTP/Gopher server load by a factor of 4 while extracting indexing information and 6,600 while delivering this information to remote indexers; network traffic by a factor of 59; and index space requirements by a factor of 43. More details are available here.

  2. Harvest defines a structured indexing format called the Summary Object Interchange Format (SOIF) that permits structured queries (e.g., matching keywords only against author or title lines in documents). This format is more powerful than the Internet Anonymous FTP Archives IETF Working Group (IAFA) format, since SOIF permits streams of object summaries (IAFA templates hold only individual items) which in turn permit a very efficient Broker/Gatherer stream retrieval protocol, and it allows arbitrary data within the fields. SOIF's support for arbitrary data means it can be used for more complex search applications, such as image and audio searching.

  3. Harvest provides an automated means of populating indexes with structured, high-quality indexing information. WHOIS++ depends on site administrators manually filling out IAFA templates to be indexed, while GILS does not define how index data are collected. Harvest can be used to provide the indexing data for these systems.

  4. FreeWAIS supports only AND/OR queries against potentially structured fields. Harvest uses Glimpse as the default engine, which supports AND/OR queries, approximate searches, regular expressions, case insensitive (or sensitive) searches, the ability to match parts of words, whole words, or multiple word phrases, variable granularity result sets, and other features.

  5. Harvest provides a flexible index/search interface that lets you plug in many search engines, including original (Thinking Machines) WAIS, Commercial WAIS, freeWAIS, Glimpse, and Nebula. Therefore, Harvest can allow users to benefit from the strengths of each engine (and can help prevent users from getting "locked" into one engine). We are currently working with several commercial search and retrieval vendors, to integrate support for their engines into Harvest. Doing so involves writing eleven "C" routines, most of which perform simple book keeping operations.

  6. FreeWAIS and other indexers only particular indexing arrangements (full text in the case of WAIS, anchors + HTML pointers in the case of many of the Web robots). These systems deal with particular data formats using a set of hard-coded content extractors - for example, a particular piece of "C" code to extract content from bibtex, for example. In contrast, Harvest allows users to customize what information is extracted and indexed, often using standard UNIX programs (like sed) or easily written regular expressions. (freeWAIS-sf also uses a mechanism that avoids hard-coded C content extractors). Plus, Harvest provides better default summarizers in many cases (e.g., our PostScript extractor performs better than most, because it treats PostScript differently depending on if it was generated by troff, TeX, WordPerfect, etc.) Note that Harvest does support full text indexing for users who need that functionality, although our measurements indicate that Harvest's use of customized indexing achieves comparable precision and recall to that of WAIS, at only 3-11% the space requirements.

  7. Overall, the expressive power of Harvest subsumes the other existing distributed indexing systems. As a brief indication, the following list indicates how one can use Harvest to implement some of the other well-known indexing systems:

    Archie, Veronica, WWWW, etc.:
    Gatherer configuration + Essence extraction script.
    Content Router:
    WAIS enumerator + Essence extraction script.
    WAIS:
    Essence full text extraction + ranking search engine.
    WebCrawler:
    Gatherer configuration + Essence full text extraction.
    WHOIS++:
    Gatherer configuration + Essence extraction script for centroids. Query front-end to search SOIF records.

    There are several efforts in progress to re-implement some of the above systems on top of Harvest, to provide better performance and a simpler implementation than their original implementations provided.

The main downside to Harvest is it is more complex - more software and more choices about how to distribute things, what to customize, etc. However, you can make basic use of Harvest without understanding the complexities, by following the installation instructions in ftp://ftp.cs.colorado.edu/pub/distribs/harvest/INSTRUCTIONS.html. These instructions will walk you through the steps needed to install and make basic use of Harvest. A site with httpd already installed should be able to get Harvest installed and running in about 30 minutes, using one of the binary distributions of the software.


What future plans do you have for Harvest?

Here's a brief synopsis of our plans, in rough priority-order:


The GATHERER


2.1 I have a Gatherer running, now how do I run a Broker that uses it?

The easiest way to do this is to use the CreateBroker program. CreateBroker will ask for a collection point. Use the host and port on which your Gatherer is running for the collection point. The Broker will then collect the indexing information from your Gatherer.

See the User's Manual for more information.

If you are new to Harvest, you should use the RunHarvest command to create and run both the Broker and the Gatherer.


2.2 General Debugging

If you suspect that your gatherer failed for some reason, here are some steps you can take to figure out what went wrong.

  1. Examine the log files (log.gatherer, log.errors) for obvious errors.

  2. A successful gathering session should produce the following files in the data directory (plus others):

  3. Verify the contents of the production database. You can view the gatherer database with
            gdbmutil print PRODUCTION.gdbm | more
    You can view only the URLs in the gatherer database with
            gdbmutil keys PRODUCTION.gdbm | more
    The ``gdbmutil'' program is located in $HARVEST_HOME/lib/gatherer and may not be on your PATH.

  4. Make sure the gatherd process is running. For example:
            ps -ax | grep gatherd
    Also try connecting to the gatherer's TCP port. For example:
            gather localhost 8500 | more
    The output from this command should be the same as for ``gdbmutil print'' above.

When experimenting with the gatherer, it is usually a good idea to start it fresh each before running the RunGatherer script. The following command will remove all files from previous gathering runs:

        rm -rf log.* data tmp

2.3 HTML not being properly summarized

Beginning with release 1.2 of Harvest, HTML files are summarized with an SGML parser and the HTML Document Type Definition (DTD). The SGML parser is much more strict than WWW browsers have traditionally been in the past. What may appear to be correct and good-looking HTML in Mosaic may in fact be written incorrectly.

When the SGML parser encounters an error, it tends to generate a large number of warnings and/or errors. Because a large percentage of HTML pages on the WWW contain errors, we have turned off error messages from the SGML parser by default. If you are seeing SOIF summaries of HTML files which appear to be too small or incomplete, then there may be an error in the HTML which you do not know about. To turn on the error messages from the SGML parser, edit $HARVEST_HOME/lib/gatherer/SGML.sum, look for the string ``syntax_check'' and make that line read

        $syntax_check   = 1;

Then re-run your gatherer and look for messages from ``sgmls'' in the log.errors file. A good reference for the HTML 2.0 DTD is at http://www.oac.uci.edu/indiv/ehood/html2.0/DTD-HOME.html

If you would prefer to not deal with SGML and the HTML DTD, you may use our older, less-strict HTML summarizer. Simply change your HTML.sum shell script to

    #!/bin/sh
    exec HTML-lax.sum $*

2.4 Keeping the gatherer database up-to-date

Harvest 1.2 assigned relatively long time-to-live values to the objects in its databases. A number of steps have been taken in version 1.3 to keep databases more up-to-date:

Keep in mind that objects are removed from Harvest databases only when they time out. So, for example, just removing the gatherer database is not sufficient to remove objects from the broker.


2.5 Using a host filter to restrict the gatherer

By default, the gatherer will only retrieve objects from the host specified in a rootnode URL. A host-filter file can be used to let the gatherer wander away from the rootnode host while still controlling where it goes. For example, to keep the gatherer on ``colorado.edu'' machines, we use

        allow    \.colorado\.edu
	deny     .*
If this file is called cu-host-filter, then it is specified with the rootnode URL as
        http://www.colorado.edu/  Host=300,cu-host-filter

It is also possible to limit the gatherer by IP address:

        allow    128\.138\..*
	deny     .*
Note that the IP address must still be written as a regular expression so the ``dots'' are escaped with a backslash.

2.6 Using a URL filter

Harvest 1.3 includes a default URL filter to avoid retrieving non-textual objects such as images, sounds and movies. The default url filter file is $HARVEST_HOME/lib/gatherer/URL-filter-defualt and looks something like:

        deny     \.gif$
        deny     \.GIF$
        deny     \.jpg$
        deny     \.JPG$
        deny     \.mpg$
        deny     \.mpeg$
        deny     \.mov$
        deny     \.au$
        deny     \.snd$
	allow    .*

The URL-filter-defualt file can be used as a template to create your own filter. It might be specified on the rootnode url line as

        http://www.ncsa.uiuc.edu/  URL=500,my.filter

NOTE: do not include access protocols or hostnames in the URL filter specifications. The URL filter only checks the pathname part of a URL. It is incorrect to write

        allow    ^http://.*\.colorado\.edu/.*
	deny     .*

2.7 What about resolving hostnames on SunOS?

DNS

In order to gather data from hosts outside of your organization, your system must be able to resolve fully qualified domain names into IP addresses. If your system cannot resolve hostnames, you will see error messaes such as ``Unknown Host.'' In this case, either

To verify that your system is configured for DNS, make sure that the file /etc/resolv.conf exists and is readable. Read the resolv.conf(5) manual page for information on this file. You can verify that DNS is working with the nslookup(8) command.

The Harvest executables for SunOS (4.1.3_U1) are statically linked with the stock resolver library from /usr/lib/libresolv.a. If you seem to have problems with the statically linked executables, please try to compile Harvest from the source code. This will make use of your local libraries which may have been modified for your particular organization.

NIS / Yellow Pages

Some sites may use NIS instead of, or in addition to, DNS. We believe that Harvest works on systems where NIS has been properly configured. The NIS servers (ypwhich(1)) must be configured to query DNS servers for hostnames they do not know about. See the -b option in ypxfr(8).

We would welcome reports of Harvest successfully working with NIS. Please write to: harvest-dvl@cs.colorado.edu.

Firewalls

Harvest can now retrieve HTTP objects through a proxy server. Gopher and FTP objects can not be retrieved across a strict firewall.

If you see the ``Host is unreachable'' message, these are the likely problems:

If you see the ``Connection refused'' message, this is the likely problem:

Reporting Problems

The Harvest gatherer is essentially a WWW client. You should expect it to work the same as Mosaic. We would be very interested to hear about problems with Harvest and hostnames under the following condition:


2.8 Known bugs in the Gatherer (Harvest 1.3 release)


The BROKER


3.1 General Debugging

When debugging a broker, it is a good idea to ``clean out'' the broker before running it. You should kill the broker process and then issue

        make clean

3.2 Can my Broker run on a different machine than my HTTP server?

Yes. Typically, the broker and httpd run on the same machine. However, the broker can run on a different machine than httpd. But, if you want users to be able to view the Broker's object files (the content summaries), then the broker's files will need to be accessible to httpd. You can NFS mount those files; or manually copy them over periodically.


3.3 The Broker doesn't compile correctly.

The Broker uses the GNU bison and flex programs (or yacc and lex) to build the grammar for the Query Manager. If you have problems compiling the Broker, then verify that you have flex v2.4.7 and bison v1.22. You can get them here:

ftp://ftp.gnu.ai.mit.edu/pub/gnu/bison-1.22.tar.gz
ftp://ftp.gnu.ai.mit.edu/pub/gnu/flex-2.4.7.tar.gz

Here's an example of what a compile problem might look like:

    Making all in broker
    bison -y -d query.y
    flex  query.l
    gcc  -I../common/include -I.  -target sun4 -c  lex.yy.c
    In file included from broker.h:44, from lex.yy.c:45:
    /usr/include/stdlib.h:27: conflicting types for `free'
    lex.yy.c:38: previous declaration of `free'
    /usr/include/stdlib.h:29: conflicting types for `malloc'
    lex.yy.c:37: previous declaration of `malloc'
    *** Error code 1
    make: Fatal error: Command failed for target `lex.yy.o'

3.4 socket: protocol not supported

On Solaris and other SVR4 machines, Perl scripts which use TCP sockets may produce this error. It arises because the ``socket.ph'' which gets included is incorrect. Our installation scripts try to produce the correct socket.ph, but may possibly fail.

Harvest 1.3 includes a script which will test your Perl and socket.ph installation. Simply run

    test-socket-ph.pl

If you don't see the message

    Perl and socket.ph tests okay.
then you need to figure out what is wrong. It is either

3.5 Searching on numbers with Glimpse

By default Glimpse does not index numbers. If you need to search for numbers then add this line to your admin/broker.conf file

        GlimpseIndex-Flags -n

3.6 Known bugs in the Broker (Harvest 1.3 release):


The CACHE


4.1 My cache doesn't work with FTP URLs.

cached relies on the ftpget.pl program to retrieve FTP files and directories. Verify that ftpget.pl is in your path when you execute cached, or that cache_ftp_program is correctly set in your cached.conf file. You can verify that ftpget.pl works by running:

        ftpget.pl - ftp.dec.com / I anonymous harvest@

4.2 Using a non-Harvest cache parent

It is possible to specify a CERN (or other) proxy cache as a parent to a Harvest cache. The ASCII port should be where the parent listens for HTTP proxy requests. The UDP port should be the echo port (usually 7) of the parent machine. The TCP port should be zero.

For example, if a CERN cache at bigcache.world.net:8080 will act as a parent, it would be specified in cached.conf as

cache_host bigcache.world.net    parent   8080    7

4.3 Security

The cache now has the ability to limit client requests based on IP addresses. For example:

    access_allow    128.125.51.0
    access_allow    128.126.0.0
    access_deny     all
It is not (currently) possible to limit access based on domain names.

4.4 Known bugs in the Cache (Harvest 1.3 release):


Last Updated: $Date: 1995/09/06 23:11:02 $