next up previous contents index
Next: Other examples Up: 4.4.2 Summarizing SGML data Previous: Creating a summarizer

The SGML-based HTML summarizer

 

  Starting with Version 1.2, Harvest summarizes HTML using the generic SGML summarizer described in Section 4.4.2. The advantage of this approach is that the summarizer is more easily customizable, and fits with the well-conceived SGML model (where you define DTDs for individual document types and build interpretation software to understand DTDs rather than individual document types). The downside is that the summarizer is now pickier about syntax, and many Web documents are not syntically correct. Because of this pickiness, the default is for the HTML summarizer to run with syntax checking outputs disabled. If your documents are so badly formed that they confuse the parser, this may mean the summarizing process dies uncerimoniously. If you find that some of your HTML documents do not get summarized or only get summarized in part, you can turn syntax-checking output on by setting syntax_check = 1 in $HARVEST_HOME/lib/gatherer/SGML.sum. That will allow you to see which documents are invalid and where.

Note that part of the reason for this problem is that Web browsers (like Netscape) do not insist on well-formed documents. So, users can easily create documents that are not completely valid, yet display fine. The problem should become less pronounced if/when people shift to creating HTML documents using HTML editors rather than editing the raw HTML themselves.

  Below is the default SGML-to-SOIF table used by the HTML summarizer. The pathname to this file is $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl. Individual Gatherers may do customized HTML summarizing by placing a modified version of this file in the Gatherer lib directory.

HTML ELEMENT   SOIF ATTRIBUTES
------------   -----------------------
    <A>             keywords,parent
    <A:HREF>        url-references
    <ADDRESS>       address
    <B>             keywords,parent
    <BODY>          body
    <CITE>          references
    <CODE>          ignore
    <EM>            keywords,parent
    <H1>            headings
    <H2>            headings
    <H3>            headings
    <H4>            headings
    <H5>            headings
    <H6>            headings
    <HEAD>          head
    <I>             keywords,parent
    <IMG:SRC>       images
    <META:CONTENT>  $NAME
    <STRONG>        keywords,parent
    <TITLE>         title
    <TT>            keywords,parent
    <UL>            keywords,parent

In HTML, the document title is written as:

    <TITLE>My Home Page</TITLE>

The above translation table will place this in the SOIF summary as:

    title{13}:  My Home Page

Note that ``keywords,parent'' occurs frequently in the table. For any specially marked text (bold, emphasized, hypertext links, etc.), the words will be copied into the keywords attribute and also left in the content of the parent element. This keeps the body of the text readable by not removing certain words.

Any text that appears inside a pair of CODE tags will not show up in the summary because we specified ``ignore'' as the SOIF attribute.

URLs in HTML anchors are written as

    <A HREF="http://harvest.cs.colorado.edu/">

The specification for <A:HREF> in the above translation table causes this to appear as

    url-references{32}: http://harvest.cs.colorado.edu/

One of the most useful HTML tags is META. This allows the document writer to include arbitrary metadata in an HTML document. A Typical usage of the META element is:

    <META NAME="author" CONTENT="Joe T. Slacker">

By specifying ``<META:CONTENT> $NAME'' in the translation table, this comes out as:

    author{15}: Joe T. Slacker

Using the META tags, HTML authors can easily add a list of keywords to their documents:

    <META NAME="keywords"  CONTENT="word1 word2">
    <META NAME="keywords"  CONTENT="word3 word4">



next up previous contents index
Next: Other examples Up: 4.4.2 Summarizing SGML data Previous: Creating a summarizer



Darren Hardy
Thu Sep 7 16:00:45 PDT 1995