DSPAM v2.10 <jonathan@nuclearelephant.com>
Copyright (c) 2003 Network Dweebs Corporation
http://www.nuclearelephant.com/projects/dspam/

LICENSE

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.

TABLE OF CONTENTS

General DSPAM Information

  1.0 About DSPAM
  1.1 Installation
  1.2 Testing
  1.3 Troubleshooting
  1.4 DSPAM Tools
  1.5 Agent Commandline Arguments

Advancced DSPAM functionality

  2.0 Linking with libdspam
  2.1 Configuring groups
  2.2 Participating in the global stats project
  2.3 External Inoculation Theory

Miscellaneous

  3.0 Bugs, Ports, and the like 
  3.1 Known Bugs
  3.2 Adding the dspam logo button to your website
  3.3 CVS Access

1.0 ABOUT DSPAM

DSPAM is an open-source, freely available anti-spam solution designed to combat
unsolicited commercial email using an advanced implementation of statistical
analysis coupled with deobfuscation techniques and other related approaches.

DSPAM provides an administratively maintenance free system capable of learning 
each user's email behaviors with very few false positives.  DSPAM is among one 
of the more popular and successful attempts at truly accurate spam filtering, 
and is rapidly gaining a large support forum.  Contributions to the project are
welcome via the dspam-dev mailing list. 

DSPAM can be implemented in one of two ways:

1. The DSPAM mailer-agent provides server-side spam filtering, quarantine
box, and a mechanism for forwarding spams into the system to be automatically
analyzed.  Advanced features, such as opt-in/opt-out filtering, inoculation,
and shared groups are supported.

2. Developers may link their projects to the dspam core engine (libdspam) in
accordance with the GPL license agreement.  This enables developers to
incorporate libdspam as a "drop-in" for instant spam filtering within their
applications - such as mail clients, other anti-spam tools, and so on.

Many of the foundational principles incorporated into this agent were 
contributed by Paul Graham's white paper on combatting SPAM, which can be 
found at http://paulgraham.com/spam.html.  Many new approaches have been layered
on top of the original core, thanks to the efforts of contributors to the DSPAM
project.

DSPAM is designed to run with a separate dictionary for each user, in
order to provide the best accuracy and learning potential.  A composite
dictionary can be created to seed a new user's mailbox, however over time
each user will develop their own statistics based on the behavior of their
email.  

The DSPAM Solution is split up into the following pieces:

DSPAM CORE ENGINE

The DSPAM core engine, also known as libdspam, provides all major spam
filtering functions.  The engine is linked to other dspam components (or
shells) to provide functionality.  The core engine is capable of being linked
in with any other application as a "drop-in" to provide spam filtering to
mail clients, other anti-spam tools, and other such type projects that
would benefit from its use.  Both static and shared versions are built by
libtool into .libs. 

libdspam provides a storage driver abstraction layer, enabling developers to 
easily change how information is stored on the system (for example Berkeley 
DB, MySQL, Oracle, etc.) with enough flexibility to write a storage
driver utilizing stone tablets and chisels.   

DSPAM AGENT

The DSPAM agent is a shell for libdspam providing server-side spam filtering.
The agent masquerades as the mail server's local delivery agent where 
it processes the email and then either delivers it using the real local 
delivery agent, or quarantines it.  The agent is where all learning and 
statistical calculation take place.  The agent also services SPAM processing: 
emails forwarded by the user that are SPAM.  This is critical to the 
learning operations of the DSPAM agent.  DSPAM can also be configured to
tag messages as spam, rather than quarantine them.

The MTA (sendmail, exim, qmail, etc) calls DSPAM with parameters identifying 
the destination user (DSPAM recognizes --user [user]").  DSPAM performs its 
internal calculations and will then call the real local delivery agent 
(mail.local, procmail, etc.) with any non-DSPAM-specific parameters passed to 
it, or move the message into the user's quarantine box.

When an email is delivered to the end-user, a DSPAM signature containing
a key to reversal data stored on the server is appended.  If the email 
is the forwarded into DSPAM as a false positive or missed spam, DSPAM will use 
this information to relearn the message.  All signatures are kept on the 
server so as not to flood emails.  As a result, the tools provided should be 
run to maintain each user's dictionary and signature databases.

CGI CLIENT

The CGI client is a very basic tool enabling a mail user to view their
quarantine box, reverse the occasional false positive, and delete
SPAMs permanently.  The CGI client works in conjunction with the DSPAM agent. 
It is possible to eliminate the quarantine box in lieu of an alternative
solution, such as client-filtering/forwarding.

TOOLS

Some basic tools which have been provided to manage dictionaries, automate 
corpus feeding, and create seeded [composite] dictionaries.  Some driver-
specific tools also exist for maintenance of any proprietary files or objects 
a particular driver may need to perform. 

1.1 INSTALLATION

UPGRADING

   Please follow the upgrade steps below all the way through whichever version 
   you are presently running.  Upgrade steps are progressive, and so you should
   follow the steps for each upgrade process up to the version you are 
   upgrading from.

   Please follow the progressive instructions below until you get to an 
   upgrade note that is versioned at or below your present version. 

   -------------------------------------------------------
   IMPORTANT UPGRADE STEPS FOR USERS UPGRADING FROM < 2.10
   -------------------------------------------------------

   No changes are required to update to 2.10, however you may wish to
   consider implementing the following enhancements:

   1. An --enable-client-compression configure flag has been added, so if you
      are running your MySQL database on a separate machine, this will speed
      things up considerably by compressing the data between the server and
      the DSPAM agents, at the expense of some CPU.

   2. If you are using the MySQL storage driver, you should consider
      issuing the following commands to speed up processing.  You will need to
      use --myisam-recover when starting MySQL if you implement these changes,
      as some indexes may need to be rebuilt in the event of a server crash.

      alter table dspam_token_data DELAY_KEY_WRITE=1;
      alter table dspam_signature_data DELAY_KEY_WRITE=1;
      alter table dspam_stats DELAY_KEY_WRITE=1;

      Optionally, you may also convert the token data type to a fixed-length
      character field if you favor speed over storage space.  The following
      statement will accomplish this:

      alter table dspam_token_data modify token char(20);

   3. Please see the section entitled NOTIFICATIONS for instructions on
      configuring automated notifications if you would like to add this
      feature to your system. 
 
   ------------------------------------------------------
   IMPORTANT UPGRADE STEPS FOR USERS UPGRADING FROM < 2.8
   ------------------------------------------------------

   Depending on how DSPAM is configured, the existence of .dspam (opt-in) or
   .nodspam (opt-out) files may be present in USERDIR.  These files have been
   moved to directories titled USERDIR/opt-in and USERDIR/opt-out, respectively.
   Prior to upgrading, you should create the necessary directory and perform
   a command such as "mv */*.nospam ./opt-out" to move the files to their new
   location.

   -------------------------------------------------------------
   IMPORTANT UPGRADE STEPS FOR USERS UPGRADING FROM < 2.8-beta-2
   -------------------------------------------------------------

   If you are using a SQL-based driver, you will need to add two columns to 
   your dspam_stats table.  The following commands should suffice:

   alter table dspam_stats add spam_corpusfed int;
   alter table dspam_stats add innocent_corpusfed int;
   update dspam_stats set spam_corpusfed = 0;
   update dspam_stats set innocent_corpusfed = 0;

   If you are using a BDB-based driver, this information will automatically be 
   created for each user with zero values.

   -------------------------------------------------------------
   IMPORTANT UPGRADE STEPS FOR USERS UPGRADING FROM < 2.8-beta-1
   -------------------------------------------------------------

   Version 2.8-beta-1 made some changes to the file storage structure,
   in order to better support Berkeley DB storage drivers.  Regardless of
   which storage driver you use, follow the steps below to upgrade to the new
   filesystem structure.

   1. Shut down your MTA so that no mail will be delivered to DSPAM.
      Insure that all dspam processes have exited.

   2. Run the tools/dspam_movefiles script with the following syntax:
      tools/dspam_movefiles /path/to/userdir

   3. Reset the permissions of USERDIR to be group writable and owned
      by the correct group.  For example:

      chown -R root:mail /path/to/userdir
      chmod -R 770 /path/to/userdir

      This may need to be adjusted to suite your specific implementation

   4. Once the script completes, restart your MTA

   ----------------------------------------------------------
   IMPORTANT UPGRADE STEPS FOR USERS UPGRADING FROM < 2.7.6.9
   ----------------------------------------------------------

   Version 2.7.6.9 made a change to the MySQL database structure to
   compensate for a bug in MySQL.  If you are using the mysql_drv
   storage driver, you will need to follow these instructions prior
   to upgrading:

     1. Shut down your MTA so that no mail will be delivered to DSPAM.
        Insure that all dspam processes have exited.

     2. Log into your MySQL database and issue the following command:

        alter table dspam_token_data modify token varchar(32);

     3. Perform your upgrade

     4. Restart the MTA

   Version 2.7.6.9 also changed the format of the signature record.  Prior
   to upgrading, the administrator should delete all temporary signature data
   from the system.  These are .sig files is using a libdb driver, or issue
   the command:

        delete from dspam_signature_data

   if using a SQL-based driver.

   ----------------------------------------------------------
   IMPORTANT UPGRADE STEPS FOR USERS UPGRADING FROM < 2.7.6.2
   ----------------------------------------------------------

   VERSION 2.7.6.2 implements a small change to the group file format.

   If you are using groups, you will need to add a second column after
   the group name.  For example:

   groupname:bob,travis

   becomes:

   groupname:grouptype:bob,travis

   The group type for any groups prior to 2.7.6.2 is 'shared'.  Other group
   types include 'inoculation' and 'classificaiton'.  This document will cover
   the different types of groups now supported.

   ------------------------------------------------------
   IMPORTANT UPGRADE STEPS FOR USERS UPGRADING FROM < 2.7
   ------------------------------------------------------

   VERSION 2.7 IMPLEMENTED ENHANCED SECURITY

   Please read the section TRUSTED USERS SECURITY in this document for 
   instructions on setting up your trusted user file and your overrides file.  
   You MUST perform this step BEFORE upgrading to version 2.7 otherwise dspam
   will run in 'untrusted user' mode and function incorrectly.

   Version 2.7 also changed the rules for username identification:

   DSPAM no longer recognizes -d to identify the user, but instead --user
   must be used.  This is to make things simpler in that:

    --user will never be passed onto the local delivery agent.
    -d, when specified, will always be bassed onto the local delivery agent.

    This means if you're upgrading from 2.6, you'll need to change your
    MTA configuration from:

    /path/to/dspam -d [user variable]
  
    to
  
    /path/to/dspam --user [user variable] -d %u

   Supplying %u instead of your normal user variable tells dspam to specify 
   the current user when calling the LDA.  This is important when the email
   is being delivered to multiple local users.  %u is an alias for $u, which
   is used by most popular MTAs.  %u allows you to specify "current user" to
   DSPAM without having to worry about your MTA interpolate it.
 
   In summary, we've decided to give DSPAM its own user flag to avoid confusion
   with your local delivery agent's user flag.  The %u flag has been created
   to insure you don't have to specify your user variable twice (some MTA's,
   such as sendmail) have a problem with this.  Technically, $u may be used,
   but %u is a safe alias that prevents your MTA from substituting the
   parameter with a list of users (bad).

   Once you have implemented the changes above, you may upgrade by building 
   the distribution and installing over the old one.  See the next section 
   (FRESH INSTALLATION) for more information.

   ------------------------------------------------------
   IMPORTANT UPGRADE STEPS FOR USERS UPGRADING FROM < 2.6
   ------------------------------------------------------

   If you are running a DSPAM version prior to 2.6, you will need to convert
   your databases to the new 64-bit database format before using this version 
   of DSPAM.  PREVIOUS VERSIONS OF 2.6-BETA ARE NOT COMPATIBLE with this
   version's databases.  If you are running with one such beta, you will need
   to convert from a backup of the 2.5-Release databases, or start with
   a fresh set of databases.

   To convert from a 2.5 database, you will need the dspam_convert tool
   available in the v2.6 distribution (no longer supplied with later
   distributions).  Once you have downloaded and built dspam_convert,
   the following operations may be used.  Be sure to keep a backup of your 
   2.5 databases until 2.6 goes production.

   cp -pr /etc/mail/dspam /tmp/dspam_crc32
   ./configure --with-userdir=/tmp/dspam_crc32
   make
   ./tools/dspam_convert
     (ignore conflicts, make sure all databases were converted successfully)
   shut down sendmail
   cp /tmp/dspam_crc32/*.dict /etc/mail/dspam
   rm -f /etc/mail/dspam/*.sig
     (the signature files contain only temporary information)
   chown root:mail /etc/mail/dspam/*.dict
   chmod 660 /etc/mail/dspam/*.dict
   make clean
   ./configure [options]
   make
   make install

   Once you have converted your databases (if necessary), all you should 
   have to do, unless otherwise stated in the CHANGE log or the website, is:

   ./configure [options]
   make
   make install

   Please see the section 'driver specific tools' for instruction on 
   building any driver-specific tools

FRESH INSTALLATION

First you will need to download a few prerequisite tools:

   Depending on which storage driver you want to use, you will need:
 
   libdb4_drv: Berkeley DB-4. 
   libdb3_drv: Berkeley DB-3.
   mysql_drv:  MySQL client libraries (and a server to connect to) 
   ora_drv:    Oracle Call Interface (and a server to connect to)

   The default is libdb4_drv.  In general, MySQL is a faster solution with
   a smaller storage footprint, and is better suited for large-scale 
   implementations, however if you don't feel like running mysql, libdb3 or 
   libdb4 are both quite sufficient.

   You can download Berkeley DB from http://www.sleepycat.com.  
   You can download MySQL from http://www.mysql.com.
   You can obtain more information about Oracle at http://www.oracle.com.

   Be sure the necessary libraries are available to root, the MTA user, and 
   the CGI user. The easiest way to do this is to copy them to /usr/lib or 
   /lib.

   NOTE: Some operating system distributions include their own version of
         libdb3_drv and libdb4_drv.  A majority of these packaged versions
         do work correctly with DSPAM, however a few do not.  If you experience
         problems with one of the libdb storage drivers, consider downloading
         and compiling the official source tree from http://www.sleepycat.com.

1. CONFIGURATION

   ./configure [options]

   The most widely used options for DSPAM include:

   --with-local-delivery-agent=PROG
   Specify an alternative local delivery agent, other than the one specific
   to your operating system.  If you are building on an unsupported platform,
   you will need to specify this.  You may use quotes if you wish to include
   additional commandline flags.  DSPAM will automatically relay the 
   commandline parameters it was initially given, with the exception of any
   DSPAM-specific parameters (such as --user, --corpus, etc.)

   Currently, DSPAM has a default local delivery agent selected for Linux, 
   FreeBSD, and Solaris platforms.

   NOTE: When specifying a series of arguments, you will need to use quotes
   around PROG.  You may also use the $u identifier to specify that you
   with DSPAM to place the destination user's ID in the corresponding space
   in the arguments list. For example: 

       --with-local-delivery-agent="/path/to/lda -d \$u" 

   Where $u will be replaced by the destination user prior to calling the LDA.
   This could potentially cause problems, however, if your MTA requires the
   user argument list to come last, which is why DSPAM, by default, will allow
   you to set this in the MTA configuration.

   NOTE: be sure to escape the $ in $u. Only do this when specifying $u on the 
   commandline.  This will prevent $u from being overwritten with the shell's 
   environment variable 'u'.

   --with-storage-driver=DRIVER
   Specify an alternative storage driver.  A storage driver is a driver
   written specifically for DSPAM to store tokens, signature data, and
   perform other proprietary operations.  The default driver is libdb4_drv,
   which incorporates Berkeley DB-4.  The following drivers have been provided:

     libdb4_drv: Berkeley DB4 Library
     libdb3_drv: Berkeley DB3 Library
     mysql_drv:  MySQL Drivers
     ora_drv:    Oracle Drivers (BETA)

   You may also need to use some of the driver-specific configure flags
   (discussed later).

   --enable-neural-networking
   Enables neural networking support (see the section NEURAL NETWORKING).  This
   feature is only presently supported by the mysq_drv storage driver, and is
   still considered experimental.

   --with-quarantine-agent=PROG
   By default, DSPAM automatically quarantines spams in its internal
   user quarantine box.  If you wish to override this default behavior,
   however, you may do so by specifying your own quarantine agent.  The same
   notes from the --with-local-delivery-agent option apply here. 

   --enable-delivery-to-stdout
   When enabled, messages get delivered to stdout, instead of a local delivery 
   agent.  It is then up to the MTA to parse and deliver the message. 

   --enable-signature-attachments
   Instead of storing the DSPAM signatures on the server (which could take
   up considerable disk space), this option will cause DSPAM to rewrite
   each message to include a dspam.dat attachment, which contains all of
   the tokens used to calculate the original message.  When the spam or
   false positive is processed back into the system, this signature will
   be read.  May increase bandwidth on an average between 2k-32k per
   message, depending on the original message's size.

   NOTE: This option doesn't work correctly with mail clients that quote an
   embedded, forwarded message (such as some or all versions of elm) and 
   should only be used on networks where all clients can properly understand 
   an embedded multipart message (Outlook, Ximian Evolution, Etcetera), and 
   forward the attachment as an attachment instead of quoted text.  In othe  
   words, this breaks a lot of stuff if you're not on a standardized GUI-based 
   client network.  Server-side signatures is still the most reliable method 
   and works for all known clients.

   This also puts a "paper clip" on every message you receive.

   --disable-trusted-user-security
   Administrators who wish to disable trusted user security may do so by 
   using this configure flag.  This will cause DSPAM to treat each user as
   if they were "trusted" which could allow them to potentially execute
   arbitrary commands on the server via DSPAM.  Because of this, administrators
   should only use this option on either a closed server, or configure their 
   DSPAM binary to be executable only by users who can be trusted.  This 
   option SHOULD NOT be used as a solution to your MTA dropping privileges
   prior to calling DSPAM.  Instead, see the TRUSTED SECURITY section of this
   document.

   --with-signature-life=DAYS
   Specifies the length (in days) a signature should remain stored on the
   server.  The default is 14 days.  This value should accurately represent
   the maximum amount of time a user would need to identify and forward
   a missed spam, or mark a false positive.  Consider vacations.

   --with-userdir=DIR
   Specify an alternative storage directory for dspam user information.  The
   default is /etc/mail/dspam.

   --disable-chained-tokens
   Disables chained tokens.  Please read the white paper on chained tokens
   before you disable them, as using them makes DSPAM significantly more
   accurate.  Only use this option if disk space is a serious issue.

   --enable-toe
   Enables TOE (Train on Errors) at 4000 innocent messages.  The default is
   Train on Everything (TEFT).  TOE will cause DSPAM only to learn when it
   has made an error, rather than learn every time a message is processed.
   Bill Yerazunis illustrated that this mode of training can be more
   accurate in many cases.  It does, however, cause the filter to adapt to new
   types of messages and mailing lists only on error.

   --enable-debug
   Turns on debugging output to USERDIR/dspam.debug and
   USERDIR/dspam.messages (see desription of --with-userdir option for
   details about USERDIR).  Never use this flag in production!

   --enable-verbose-debug
   Turns on extremely verbose debugging output to USERDIR/dspam.debug
   and USERDIR/dspam.messages (see desription of --with-userdir option
   for details about USERDIR).  Never use this flag in production!

   --prefix=DIR
   Specify an alternative root prefix for installation.  The default is
   /usr/local

   --enable-large-scale
   Switch for large-scale implementation.  Currently only affects the filesystem
   layout for dspam.  User data will be stored as $USERDIR/u/s/user instead
   of $USERDIR/user

   --enable-source-address-tracking
   Logs the source address of spams and innocent messages via syslog.
   You can create a file USERDIR/mta.whitelist which can contain a list of
   local MTA IPs, which will cause DSPAM to skip to the next 'Received' header.
   Each IP should be on a new line.

   Also writes SBL blacklist files for use with the Streamlined Blackhole
   Server (http://www.nuclearelephant.com/projects/sbl/).

   --disable-traditional-bayesian
   Disables the traditional Bayesian algorithm (enabled by default).

   --enable-alternative-bayesian
   Enables Brian Burton's alternative Bayesian algorithm.  The differences are:
     - 27 Samples are used instead of 15
     - Tokens appearing more than once may take up to 2 slots in the
       calculation.  This is ideal when there is very limited data

   --enable-chi-square
   Enables Robinson's Chi-Square algorithm.  The differences are:
     - 25 Samples are used instead of 15
     - The entire combination algorithm is different.  See:
       http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
       for more information.

   NOTE: You may have multiple algorithms enabled simultaneously; if any of
     the enabled algorithms believe the message is spam, it will be marked
     accordingly.  Naturally, you also have the potential problem of any
     false positives generated by the enabled algorithms, so it is recommended
     to either stick with a single algorithm, or use only Bayesian or only
     Chi-Square type algorithms.  Bayesian+Alt-Bayesian seems to be the most
     effective combination (not using Chi-Square at all). 

     Generally, the alternative-Bayesian algorithm appears to catch some spams
     that the traditional Bayesian algorithm does not, however it also misses
     far more spams than the traditional algorithm.  Therefore, an 
     implementation using both Bayesian algorithms appears to be the most
     effective in catching spam.
       
   --disable-bias
   When bias is disabled, dspam no longer biases the statistics in favor of
   innocent mail, but measures both spam and innocent tokens equally in the 
   calculation equally.  This may provide more effective spam filtering,
   but could potentially weaken false positive protection.

   --disable-test-conditional
   Disables test-conditional training.  Test-conditional training is a more
   agressive approach to training than traditional training, and provides more
   inoculous results rapidly.  

   Enabled by default, this mode of training will automatically re-train the 
   user's dictionary on spam or false positive until the training condition is 
   met (e.g. until the user's dictionary no longer results in 
   misclassification of the message being retrained).  This training has a 
   maximum number of 5 iterations, and will only invoke when:
                                                                                
   - The user has > 4000 innocent messages in their corpus, and is reporting
     a spam
                                                                                
   - The user is reporting a false positive (regardless of the number of 
     messages in their corpus)

   This method of training has its controversial points as well.  All of these
   issues revolve around the assumption this approach to training makes that
   you are likely to receive the same (or very similar) again one or more times
   in the future.

   - Since the message is being retrained repeatedly, the learning curve is
     going to be based solely on that one message rather than the natural flow
     of similar messages that may contain slightly different text.

   - It's possible a user may agressively train a spam they will only receive
     once but could potentially increase their risk of false positives by
     training this agressively.

   - If there is a significant overlap of dictionary tokens between a user's
     regular mail and the incoming spams being agressively trained, the user
     could potentially end up retraining with spam, then retraining with
     false positives, then retraining with spam again.

   In spite of these controversial points, this approach to training has had
   successful results with several implementations.

   --enable-spam-delivery
   When messages are marked as spam, they are tagged using X-DSPAM headers.
   Using this configure flag will cause DSPAM to deliver spams to the user's
   mailbox (where they should be filtered using the client's own mechanism)
   instead of quarantining them.  This presents a potential learning curve
   for end-users as they will still need to forward in missed spams and false
   positives (instead of just moving them to the correct mailbox).

   --enable-homedir-dotfiles
   When enabled, instead of checking for $USERDIR/$USER/$USER[.nodspam|.dspam],
   DSPAM will check for a .nodspam|.dspam file in the user's home directory.
   These two dotfiles are used for opt-out or opt-in filtering.
                                                                                
   --enable-opt-in
   Causes DSPAM to filter mail only for users with a .dspam dotfile.  The 
   default is opt-out, which requires a .nodspam file to exist to bypass 
   filtering.

   --enable-client-compression
   Enables data source client compression for storage drivers where it is
   available (presently only mysql_drv).  This causes data between the
   data source and its clients to be compressed.  You should use this option
   if your data source is on a separate machine from the DSPAM agent(s) as it
   conserves bandwidth, but at the expense of a few CPU cycles.

   DRIVER SPECIFIC:

   libdb4_drv:
     --with-db4-includes=DIR
     Specify a path to the Berkeley db4 includes

     --with-db4-libraries=DIR
     Specify a path to the Berkeley db4 libraries

   libdb3_drv:
     --with-db3-includes=DIR
     Specify a path to the Berkeley db3 includes

     --with-db3-libraries=DIR
     Specify a path to the Berkeley db3 libraries
     (Currently links to -ldb3, to you may need to symlink libdb-3.3.so to
      libdb3.so if it doesn't exist)

   mysql_drv:
     --with-mysql-includes=DIR
     Specify a path to the MySQL includes

     --with-mysql-libraries=DIR
     Specify a path to the MySQL libraries
     (Currently links to -lmysqlclient, also -lcrypto on some systems)

     --enable-virtual-users
     Tells DSPAM to create virtual user ids.  Use this if your users don't
     actually exist on the system (e.g. in /etc/passwd if using a password file)

     NOTE: Please see the file tools.mysql_drv/README for more information
     about configuring the mysql_drv storage driver.

   ora_drv:
     --with-oracle-home=DIR
     Specify the Oracle Home (or client home)

     --enable-virtual-users
     Tells DSPAM to create virtual user ids.  Use this if your users don't
     actually exist on the system (e.g. in /etc/passwd if using a password file)                                                                                
     NOTE: Please see the file tools.ora_drv/README for more information
     about configuring the ora_drv storage driver.

2. BUILDING AND INSTALLING

   After you have run configure with the correct options, build and install
   DSPAM by performing:

   make && make install

   If you are a developer wanting to link to the core engine of dspam,
   libdspam will be built during this process.  Please see the
   example.c file for examples of how to link to and use libdspam.  Static
   and dynamic libraries are built in the .libs directory.

3. PERMISSIONS

   After install, the USERDIR will have been created for you automatically
   (the default is /etc/mail/dspam).  Insure the permissions of the directory
   are writable by both your MTA and CGI users.    

   You may need to add root, your MTA user, and your CGI user to the
   directory's [mail] group in /etc/group.  The MTA user is usually 'daemon' 
   or 'smmsp' although on FreeBSD the default is 'mailnull'.  This is very 
   important, as your MTA user needs to be able to lock and work with files.

   IMPORTANT!!!

   FreeBSD's mail.local changes its effective uid, and so in order to use it
   dspam must be installed as setuid root to work on the commandline properly.
   This is done automatically on install.

   DSPAM CGI

   The CGI will need to function in the same group as the dspam agent.  The
   best way to do this is to create a separate virtualhost specifically for
   the CGI and assign it to run in the MTA group.  If you are using
   procmail, DSPAM may also need to be setuid root in order to deliver
   false positives through the CGI.  Be sure not to open up the permissions
   of the dspam agent where another user or cgi could call dspam with
   setuid or setgid permissions as this is insecure.

   TRUSTED USERS SECURITY

   DSPAM has tighter security for untrusted users on the system, to prevent
   them from being able to spoof other users or specify their own passthru
   arguments to potentially hijack the local delivery agent.  This method
   of security has been implemented due to the fact that some implementations
   (such as those using procmail) may require the DSPAM agent to be setuid or
   setgid.

   The trusted.users file should be created in $USERDIR (defaulted to
   /etc/mail/dspam).  This file should contain a list of trusted users who 
   should be allowed to set the dspam user, passthru parameters, and other 
   information that would be potentially dangerous for a malicious user to 
   be able to set.  The file should contain one username per line, and will
   generally the usernames of the MTA and CGI users.  Example:

   root
   smmsp
   daemon
   cgi
   mailnull

   Where cgi represents the special CGI user you configure Apache to
   run your dspam.cgi as.

   Be sure to examine USERDIR/dspam.debug to insure that you don't get any
   untrusted user warnings when submitting spam or a false positive, as both
   of these actions frequently call dspam from a different user than
   standard mail delivery.

     NOTE ON CGI USERS:  It is far more secure to create a separate virtual
     host for the DSPAM CGI running as a different user than any other
     scripts on the system.  This avoids giving trusted user privileges to
     another CGI.  If you do this, be sure to add the CGI user to the trusted
     users list.

   If you are using an MTA that changes its userid before calling DSPAM to
   match the destination user, you should NOT add each user to the trusted
   users file, but instead configure a preset commandline.  DSPAM will see
   that the user is not trusted and automatically set their DSPAM user id
   and optionally the passthru local delivery agent arguments.
   
   To override an untrusted user's passthru local delivery agent arguments
   (arguments which could be used to hijack the local delivery agent to gain
   privileged access to the system) you will need to set up a file called 
   untrusted.mailer_args in the same directory ($USERDIR).  The first line 
   should contain the path to the local delivery agent followed by a list of 
   all the LDA arguments to pass through (including a user identity flag if 
   necessary).  This file's information will override any passthru commandline 
   parameters specified by the user.  For example:

   /bin/mail -d $u

   The variable $u informs DSPAM that you would like the destination username
   to be used in the position $u is specified, so when DSPAM calls your LDA
   for user 'bob', it will call it with:

   /bin/mail -d bob

   NOTE: In the event that ALL of the following are true:

    - Your MTA performs a setuid() to the destination user prior to calling
      DSPAM

    - There are additional _dynamically assigned_ parameters that must be
      passed to DSPAM which cannot be specified in configuration

    - The local delivery agent has no potentially dangerous commandline
      options, or you are placing a wrapper around the local delivery
      agent

    Then you may want to remove the untrusted.mailer_args file all together.
    If the file cannot be found, dspam will permit the user to specify their 
    own passthru arguments to the preconfigured LDA (with some basic sanity 
    checking) which COULD POTENTIALLY BE INSECURE if improperly set up..  It
    is strongly recommended you use this file to override the user.

    DSPAM warns you (over log record) when unable to open
    untrusted.mailer_args file.

    If you don't want to see this warning then make untrusted.mailer_args
    file exists but empty.

4. MAIL SERVER CONFIGURATION

   Modify your mail server configuration to use /usr/local/bin/dspam as the
   default local delivery agent.  

   SENDMAIL EXAMPLES
   If you are using sendmail, modify the Mlocal tag, as shown below:

Mlocal,		P=/usr/local/bin/dspam, F=lsDFMAw5:/|@qfSmn9, S=EnvFromL/HdrFromL, R=EnvToL/HdrToL,
		T=DNS/RFC822/X-Unix,
		A=dspam --user $u -d %u

   You may also need to add this to DSPAM_ARGS in dspam.cgi:

   $CONFIG{'DSPAM_ARGS'} = "--user $ENV{'REMOTE_USER'} -d \$u";

   If you are using procmail with sendmail, you can preserve the existing
   procmail flags with the exception of LMTP.  For example:

Mlocal,		P=/usr/local/bin/dspam, F=lsDFMAw5:/|@qSPfhn9, S=EnvFromL/HdrFromL, R=EnvToL/HdrToL,
		T=DNS/RFC822/X-Unix,
		A=dspam -t -Y -a $h --user $u -d %u

   The --user flag identifies the user to DSPAM, and does not actually get
   passed through to the LDA.  Be sure to change "-d $u" to "-d %u" to let
   DSPAM supply the username to the LDA.

   NOTE: be sure to escape the $ in $u ONLY when specifying it on the 
   commandline.  This will prevent $u from being overwritten with the shell's 
   environment variable 'u'.
                                                                                
   Finally, if you are using sendmail restricted shell (smrsh), you will
   need to either disable it, or add /usr/local/bin/dspam to the allow list. 
   Some people have experienced problems using smrsh with DSPAM.

   It is required that dspam be called with the --user flag followed by the user
   name of the local recipient.  

   EXIM
   To integrate DSPAM with exim, you'll need to create a new director in the
   exim configuration.  First, add the following code to the directors :

spamscan:
  no_verify
  condition = "${if and { {!eq {$received_protocol}{spam-scanned}} } {1}{0}}"
  driver = localuser
  transport = spamcheck

  Then add the following code to the transports:

spamcheck:
  driver = pipe
  command = /usr/local/bin/dspam --user $local_part -d %u 
  user = mail
  group = mail
  return_path_add = false
  log_output = true
  return_fail_output = true

   Finally, you will need to configure your local delivery agent into dspam.
   This may end up calling exim again, if you are using maildir
   format for example.  An example of calling back to exim for delivery might
   look like:

   ./configure --with-local-delivery-agent="/usr/sbin/exim -oMr spam-scanned"

   Of course, any parameters you pass to DSPAM will also get passed onto
   the local delivery agent, so you can safely locate the parameters in
   your exim configuration incase you want to change them later:

   command = /usr/local/bin/dspam --user $local_part -d $local_part -oMr spam-scanned

   then configure with just:

   ./configure --with-local-delivery-agent=/usr/sbin/exim

   This was tested under Debian Sarge with Exim 3.36-6.

   COURIER

   Step 1: Configuring Courier to use DSPAM

   To set Courier to use DSPAM as its local delivery agent, you'll want to
   change the DEFAULTDELIVERY parameter.  In the file /etc/courier/courierd,
   find the line for DEFAULTDELIVERY and change it to:

   DEFAULTDELIVERY="| /usr/local/bin/dspam --user $USER"

   Step 2: Setting DSPAM to use Maildrop as LDA
   Maildrop is Courier's native equivalent of procmail.  To configure DSPAM
   to use it, use the following configure flags:

   --with-local-delivery-agent="/usr/bin/maildrop -d \$u"

   You'll also need to configure untrusted.mailer_args as Courier drops to the
   uid of the destination user and changes to that users homedir before calling
   the LDA.  Accomplish this by using the following parameters in your
   untrusted.mailer_args file:

   /usr/bin/maildrop -d $u

   Step 3: Configuring DSPAM aliases for missed spam/false positives

   The aliases for missed spam/false positives should be added to the file
   /etc/courier/aliases/system.  After adding new aliases, run 'makealiases'
   and restart Courier.

   OTHER MAIL SERVERS
   Consult your mail server's documentation on how to replace the local 
   delivery agent.  It may be necessary to recompile the software.  Worst 
   case scenario, you may have to symlink dspam to the filename of the local 
   delivery agent, and rename the local delivery agent.  If you do this, be 
   sure the change is reflected in your configuration of dspam using
   the --with-local-delivery-agent= flag.  We have reports of dspam working 
   with postfix, and will post documentation for configuring it as soon as 
   its available.

5. ALIASES

   For each user, you will need to create an email address the user can
   send spam to, so that DSPAM can analyze and learn.  The easiest way to
   do this is to create a new alias.  For example:

   spam-bob:      "|/usr/local/bin/dspam --user bob --addspam"

   You will end up having one alias per mail user on the system.  Be sure the
   aliases are unique and each username matches the name after the --user flag.
   A tool has been provided called dspam_genaliases.  This tool will read the
   /etc/passwd file and write out a dspam aliases file that can be included
   in your master aliases table.  

   To report spams, the user should be instructed to forward each spam to
   spam-user@yourhost

   If you will be using the --enable-spam-delivery mechanism, you will also
   need an alias to forward false positives into.  The following example should
   suffice:

   fp-bob:	"|/usr/local/bin/dspam --user bob --falsepositive"

   It doesn't really matter what you name these aliases, so long as the flags
   being passed to dspam are correct for each user.  It might be a good idea
   to create an alias custom to your network, so that spammers don't forward
   spam into it.  For example, fp-yourcompany-bob or something.  

6. DRIVER SPECIFIC TOOLS

   Depending on the storage driver you are using, there may be tools specific
   to that driver that will be built.
   
THE CLEANUP AND PURGE TOOLS

   CLEANUP

   You should configure dspam_clean to run under cron nightly.

   This clean tool will read all signature databases and purge signatures that
   are older than 14 days (configurable).  Without this tool, old
   signatures will continue to pile up.  A cron like the one below should
   suffice.  Be sure the user running cleanup has full read/write permissions
   on the DSPAM data files.

   0 0 * * * /usr/local/bin/dspam_clean

   NOTE: If you are using the mysql_drv storage driver, run the purge.sql
         script nightly instead of dspam_clean.

   PURGE

   Depending on which storage driver you choose, there may be a driver-specific
   purge tool designed to keep user dictionaries clean, small, and optimized by 
   deleting unimportant tokens and rewriting the database from scratch every 
   time it is run.  Unimportant tokens are tokens that have fewer than the 
   minimum number of hits to be assigned a true probability.  The minimum is 
   presently Spam Hits + 2 * Innocent Hits < 5.  This tool can be run as often 
   as you like, but if implemented on a large-scale, should only be run when 
   the system is fairly acquiesce as it locks the user it is processing, 
   creating potential delays in mail delivery.  Depending on how much disk 
   space you wish to conserve with this tool, you may run it every day, week, 
   or two weeks.  To configure this tool in cron, use an entry similar to below:

   0 0 5,10,15,20,25,30 * * * /usr/local/bin/libdb4_purge

   The above entry will run the libdb4-specific purge tool on the 5th, 10th, 
   15th, and so on.  It is not absolutely necessary to run the purge tool 
   (especially if you wish to collect all available data over a long term 
   period), but is healthy in maintaining a smaller set of user dictionaries 
   containing only interesting tokens.  Without it, your user's dictionaries 
   could easily grow in size over time. 

   Obviously if you are using a SQL-based driver, you will not need to compress
   files, but may want to run some basic SQL commands to delete unused tokens, 
   etc.

   DEADLOCK

   If you are using one of the libdb drivers, you may also wish to run the
   deadlock detection tool in the background to help prevent deadlocks.  The
   tool 'libdb3_deadlock' or 'libdb4_deadlock' can be run at startup
   (be sure to ampersand it to run in the background) and will continue to
   monitor the system for deadlocks every 100ms.

7. NOTIFICATIONS

   DSPAM is capable of sending three different notifications:

   - A "First Run" message sent to each user when they receive their first 
     message through DSPAM.

   - A "First Spam" message sent to each user when they receive their first
     spam

   - A "Quarantine Full" message sent to each user when their quarantine box
     is > 2MB in size.

   These notifications can be activated by copying the txt/ directory from the
   distribution into USERDIR (by default /etc/mail/dspam).  You will want to
   modify these templates prior to installing them to reflect the correct
   email addresses and URLs (look for 'configureme' and 'yourdomain').

   NOTE: The quarantine warning is reset when the user clicks 'Delete All', but
   is not reset if they use "Delete Selected".  If the user doesn't wish to
   receive reminders, they should use the "Delete Selected" function instead
   of "Delete All".

THE CGI CLIENT

   The CGI client (dspam.cgi) can be run from any executable location on
   a web server, and detects its user's identity from the REMOTE_USER
   environment variable.  This means you'll need to use HTTP password
   authentication to access the CGI (Any type of authentication will work,
   so long as Apache supports the module).  You'll want the usernames to match
   the actual username on the system.  A copy of the shadow password file
   will suffice for authentication. 

   The accompanying files in the cgi/ folder should be copied into the same
   location as dspam.cgi, as they are needed by the tool to generate output.
   Be sure to copy the templates and graphics into the cgi-bin as well.

   NOTE: Some authentication mechanisms are case insensitive and will 
   authenticate the user regardless of the case they type it in.  DSPAM, 
   on the other hand, is case sensitive and the case of the username used
   will need to match the case on the system.  If you suffer from this
   authentication problem, and are certain all of your users' usernames are
   in lowercase, you can add the following line of code to the CGI right
   after the call to &ReadParse...

   $ENV{'REMOTE_USER'} = lc($ENV{'REMOTE_USER'});

   You will also want to set the LARGE_SCALE variable in dspam.cgi if you are
   compiling with --enable-large-scale.  This will adjust the CGI to use the
   correct filesystem paths for dspam's large-scale filesystem implementation.

   Please see the section on permissions for more information about setting up
   the CGI's group and agent permissions.

1.2 TESTING

  Most software packages are supplied with a test suite to determine if the
  software is functioning properly.  Since dspam's correct function relies 
  primarily on having the correct permissions and mail server configuration,
  a test script fails to provide the level of testing required for such a
  package.  The following exercise has been provided to test dspam's correct
  functioning on your system.  This exercise does not test the CGI, but only
  the core dspam agent.
  
  Before running the test, you should have completed section 1.1's instructions
  for compiling and installing dspam as well as configured your mail server
  to support dspam.

  1. Create a new user account on your system.  It is important that this be a 
  new account to prevent any unrelated email from being delivered during 
  testing.  Be sure to configure a spam alias for the test account.

  2. Send a short (10 words or less) email to the account, and pick it up 
  using your favorite mail client.  

  3. Run dspam_stats [username] on the server.  You should see a value of 1 
  for "TI" or "Total Innocent" as shown below:

  dspam-test            0 TS       1 TI       0 TM       0 FP

  If you receive an error such as "unable to open /etc/mail/dspam/... for
  reading", then the dspam agent is not configured correctly.  The problem
  could lie in either your mail server configuration or one or more of the
  permissions on the directory or agent.  Check your configuration and
  permissions, and repeat this step until the correct results are experienced.

  4. Run dspam_dump [username] to get a complete list of tokens and their 
  statistics.  Each token should have an I: (innocent) hit count of 1. The 
  tokens will be represented as 64-bit values, for example:

3126549390380922317              S:    0  I:    1  LH: Mon Aug  4 11:40:12 2003
13884833415944681423             S:    0  I:    1  LH: Mon Aug  4 11:40:12 2003
14519792632472852948             S:    0  I:    1  LH: Mon Aug  4 11:40:12 2003
8851970219880318167              S:    0  I:    1  LH: Mon Aug  4 11:40:12 2003

  5. Forward the test message to the spam alias you've created for the test 
  account.  Provide enough time for the message to have processed.

  6. Run dspam_stats [username] on the server again.  Now, the value for TI 
  should be zero and the value for TM (total misses) should be 1 as shown
  below:

dspam-test            0 TS       0 TI       1 TM       0 FP

  If this is not the case, check the group permissions of the dspam agent as
  well as the permissions your MTA uses when piping to aliases.
  
  7. Run dspam_dump [username] again.  make sure that _EVERY_ token now has an 
  I: of zero and a S: of 1:

3126549390380922317              S:    1  I:    0  LH: Mon Aug  4 11:44:29 2003
13884833415944681423             S:    1  I:    0  LH: Mon Aug  4 11:44:29 2003
14519792632472852948             S:    1  I:    0  LH: Mon Aug  4 11:44:29 2003
8851970219880318167              S:    1  I:    0  LH: Mon Aug  4 11:44:29 2003

  If you have some tokens that do not have an S: of 1 or an I: of 0, the dspam
  signature was not found on the email, and this could be due to a lot of
  things.

1.3 TROUBLESHOOTING

    Problem: I get an error similar to 'cannot find -ldb-4.1'
   Solution: Your compiler can't locate your db libraries.  Try installing
             them into /usr/lib, or add them to your (and your MTA's)
             LD_LIBRARY_PATH.  You may also use --with-db4-includes and
             --with-db4-libraries as configure flags.  If you are using libdb3,
             use the db3-specific configure parameters.
 
    Problem: Dictionary isn't updating
   Solution: Check the file permissions of both the .dict and the .mbox files.
             These files will need to be writable by the dspam agent as well
             as the CGI user.

    Problem: No files are being created in the user directory
   Solution: Check the directory permissions of the directory.  The user 
             directory must be writable by the user the dspam agent is running
             as as well as the CGI user.

    Problem: False positives are never being delivered
   Solution: Your CGI most likely doesn't have the privileges required by
             the LDA to deliver the messages.  Make sure the CGI user is in
             the correct group.  Also consider setting the dspam agent to
             setuid or setgid with the correct permissions.

1.4 DSPAM TOOLS

  A few miscellaneous tools have been provided to make DSPAM management 
  a bit easier.  These tools include:

  dspam_corpus - Used to feed an existing corpus of mail (in mailbox format)
    into the dspam system.  
    Syntax: dspam_corpus [username] [filename] [--addspam]
    where username is the username of the user to apply the corpus to,
    filename represents the filename of the mailbox, and the optional -a flag
    to specify if this corpus is known spam (to add as spam into the user's
    dictionary).  
 
  dspam_dump - Dumps a DSPAM dictionary. This can be used to view the 
    entire contents of a user's dictionary, or used in combination 
    with grep to view a subset of data.  Syntax: dspam_dump [username] 
    where username is the DSPAM user's username.

  dspam_clean - Used in conjunction with the SERVER_SIDE_SIGNATURE mode
    of operation where signatures are stored locally in a database.  The
    clean tool should be run nightly and is responsible for deleting old
    signatures (>14 days by default) from the datafiles.

  dspam_purge - Processes every user dictionary on the system and keeps each
    dictionary clean of unimportant tokens over a long term period.  This tool 
    deletes non-qualifying tokens that have not been hit after a certain 
    duration.  The defaults specified in config.h are:

    - Delete any tokens with a (SH)+(2)(IH) value of less than 5 after 
    not being hit for 60 days
    - Delete any tokens with a (2)(IH) value of less than 5 and zero
    SH after not being hit for 30 days
    - Delete any tokens with an IH of 1, SH of zero after not being
    hit for 15 days.

  dspam_stats - Displays the spam statistics for one or all users on the system.
    Syntax: dspam_stats [username].  If no username is provided, all users 
    will be displayed.  Displays TS (Total Spams), TI (Total Innocent), TM
    (Total Spam Misses) and FP (Total False Positives).  Spam misses are
    spams that were forwarded in by the user.  To calculate the total number
    of spams caught by DSPAM, subtract TM from TS. 

  dspam_genaliases - Reads the /etc/passwd file and outputs a dspam aliases
    table which can be included in the master aliases table.  You may try
    Art Sackett's generate_dspam_aliases tool at 
    http://www.artsackett.com/freebies/generate_dspam_aliases/ if you need
    some better functionality.  This will eventually be merged in as a
    replacement for the existing tool.
 
  dspam_merge - Merges multiple users' dictionaries together into one user's
    dictionary (does not affect the merge users).  This can be used to create
    a seeded dictionary for a new user, or to copy a single user's dictionary
    to a new file.

  dspam_ngstats - A global statistics tool for those who wish to participate
    in global statistics tracking.  This tool reports your system's spam
    totals to a master server on a daily or weekly basis, where they are
    merged with the statistics of other participating systems to provide
    global statistics of DSPAM's efficiency and how many spams DSPAM has
    put a stop to.  See section 2.2 for more information.

1.5 AGENT COMMANDLINE ARGUMENTS

  The DSPAM agent (dspam) recognizes the following commandline arguments:

  --user [user1 user2 userN]
  Specifies the destination users of the incoming message.  DSPAM then 
  processes the message once for each user individually.  If the message is to
  be delivered, the $u (or %u) parameters of the arguments string will be
  interpolated for the current user being processed.

  --addspam
  Tells DSPAM that the message being presented should be treated as spam.  This
  affects the learning phase so that the message tokens' spam counts will
  be incremented.  If a valid signature is found in the message, the message
  tokens' innocent counts will also be decremented, as well as the innocent
  message total.

  --falsepositive
  Tells DSPAM that the message being presented should be treated as a false
  positive.  This results in a re-learning phase where the message tokens'
  spam counts will be incremented and the innocent counts decremented.  Totals
  will also be updated in the same fashion. 

  --corpus
  Tells DSPAM that the message being presented is from a corpus, and is not
  a misclassification.  Depending on whether --addspam was specified or not,
  the message will be treated as spam OR innocent.  Use this flag when 
  submitting any messages that do not have a valid signature, otherwise the
  message's headers will be ignored to avoid classifying the user's own 
  headers.

  --inoculate
  Treats the message as an inoculation resulting in accelerated learning of
  the message.  Inoculation should only be used in conjuntion with users having
  a mature dictionary (4000+ messages of spam/nonspam).  If the message being
  presented does not have a valid signature, the --corpus flag must be used
  in conjunction with this flag.  This flag infers that the message is being
  submitted as a corpus message (whether via actual message corpus or a
  signature of the message from another user's corpus) and not as a result of a
  misclassification.  As a result, corpus totals will be incremented when this
  flag is used.

  --deliver-fp
  Causes false positives to be delivered to the user.  It is only necessary to
  specify this flag if you configure with --enable-spam-delivery.

  --stdout
  Causes the message to be delivered to stdout.  The default behavior will only
  deliver innocent messages.  If you wish to have spams delivered to stdout
  as well, use this argument in conjunction with --deliver-spam, or configure
  with --enable-spam-delivery.

  --deliver-spam
  Causes the message, if spam, to be delivered rather than quarantined.

2.0 LINKING WITH LIBDSPAM

  Developers are able to link to the DSPAM core engine (libdspam) to provide 
  "drop-in" spam-filtering for their applications.  Examples of the libdspam
   API can be found in the example.c file included with this distribution.

  To link to libdspam, follow the instructions in section 2.1 for compiling
  and installing dspam.  When dspam is compiled, the libdspam static and
  shared libraries are also built.  This library contains all the functions 
  necessary to use dspam's filtering in your application. 

  Your application will also need to link to Berkeley db4.  If you would
  prefer to use your own proprietary storage mechanisms, modify the 
  localdb.c code and recompile.  

  To build with the dspam API, you will also need the following header
  files: 

  libdspam.h         - library API
  libdspam_objects.h - structures and macros used
  buffer.h           - used internally
  decode.h           - used internally
  lht.h              - used internally
  nodetree.h         - used internally

  If you are interested in linking libdspam with your project and have 
  questions or concerns, please contact the dspam-dev mailing list.

2.1 CONFIGURING GROUPS

  Groups enable a group of users to share information.  The following
  group types are supported:

  SHARED
  Enables users with similar email behavior to share the same dictionary 
  while still maintaining a private quarantine box.  The benefits of this
  type of group are faster learning, and sharing a single spam alias.  Shared
  groups can have both positive and negative effects on accuracy.  If a shared
  group consists of users with similar, predictable email behavior, the users 
  in the group can benefit from a larger dictionary of spam and faster 
  learning (especially for newcomers in the group).  If a group consists of 
  users with different email behavior, however, the users in the group will 
  experience poor spam filtering and a higher number of false positives.

  SHARED GROUP NOTES:

  1. The mysql_drv storage driver supports shared groups, but has one caveat:
     If you are NOT enabling "virtual users" support, you will need to create
     an actual user on your system named after each group you create.

  2. The ora_drv storage driver does not yet support shared groups

  On top of shared group support, a shared group can also be made to be
  'managed'.  Using the group type 'SHARED,MANAGED' will cause the group to
  share a single quarantine mailbox which could be managed by the group's
  administrator.  This would enable one individual to monitor quarantine for
  the entire group, however personal emails marked as false positives could
  potentially be viewed as well.  For this reason, managed groups should only
  be used when this is not an issue.

  INOCULATION
  An inoculation group allows users to maintain their own private dictionaries
  with their own spam alias, but all members of the group will inoculate other
  members with spams they manually forward into their alias.  This allows 
  users to report spams to one another and maintain their own private
  dictionary.  Another advantage to this is that users do not necessarily have
  to share the same email behavior.  

  NOTE: Users should only be added to an inoculation group after their initial
        learning period, to avoid potential false positives due to lack of data.

  To create groups, you'll want to create a file with the filename 'group' 
  located in the DSPAM user directory.  The default is /etc/mail/dspam/group.  
  The format of the file should look like this:

  group1:shared:user1,user2,user3
  group2:inoculation:user4,user5,user6

  A user can be a member of multiple inoculation groups, but a user cannot be
  a member of both an inoculation group and a shared group.

  DSPAM will read this file upon startup and determine if the user fits into
  any particular group.  
  
  Use the dspam_stats tool to keep an eye on the effectiveness of shared groups.
  If a shared group experiences poor performance, find the users whose email 
  behavior is inconsistent with that of the group and remove them from the 
  group.

  CLASSIFICATION
  Classification groups allow a group of users to network their results
  together.  If DSPAM is uncertain of whether a message is spam or nonspam for
  a group member, all other members of the group are queried.  If another
  member believes the message to be spam, it will be marked as spam.

  A user can simultaneously be a member of a classification and inoculation
  group, but a user cannot be a member of both a classification group and a
  shared group.

  VERSATILE LANGUAGE INOCULATION MESSAGES

  A new Internet-Draft has been released to the public:

    http://www.ietf.org/internet-drafts/draft-spamfilt-inoculation-00.txt

  To create a message format standard for sending inoculation data via email.
  This will allow users on different servers, and even using different 
  anti-spam tools to share inoculation information with one-another.

  DSPAM presently implements support for this message standard with the 
  following limitations:

  - Only inbound inoculation messages are supported.  DSPAM does not yet send
    out inoculations using this message format.  This should not be confused
    with local inoculation, which *is* supported.
  
  - The message/inoculation format is the only inoculation type presently
    supported.  text/inoculation and multipart/inoculation coming soon.

  - The only supported authentication mechanism is presently md5 verification
    codes/checksums.

  Any unsupported inoculations will simply be dropped.

  A list of identifies and authentication information can be set up in the file
  [username].inoc or in the user's home directory in a .inoc file if
  homedir-dotfiles is enabled.  The format of this file is:

  sender1:shared secret
  sender2:shared secret

  Each sender should specify the correct sender id when sending an 
  inoculation, and should generate their checksum based on the shared secret
  established between both parties.

  NEURAL NETWORK

  Neural networks are similar to classification networks, however with some 
  differences.  First, all nodes in the network are queried sequentially,
  increasing execution time depending on the number of nodes in a network.
  Once the results from all nodes has been returned, the results from the most 
  reliable nodes are used.  Reliable nodes are determined based on how accurate
  they have been in the past.  Depending on the size of the network, the top
  20% of nodes (with a minimum of two nodes) are used. The reliability (and 
  results) are then combined to form a probability based on the results.  

  The advantage to using a neural network over a classification network is 
  that the filter is capable of "learning" which users have dictionaries 
  closer to their own mail behavior therefore providing better results.  
  This data can be used in the future to create dynamic classification of 
  groups.

  Neural networking must be explicitly enabled using the configure flag
  --enable-neural-networking.  Neural networking is presently only
  supported by the mysql_drv storage driver, and is still experimental.
 
2.2 PARTICIPATING IN THE GLOBAL STATS PROJECT
                                                                                
  A small tool has been provided for administrators wishing to participate in
  the dspam global stats project.  The dspam_ngstats tool may be run nightly,
  weekly, or at any time interval you choose, and will report the global stats
  of your entire installation to the dspam stats collector, where they will
  be merged with the stats from other dspam implementations.  This will provide
  the community with one global statistic for the entire dspam project.  If you
  wish to participate, simply add dspam_ngstats to your crontab.  No information  about your users or implementation is broadcasted, only your system's IP
  address (for tracking purposes) is recorded along with your global stats.
                                                                                
                                                                                
  NOTE: If your server does not have a static public ip address, you will need
  to configure a unique id inside dspam_ngstats.c.  Set the definition of
  NGSTATS_UID from "" to "my_unique_id".  Your unique id may be comprised of
  underscores, periods, and alphanumeric characters.  Any invalid id's will
  cause the stats for that machine to be ignored.

2.3 EXTERNAL INOCULATION THEORY

  Bill Yerazunis recently expressed his theory of inoculation on an anti-spam
  development list, using the term "vaccination":

  "Part of the problem is that spam isn't stationary, it evolves. That 
   pesky .1% error rate is in some part due to the base mutation rate of spam 
   itself.  Maybe the answer is "vaccination".  Vaccination is using _one_ 
   person's misery be used to generate some protective agent that protects the 
   rest of the population; only the first person to get the spam actually has 
   to read it. 

   My expectation is this: say you have ten friends, and you all agree to share 
   your training errors.  Each of you will (statistically) expect to be the 
   first to see a new mutation of spam about 9% of the time; the other ten 
   friends in this group will have their bayesian filter trained preemptively 
   to prevent this.  Net result: you get a tenfold decrease in error rate - 
   down to 99.99% accuracy.  With a hundred such (trusted) friends, you may be 
   down to 99.999% accuracy."

   DSPAM has taken this concept and rolled it into support for what we call
   "inoculation groups" providing the exact functionality Bill describes.  This
   could be considered an "internal inoculation" practice.

   On top of this, DSPAM has been designed to support external inoculation as 
   a complement to internal inoculation.  This is where instead of your internal
   circle of friends inoculate you, you rely on external elements - namely
   spammers themselves - to inoculate you.

   The theory behind external inoculation is this: why put _anyone_ through
   the misery of being the first to receive a new spam when you can have
   the spammers themselves send it directly to you.  On top of this,
   external inoculation can be combined with internal inoculation by taking
   the spam you received externally and inoculating your friends with it
   internally.

   Inoculation is a little different from learning, as inoculation causes
   tokens to be given additional hit counts in an attempt to learn from a
   single email.  As a result, any form of inoculation should _only_ be
   attempted after an initial learning phase (perhaps when your filtering
   accuracy exceeds 99.0%).  DSPAM inoculates like this:

   1. Every token that doesn't already exist in the database, or have fewer
      than two hits will be hit five times.

   2. All other tokens are hit twice.

   External inoculation is accomplished by creating a covert, external alias
   that is configured to automatically inoculate your dictionary from any
   messages it receives.  The covert alias can then be published onto a series
   of public newsgroups and websites where it is sure to be harvested by
   a spammer's tools.  One could even pro-actively subscribe one's self to
   several different opt-in spam lists, etcetera.

   The first step is to configure an alias.  To do this you would use something
   like:

   bob_c:	"|/path/to/dspam --addspam --inoculate --user bob --corpus"

   The 'C' in bob is for 'Covert'.  We must use a covert alias because if we
   use something obvious like 'bob-spam', harvester tools will automatically
   strip the -spam off and spam your real account.

   Once the alias is set up, make sure this alias gets out only on lists where
   harvesters will grab it, and nobody will send legitimate email to it.  
   It may even be a good idea to put it at the bottom of your tagline in all
   your publicly archived emails, something like...

   Spammers, send me mail here: bob_c@yourdomain.com

   Finally, you can multiply the effects of this by sharing an inoculation
   group with your friends.  If all of your friends have a public covert
   alias, then you will all be able to inoculate eachother should one of you
   receive a spam to the account.  What a great way to train your filter!

   On top of this, should external inoculation become commonplace to the
   point where harvesters are picking up an equal amount of them as legitimate
   email addresses, spammers will start to realize that harvesters are just
   plain too dumb to tell the difference (the spammers themselves couldn't tell
   if mine was or not).  This could, best case scenario, put an end to
   harvester bots, making them obsolete as counter-productive tools.

3.0 BUGS, PORTS, AND THE LIKE

  Please report any questions, bugs, suggestions, and the like to the 
  dspam-users mailing list.  See the project website for details.

  If you port DSPAM to another platform, or would like to submit changes to
  the distribution, please email a diff along with any other pertinent 
  information to the dspam-dev mailing list.

  If you like DSPAM and want to buy the author pizza (or a ferrari),
  paypals may be sent to jonathan@nuclearelephant.com.

  Thanks =)

3.1 KNOWN BUGS

  - DSPAM presently does _not_ handle a mass forward of emails, but only one
    forward at a time.  Be sure to tell your users not to select multiple
    messages and forward them...this results in a single message being sent
    into DSPAM instead.  Users should individually forward each spam.  

  - The Oracle storage driver is slow; this is primarily due to the fact that
    the agent has to establish a new connection with Oracle every time it is
    run.  This adds another 0.5 - 1.5 seconds of delay.

3.2 ADDING THE DSPAM LOGO BUTTON TO YOUR WEBSITE

  A small button has been included for those who would like to advertise dspam
  on their web page.  To use, copy the graphic (dspam-button.gif) into your
  web page's directory and use the following code wherever you'd like the
  button displayed:

  <A HREF="http://www.networkdweebs.com/software/dspam/">
  <IMG BORDER=0 SRC="dspam-button.gif"></A>

3.3 CVS ACCESS

  The DSPAM source tree can be downloaded via read-only cvs access using the
  following commands:

  cvs -d :pserver:cvs@cvs.nuclearelephant.com:/usr/local/cvsroot login
  cvs -d :pserver:cvs@cvs.nuclearelephant.com:/usr/local/cvsroot co dspam 

  DSPAM has been version-tagged in cvs so that you can checkout a particular
  version by using this format:

  co -r dspam-2_7_4 dspam

