     System Requirements
     -------------------

To use Clean Feed you will need to be running INN 1.5.1 or later.  You
will need to have Perl installed, and you need to have the Perl hooks
compiled into INN.  The MD5 portion of the filter requires at least Perl
5.004, and the MD5 module from CPAN.  Running without MD5 will work with
Perl 5.003 and probably with 5.002 as well.

The MD5 Perl module can be found at:
   http://www.perl.com/CPAN-local/modules/by-module/MD5/


     Installation
     ------------

These directions assume you have the Perl hooks compiled into INN
already.  If you do not, you must add them and rebuild the whole INN
distribution before continuing.

If you are running INN 1.7.2+insync1.1d, skip this step.

First you must apply the included small patch to innd.  The patch is
against clean 1.5.1.  The patch will apply to 1.7, and I'm told it works
against 1.6b3 as well.  Apply it in the innd directory, rebuild the innd
executable, replace the old one with the new one, and restart innd:

cp filter.patch inn-1.5.1/innd
cd inn-1.5.1/innd
make clean
patch <filter.patch
make
mv /usr/news/bin/innd /usr/news/bin/innd.old
cp innd /usr/news/bin/innd
ctlinnd throttle "replacing innd"
ctlinnd shutdown "replacing innd"
/usr/news/bin/rc.news

...or whatever works on your system.  If you want to use the MD5
filter, see the additional instructions below.

  INN 1.7+insync:  The insync "jumbo patch" set contains the Perl filter
  body patch already.  Unfortunately, all versions prior to 1.1b also
  contained a bug in that patch.  The bug was fixed in the 1.1b release,
  so if you have that version, just let the body-filter portion of the
  patch fail.  To apply to an older 1.7+insync, first apply the patch;
  let the body filter part fail (if it asks you whether to apply anyway,
  tell it no).  Then load up innd/perl.c into an editor, go to line 59
  (line 62 in version 1.1), and change the 4 to an 8 (you'll see what
  I mean). 

  1.7.2+insync1.1d:  The 1.1d release of the insync patch set includes
  both patches.  You need not apply any patches to use Clean Feed with
  1.1d.  The 1.1c release included filter.patch but not the dynamic
  loading patch described below.  If you have 1.1c you can either apply
  this patch, or upgrade to 1.1d.

MD5 FILTER: New as of version 0.95.1 is a filter which uses MD5 checksums
of article message bodies to detect and reject spam.  If you want to use
this filter (trust me, you want to use it) there are a few more things
you have to do.  Note: this filter REQUIRES at least Perl 5.004.  If you
have 5.003 or earlier, you will have to run without MD5.  Upgrade your
Perl, it is one of the easiest software installs around.

First, you need to apply another patch (included) to enable innd to load
dynamic Perl modules.  There are two versions of the patch included;
dynamic-load.patch is against 1.7+insync, and dynamic-1.5.1.patch is
against 1.5.1.  The patch is for lib/perl.c (NOT innd/perl.c).  You don't
need this patch if you are running 1.7.2+insync1.1d.

In order to compile INN with the new patch, you need to edit the PERL_LIB
entry in config.data.  Type this command at the shell, and paste its output
into config.data as PERL_LIB:
    perl -MExtUtils::Embed -e ldopts

You can apparently also simply enter that line in backquotes as
PERL_LIB.

After doing this you will need to rebuild the whole of INN and do a make
update.  You can do this step at the same time as applying the first
patch, above, if you are installing Clean Feed for the first time.

Finally, you need to install the MD5 Perl module, available at:
   http://www.perl.com/CPAN-local/modules/by-module/MD5

  AIX: There is a problem with Perl dynamic loading from INN under
  the AIX operating system.  In simple terms, it doesn't work.
  This seems to be a problem with the gcc compiler.  Success has
  been reported by rebuilding both Perl and INN with IBM's
  commercial compiler CSet (a.k.a. xlC).

Then, take the filter file (filter_innd.pl) and install it as
filter_innd.pl, wherever your system expects to find that file (which is
set in config.data).  Then reload the filter.

cd /usr/news/bin/control
mv filter_innd.pl filter_innd.pl.old
cp /wherever/cleanfeed/filter_innd.pl .
emacs filter_innd.pl   (edit the configuration, see below)
perl -cw filter_innd.pl   (check for mistakes!)
ctlinnd reload filter.perl "raise the shields"

Now it's running!  To watch the fun ensue:

tail -f /var/log/news/news.notice


     Configuration
     -------------

MD5: If you are not running MD5 for whatever reason, the filter should
work as-is.  If not, you can comment out the two lines below the
config section referring to MD5.

There are a few variables at the top of the filter file, which you can
set to your liking.

$maxgroups - Reject crossposts to more than this number of groups.

$maxfollowups - Allow crossposts where followups are set to fewer
   than this many groups, even if posted to more than $maxgroups groups.
   This allows you to reject wide crossposts while allowing FAQs, etc.

$maxmultiposts - Start rejecting articles after we've seen this
   many copies (as determined by the EMP filter).

$md5maxmultiposts - Same as above, but for the MD5 filter.  This is
   far less prone to false positives, so it can be set lower.

$tjfmaxgroups - A seperate crosspost limit for test, forsale, and
   jobs newsgroups.

$ArticleHistory - (formerly ArticleHistSize) How many ids to
   remember for header-based EMP comparison.  Setting this higher
   will catch more spam because there will be a larger "window"
   to look it.  Larger settings will also consume more memory and
   have a (very small) impact on performance.  Most articles
   will actually take up two entries in this history.

$MD5History - How many articles to remember for MD5-based EMP
   comparison.  Since the MD5 filter is not prone to false
   positives, setting this higher will result in more spam
   caught, if you have the RAM to spare.

$EMPMaxLife - How long to keep the history of EMP identified by
   the header filter, for continual rejection.  In hours.  Default
   is 24 hours.

$MD5MaxLife - How long to keep the history of EMP identified my
   the MD5 filter.  In hours.  Default is 24 hours.

$EMPHistSize - The maximum allowed size of the EMP memory.  Use
   this as a "sanity check" so a sudden burst won't use up all of
   your memory.  Set this high enough so that you normally never
   hit this number; use the $EMPMaxLife to expire the hash instead.
   The default of 500 works well *if* you are running the MD5 filter
   as well; if you are not using MD5, raise this to at least 3000.

$MD5HistSize - Same as above, but for the MD5 EMP memory.  The default
   here is 4500, which works well for me.

$EMPstarttrimming - The filter doesn't waste time trimming the EMP
   memory until it has this many entries in it.  Just a minor
   performance enhancement during the first hours the filter is up.

$trimcycles - The EMP memories are trimmed every $trimcycles times
   through the filter.

$MIDmaxlife - how long to remember rejected message-ids so cancels
   for these posts can also be rejected.  This only has an effect
   if cancel-rejection is enabled (below).

$verbose - When on (set to 1, or any true value) verbose logging to
   news.notice will happen; spam domains will be listed, etc.  When
   off, only general messages will be logged, making the news.daily
   reports less interesting but much shorter and more to the point.

$block_binaries - Enables blocking of binary posts in non-binary
   newsgroups.  This is now ON by default, since the filter works
   great.

$max_encoded_lines - Sets the number of uuencoded or base64-encoded
   lines to allow before considering a post to be a binary.  This
   should be set high enough to pass regular PGP signatures.  Default
   is 15 lines, which may be a little low if you are lenient.

$block_mime_html - Enables blocking of MIME-encapsulated HTML posts.
   This does NOT affect straight text/html or multipart/alternative
   posts of the type created by Netscape and IE, but ONLY posts which
   are MIME-encapsulated HTML, a favorite format of sex spammers which
   often sneaks in under the EMP radar.  This is ON by default.

$block_html - Enables blocking HTML and multipart/alternative posts.
   Added by request, OFF by default.

$block_late_cancels - If set, cancels for recently rejected articles
   will be rejected.  Set the window with $MIDmaxlife (above).  This
   will result in a *huge* number of rejections.  If you're concerned
   about your downstream sites receiving the cancels, leave this off.
   If you need a major performance boost, turn it on.  OFF by default.

$statfile - If this is set to the full path of a file, a crude stats
   file will be written each time the filter is reloaded with ctlinnd
   reload filter.perl, or started up.  The file just shows how many
   entries are present in each of the EMP histories and the MID
   history.  This is useful to ensure that your retention-time is not
   too high, and that your max-size is not too low.  You want the EMP
   memory to expire by time, not max size, for best performance.  The
   default for this is undef, which disables the stat file.

$bin_allowed - This is a regular expression telling the anti-binary
   filter in which groups binaries are allowed.  If all groups in
   the Newsgroups header match this pattern, binaries are allowed.
   (This obviously has no effect when the binary filter is disabled.)
   Default is '\.binaries|alt\.sex\.pictures|alt\.anonymous\.messages|
   de\.alt\.dateien'.  alt.sex.pictures* groups are, of course, not
   binary groups, and were replaced by alt.binaries.pictures.erotica
   long ago, but many legit binary posts are crossposted there.  I
   don't carry these groups, so allowing binaries crossposted to them
   is fine.  You may disagree.  de.alt.dateien is the German version
   of alt.binaries.

$md5exclude - A regular expression indicating what groups should be
   exempt from the MD5 filter.  Default is to exclude *.test groups.

$allexclude - A regular expression indicating what groups should be
   completely exempt from the filter.  Default is to exclude clari.*
   groups.  If you carry ClariNet, use this, or some of those posts
   will hit the filter.


Not really configuration, but some other variables of interest that you
may or may not care to look at and/or modify:

$badguys - This is a monster regular expression containing domains
   of known spammers.  Only the "middle" part of the domains are
   listed; these are checked as email addresses by appending a list
   of top-level domains to the end, and as url's by adding http://
   and looking for an optional "www." before them.  If you add to
   this list, be *very* careful not to end up with "||" in there,
   which will match every single post that comes through.

$badips - Same as above, but for the spammers who use IP addresses
   instead of domain names.

$exempt - Regular expression of NNTP-Posting-Hosts that are exempt
   from the posting-host-based EMP filter.  This is for systems where
   all posts contain the same NNTP-Posting-Host header, such as AOL,
   which if not exempted would end up hitting the EMP filter with
   all of their posts.  There aren't many of these out there; a
   "regular" multi-user system does not present a problem because the
   filter doesn't kick in until it sees a large number of posts from
   the same posting-host and also of the same length, in a short period
   of time.


Any of the individual filters can be disabled by commenting them out.

After modifying the code, *always* check your work by typing:
   perl -cw filter_innd.pl
You should get no errors and no warnings.

That's it for now.  Have fun.


     Credits
     -------

Written by Jeremy Nixon <jeremy@exit109.com>.
Cyclone port by David Riley.
Based on Jeff Garzik's EMP filter.

I can't possibly mention everyone who has submitted ideas or fixes
for the filter, but I'd like to acknowledge the substantial contributions
of several people:  Danhiel Baker, Frank Copeland, and Brian Moore.
Thanks, guys.

Copyright 1997 by Jeremy Nixon, All Rights Reserved.
This software may be distributed freely, provided it is intact (including
all the files from the original archive).  You may modify it, and you
may distribute your modified version, provided the original work is
credited to the appropriate authors, and your work is credited to you.

This filter is available at:

http://www.exit109.com/~jeremy/news/antispam.html
