		The CRM114 and CRM114 Mailfilter FAQ


*** What does CRM114 stand for?

- CRM stands for Controllable Regex Mutilator, concept # 114.  It's a
mutilating regex engine, designed to slice and dice text with the
vigor of a Cuisinart on an overripe zucchini.   There is no truth
to the rumor that it really means "Crash's Regex Monstrosity".


*** Very funny.  What does CRM114 _really_ stand for?

- CRM114, or more accurately, "the CRM114 Discriminator" is from the
Stanley Kubrick movie "Dr. Strangelove".  (an _excellent_ movie- you
should go get it and watch it.  Really.  Some critics have said it is
the greatest movie of it's era; others are more accurate and call it
the greatest movie _ever_ made.  In my opinion, a hundred years from
now, Dr. Strangelove will be considered the _definitive_ satirical
history of the Cold War and perhaps of the second half of the 20th
Century).  But I digress.

Anyway, the "CRM114 Discriminator" in the movie is a fictional
accessory for a radio reciever that's "designed not to recieve _at
all_", that is, unless the message is properly authenticated.  That
was the original goal of CRM114 - to discriminate between authentic
messages of known importance, and get rid of the rest.

Note the emphasis on "get rid of".  Unlike many other "filters", 
CRM114's default action is to read all of input, and put NOTHING onto
output.  The simplest possible CRM114 program does exactly that, 
in fact- read all of stdin, and throw it away.


*** I've got a _ton_ of spam in my library.  Why shouldn't I just 
    load it into CRM114 and get a head start on training?

Bad idea.  CRM114's learning algorithm is predicated on using the
TOE strategy - that is, Train Only Errors.  When CRM114's mailfilter
makes a mistake, you train in the right result and it will do better
next time.  

I've tested this _exhaustively_, spending a few CPU-weeks in the
process; CRM114 really does work best if you train in only errors, and
in the order encountered.  It's about a factor of two times more
accurate, and about a factor of two times faster during the execution.

The actual numbers work out something like this: I used a torture-test
corpus of 4147 messages, split roughly 60/40 between nonspam and spam.
Running TOE, with the 5th order polynomial and entropic correction,
the error rate curve showed a nice exponential approach to zero errors.
Reshuffling the corpus of 4147 messages ten different ways, the final 
error rate (that is, the error rate in the last 500 messages) was just
about 6.9 errors per 500 final messages, or 1.3% (very good on such
a difficult corpus.  I _personally_, when hand-scoring these messages,
get about a 30% error rate).

Training _every_ sample yielded about 14.9 errors in the final 500 
messages, or an error rate of about 2.9 %.  Interestingly, the error
curve or training every sample dove more quickly initially, but 
then _rose_ again as new items were trained.

The relative runtimes were 14 minutes (roughly) for TOE and training
only errors, versus about 29 minutes (roughly) for training everything,
averaged across the 10 runs of 4147 messages each.

So, if you don't mind being something more than a factor of 2 less
accurate, and twice as slow, you can go ahead and train everything.
Seriously- if you want accuracy, start from an empty .css file and
train only errors, as you encounter them.


***  But WHY does it work better to train only errors ?

- Intuitively, here's how you can understand it:

If you train in only on an error, that's close to the minimal change
necessary to obtain correct behavior from the filter.  

If you train in something that would have been classified correctly
anyway, you have now set up a prejudice (an inappropriately strong
reaction) to that particular text.

Now, that prejudice will make it _harder_ to re-learn correct behavior on
the next piece of text that isn't right.  Instead of just learning
the correct behavior, we first have to unlearn the prejudice, and
_then_ learn the correct behavior.  It can be done- but it doesn't
converge on the right answer as fast as never getting these unwarranted
prejudices in the first place.

In filters as in people, prejudices are generalizations that are best
avoided.  

There is a secondary effect as well, due to the limits to growth of
the .css files.  If you train everything, you will typically start
seeing CRM114 go into microgrooming at around a megabyte of text.
This is because there is a limited amount of space in a growth-limited
.css file.  When you reach this point, for every new feature added, at
least one old feature must be forgotten.  This loss of information is
a mixed blessing- although useful information is now being lost, old
prejudices are also being forgotten.  This slow tracking allows even
an aging, saturated CRM114 system data base to adapt to an evolving
spam stream.



*** Why are the bucket files called .css files?  They aren't cascading
    style sheets.

The .css suffix for SBPH bucket files originally stood for CRM Sparse
Spectra, until it was pointed out by a colleague that "sparse spectra"
was actually taken by another related but different method.  The name
stuck, even though it was no longer strictly accurate.


*** How accurate is CRM114 for anti-spam filtering?

Depending on your spam/nonspam mix, _very_ accurate.  I regularly
clock over 99%; I've had months where it was over 99.9%.

DON'T expect this level of performance without training on your
errors for a week or two.


*** Why did you make the CRM114 language so weird?

- Because I had some ideas about how I thought about "filtering" and
wanted to see how they worked out in practice.  I had a bad experience
with PERL, so I wanted a language where everything was easy to
understand, where the actions of a particular statement could
always be determined without referring to ANY other statement, let
alone "magic mind reading" and "action-at-a-distance"...  I probably
would do it differently now that I've done it this way.


*** So, is CRM114 a mailfilter, or what?

- No, CRM114 is actually a language that makes it easy to write filters
of any sort.  The most useful of these so far is for mail filtering;
the CRM114 distribution pack contains a pretty reasonable mail filter
for people who want it to "just work".


*** What algorithm does the mailfilter use?

- There's a whole file that just describes it ("classify_details.txt")
in the distribution, but in short, it matches short phrases from the
incoming text with phrases you supplied previously as example text.
In reality, it does a lot of hashing and polynomials to make the run
time acceptable.

I call the filtering algorithm Sparse Binary Polynomial Hashing with
Bayesian Chain Rule evaluation (SBPH/BCR), which gives you a vague
idea of how it might work inside.

Note that in CRM114's Maililter, we do NOT do "special tagging", such
as creating special tokens saying "This was in the Subject" or 
"This was in the Recieved header chain".  The short-phrase sliding
window is long enough that such tokens aren't necessary.

Minor Update- by altering theweightings of different lengths of short
phrases, it's possible to change the behavior of SBPH/BCR from a
strict Bayesian, to an entropically-corrected Bayesian, to a
Markovian matcher. 


*** So that's it?

- Mathematically, yes.  But since about 2003-11-xx, the chain 
rule function has been updated with entropic correction; this puts 
more weight on longer chains.  

In effect, this is a Markov model of the data stream with lots
of hidden states.  So, SBPH/BCR is really not SBPH/BCR, its more
of Sparse Binary Polynomial Hashing / Bayesian-Markov Model (SBPH/BMM).

The really nice thing about SBPH/BMM is that it's slightly more
accurate than the previous SBPH/BCR and it's 100% upward compatible
with /BCR data files.  All the information was there, it just needed
to be used properly.  


*** Why didn't you just use Bayesian filtering?

- I had played with single-word Bayesian filtering from '96 through
2000 and found that it could behave very well on very large input
texts (typically, tens to hundreds of megabytes).  But first brutally
naive implementation was far too memory-intensive to use for real
filtering; Paul Graham and others have refined Bayesian filtering to
the point where it's actually very useful for large numbers of people
to use (by clipping the less significant words).

In short, I didn't think that Bayesian filtering would work as well as
it does; I was wrong.  So, I tried a different idea and it seems to
work pretty well too.  The two methods are closely related; SBPH/BCR
with a polynomial of order 1 (that is, phrase length == 1 word) is
completely equivalent to 1-word Bayesian filtering without
insignificant-word and hapax clipping)

(addendum: as of the Nov 5th 2002 edition of CRM114, the classifier
does indeed do full Bayesian matching on these polynomial features.
This improves accuracy out into the >99.9% region, and
December-2003-onward versions default to use Markov weighting as well,
which gives somewhat better accuracy than entropically corrected
Bayesian weighting. )


*** Can I use the CRM114 mailfilter from inside PROCMAIL?

- Yes.  You'll want to edit the file mailfilter.crm to change the
actions from "accept" to "exit /0/" when the mail is good, and
from mailing your spambucket account with "syscall ..." to an "exit /1/"
when the mail is spam.  But yes, you can.


*** It's making too many mistakes!  What did I do wrong?

- You probably didn't do anything wrong.  What's probably happened
is that your spam/nonspam mix is very different than mine.  This causes
the words and phrases in your spam/nonspam to not match up with the
words/phrases in mine.

The fix is to train the mailfilter anytime it makes an error.  The
filter learns very fast; you should see drastic improvement after a
single day of error feedback.  I usually pass 99% accuracy at two or
three days, starting from zero.

In extreme cases, delete the pre-generated spam.css and nonspam.css
files, and start from scratch with the training.  In one day, (and
assuming sufficient spam and nonspam) you should be around 97%, two
days 98%, and three days > 99%.



*** How much data does it take to get that accurate?

- Not a lot.  At 99.67% accuracy, I only had 84K of nonspam and 185K of
spam text.  Interestingly, because spam contains a lot of run-on HTML,
the total number of hashed datapoint features is roughly equal.
 

*** I tried training in a huge amount of spam or nonspam, and it hung!

- Actually, it probably didn't hang.  Training is slow, only 10K or
so per second, so a half-meg spam bucket may take on the order of
a few minutes to train in.  Give it time.  :)



*** I trained in (some huge amount) of spam and nonspam, and it
    doesn't work any more!!!	    

- As noted above, you can overflow the buckets in the .css files
if you train in too much spam or nonspam.  You should get very good
results with less than 100K each of spam and nonspam text
(roughly equal numbers of messages is good too).  

Use the most recent spam and nonspam you can get your hands on.
Don't use spam more than a few months old for training.

And realize, if you're doing any "bulk training", rather than
Train Only Errors, that you could be doing 2x _better_ if you
trained only errors.  So there.  :)


*** Does CRM114 or the mailfilter work for any language besides English?

CRM114 uses 8-bit ASCII characters, and is 8-bit clean except for NULL
string terminations (which are forced by the GNU REGEX library, not my
decision).  I you use the included (and defaulted) TRE regex engine instead,
it's a NULL-clean system and you should be OK for 8-bit languages.

BUT if you use a unicode-based or other wide-character language,
you'll need to port up CRM114 to use wchar instead of char, as well as
getting unicode-clean regex libraries (there is a version of TRE that
does that, nicely enough).  This is not a minor undertaking, but if
you do it, please let me know and I'd gladly roll your changes back
into the standard CRM114 kit.

That said, if you get _mail_ in any language other than English, there
are two possibilities.  If you're lucky, you use a language that fits
in 8-bit characters.  In that case, you can just delete the spam.css
and nonspam.css files, and re-train the mailfilter for your local spam
mix.  Otherwise, you're stuck with wchars, so see above.

(Note: new versions of CRM114 since August 2003 default to use the TRE
library, which is both 8-bit-safe and has fewer edge errors than the
GNU library.  The GNU-based version remains available as a Makefile
option for those who depend on the GNU idiosyncracies.)


*** Why is LEARNing or CLASSIFYing so slow?

- It's not _that_ slow.  In fact, it's really quite fast.  With a
(relatively slow) Transmeta Crusoe 666MHz and a slow laptop disk,
CRM114 can "learn" at about 10kbytes/second, and can classify text
about twice as fast (20Kbytes/second), which compares very favorably with
genetic algorithms or neural nets.  Of course, that assumes that the
.css file is already in the UMB's (a reasonable assumption); if
they're not, add a reasonable amount of time for disk I/O to page in
the needed bits.

Note that this is still quite a bit slower than things like Paul
Graham's stuff, or Eric Raymond's Bogofilter.  It's a trade-off; 
that's what Open Source is all about.

Note that because LEARNing and CLASSIFYing do a lot of very randomized
accesses into the bucket files, these two verbs will thrash cache
pretty intensely.  I've had reports that 16MB bucket files will learn
or classify at horrendously slow rates- the results are still correct
and accurate, but it's very annoying.  We have a workaround plan (to
do sorted access, or use a tree sructure) in consideration.

(Note: in the ever-improving scheme of Free Software, the speed of
CRM114 has been continuously improved; we are now several times FASTER
than SpamAssasin.)

*** Why is CRM114 such a memory pig?

- It's not _that piggish_.  To keep speed up, the CRM114 engine
preallocates five buffers for data; each buffer is the size of a data
window (default 8 megabytes each, change it with the -w option).
Small buffers are allocated dynamically on the stack; expect to see
50K or so there.  LEARN and CLASSIFY use mmap to access the .css files
as part of virtual memory, so each .css file will consume a fair
amount of virtual memory (by default, 24 megabytes per .css file, but
this is released as soon as it's no longer needed, and since it's
mmaped rather than malloced, it does not require paging file or
swapfile space).


*** Aren't you afraid spammers will dissect CRM114 in order to beat it?

- Not really.  The basis of the LEARN/CLASSIFY filter is to look at
significant phrases in human language.  At least in English, there are
relatively few "natural" phrases one can use to sell Viagra, porn, or
low-interest mortgages.  So, a spammer trying to beat CRM114 would
have to avoid those phrases, and instead use phrases used in normal
non-sales-pitch discourse.

The cool part is that the non-sales-pitch discourse has no way to
express the sales pitch!  The medium cannot carry the message, there's
just no way to say it.  So the spammers are simply unable to function.


*** That sounds awfully close to 1984 and Newspeak.

- Yes, I realize this, and _yes_, it bothers me.  CRM114 could provide a
uniquely powerful tool for censorship.  But from what I can tell from
the public literature, the concept of phrase analysis is nothing new
to the CIA or the NSA.


*** Why can't you give me your sample spam and nonspam files?

- I can't give the text out because I don't own the copyright on it!
Spam text often has a copyright notice at the bottom, and nonspam text
(stuff my friends/cow-orkers/etc send me) is clearly copyright _them_,
not _me_.  So, it would be a gross breach of confidence at the very
least, if not an outright violation of any reasonable copyright law,
for me to distribute that text.

Fear not, you don't _have_ to trust my "magic files" to not contain
a hidden agenda.  You can rebuild the .css files with your own
spamtext.txt and nonspamtext.txt files easily.  Just delete *.css and
then create two files of spam and nonspam "spamtext.txt" and
"nonspamtext.txt".  Run the "make cssfiles" command and new .css files
will be built.

Even better, delete the .css files, type 

     cssutil -b -s spam.css
     cssutil -b -s nonspam.css

and train only errors for a few days; you'll end up with a highly accurate
filter that matches exactly the kind of mail you get, and the kind
of spam you get.


***  When will CRM114 go to full Bayesian?

As of Nov 1 2002, it has.  :-)  See the file "classify_details.txt" for 
the full scoop.

We may change the Bayesian Chain Rule at some point in the future; the
reason is that the standard Bayesian Chain Rule (BCR) has an
underlying assumption of statistical independence on the input events.
Unfortunately, spam features and nonspam features are NOT independent
and so BCR is really quite incorrect to use.  I'm working on better
alternatives and they will appear as they are found, tested, and
proven to work better than BCR.

	