Bogofilter and Repetitive Training

Introduction and general description:

Bogofilter is a Bayesian spam filter; background is presented on my general bogofilter page.  Bogofilter runs a modification of Paul Graham's original calculation method that was suggested by Gary Robinson.  The basic principle is that known spam and nonspam messages are presented to bogofilter for training; the messages are broken up into tokens (words, IP addresses and so forth) and the frequency of occurrence of each token in spam and nonspam is counted on a cumulative basis.  Bogofilter can then compare these frequencies to token frequencies in a message to be classified, and calculate a score that indicates the likelihood that the message is spam or nonspam.

Initially people thought that every available message should be used to train bogofilter.  Some even do this automatically; bogofilter has an option for that purpose.  Of course the messages have to be verified and errors "backed out" of the training database afterward, or else bogofilter's inevitable mistakes will degrade the accuracy of training.

The suggestion was soon made that, after the training database had reached a reasonable size (some 10,000 each of spam and nonspam had been used for training), one could save time and storage by training only on errors (messages that were classed wrongly, or classed as "unsure."  Tests soon showed that this worked quite well, both for bogofilter users and for participants in the Spambayes project, which did a lot of useful research into both tokenization and scoring.

Training on error is theoretically questionable, because in principle the distribution of tokens in the training database should match the distribution in the message population.  Nevertheless, it works for those who have used it extensively, and success is deemed to excuse some departure from strict theoretical rigour. (In fact, doing Bayesian analysis even with full training already violates an assumption of independence on which Bayesian theory depends.)

Once people started training on error -- some folks even skipped the initial stage of building the database to reasonable size by full training, though that has been shown to be less efficient -- the question naturally arose, "Once the filter has made a mistake, why not train with that message repeatedly till the program gets it right?"  And indeed many people had the empirical experience that doing this with the occasional "difficult" message did improve accuracy to some extent.  However, concern was raised that this could lead to much worse distortion of the token distribution within the training database, and eventually to degraded accuracy: the ability to recognize the erroneous messages used in repeated training might improve, but the ability to recognize similar but not identical messages might suffer.  Putting it glibly, training repeatedly on a dachshund is probably not going to help a Bayesian classifier learn to distinguish a St. Bernard from a cat.

The purpose of the experimentation described here, then, was to determine whether, after one has begun by building a good-sized training database without repetition, training repetitively helps or harms bogofilter's ability to classify hitherto unseen messages correctly.

Procedure and Results:

Two separate experiments were run, both using the following protocol:

The available spam and nonspam corpora were each split into two parts, one of which contained a third of the messages and the other the remaining two thirds.  This was done by "dealing out" the messages as one would deal a deck of cards, putting two cards in one pile and one in the other, then repeating the process till the cards had all been dealt.  The smaller parts should thus have been closely representative of the total spam and nonspam.

The smaller parts were used to create a training database by "full training" and the optimal bogofilter parameters were determined using bogotune, with the newly created training database and the larger parts of the original corpora for testing.  These test messages remained "new" in that none of them were ever used for training.

The following steps were then iterated: the smaller parts were classified using the newly determined parameters, and the messages that were classified wrongly or as unsure were used to train the database further.  Then bogotune was run again (training usually alters the optimal parameter values), still with the larger parts as test messages.  At the beginning, and each time through this loop, bogotune's expected false-negative count was recorded, using the cutoff setting for the lowest false-positive count in the bogotune recommendations.  Also recorded were the numbers of spams and nonspams to be used in training for the next round.

According to the hypothesis under test, some initial improvement in discrimination might be observed.  I would then expect the training database to become too closely representative of the idiosyncracies of the training set (the smaller parts of the corpora), at which point bogofilter's classification accuracy should begin to degrade.  These experiments were designed to test that latter expectation.

There were two experiments.  The first one used messages taken from my personal mailbox, 30,198 nonspam and 30,155 spam.  The training set thus comprised 10,066 of the nonspam and 10,051 of the spam.  There were six iterations of repetitive training-on-error; the first four showed the expected improvement in accuracy, but the fifth and sixth displayed significant degradation. (The training leading to the fifth round was carefully verified in the log to make sure there was no error in manipulation.) The results are as follows (the columns show the number of iterations of training, the expected false negatives reported by bogotune, and the numbers of nonspam and spam that were wrongly or uncertainly classified and were therefore used in the next round of training; the right-hand three columns show the same results in percentage form):


 run testfn trainfp trainfn testfnpc trainfppc trainfnpc
   0    520       1     160     2.59   0.00993      1.59
   1    505       1     147     2.51   0.00993      1.46
   2    481       1     126     2.39   0.00993      1.25
   3    510       0     127     2.54   0.00000      1.26
   4    477       1     113     2.37   0.00993      1.12
   5    649      54     173     3.23   0.53646      1.72
   6    724       9     174     3.60   0.08941      1.73

The percentages of false negatives in test and training are shown on the left panel of the figure.

graphs

So far, the working hypothesis seemed to be holding up.  While this first experiment was running, a second one was being run as well, this time with messages from an organization of about 80 users with diverse interests.  There were 54,229 nonspam and 59,371 spam altogether, so the training set included 18,076 nonspam and 19,791 spam.  The discrimination was, in this case, superior to that observed in the first experiment, though a higher proportion of false positives was encountered; given that the starting training database was larger, and the number of messages used in each round of training-on-error was smaller, it was to be expected that this experiment would have to run to more iterations before a conclusion was reached:


 run testfn trainfp trainfn testfnpc trainfppc trainfnpc
   0   1065      67      21     2.69     0.371     0.106
   1   1022      67      21     2.58     0.371     0.106
   2    941      67      21     2.38     0.371     0.106
   3    678      67      21     1.71     0.371     0.106
   4    619      67      21     1.56     0.371     0.106
   5    643      67      21     1.62     0.371     0.106
   6    589      67      21     1.49     0.371     0.106
   7    567      67      21     1.43     0.371     0.106
   8    576      67      21     1.46     0.371     0.106
   9    521      67      21     1.32     0.371     0.106
  10    521      67      21     1.32     0.371     0.106
  11    505     204      69     1.28     1.129     0.349
  12    544     204      69     1.37     1.129     0.349
  13    578     204      69     1.46     1.129     0.349

The right-hand panel of the figure shows the corresponding data for this experiment.  The accuracy of classifying the test data at first improved quite significantly (the error rate being more than halved), but eventually degradation began to set in.

In the following two tables the parameters determined by tuning are listed. The columns are number of rounds of repetitive training, x, minimum deviation, s, the spam and ham cutoffs, the numbers of correctly classified spam and nonspam, the false positives and false negatives, and the error rate among the test messages (the last five columns repeat data already presented):


 run        x    md      s    sc  hc   sp    ns fp  fn errpc
   0 0.430980 0.100 0.0100 0.629 0.1 9891 10065  1 160  2.59
   1 0.434814 0.070 0.0100 0.669 0.1 9904 10065  1 147  2.51
   2 0.489179 0.065 0.0100 0.688 0.1 9925 10065  1 126  2.39
   3 0.441655 0.035 0.0316 0.843 0.1 9924 10066  0 127  2.54
   4 0.423983 0.100 0.1000 0.793 0.1 9938 10065  1 113  2.37
   5 0.585336 0.265 0.1000 0.977 0.1 9878 10012 54 173  3.23
   6 0.498972 0.275 0.1778 0.995 0.1 9877 10057  9 174  3.60

 run    x    md      s    sc    hc    sp    ns  fp fn errpc
   0 0.52 0.435 0.1000 0.999 0.364 19770 18009  67 21  2.69
   1 0.52 0.435 0.1000 0.997 0.380 19770 18009  67 21  2.58
   2 0.52 0.450 0.0562 0.995 0.449 19770 18009  67 21  2.38
   3 0.48 0.070 0.0562 0.999 0.232 19770 18009  67 21  1.71
   4 0.60 0.100 0.1000 0.997 0.241 19770 18009  67 21  1.56
   5 0.55 0.050 0.0316 0.998 0.265 19770 18009  67 21  1.62
   6 0.54 0.065 0.0562 0.995 0.231 19770 18009  67 21  1.49
   7 0.55 0.065 0.0562 0.993 0.223 19770 18009  67 21  1.43
   8 0.53 0.070 0.0562 0.993 0.204 19770 18009  67 21  1.46
   9 0.60 0.115 0.0562 0.978 0.190 19770 18009  67 21  1.32
  10 0.60 0.115 0.0562 0.978 0.190 19770 18009  67 21  1.32
  11 0.60 0.115 0.1000 0.978 0.163 19722 17872 204 69  1.28
  12 0.48 0.100 0.1000 0.985 0.152 19722 17872 204 69  1.37
  13 0.55 0.100 0.1000 0.991 0.133 19722 17872 204 69  1.46

The resistance of the training database to change during the first ten rounds of training on error may be due to the presence of short messages that are particularly difficult to classify.

Conclusion:

The conclusion can be drawn that some number of repetitions of training -- perhaps four would be a safe choice -- can indeed "boost" bogofilter's accuracy; there may, however, be an optimum that varies with the size and homogeneity of the message corpus.

Note that the sensitivity of bogofilter's accuracy to overtraining should decrease as the size of the training database increases.  I would expect, though I have not tested this and some people report otherwise, that training on error ab initio would be yet more vulnerable to overtraining than the present results indicate.

Appendix: Notes on experimental procedure

The following notes should suffice if anyone wishes to repeat the experiment:

cat distrib
FILE=($1.msg $1.msg $1.train)
let n=${FILENO}%3
fname=${FILE[$n]}
cat >>$fname

cat runex
#! /bin/bash
#  runex for repetitive training experiment
#  needs sp.train, ns.train, sp.msg, ns.msg and directory db

# build initial training db
/bin/rm -f db/wordlist.db
cat sp*.train | bogofilter -d db -s
cat ns*.train | bogofilter -d db -n

# 20 iterations is plenty
for iter in `seq 0 20`; do
    # get parameter values
    /usr/bin/bogotune -d db -v -s sp*.msg -n ns*.msg | tee log.$iter
    eval `fgrep = log.$iter`

    # check for wrongly classified nonspam
    classify ns.train 3 -d db -m $min_dev,$robs,$robx \
      -o $spam_cutoff,$ham_cutoff
    grep -c '^From ' corpus.* | tee -a log.$iter
    mv corpus.unsure corpus.nstrain
    test -f corpus.bad && cat corpus.bad >>corpus.nstrain
    /bin/rm -f corpus.good corpus.bad

    # check for wrongly classified spam
    classify sp.train 3 -d db -m $min_dev,$robs,$robx \
      -o $spam_cutoff,$ham_cutoff
    grep -c '^From ' corpus.* | tee -a log.$iter
    mv corpus.unsure corpus.sptrain
    test -f corpus.good && cat corpus.good >>corpus.sptrain
    /bin/rm -f corpus.good corpus.bad

    # train on wrongly classified messages
    bogofilter -d db -s < corpus.sptrain
    bogofilter -d db -n < corpus.nstrain
    /bin/rm corpus.*
done

cat classify
helper=bogowrap
src=$1;
if [ "x$2" = "x3" ]; then helper=class3; shift; fi
shift
formail -s $helper $* < $src

cat class3
#! /bin/sh
cat >msg.$$
/usr/bin/bogofilter $* < msg.$$
res=$?
if [ $res = 0 ]; then
    cat msg.$$ >>corpus.bad
elif [ $res = 1 ]; then
    cat msg.$$ >>corpus.good
elif [ $res = 2 ]; then
    cat msg.$$ >>corpus.unsure
fi
/bin/rm msg.$$


runex

The data reduction (calculating and plotting percentages) is sufficiently trivial that I don't think it's necessary to show them here. Those, and the detailed logs, are available on request.

Greg Louis, 2004; last modified 2004-03-11]