Initially people thought that every available message should be used to train bogofilter. Some even do this automatically; bogofilter has an option for that purpose. Of course the messages have to be verified and errors "backed out" of the training database afterward, or else bogofilter's inevitable mistakes will degrade the accuracy of training.
The suggestion was soon made that, after the training database had reached a reasonable size (some 10,000 each of spam and nonspam had been used for training), one could save time and storage by training only on errors (messages that were classed wrongly, or classed as "unsure." Tests soon showed that this worked quite well, both for bogofilter users and for participants in the Spambayes project, which did a lot of useful research into both tokenization and scoring.
Training on error is theoretically questionable, because in principle the distribution of tokens in the training database should match the distribution in the message population. Nevertheless, it works for those who have used it extensively, and success is deemed to excuse some departure from strict theoretical rigour. (In fact, doing Bayesian analysis even with full training already violates an assumption of independence on which Bayesian theory depends.)
Once people started training on error -- some folks even skipped the initial stage of building the database to reasonable size by full training, though that has been shown to be less efficient -- the question naturally arose, "Once the filter has made a mistake, why not train with that message repeatedly till the program gets it right?" And indeed many people had the empirical experience that doing this with the occasional "difficult" message did improve accuracy to some extent. However, concern was raised that this could lead to much worse distortion of the token distribution within the training database, and eventually to degraded accuracy: the ability to recognize the erroneous messages used in repeated training might improve, but the ability to recognize similar but not identical messages might suffer. Putting it glibly, training repeatedly on a dachshund is probably not going to help a Bayesian classifier learn to distinguish a St. Bernard from a cat.
The purpose of the experimentation described here, then, was to determine whether, after one has begun by building a good-sized training database without repetition, training repetitively helps or harms bogofilter's ability to classify hitherto unseen messages correctly.
The available spam and nonspam corpora were each split into two parts, one of which contained a third of the messages and the other the remaining two thirds. This was done by "dealing out" the messages as one would deal a deck of cards, putting two cards in one pile and one in the other, then repeating the process till the cards had all been dealt. The smaller parts should thus have been closely representative of the total spam and nonspam.
The smaller parts were used to create a training database by "full training" and the optimal bogofilter parameters were determined using bogotune, with the newly created training database and the larger parts of the original corpora for testing. These test messages remained "new" in that none of them were ever used for training.
The following steps were then iterated: the smaller parts were classified using the newly determined parameters, and the messages that were classified wrongly or as unsure were used to train the database further. Then bogotune was run again (training usually alters the optimal parameter values), still with the larger parts as test messages. At the beginning, and each time through this loop, bogotune's expected false-negative count was recorded, using the cutoff setting for the lowest false-positive count in the bogotune recommendations. Also recorded were the numbers of spams and nonspams to be used in training for the next round.
According to the hypothesis under test, some initial improvement in discrimination might be observed. I would then expect the training database to become too closely representative of the idiosyncracies of the training set (the smaller parts of the corpora), at which point bogofilter's classification accuracy should begin to degrade. These experiments were designed to test that latter expectation.
There were two experiments. The first one used messages taken from my personal mailbox, 30,198 nonspam and 30,155 spam. The training set thus comprised 10,066 of the nonspam and 10,051 of the spam. There were six iterations of repetitive training-on-error; the first four showed the expected improvement in accuracy, but the fifth and sixth displayed significant degradation. (The training leading to the fifth round was carefully verified in the log to make sure there was no error in manipulation.) The results are as follows (the columns show the number of iterations of training, the expected false negatives reported by bogotune, and the numbers of nonspam and spam that were wrongly or uncertainly classified and were therefore used in the next round of training; the right-hand three columns show the same results in percentage form):
run testfn trainfp trainfn testfnpc trainfppc trainfnpc 0 520 1 160 2.59 0.00993 1.59 1 505 1 147 2.51 0.00993 1.46 2 481 1 126 2.39 0.00993 1.25 3 510 0 127 2.54 0.00000 1.26 4 477 1 113 2.37 0.00993 1.12 5 649 54 173 3.23 0.53646 1.72 6 724 9 174 3.60 0.08941 1.73

So far, the working hypothesis seemed to be holding up. While this first experiment was running, a second one was being run as well, this time with messages from an organization of about 80 users with diverse interests. There were 54,229 nonspam and 59,371 spam altogether, so the training set included 18,076 nonspam and 19,791 spam. The discrimination was, in this case, superior to that observed in the first experiment, though a higher proportion of false positives was encountered; given that the starting training database was larger, and the number of messages used in each round of training-on-error was smaller, it was to be expected that this experiment would have to run to more iterations before a conclusion was reached:
run testfn trainfp trainfn testfnpc trainfppc trainfnpc 0 1065 67 21 2.69 0.371 0.106 1 1022 67 21 2.58 0.371 0.106 2 941 67 21 2.38 0.371 0.106 3 678 67 21 1.71 0.371 0.106 4 619 67 21 1.56 0.371 0.106 5 643 67 21 1.62 0.371 0.106 6 589 67 21 1.49 0.371 0.106 7 567 67 21 1.43 0.371 0.106 8 576 67 21 1.46 0.371 0.106 9 521 67 21 1.32 0.371 0.106 10 521 67 21 1.32 0.371 0.106 11 505 204 69 1.28 1.129 0.349 12 544 204 69 1.37 1.129 0.349 13 578 204 69 1.46 1.129 0.349
In the following two tables the parameters determined by tuning are listed. The columns are number of rounds of repetitive training, x, minimum deviation, s, the spam and ham cutoffs, the numbers of correctly classified spam and nonspam, the false positives and false negatives, and the error rate among the test messages (the last five columns repeat data already presented):
run x md s sc hc sp ns fp fn errpc 0 0.430980 0.100 0.0100 0.629 0.1 9891 10065 1 160 2.59 1 0.434814 0.070 0.0100 0.669 0.1 9904 10065 1 147 2.51 2 0.489179 0.065 0.0100 0.688 0.1 9925 10065 1 126 2.39 3 0.441655 0.035 0.0316 0.843 0.1 9924 10066 0 127 2.54 4 0.423983 0.100 0.1000 0.793 0.1 9938 10065 1 113 2.37 5 0.585336 0.265 0.1000 0.977 0.1 9878 10012 54 173 3.23 6 0.498972 0.275 0.1778 0.995 0.1 9877 10057 9 174 3.60 run x md s sc hc sp ns fp fn errpc 0 0.52 0.435 0.1000 0.999 0.364 19770 18009 67 21 2.69 1 0.52 0.435 0.1000 0.997 0.380 19770 18009 67 21 2.58 2 0.52 0.450 0.0562 0.995 0.449 19770 18009 67 21 2.38 3 0.48 0.070 0.0562 0.999 0.232 19770 18009 67 21 1.71 4 0.60 0.100 0.1000 0.997 0.241 19770 18009 67 21 1.56 5 0.55 0.050 0.0316 0.998 0.265 19770 18009 67 21 1.62 6 0.54 0.065 0.0562 0.995 0.231 19770 18009 67 21 1.49 7 0.55 0.065 0.0562 0.993 0.223 19770 18009 67 21 1.43 8 0.53 0.070 0.0562 0.993 0.204 19770 18009 67 21 1.46 9 0.60 0.115 0.0562 0.978 0.190 19770 18009 67 21 1.32 10 0.60 0.115 0.0562 0.978 0.190 19770 18009 67 21 1.32 11 0.60 0.115 0.1000 0.978 0.163 19722 17872 204 69 1.28 12 0.48 0.100 0.1000 0.985 0.152 19722 17872 204 69 1.37 13 0.55 0.100 0.1000 0.991 0.133 19722 17872 204 69 1.46
Note that the sensitivity of bogofilter's accuracy to overtraining should decrease as the size of the training database increases. I would expect, though I have not tested this and some people report otherwise, that training on error ab initio would be yet more vulnerable to overtraining than the present results indicate.
cat distrib
FILE=($1.msg $1.msg $1.train)
let n=${FILENO}%3
fname=${FILE[$n]}
cat >>$fname
cat runex
#! /bin/bash
# runex for repetitive training experiment
# needs sp.train, ns.train, sp.msg, ns.msg and directory db
# build initial training db
/bin/rm -f db/wordlist.db
cat sp*.train | bogofilter -d db -s
cat ns*.train | bogofilter -d db -n
# 20 iterations is plenty
for iter in `seq 0 20`; do
# get parameter values
/usr/bin/bogotune -d db -v -s sp*.msg -n ns*.msg | tee log.$iter
eval `fgrep = log.$iter`
# check for wrongly classified nonspam
classify ns.train 3 -d db -m $min_dev,$robs,$robx \
-o $spam_cutoff,$ham_cutoff
grep -c '^From ' corpus.* | tee -a log.$iter
mv corpus.unsure corpus.nstrain
test -f corpus.bad && cat corpus.bad >>corpus.nstrain
/bin/rm -f corpus.good corpus.bad
# check for wrongly classified spam
classify sp.train 3 -d db -m $min_dev,$robs,$robx \
-o $spam_cutoff,$ham_cutoff
grep -c '^From ' corpus.* | tee -a log.$iter
mv corpus.unsure corpus.sptrain
test -f corpus.good && cat corpus.good >>corpus.sptrain
/bin/rm -f corpus.good corpus.bad
# train on wrongly classified messages
bogofilter -d db -s < corpus.sptrain
bogofilter -d db -n < corpus.nstrain
/bin/rm corpus.*
done
cat classify
helper=bogowrap
src=$1;
if [ "x$2" = "x3" ]; then helper=class3; shift; fi
shift
formail -s $helper $* < $src
cat class3
#! /bin/sh
cat >msg.$$
/usr/bin/bogofilter $* < msg.$$
res=$?
if [ $res = 0 ]; then
cat msg.$$ >>corpus.bad
elif [ $res = 1 ]; then
cat msg.$$ >>corpus.good
elif [ $res = 2 ]; then
cat msg.$$ >>corpus.unsure
fi
/bin/rm msg.$$
runex
The data reduction (calculating and plotting percentages) is sufficiently trivial that I don't think it's necessary to show them here. Those, and the detailed logs, are available on request.
[© Greg Louis, 2004; last modified 2004-03-11]