Bogofilter Training: Comparing Full Training with Training-on-error

Introduction and general description:

There are two possible ways to train bogofilter starting from scratch, assuming that corpora of spam and nonspam messages have been accumulated for the purpose.  One is simply to run
bogofilter -s < spam_corpus
bogofilter -n < nonspam_corpus
which registers every message in both corpora.  The other extreme is to train on error: messages are fed to the classifier in random order, and whenever a classification is wrong or uncertain, the message in question is used for training before the next message is classified.

I run bogofilter with the Robinson-Fisher method of calculation, in which values of a guess a priori and a weight parameter for that guess are used in calculating individual token-probability estimates, and Fisher's method of combining probabilities is applied in calculation of the overall message "spamicity."  It was of interest to know three things about training for this approach:

  1. Does full training or on-error training lead to fewer classification errors?
  2. Does the number of classification errors diminish if more than one round of on-error training is performed?
  3. Does the number of classification errors diminish if a round of on-error training is performed after full training?

Another experiment has been performed since this one, with more data, and slightly different objectives.

Procedure and Results:

For this experiment I used a training corpus of 6372 nonspams and 6372 spam emails, and a test corpus consisting of a further 4872 nonspams and 1851 spams.  The nonspams and spams were "dealt" out into three separate mailboxes each so as to provide three test runs per experimental condition. The training methods, for full training and for training on error, were briefly described in the foregoing section.  For training on error, I wrote a script called randomtrain that produces a list of messages, in random order, with flags to indicate whether each message is spam or nonspam, and then uses the list to feed messages to bogofilter for classification and, when needed, for training.  I produced the following set of training databases, each in its own directory:

Effect of repeated training-on-error

The first table shows the numbers of misclassifications (and therefore the number of messages used in training) for each round of training-on-error:

  rounds spam good
1      1 1090  764
2      2  193   56
3      3   28   15
4      4   10    5
5      5    8    3

These results are plotted on the left-hand graph below, spam in red and nonspam in blue, on an arbitrary scale.

Next we show the data for false positives (fpos), false negatives (fneg), total error (err) and percent error (percent) for each round of training-on-error:

   round run fpos fneg err percent
1      1   0   22  126 148    6.60
2      1   1   17  123 140    6.25
3      1   2   19  121 140    6.25
4      2   0   23  105 128    5.71
5      2   1   18  113 131    5.85
6      2   2   22  109 131    5.85
7      3   0   23  104 127    5.67
8      3   1   18  111 129    5.76
9      3   2   22  108 130    5.80
10     4   0   23  104 127    5.67
11     4   1   18  111 129    5.76
12     4   2   22  108 130    5.80
13     5   0   23  103 126    5.62
14     5   1   19  108 127    5.67
15     5   2   22  107 129    5.76

It appears that a small but significant improvement in discrimination was achieved by running two rounds of training, but that after that, the law of diminishing returns set in and further improvement was minimal.  Summarizing the five rounds, with means and 95% confidence limits (these values also appear on the left-hand graph below):

  round meanerrpc lcl95 ucl95
1     1      6.37  6.13  6.60
2     2      5.80  5.56  6.04
3     3      5.74  5.50  5.98
4     4      5.74  5.50  5.98
5     5      5.68  5.44  5.92

Choice of training method We now compare the results obtained with a two-round training-on-error database with those obtained with full training, and with full training followed by one round of training-on-error.  Also included in the comparison is the production training database described above.

        train run fpos fneg err percent
1  production   0    2   47  49    2.19
2  production   1    2   43  45    2.01
3  production   2    3   45  48    2.14
4    errtwice   0   23  105 128    5.71
5    errtwice   1   18  113 131    5.85
6    errtwice   2   22  109 131    5.85
7        full   0   57   65 122    5.44
8        full   1   48   54 102    4.55
9        full   2   56   63 119    5.31
10    fullerr   0   52   69 121    5.40
11    fullerr   1   46   58 104    4.64
12    fullerr   2   49   69 118    5.27

Summarizing (these means and confidence limits appear in the graph on the right below):

       train meanerrpc lcl95 ucl95
1 production      2.11  1.79  2.44
2   errtwice      5.80  5.48  6.12
3       full      5.10  4.78  5.43
4    fullerr      5.10  4.78  5.43

Not surprisingly, the much larger production training set had an advantage over the experimental training corpus.  More surprising was that the full training performed slightly better (though not significantly so, in the statistical sense) than did training on error.  Following full training with a round of training on error made no difference; rather few messages were used in that round.

Here are the two graphs showing these results.  As mentioned above, the red line shows the number of spams used in each round of on-error training, scaled so that the maximum appears at 5.0 on the y axis, and the blue line shows the number of nonspams on the same scale.  The black crosses indicate the percentage errors, with 95% confidence limits indicated by the vertical lines.

graphs

Conclusions:

  1. The Fisher-based method of calculation worked as well with full training as it did with training on error.
  2. When training on error, two rounds were better than one, but more than that brought too little benefit to justify the effort.
  3. There was no advantage in following full training with a round of training on error.

Appendix A: Calculation methods:

The Fisher-based calculation uses
scalefactor = badlist_messagecount / goodlist_messagecount
f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
The scale factor is the ratio of messages used to make up the list of spam tokens to messages used to make up the list of nonspam tokens.

The f(w) value is calculated for each unique token in the message; the number of such tokens is represented below by n.

Fisher's method uses an inverse chi-squared function, prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2n degrees of freedom:

P = prbx(-2 * sum(ln(1-f(w))), 2*n)
Q = prbx(-2 * sum(ln(f(w))), 2*n)
S = (1 + Q - P) / 2

Appendix B: Notes on experimental procedure:

The following notes should suffice if anyone wishes to repeat the experiment:

alias rt='randomtrain exh -s sieved.bad -n sieved.good'
mkdir exh
rt
 spam  reg   good  reg
 6372 1090   6372  764

cp -a exh once
rt
 spam  reg   good  reg
 6372  193   6372   56 

cp -a exh twice
rt
 spam  reg   good  reg
 6372   28   6372   15

cp -a exh thrice
rt
 spam  reg   good  reg
 6372   10   6372    5 

cp -a exh fourX
rt
 spam  reg   good  reg
 6372    8   6372    3 

mkdir full
bogofilter -d full -s < /store/safe/mail/sieved.bad
bogofilter -d full -n < /store/safe/mail/sieved.good

cp -a full fullerr
randomtrain fullerr -s /store/safe/mail/sieved.bad -n \
  /store/safe/mail/sieved.good
 spam  reg   good  reg
 6372   18   6372  221 

The classify script used below looks like this:
  cd
  helper=bin/bogowrap
  test "x$2" = "x3" && helper=bin/class3
  formail -s $helper <$1

bogowrap is just
  #! /bin/sh
  cat >msg.$$
  if /usr/bin/bogofilter >corpus.bad
  else
    cat msg.$$ >>corpus.good
  fi
  /bin/rm msg.$$

thirds is equally simple:
  #! /bin/sh
  let n=${FILENO}%3
  fname=cgx-$n
  cat >>$fname

classify mbox1129
classify mbox1202

(manual reclassification performed)

FILENO=0 formail -s ~/bin/thirds < corpus.good
rename gx ns cgx*
FILENO=0 formail -s ~/bin/thirds < corpus.bad
rename gx sp cgx*

for round in .bogofilter once twice thrice fourX exh; do
  for run in 0 1 2; do
    formail -s bogofilter -d $round -v < csp-$run &> sp-$round-$run
    formail -s bogofilter -d $round -v < cns-$run &> ns-$round-$run
  done
done

for round in full fullerr; do
  for run in 0 1 2; do
    formail -s bogofilter -d $round -v < csp-$run &> sp-$round-$run
    formail -s bogofilter -d $round -v < cns-$run &> ns-$round-$run
  done
done

grep -c ^1 ns-*
ns-.bogofilter-0:2
ns-.bogofilter-1:2
ns-.bogofilter-2:3
ns-exh-0:23
ns-exh-1:19
ns-exh-2:22
ns-fourX-0:23
ns-fourX-1:18
ns-fourX-2:22
ns-full-0:57
ns-full-1:48
ns-full-2:56
ns-fullerr-0:52
ns-fullerr-1:46
ns-fullerr-2:49
ns-once-0:22
ns-once-1:17
ns-once-2:19
ns-thrice-0:23
ns-thrice-1:18
ns-thrice-2:22
ns-twice-0:23
ns-twice-1:18
ns-twice-2:22

grep -c -v ^1 sp-*
sp-.bogofilter-0:47
sp-.bogofilter-1:43
sp-.bogofilter-2:45
sp-exh-0:103
sp-exh-1:108
sp-exh-2:107
sp-fourX-0:104
sp-fourX-1:111
sp-fourX-2:108
sp-full-0:65
sp-full-1:54
sp-full-2:63
sp-fullerr-0:69
sp-fullerr-1:58
sp-fullerr-2:69
sp-once-0:126
sp-once-1:123
sp-once-2:121
sp-thrice-0:104
sp-thrice-1:111
sp-thrice-2:108
sp-twice-0:105
sp-twice-1:113
sp-twice-2:109

grep -c '^From ' csp* cns*

retrain$err <- retrain$fpos+retrain$fneg
retrain$percent <- retrain$err * 100 / (1624+617)

   round run fpos fneg err percent
1      1   0   22  126 148    6.60
2      1   1   17  123 140    6.25
3      1   2   19  121 140    6.25
4      2   0   23  105 128    5.71
5      2   1   18  113 131    5.85
6      2   2   22  109 131    5.85
7      3   0   23  104 127    5.67
8      3   1   18  111 129    5.76
9      3   2   22  108 130    5.80
10     4   0   23  104 127    5.67
11     4   1   18  111 129    5.76
12     4   2   22  108 130    5.80
13     5   0   23  103 126    5.62
14     5   1   19  108 127    5.67
15     5   2   22  107 129    5.76

summary(aov(percent ~ round, data=retrain))
            Df  Sum Sq Mean Sq F value   Pr(>F)   
round        1 0.61170 0.61170  16.884 0.001232 **
Residuals   13 0.47099 0.03623                    

raov <- aov(percent ~ round, data=retrain) 

rdf <- 13
rms <- deviance(raov) / rdf
d <- c(1.95996, 0.412, 0.423)
z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms/3)
meanerr <- apply(array(retrain$percent, dim=c(3,5)), 2, mean)
lcl95 <- meanerr - z
ucl95 <- meanerr + z
retrainres <- data.frame(round=c(1,2,3,4,5), meanerrpc=meanerr,        
  lcl95=lcl95, ucl95=ucl95)

print(retrainres, digits=3)
  round meanerrpc lcl95 ucl95
1     1      6.37  6.13  6.60
2     2      5.80  5.56  6.04
3     3      5.74  5.50  5.98
4     4      5.74  5.50  5.98
5     5      5.68  5.44  5.92

retrainres$round <- factor(retrainres$round)
plot(retrainres$round, retrainres$meanerrpc, ylim=c(0,1.2),
  main="Effect of repeated training-on-error",
  xlab="Number of training cycles", ylab="Percent error")
lines(retrainres$round, retrainres$ucl95, type="h")
lines(retrainres$round, retrainres$lcl95, type="h", col="white")
lines(retrainres$round, retrainres$meanerrpc, lty="dotted", lwd=2)
read.table("training.reg.tbl") -> reg
reg
  rounds spam good
1      1 1090  764
2      2  193   56
3      3   28   15
4      4   10    5
5      5    8    3

lines(reg$rounds, reg$spam * 5 / 1090, col="red")
points(reg$rounds, reg$spam * 5 / 1090, col="red")
points(reg$rounds, reg$good * 5 / 1090, col="blue")
lines(reg$rounds, reg$good * 5 / 1090, col="blue")

data.frame(train=c(rep("production",3), rep("errtwice",3),
rep("full",3), rep("fullerr",3)), run=c(0,1,2,0,1,2,0,1,2,0,1,2),
fpos=c(2,2,3,23,18,22,57,48,56,52,46,49),
fneg=c(47,43,45,105,113,109,65,54,63,69,58,69)) -> vartrain

vartrain$err <- vartrain$fpos+vartrain$fneg
vartrain$percent <- vartrain$err * 100 / (1624+617)

print(vartrain, digits=3)
        train run fpos fneg err percent
1  production   0    2   47  49    2.19
2  production   1    2   43  45    2.01
3  production   2    3   45  48    2.14
4    errtwice   0   23  105 128    5.71
5    errtwice   1   18  113 131    5.85
6    errtwice   2   22  109 131    5.85
7        full   0   57   65 122    5.44
8        full   1   48   54 102    4.55
9        full   2   56   63 119    5.31
10    fullerr   0   52   69 121    5.40
11    fullerr   1   46   58 104    4.64
12    fullerr   2   49   69 118    5.27

write.table(vartrain,"vartrain.tbl")
vaov <- aov(percent ~ train, data=vartrain)
summary(vaov)
            Df  Sum Sq Mean Sq F value    Pr(>F)    
train        3 24.3465  8.1155  79.139 2.742e-06 ***
Residuals    8  0.8204  0.1025                      

rdf <- 8
rms <- deviance(raov) / rdf
d <- c(1.95996, 0.412, 0.423)
z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms/3)
meanerr <- apply(array(vartrain$percent, dim=c(3,4)), 2, mean)
lcl95 <- meanerr - z
ucl95 <- meanerr + z
vartrainres <- data.frame(train=c("production","errtwice","full","fullerr"),
  meanerrpc=meanerr, lcl95=lcl95, ucl95=ucl95)

print(vartrainres,digits=3)
       train meanerrpc lcl95 ucl95
1 production      2.11  1.79  2.44
2   errtwice      5.80  5.48  6.12
3       full      5.10  4.78  5.43
4    fullerr      5.10  4.78  5.43

plot(vartrainres$train, vartrainres$meanerrpc, ylim=c(0,7),
  main="Comparison of training methods",
  xlab="Method", ylab="Percent error")
lines(vartrainres$train, vartrainres$ucl95, type="h")
lines(vartrainres$train, vartrainres$lcl95, type="h", col="white")

Greg Louis, 2002, 2003; last modified 2003-04-18]