bogofilter -s < spam_corpus bogofilter -n < nonspam_corpus
I run bogofilter with the Robinson-Fisher method of calculation, in which values of a guess a priori and a weight parameter for that guess are used in calculating individual token-probability estimates, and Fisher's method of combining probabilities is applied in calculation of the overall message "spamicity." It was of interest to know three things about training for this approach:
Another experiment has been performed since this one, with more data, and slightly different objectives.
Effect of repeated training-on-error
The first table shows the numbers of misclassifications (and therefore the number of messages used in training) for each round of training-on-error:
rounds spam good 1 1 1090 764 2 2 193 56 3 3 28 15 4 4 10 5 5 5 8 3
These results are plotted on the left-hand graph below, spam in red and nonspam in blue, on an arbitrary scale.
Next we show the data for false positives (fpos), false negatives (fneg), total error (err) and percent error (percent) for each round of training-on-error:
round run fpos fneg err percent 1 1 0 22 126 148 6.60 2 1 1 17 123 140 6.25 3 1 2 19 121 140 6.25 4 2 0 23 105 128 5.71 5 2 1 18 113 131 5.85 6 2 2 22 109 131 5.85 7 3 0 23 104 127 5.67 8 3 1 18 111 129 5.76 9 3 2 22 108 130 5.80 10 4 0 23 104 127 5.67 11 4 1 18 111 129 5.76 12 4 2 22 108 130 5.80 13 5 0 23 103 126 5.62 14 5 1 19 108 127 5.67 15 5 2 22 107 129 5.76
It appears that a small but significant improvement in discrimination was achieved by running two rounds of training, but that after that, the law of diminishing returns set in and further improvement was minimal. Summarizing the five rounds, with means and 95% confidence limits (these values also appear on the left-hand graph below):
round meanerrpc lcl95 ucl95 1 1 6.37 6.13 6.60 2 2 5.80 5.56 6.04 3 3 5.74 5.50 5.98 4 4 5.74 5.50 5.98 5 5 5.68 5.44 5.92
Choice of training method We now compare the results obtained with a two-round training-on-error database with those obtained with full training, and with full training followed by one round of training-on-error. Also included in the comparison is the production training database described above.
train run fpos fneg err percent
1 production 0 2 47 49 2.19
2 production 1 2 43 45 2.01
3 production 2 3 45 48 2.14
4 errtwice 0 23 105 128 5.71
5 errtwice 1 18 113 131 5.85
6 errtwice 2 22 109 131 5.85
7 full 0 57 65 122 5.44
8 full 1 48 54 102 4.55
9 full 2 56 63 119 5.31
10 fullerr 0 52 69 121 5.40
11 fullerr 1 46 58 104 4.64
12 fullerr 2 49 69 118 5.27
Summarizing (these means and confidence limits appear in the graph on the right below):
train meanerrpc lcl95 ucl95
1 production 2.11 1.79 2.44
2 errtwice 5.80 5.48 6.12
3 full 5.10 4.78 5.43
4 fullerr 5.10 4.78 5.43
Not surprisingly, the much larger production training set had an advantage over the experimental training corpus. More surprising was that the full training performed slightly better (though not significantly so, in the statistical sense) than did training on error. Following full training with a round of training on error made no difference; rather few messages were used in that round.
Here are the two graphs showing these results. As mentioned above, the red line shows the number of spams used in each round of on-error training, scaled so that the maximum appears at 5.0 on the y axis, and the blue line shows the number of nonspams on the same scale. The black crosses indicate the percentage errors, with 95% confidence limits indicated by the vertical lines.

scalefactor = badlist_messagecount / goodlist_messagecount f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
The f(w) value is calculated for each unique token in the message; the number of such tokens is represented below by n.
Fisher's method uses an inverse chi-squared function, prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2n degrees of freedom:
P = prbx(-2 * sum(ln(1-f(w))), 2*n) Q = prbx(-2 * sum(ln(f(w))), 2*n) S = (1 + Q - P) / 2
alias rt='randomtrain exh -s sieved.bad -n sieved.good' mkdir exh rt spam reg good reg 6372 1090 6372 764 cp -a exh once rt spam reg good reg 6372 193 6372 56 cp -a exh twice rt spam reg good reg 6372 28 6372 15 cp -a exh thrice rt spam reg good reg 6372 10 6372 5 cp -a exh fourX rt spam reg good reg 6372 8 6372 3 mkdir full bogofilter -d full -s < /store/safe/mail/sieved.bad bogofilter -d full -n < /store/safe/mail/sieved.good cp -a full fullerr randomtrain fullerr -s /store/safe/mail/sieved.bad -n \ /store/safe/mail/sieved.good spam reg good reg 6372 18 6372 221 The classify script used below looks like this: cd helper=bin/bogowrap test "x$2" = "x3" && helper=bin/class3 formail -s $helper <$1 bogowrap is just #! /bin/sh cat >msg.$$ if /usr/bin/bogofilter>corpus.bad else cat msg.$$ >>corpus.good fi /bin/rm msg.$$ thirds is equally simple: #! /bin/sh let n=${FILENO}%3 fname=cgx-$n cat >>$fname classify mbox1129 classify mbox1202 (manual reclassification performed) FILENO=0 formail -s ~/bin/thirds < corpus.good rename gx ns cgx* FILENO=0 formail -s ~/bin/thirds < corpus.bad rename gx sp cgx* for round in .bogofilter once twice thrice fourX exh; do for run in 0 1 2; do formail -s bogofilter -d $round -v < csp-$run &> sp-$round-$run formail -s bogofilter -d $round -v < cns-$run &> ns-$round-$run done done for round in full fullerr; do for run in 0 1 2; do formail -s bogofilter -d $round -v < csp-$run &> sp-$round-$run formail -s bogofilter -d $round -v < cns-$run &> ns-$round-$run done done grep -c ^1 ns-* ns-.bogofilter-0:2 ns-.bogofilter-1:2 ns-.bogofilter-2:3 ns-exh-0:23 ns-exh-1:19 ns-exh-2:22 ns-fourX-0:23 ns-fourX-1:18 ns-fourX-2:22 ns-full-0:57 ns-full-1:48 ns-full-2:56 ns-fullerr-0:52 ns-fullerr-1:46 ns-fullerr-2:49 ns-once-0:22 ns-once-1:17 ns-once-2:19 ns-thrice-0:23 ns-thrice-1:18 ns-thrice-2:22 ns-twice-0:23 ns-twice-1:18 ns-twice-2:22 grep -c -v ^1 sp-* sp-.bogofilter-0:47 sp-.bogofilter-1:43 sp-.bogofilter-2:45 sp-exh-0:103 sp-exh-1:108 sp-exh-2:107 sp-fourX-0:104 sp-fourX-1:111 sp-fourX-2:108 sp-full-0:65 sp-full-1:54 sp-full-2:63 sp-fullerr-0:69 sp-fullerr-1:58 sp-fullerr-2:69 sp-once-0:126 sp-once-1:123 sp-once-2:121 sp-thrice-0:104 sp-thrice-1:111 sp-thrice-2:108 sp-twice-0:105 sp-twice-1:113 sp-twice-2:109 grep -c '^From ' csp* cns* retrain$err <- retrain$fpos+retrain$fneg retrain$percent <- retrain$err * 100 / (1624+617) round run fpos fneg err percent 1 1 0 22 126 148 6.60 2 1 1 17 123 140 6.25 3 1 2 19 121 140 6.25 4 2 0 23 105 128 5.71 5 2 1 18 113 131 5.85 6 2 2 22 109 131 5.85 7 3 0 23 104 127 5.67 8 3 1 18 111 129 5.76 9 3 2 22 108 130 5.80 10 4 0 23 104 127 5.67 11 4 1 18 111 129 5.76 12 4 2 22 108 130 5.80 13 5 0 23 103 126 5.62 14 5 1 19 108 127 5.67 15 5 2 22 107 129 5.76 summary(aov(percent ~ round, data=retrain)) Df Sum Sq Mean Sq F value Pr(>F) round 1 0.61170 0.61170 16.884 0.001232 ** Residuals 13 0.47099 0.03623 raov <- aov(percent ~ round, data=retrain) rdf <- 13 rms <- deviance(raov) / rdf d <- c(1.95996, 0.412, 0.423) z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms/3) meanerr <- apply(array(retrain$percent, dim=c(3,5)), 2, mean) lcl95 <- meanerr - z ucl95 <- meanerr + z retrainres <- data.frame(round=c(1,2,3,4,5), meanerrpc=meanerr, lcl95=lcl95, ucl95=ucl95) print(retrainres, digits=3) round meanerrpc lcl95 ucl95 1 1 6.37 6.13 6.60 2 2 5.80 5.56 6.04 3 3 5.74 5.50 5.98 4 4 5.74 5.50 5.98 5 5 5.68 5.44 5.92 retrainres$round <- factor(retrainres$round) plot(retrainres$round, retrainres$meanerrpc, ylim=c(0,1.2), main="Effect of repeated training-on-error", xlab="Number of training cycles", ylab="Percent error") lines(retrainres$round, retrainres$ucl95, type="h") lines(retrainres$round, retrainres$lcl95, type="h", col="white") lines(retrainres$round, retrainres$meanerrpc, lty="dotted", lwd=2) read.table("training.reg.tbl") -> reg reg rounds spam good 1 1 1090 764 2 2 193 56 3 3 28 15 4 4 10 5 5 5 8 3 lines(reg$rounds, reg$spam * 5 / 1090, col="red") points(reg$rounds, reg$spam * 5 / 1090, col="red") points(reg$rounds, reg$good * 5 / 1090, col="blue") lines(reg$rounds, reg$good * 5 / 1090, col="blue") data.frame(train=c(rep("production",3), rep("errtwice",3), rep("full",3), rep("fullerr",3)), run=c(0,1,2,0,1,2,0,1,2,0,1,2), fpos=c(2,2,3,23,18,22,57,48,56,52,46,49), fneg=c(47,43,45,105,113,109,65,54,63,69,58,69)) -> vartrain vartrain$err <- vartrain$fpos+vartrain$fneg vartrain$percent <- vartrain$err * 100 / (1624+617) print(vartrain, digits=3) train run fpos fneg err percent 1 production 0 2 47 49 2.19 2 production 1 2 43 45 2.01 3 production 2 3 45 48 2.14 4 errtwice 0 23 105 128 5.71 5 errtwice 1 18 113 131 5.85 6 errtwice 2 22 109 131 5.85 7 full 0 57 65 122 5.44 8 full 1 48 54 102 4.55 9 full 2 56 63 119 5.31 10 fullerr 0 52 69 121 5.40 11 fullerr 1 46 58 104 4.64 12 fullerr 2 49 69 118 5.27 write.table(vartrain,"vartrain.tbl") vaov <- aov(percent ~ train, data=vartrain) summary(vaov) Df Sum Sq Mean Sq F value Pr(>F) train 3 24.3465 8.1155 79.139 2.742e-06 *** Residuals 8 0.8204 0.1025 rdf <- 8 rms <- deviance(raov) / rdf d <- c(1.95996, 0.412, 0.423) z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms/3) meanerr <- apply(array(vartrain$percent, dim=c(3,4)), 2, mean) lcl95 <- meanerr - z ucl95 <- meanerr + z vartrainres <- data.frame(train=c("production","errtwice","full","fullerr"), meanerrpc=meanerr, lcl95=lcl95, ucl95=ucl95) print(vartrainres,digits=3) train meanerrpc lcl95 ucl95 1 production 2.11 1.79 2.44 2 errtwice 5.80 5.48 6.12 3 full 5.10 4.78 5.43 4 fullerr 5.10 4.78 5.43 plot(vartrainres$train, vartrainres$meanerrpc, ylim=c(0,7), main="Comparison of training methods", xlab="Method", ylab="Percent error") lines(vartrainres$train, vartrainres$ucl95, type="h") lines(vartrainres$train, vartrainres$lcl95, type="h", col="white")
[© Greg Louis, 2002, 2003; last modified 2003-04-18]