Bogofilter is a Bayesian spam filter; background is presented on my general bogofilter page. Bogofilter runs a modification of Paul Graham's original calculation method that was suggested by Gary Robinson. (In mainline bogofilter, the original method is still optional.) Since this modification is the basis for the present experiment, a brief review of the calculations follows:
Paul Graham's original calculation uses a parameter p(w) that is calculated for each unique token in a message, based on the numbers of times that the token has been encountered in spam during training (badcount) and in nonspam (goodcount), scaled by the numbers of messages used to build the training database (bad- or goodlist_messagecount):
(badcount/badlist_messagecount)
p(w) = -----------------------------------------------------------------
(badcount/badlist_messagecount + goodcount/goodlist_messagecount)
n = badcount + goodcount f(w) = (s * x + n * p(w)) / (s + n)
scalefactor = badlist_messagecount / goodlist_messagecount f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
n' = badcount + goodcount * scalefactor
-- this should have a negligible effect, unless the training
set is extremely lopsided.Robinson recommends applying Fisher's method of combining probabilities to obtain the actual index used to classify a message as spam or nonspam. Fisher's method uses an inverse chi-squared function, which I'll refer to as prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2N degrees of freedom, N being the number of unique tokens in the message:
P = prbx(-2 * sum(ln(1-f(w))), 2*N) Q = prbx(-2 * sum(ln(f(w))), 2*N) S = (1 + Q - P) / 2
Graham's original calculation involved combining only the fifteen most extreme values of p(w), i.e. the fifteen values farthest from 0.5 (nearest to 1 or 0). With Robinson's approach, there is no need to restrict the calculation in this way; however, bogofilter supports the possibility of ignoring values near to 0.5 if the user wishes to do so. The rationale is that such values contribute little to the overall result of the calculation and it saves time to concentrate on the more "characteristic" tokens; the counterargument is that one might as well take advantage of all available information.
In practice, it has been found -- for example, in this experiment -- that in some typical operating conditions, a low minimum deviation from 0.5 (around 0.1 or 0.15) gives better discrimination than when no minimum is applied. Using a minimum deviation (hereinafter abbreviated mindev) violates one of the assumptions on which the Bayesian classification is based, but that assumption is violated anyway by the nature of the data; it may even be (I won't discuss this in detail here but a reference may soon be available) that the two violations act to cancel one another out and approximate the required uniformity of distribution more precisely.
Robinson discusses choosing the value of the parameter x in his essay on spam detection. Basically, a good starting point might be 0.5, or the average p(w) for words in the training set that have counts of 10 or more. Notice that this sets up a potential interaction of x and s with mindev, since x may well be close enough to 0.5 to fall beneath the mindev threshold. Parameter s is a weight factor, as described above; Robinson suggests starting with 1 and tuning from there. I found in practice that a value of 0.001 for s gave better discrimination, even with mindev set to zero; the optimum size of s seems related to the size of the training set in a way that is not mathematically obvious (to me, anyway).
It seemed desirable to repeat the process with larger corpora. I took 21,010 spams and 34,000 nonspams and "dealt" them out into two sets of four files. The "dealing" was done like this:
$cat distrib
FILE=(t.$1 r0.$1 t.$1 r1.$1 t.$1 r2.$1)
let n=${FILENO}%${#FILE[*]}
fname=${FILE[$n]}
cat >>$fname
FILENO=0 formail -s ./distrib ns < corpus.nonspam
FILENO=0 formail -s ./distrib sp < corpus.spam
$ grep -c '^From ' *.ns *.sp r0.ns:3502 r1.ns:3502 r2.ns:3501 t.ns:21116 r0.sp:5667 r1.sp:5667 r2.sp:5666 t.sp:20935
The t.* files were used to build a bogofilter training set, and the r0,
r1 and r2 files were used for classification. This experiment
employed a special version of bogofilter with an option
-m mindev[,cutoff[,s]] so that mindev, s and
the spam-cutoff value (the value of the spam index that determines the
classification) could be set from the command line.
(Note that as of bogofilter version 0.11.1.9, the -m and -o command line options allow the specification of these parameters; the special version mentioned above is no longer needed. An example of the use of the new version is presented in the Appendix to the report of another experiment.)
The training database was built with the commands
bogofilter -d db -s <t.sp andbogofilter -d db -n <t.ns.For s in 0.01 to 1e-8 by logarithmic steps of a half decade each (0.01, 0.0032, 1e-3 ...), and within that loop, for md in 0.025 to 0.475 by steps of 0.025, I first determined the spam cutoff that would give closest to 30 (approximately 0.3%) false positives; that cutoff was then used to classify the messages in the three spam files r0.sp, r1.sp and r2.sp, and the number of false negatives was determined. The counts of false positives and false negatives were added together and expressed as a percentage error. These percentage errors were subtracted from 100 to give percentage-correct values, because it's easier to understand the perspective graphs when they're plotted this way.
There are many data in this experiment, so the table of results is presented separately rather than being reproduced here.
The data are presented on the perspective plots below. On the right is a graph showing the cutoff values used at each point (chosen to give 30 false positives, as nearly as possible). The actual experimental results appear on the graph at the left.

A summary of the analysis of variance (unsurprisingly, s, mindev (md) and the interaction between them are all highly significant) and a list of the fifteen most successful combinations of s and mindev follow:
# ./smindev.R bogolog/smindev30a.tbl 9168.33
Df Sum Sq Mean Sq F value Pr(>F)
s 12 43.941 3.662 337.604 < 2.2e-16 ***
md 16 52.218 3.264 300.895 < 2.2e-16 ***
s:md 192 38.889 0.203 18.674 < 2.2e-16 ***
Residuals 442 4.794 0.011
---
rs md cutoff percent
13 0.01000 0.325 0.640 98.43665
14 0.01000 0.350 0.618 98.42938
30 0.00320 0.325 0.613 98.42938
16 0.01000 0.400 0.538 98.42574
29 0.00320 0.300 0.659 98.40393
33 0.00320 0.400 0.526 98.40029
12 0.01000 0.300 0.679 98.39665
47 0.00100 0.325 0.593 98.37848
15 0.01000 0.375 0.583 98.37484
46 0.00100 0.300 0.659 98.31667
63 0.00032 0.300 0.631 98.31303
11 0.01000 0.275 0.788 98.30576
32 0.00320 0.375 0.577 98.30576
31 0.00320 0.350 0.633 98.30213
48 0.00100 0.350 0.590 98.29122
Omitting mindev values 0.45 and 0.475 from the plot allows the use of a higher resolution in the graph of percent correct:

It was clear from these results that my decision to begin the scan of s values at 0.01 was an error. There might well be even better results, it seemed, with mindev in the 0.3 to 0.4 range and s values larger than 0.01. That being so, I performed another scan, this time with s values ranging from 1.0 to 1e-4, again in half decades, and mindev from 0.2 to 0.425. The raw data are again presented separately.

The anova summary and the fifteen best results are shown below. It does seem that an s value between 0.032 and 0.32 is optimal for the data used in this experiment, and that a relatively high mindev (between 0.3 and 0.4) is to be preferred.
# ./smindev.R bogolog/smindev30b.tbl 9168.33
Df Sum Sq Mean Sq F value Pr(>F)
s 8 40.215 5.027 300.981 < 2.2e-16 ***
md 9 19.205 2.134 127.762 < 2.2e-16 ***
s:md 72 19.566 0.272 16.271 < 2.2e-16 ***
Residuals 180 3.006 0.017
---
rs md cutoff percent
28 0.1000 0.375 0.618 98.53845
38 0.0320 0.375 0.582 98.53845
37 0.0320 0.350 0.630 98.52390
29 0.1000 0.400 0.601 98.50209
27 0.1000 0.350 0.674 98.49845
39 0.0320 0.400 0.561 98.49845
46 0.0100 0.325 0.640 98.49118
47 0.0100 0.350 0.618 98.48391
56 0.0032 0.325 0.613 98.48391
49 0.0100 0.400 0.538 98.48028
17 0.3200 0.350 0.722 98.45846
55 0.0032 0.300 0.659 98.45846
36 0.0320 0.325 0.693 98.45483
59 0.0032 0.400 0.526 98.45483
45 0.0100 0.300 0.679 98.45119
#! /bin/sh /usr/bin/setR
# This file is smindev.R, an R script to perform the data reduction for
# experiments in which min_dev and s are varied over a wide range of
# values; for each combination, a spam cutoff is first determined such
# that there are no more than 30 false positives when three nonspam
# files are pooled and evaluated, and with that cutoff, the number of
# false negatives is determined for each of three spam files.
# The following script distributes the messages into four files, t for
# training and r0, r1 and r2 for the experiment.
# It's run from formail as follows:
# cat [list of spam mbox files] | formail -s ./distrib sp
# cat [list of nonspam mboxen] | formail -s ./distrib ns
# #! /bin/sh
# # distrib - deal messages from an mbox into files
# # usage: FILENO=0 formail -s ./distrib extension < mbox
#
# # put names of files to be produced into this array
# FILE=(t.$1 r0.$1 t.$1 r1.$1 t.$1 r2.$1)
#
# # no user serviceable parts beyond this point
# let n=${FILENO}%${#FILE[*]}
# fname=${FILE[$n]}
# cat >>$fname
# In the experiment for which this script was written, extra files
# were added to the training mailboxes after the distrib script had
# been run:
# # grep -c '^From ' *.ns *.sp
# r0.ns:3502
# r1.ns:3502
# r2.ns:3501
# t.ns:21116
# r0.sp:5667
# r1.sp:5667
# r2.sp:5666
# t.sp:20935
# The training database is built with the command
# bogofilter -d db -s < t.sp; bogofilter -d db -n < t.ns
# /**/
# This is the script to run the experiment:
# #! /bin/bash
# # runex for studying effects of s and mindev
#
# function getco () {
# mopt=$1; shift
# res=`cat $* | formail -s bogofilter -d db -m $mopt -v 2>&1 | \
# perl -e ' $target = 30; while (<>) { ' \
# -e ' ($i, $d) = split; push @diffs, $d unless $i != 1; }' \
# -e ' die "dainbramage" unless scalar @diffs > 15;' \
# -e ' @s = sort { $a <=> $b } (@diffs); $co = $s[$target];' \
# -e ' while($co < 0.000001) { ++$target; $co = $s[$target]; }' \
# -e ' printf("%8.6f %d",1.0-$s[$target],$target-1);'`
# }
#
# function wrapper () {
# mopt=$1; shift; v=-v
# res=`cat $* | formail -s bogofilter -d db -m $mopt -v 2>&1 | \
# grep -c $v '^1'`
# }
#
# for s in 1e-2 3.2e-3 1e-3 3.2e-4 1e-4 3.2e-5 1e-5 3.2e-6 1e-6 3.2e-7 \
# 1e-7 3.2e-8 1e-8; do
# for md in `seq 0.025 0.025 0.47501`; do
# echo -n "$s $md fpos... "
# getco $md,0.1,$s r0.ns r1.ns r2.ns
# fpos=${res##* }; co=${res%% *}; let fpos=$fpos/3
# echo -n "$fpos at cutoff $co, run0... "
# run=0; wrapper $md,$co,$s r0.sp; fneg=$res
# echo "$s $md $co $run $fpos $fneg" >>parms.tbl
# echo -n "$fneg, run1... "
# run=1; wrapper $md,$co,$s r1.sp; fneg=$res
# echo "$s $md $co $run $fpos $fneg" >>parms.tbl
# echo -n "$fneg, run2... "
# run=2; wrapper $md,$co,$s r2.sp; fneg=$res
# echo "$s $md $co $run $fpos $fneg" >>parms.tbl
# echo $fneg
# done
# done
# /**/
# For use in R, parms.tbl from runex is stored as bogolog/smindev.tbl
# by default; otherwise run ./smindev.R filename msg-count
graphics.off(); setwd("/proj/Rwork")
if(length(argv) > 0) fn <- argv[1] else fn <- "bogolog/smindev.tbl"
if(file.exists(fn) == FALSE) stop(paste("file", fn, "not found"))
read.table(fn, col.names=c("s", "md", "cutoff", "run", "fp", "fn")) -> parms
if(length(argv) > 1) msgcount <- as.real(argv[2]) else msgcount <- 9168.33
sval <- sort(unique(parms$s), decreasing=TRUE)
x <- -log10(sval)
y <- sort(unique(parms$md))
n <- length(unique(parms$run))
parms$percent = (parms$fp + parms$fn) * 100 / msgcount
parms$s = factor(parms$s)
parms$md = factor(parms$md)
paov <- aov(percent ~ s + md + s*md, data=parms)
print(summary(paov))
pcs <- array(parms$percent, dim=c(n, length(parms$percent)/n))
meanerr <- apply(100-pcs, 2, mean)
cutoffs <- array(parms$cutoff, dim=c(n, length(parms$cutoff)/n))[1,]
parmres <- data.frame(rs=rep(sval, each=length(parms$s)/(n*length(sval))),
md=rep(y,length(sval)), cutoff=cutoffs, percent=meanerr)
z <- t(array(parmres$percent, dim=c(length(parms$md)/(n * length(sval)),
length(sval))))
co <- t(array(parmres$cutoff, dim=c(length(parms$md)/(n * length(sval)),
length(sval))))
X11(width=4.5, height=4.5)
pplot <- function(th,ph) {
persp(x, y, z, ticktype="detailed", theta=th, phi=ph,
main="Percent correct vs s and mindev",
xlab="-log(10) s", ylab="mindev",
zlab="percent correct", shade=0.6, border=4, r=sqrt(2), d=2.5)
}
pplot(-110,15)
X11(width=4.5, height=4.5)
qplot <- function(th,ph) {
persp(x, y, co, ticktype="detailed", theta=th, phi=ph,
main="Cutoff vs s and mindev",
xlab="-log(10) s", ylab="mindev",
zlab="cutoff", shade=0.6, border=4, r=sqrt(2), d=2.5)
}
qplot(-110,15)
sortlist <- sort(parmres$percent, index.return=TRUE, decreasing=TRUE)
print(parmres[sortlist$ix,][1:15,])
[© Greg Louis, 2003; last modified 2003-04-18]