Practical instructions for tuning bogofilter are presented in the accompanying HOWTO document.
That experiment found that "The best discrimination was achieved with a mindev of 0.35 and s of 0.0032; the cutoff point for those values was 0.992. There are, however, other local minima; it's not clear at this point how generally applicable this minimum might prove." It seemed desirable to repeat the experiment with larger training and test message corpora, and that was the object of the second in the series. It used final sizes of:
$ grep -c '^From ' *.ns *.sp r0.ns:3502 r1.ns:3502 r2.ns:3501 t.ns:21116 r0.sp:5667 r1.sp:5667 r2.sp:5666 t.sp:20935
The t.* files were used to build a bogofilter training set, and the r0,
r1 and r2 files were used for classification. These experiments
employed a special version of bogofilter with an option
-m mindev[,cutoff[,s]] so that mindev, s and
the spam-cutoff value (the value of the spam index that determines the
classification) could be set from the command line.
(Note that starting with version 0.11.1.9, bogofilter's -o and -m options together allow specification of these parameters; a special version is no longer needed. The listing of the "runex" script in the Appendix has been modified to work with the new command-line options.)
It seemed that an s value between 0.032 and 0.32 is optimal for the data used in this experiment, and that a relatively high mindev (between 0.3 and 0.4) was to be preferred.
In the conclusions I stated that this experiment should be repeated with mail corpora derived from other sources, in order to determine how generally applicable its conclusions might be. That was the purpose of the present work, the third in the series.
GL-w GL-h DR-h
ns (training) 21116 26732 14182
sp (training) 20935 9462 5238
ns (run 1) 3502 13365 4728
ns (run 2) 3502 13366 4727
ns (run 3) 3501 n/a 4727
sp (run 1) 5667 4732 1746
sp (run 2) 5667 4730 1746
sp (run 3) 5666 n/a 1746
smallest md 0.02 0.02 0.02
largest md 0.48 0.48 0.48
smallest s 1e-8 1e-8 1e-8
largest s 4.64 4.64 4.64
cutoff target 21 8 6
Each message set was individually evaluated. First, the training files were used to create new training databases; then a combination of bogolexer and bogoutil was used to build, for each test file, a set of message digests that could be fed into a specially written classifier (apclass). This greatly accelerated the processing, in comparison with the formerly-used method of reclassifying the original messages repeatedly with bogofilter (the Appendix describes that method, since apclass is not generally available). Then, for each combination of s and md, the following procedure was applied. The ns (nonspam) digest files were pooled and the pool was classified with apclass. A spam cutoff was chosen, such that the number of false positives (messages with spam scores greater than or equal to the cutoff) was equal to the "cutoff target" value. The current s, md and spam cutoff values were then used by apclass to classify each message in the spam (sp) digest files; the number of false negatives (messages with spam scores lower than the cutoff) in each run was determined.
As these evaluations produced many data, the results appear in separate tables: GL-w, GL-h and DR-h. The first two lines of each table repeat the numbers of spam and nonspam messages in the run files; the remaining lines contain columns of s, md, cutoff, run ordinal, false-positive count and false-negative count in that order.
Script smindev.R (listed in the Appendix), was then run with the three tables named in the preceding paragraph. The output of this script consists of four parts: (1) the summary of an analysis of variance of the percent error with factors s and md and the interaction between them; (2) a list of the 15 combinations of s and md that gave the highest percent error for the message set; (3) a perspective plot of percent correct (not percent error this time) vs. s and md; and (4) a perspective plot of the cutoff value vs. s and md.
Here are the results for dataset GL-w. In the table, the leftmost column is just the original ordinal of the record, and may be ignored; column rs has the s value, and md the minimum deviation. Both parameters and their interaction are highly significant:
Df Sum Sq Mean Sq F value Pr(>F)
s 26 1964.8 75.6 3231.97 < 2.2e-16 ***
md 23 6348.9 276.0 11805.69 < 2.2e-16 ***
s:md 598 1601.9 2.7 114.57 < 2.2e-16 ***
Residuals 1296 30.3 0.02338
---
rs md cutoff percent
212 0.01000 0.40 0.504 1.156153
188 0.02150 0.40 0.508 1.159789
141 0.10000 0.42 0.519 1.199782
165 0.04640 0.42 0.514 1.207053
260 0.00215 0.40 0.502 1.217960
189 0.02150 0.42 0.507 1.221596
115 0.21500 0.38 0.565 1.247046
210 0.01000 0.36 0.533 1.254317
140 0.10000 0.40 0.554 1.265225
213 0.01000 0.42 0.504 1.268860
164 0.04640 0.40 0.536 1.276132
138 0.10000 0.36 0.603 1.283403
187 0.02150 0.38 0.530 1.283403
284 0.00100 0.40 0.502 1.283403
211 0.01000 0.38 0.523 1.294310

Here are the corresponding results for dataset GL-h:
Df Sum Sq Mean Sq F value Pr(>F)
s 26 1067.41 41.05 16984.15 < 2.2e-16 ***
md 23 453.68 19.73 8160.33 < 2.2e-16 ***
s:md 598 336.14 0.56 232.55 < 2.2e-16 ***
Residuals 648 1.57 0.002417
---
rs md cutoff percent
166 4.64e-02 0.44 0.608 0.8206007
329 2.15e-04 0.34 0.503 0.8288896
352 1.00e-04 0.32 0.503 0.8399414
305 4.64e-04 0.34 0.507 0.8454674
353 1.00e-04 0.34 0.503 0.8703340
304 4.64e-04 0.32 0.514 0.8730970
328 2.15e-04 0.32 0.509 0.8786229
190 2.15e-02 0.44 0.610 0.8841489
281 1.00e-03 0.34 0.517 0.8841489
142 1.00e-01 0.44 0.709 0.8924378
165 4.64e-02 0.42 0.655 0.9145415
375 4.64e-05 0.30 0.506 0.9173045
377 4.64e-05 0.34 0.503 0.9173045
327 2.15e-04 0.30 0.518 0.9200674
280 1.00e-03 0.32 0.537 0.9228304

And here are the results for dataset DR-h:
Df Sum Sq Mean Sq F value Pr(>F)
s 26 493.23 18.97 1720.22 < 2.2e-16 ***
md 23 68.09 2.96 268.45 < 2.2e-16 ***
s:md 598 549.74 0.92 83.36 < 2.2e-16 ***
Residuals 1296 14.29 0.01
---
rs md cutoff percent
406 2.15e-05 0.44 0.5 0.8599382
407 2.15e-05 0.46 0.5 0.9320288
383 4.64e-05 0.46 0.5 0.9474768
312 4.64e-04 0.48 0.5 0.9577755
360 1.00e-04 0.48 0.5 0.9629248
336 2.15e-04 0.48 0.5 0.9783728
431 1.00e-05 0.46 0.5 0.9783728
384 4.64e-05 0.48 0.5 0.9835221
502 1.00e-06 0.44 0.5 0.9835221
526 4.64e-07 0.44 0.5 0.9938208
408 2.15e-05 0.48 0.5 1.0041195
432 1.00e-05 0.48 0.5 1.0195675
479 2.15e-06 0.46 0.5 1.0453141
574 1.00e-07 0.44 0.5 1.0504634
456 4.64e-06 0.48 0.5 1.0556128

The next graphic shows the percentage-correct plots again, grouped to make comparison easier. The personal (single-user) email dataset results appear on the left and the work-environment (multi-user) result on the right:

Relatively low values of s (1e-3 and below), and relatively high values of minimum deviation (0.25 and above) seem consistently to produce good results, but low s values are dangerous (see the first item in the Conclusions). It would be helpful to know if there were a region within the s / md surface that gave near-optimal discrimination for all three data sets, as this might then be taken as a worthwhile starting point for new bogofilter installations. If we take the results for the three datasets and rank the s / md pairs according to the percentage correct that was obtained in each case, each s / md combination will have three ranks. If we then take the worst rank for each point as a measure of the overall merit of that s / md combination, we can list and/or plot the "best" combinations.
The following table shows the best fifteen values of s (rs) and minimum deviation (md), percentage correct and cutoff, rank (smaller number means higher rank), maximum (worse) rank of the two and the difference; the rows of the table are sorted by rank, and within rank by difference. The graph on the left is a perspective plot of the best 150 points; on the right, to facilitate picking the optimum parameter values, the same data are plotted in a "thermal image" form:
rs md glwpc glwco glhpc glhco drhpc drhco glwr glhr drhr maxr difr
142 1.00e-01 0.44 1.35 0.526 0.892 0.709 1.71 0.747 28 10 34 34 24
166 4.64e-02 0.44 1.41 0.525 0.821 0.608 1.72 0.691 40 1 36 40 39
190 2.15e-02 0.44 1.46 0.517 0.884 0.610 1.92 0.689 56 8 57 57 49
214 1.00e-02 0.44 1.52 0.512 0.970 0.606 1.86 0.611 78 69 49 78 29
141 1.00e-01 0.42 1.20 0.519 0.970 0.768 2.04 0.943 3 68 111 111 108
165 4.64e-02 0.42 1.21 0.514 0.915 0.655 2.04 0.912 4 11 112 112 108
400 2.15e-05 0.32 1.59 0.539 0.975 0.507 2.05 0.669 101 88 126 126 38
399 2.15e-05 0.30 1.63 0.582 0.997 0.510 2.03 0.696 123 135 107 135 28
375 4.64e-05 0.30 1.60 0.587 0.917 0.506 2.07 0.766 107 12 140 140 128
189 2.15e-02 0.42 1.22 0.507 0.926 0.610 2.09 0.873 6 16 159 159 153
376 4.64e-05 0.32 1.55 0.540 0.931 0.506 2.09 0.722 91 19 161 161 142
423 1.00e-05 0.30 1.65 0.573 1.014 0.507 2.02 0.650 128 166 95 166 71
117 2.15e-01 0.42 1.54 0.544 1.025 0.759 1.93 0.895 83 178 58 178 120
237 4.64e-03 0.42 1.42 0.509 1.025 0.604 2.05 0.744 45 180 120 180 135
352 1.00e-04 0.32 1.55 0.557 0.840 0.503 2.11 0.779 90 3 180 180 177

In the following table, these 15 "best-compromise" values are compared against the individual optima of the work and personal data sets. The glwd, glhd and drhd columns show the amount by which the percent correct figures exceed the optima (0.16, 0.82 and 0.86% respectively):
rs md glwd glhd drhd
1 1.00e-01 0.44 0.19270 0.07184 0.84964
2 4.64e-02 0.44 0.25087 0.00000 0.85994
3 2.15e-02 0.44 0.30540 0.06355 1.06076
4 1.00e-02 0.44 0.36721 0.14920 0.99897
5 1.00e-01 0.42 0.04363 0.14920 1.17919
6 4.64e-02 0.42 0.05090 0.09394 1.17919
7 2.15e-05 0.32 0.42902 0.15473 1.19464
8 2.15e-05 0.30 0.47628 0.17683 1.17405
9 4.64e-05 0.30 0.44719 0.09670 1.21009
10 2.15e-02 0.42 0.06545 0.10499 1.22554
11 4.64e-05 0.32 0.39630 0.11052 1.22554
12 1.00e-05 0.30 0.49082 0.19341 1.15860
13 2.15e-01 0.42 0.38175 0.20446 1.07106
14 4.64e-03 0.42 0.26541 0.20446 1.18949
15 1.00e-04 0.32 0.39630 0.01934 1.24614
The following R scripts were used in data reduction in this experiment. Other (non-R) scripts needed are included as comments in the smindev.R listing (the runex script shown is one that uses bogofilter for all its classification work, since apclass is not generally available). The setR script is used to render smindev.R and mergeparm.R executable, with parameters, from the command line.
#! /bin/sh /usr/bin/setR
# This file is smindev.R, an R script to perform the data reduction for
# experiments in which min_dev and s are varied over a wide range of
# values; for each combination, a spam cutoff is first determined such
# that there are no more than some target number of false positives when
# nonspam files are pooled and evaluated, and with that cutoff, the number
# of false negatives is determined for each spam file.
# The following script distributes the messages into four files, t for
# training and r0, r1 and r2 for the experiment.
# It's run from formail as follows:
# cat [list of spam mbox files] | formail -s ./distrib sp
# cat [list of nonspam mboxen] | formail -s ./distrib ns
# #! /bin/sh
# # distrib - deal messages from an mbox into files
# # usage: FILENO=0 formail -s ./distrib extension < mbox
#
# # put names of files to be produced into this array
# FILE=(t.$1 r0.$1 t.$1 r1.$1 t.$1 r2.$1)
#
# # no user serviceable parts beyond this point
# let n=${FILENO}%${#FILE[*]}
# fname=${FILE[$n]}
# cat >>$fname
# In the experiment for which this script was written, extra files
# were added to the training mailboxes after the distrib script had
# been run:
# $ ./sizes
# ns 3502 3502 3501
# sp 5667 5667 5666
# The training database is built with the command
# bogofilter -d db -s &1 | \
# perl -e ' $target = 10; while (<>) { ' \
# -e ' ($i, $d) = split; push @diffs, $d unless $i != 1; }' \
# -e ' die "dainbramage" unless scalar @diffs > 15;' \
# -e ' @s = sort { $a <=> $b } (@diffs); $co = $s[$target];' \
# -e ' while($co < 0.000001) { ++$target; $co = $s[$target]; }' \
# -e ' printf("%8.6f %d",1.0-$s[$target],$target-1);'`
# }
#
# function wrapper () {
# mopt=$1; oopt=$2; shift; v=-v
# res=`cat $* | formail -s bogofilter -d db -m $mopt -o $oopt -v -t 2>&1 | \
# grep -c $v '^1'`
# }
#
# sizes >parms.tbl
# for s in 1e-2 3.2e-3 1e-3 3.2e-4 1e-4 3.2e-5 1e-5 3.2e-6 1e-6 3.2e-7 \
# 1e-7 3.2e-8 1e-8; do
# for md in `seq 0.025 0.025 0.47501`; do
# echo -n "$s $md fpos... "
# getco $md,$s 0.1 r0.ns r1.ns r2.ns
# fpos=${res##* }; co=${res%% *}; let fpos=$fpos/3
# echo -n "$fpos at cutoff $co, run0... "
# run=0; wrapper $md,$s $co r0.sp; fneg=$res
# echo "$s $md $co $run $fpos $fneg" >>parms.tbl
# echo -n "$fneg, run1... "
# run=1; wrapper $md,$s $co r1.sp; fneg=$res
# echo "$s $md $co $run $fpos $fneg" >>parms.tbl
# echo -n "$fneg, run2... "
# run=2; wrapper $md,$s $co r2.sp; fneg=$res
# echo "$s $md $co $run $fpos $fneg" >>parms.tbl
# echo $fneg
# done
# done
# /**/
# For use in R, parms.tbl from runex is expected in bogolog/smindev.tbl
# by default; otherwise run ./smindev.R filename
graphics.off(); setwd("/proj/Rwork")
if(length(argv) > 0) fn <- argv[1] else fn <- "bogolog/smindev.tbl"
if(file.exists(fn) == FALSE) stop(paste("file", fn, "not found"))
if(length(argv) > 1) sub <- argv[2] else sub <- "--"
### First read the message counts:
read.table(fn, nrows=2) -> meta
msgcount <- sum(apply(meta[,2:length(meta)],1,mean))
### Now read the data
parms <- read.table(fn, col.names=c("s", "md", "cutoff", "run", "fp", "fn"),
skip=2)
### Get axis values and number of replicates
sval <- sort(unique(parms$s), decreasing=TRUE)
x <- -log10(sval)
y <- sort(unique(parms$md))
n <- length(unique(parms$run))
### Express error in percentage and perform an anova
parms$percent = (parms$fp + parms$fn) * 100 / msgcount
parms$s = factor(parms$s)
parms$md = factor(parms$md)
paov <- aov(percent ~ s + md + s*md, data=parms)
print(summary(paov))
### Now express results as mean percent correct
pcs <- array(parms$percent, dim=c(n, length(parms$percent)/n))
meanerr <- apply(pcs, 2, mean)
cutoffs <- array(parms$cutoff, dim=c(n, length(parms$cutoff)/n))[1,]
### Create a data frame with these results
parmres <- data.frame(rs=rep(sval, each=length(parms$s)/(n*length(sval))),
md=rep(y,length(sval)), cutoff=cutoffs, percent=meanerr)
### calculate the z-axis values for percent correct and for cutoffs
z <- t(array(100 - parmres$percent, dim=c(length(parms$md)/(n * length(sval)),
length(sval))))
co <- t(array(parmres$cutoff, dim=c(length(parms$md)/(n * length(sval)),
length(sval))))
X11(width=4.5, height=4.5)
### produce a trial plot, making it easy to try other rotation values
pplot <- function(th,ph) {
persp(x, y, z, ticktype="detailed", theta=th, phi=ph,
main="Percent correct vs s and mindev", sub=sub,
xlab="-log(10) s", ylab="mindev", zlim=c(90,100),
zlab="percent correct", shade=0.6, border=4, r=sqrt(2), d=2.5)
}
pplot(70,15)
### another trial plot, this time for cutoffs
X11(width=4.5, height=4.5)
qplot <- function(th,ph) {
persp(x, y, co, ticktype="detailed", theta=th, phi=ph,
main="Cutoff vs s and mindev", sub=sub, xlab="-log(10) s",
ylab="mindev", zlab="cutoff", shade=0.6, border=4, r=sqrt(2), d=2.5)
}
qplot(70,15)
### get the 15 best combinations of s and mindev
sortlist <- sort(parmres$percent, index.return=TRUE)
system("echo")
print(parmres[sortlist$ix,][1:15,])
#! /bin/sh /usr/bin/setR
# mergeparm.R -- find common optima
graphics.off(); setwd("/proj/Rwork")
if(length(argv) < 3) {
read.table("bogolog/parm21glw") -> glw
read.table("bogolog/parm08glh") -> glh
read.table("bogolog/parm06drh") -> drh
sub <- "GL-w / GL-h / DR-h, best 150"
} else {
glw <- read.table(argv[1])
glh <- read.table(argv[2])
drh <- read.table(argv[3])
if(length(argv) > 3) sub <- argv[4] else sub <- "--"
}
# Get the individual md and rs values and make rs and md columns for
# the data frame to be written to parmres.merge
md <- sort(unique(drh$md))
rs <- sort(unique(drh$rs), decreasing=TRUE)
prs <- rep(rs, each=length(md))
pmd <- rep(md,length(rs))
# Create vectors of percentages and cutoffs for data points in common
drhpc <- seq(1,length(drh$rs))
drhco <- drhpc
glwpc <- drhpc
glwco <- drhpc
glhpc <- drhpc
glhco <- drhpc
for(i in 1:length(drh$rs)) {
for(j in 1:length(prs)) {
if(drh$rs[i] == prs[j] && drh$md[i] == pmd[j]) {
drhpc[j] <- drh$percent[i]
drhco[j] <- drh$cutoff[i]
}
}
}
for(i in 1:length(glw$rs)) {
for(j in 1:length(prs)) {
if(glw$rs[i] == prs[j] && glw$md[i] == pmd[j]) {
glwpc[j] <- glw$percent[i]
glwco[j] <- glw$cutoff[i]
}
}
}
for(i in 1:length(glh$rs)) {
for(j in 1:length(prs)) {
if(glh$rs[i] == prs[j] && glh$md[i] == pmd[j]) {
glhpc[j] <- glh$percent[i]
glhco[j] <- glh$cutoff[i]
}
}
}
# Create and write the merged table
p <- data.frame(rs=prs, md=pmd, glwpc=glwpc, glwco=glwco,
glhpc=glhpc, glhco=glhco, drhpc=drhpc, drhco=drhco)
sink("bogolog/parmres.merge"); print(p); sink()
# Rank the percentage figures and add columns of ranks
sglw <- sort(p$glwpc, index.return=TRUE)
sglh <- sort(p$glhpc, index.return=TRUE)
sdrh <- sort(p$drhpc, index.return=TRUE)
rsglw <- sort(sglw$ix, index.return=TRUE)
rsglh <- sort(sglh$ix, index.return=TRUE)
rsdrh <- sort(sdrh$ix, index.return=TRUE)
p$glwr <- rsglw$ix
p$glhr <- rsglh$ix
p$drhr <- rsdrh$ix
# Get the maximum (lowest) rank and the greatest difference in rank
# for each rs/md pair
p$maxr <- pmax(p$glwr, p$glhr, p$drhr)
dif1 <- abs(p$glwr - p$glhr)
dif2 <- abs(p$glwr - p$drhr)
dif3 <- abs(p$glhr - p$drhr)
p$difr <- pmax(dif1, dif2, dif3)
# Sort by rank, and within rank, by difference
sortrank <- sort(p$maxr + p$difr / 10000, index.return=TRUE)
# Create a table in sortrank order, copy and print the top 40 records
ranka <- p[sortrank$ix,]
rank150 <- ranka[1:150,]
print(rank150,digits=3)
# Make x, y and z and do thermal and perspective plots
x <- sort(unique(p$md))
sval <- sort(unique(p$rs))
y <- log10(sval)
z <- array(dim=c(length(y),length(x)))
for(i in seq(along=x)) {
for(j in seq(along=y)) {
for(k in 1:length(rank150$rs)) {
if((rank150$rs[k] == sval[j]) && (rank150$md[k] == x[i])) z[j,i] <-
rank150$maxr[k]
}
}
}
X11(width=3.5,height=3.5)
image(x,y,t(z),col=heat.colors(15),
main="Discrimination (darker is better) vs s and mindev",
sub=sub,ylab="log(10) s", xlab="mindev")
X11(width=3.5,height=3.5)
persp(x,y,300-t(z),ticktype="detailed",shade=0.6, phi=15, theta=-20,
expand=0.7, border=4,r=sqrt(2),d=2.5,xlab="mindev",ylab="log(10) s",
zlab="rank",main="Discrimination vs. s and mindev", sub=sub)
[© Greg Louis, 2003; last modified 2003-04-19]