robx=0.415 robs=0.01 min_dev=0.1 spam_cutoff=0.95
Four corpora were used in the present evaluation. They were provided by Andrew Partan, David Relson, Consultronics Limited (to all of whom many thanks) and myself. Each was supplied in message-count format, together with the wordlist from which the counts had been generated.
source robx min_dev robs spam_cutoff ham_cutoff
ap 0.514174 0.355 0.0562 0.9978 0.450
dr 0.531589 0.465 0.0562 0.7037 0.167
csl 0.500000 0.035 0.0562 0.9946 0.100
gl 0.431615 0.110 0.1778 0.6515 0.100
There were too many messages in the corpora for me to run them all in a single invocation of bogotune, so I wrote another script to distribute them ("deal them out") into three separate spam files and three separate nonspam files, each sampled in proportion from the original four corpora. This provided an opportunity to check whether bogotune, when fed very similar message corpora, would produce very similar recommendations.
The merged wordlist contained 8,571,796 individual tokens; its message counts were 190,852 spam and 226,204 nonspam. It occupied 271 megabytes.
A discussion on the bogofilter mailing list had raised the issue that for beginning users with small training databases, it might be wise to restrict the range of x so that it would always fall within 0.5 plus or minus min_dev. A one-line change to bogotune.c imposed this restriction. (The one-line change didn't completely correct outlier detection, but in this experiment that problem wasn't encountered.)
The three merged corpora produced the following three sets of bogotune recommendations:
robx=0.520000 min_dev=0.375 robs=0.0100 spam_cutoff=0.9755 # for 0.47% fpos (477); expect 2.51% fneg (2049). ham_cutoff=0.450 robx=0.530000 min_dev=0.375 robs=0.0178 spam_cutoff=0.9753 # for 0.49% fpos (506); expect 1.87% fneg (1524). ham_cutoff=0.450 robx=0.510000 min_dev=0.345 robs=0.0178 spam_cutoff=0.9752 # for 0.51% fpos (517); expect 1.78% fneg (1451). ham_cutoff=0.450
With these new parameters, then, the individual corpora with their own wordlists were re-evaluated. In the following table, the fp and fn columns present percentages of false positives and false negatives achieved with the new parameters; the ofp and ofn columns present the corresponding percentages obtained with the individual bogotuned parameters:
source fp fn ofp ofn spam nonspam
ap 0.008 0.39 0.01 0.47 76,324 132,849
dr 0.030 4.87 0.05 2.74 56,172 60,462
csl 0.180 5.45 0.20 3.59 60,120 62,312
gl 0.013 1.19 0.01 0.99 52,127 51,951
robx=0.52 min_dev=0.375 robs=0.0178 spam_cutoff=0.99 ham_cutoff=0 # (or 0.45 if we want to make ternary outcomes the default).
Script to merge wordlists
Follow with bogoutil -l wordlist.db < wordlist.counts
#! /usr/bin/perl
# mergedb -- merge wordlist files named on command line, output counts
my $file, %spcounts, %nscounts;
my $token, $nsp, $nns, $junk;
foreach $file (@ARGV) {
next unless open(DUMP, "bogoutil -d $file |");
while() { # that's "while(<DUMP>)" if you download from here
($token, $nsp, $nns, $junk) = split;
$spcounts{$token} += $nsp;
$nscounts{$token} += $nns;
}
close DUMP;
}
open(LOAD, ">wordlist.count") || die "couldn't open wordlist.count";
foreach $token (sort keys %spcounts) {
printf(LOAD "%s %d %d\n", $token, $spcounts{$token}, $nscounts{$token})
|| die "couldn't write $token to wordlist.count";
}
close LOAD || die "couldn't close wordlist.count";
exit 0;
Script to recalculate token counts in message-count files
#! /bin/sh
# Usage: mergemc path/to/dir/containing/wordlist.db mc.file.to.merge ...
# reads stdin if no mc files named, outputs to stdout
export BOGOFILTER_DIR=$1
shift
if [ ! -r $BOGOFILTER_DIR/wordlist.db ]; then
echo "$BOGOFILTER_DIR/wordlist.db not found"
exit 1
fi
cat $* | awk 'BEGIN {FS="\""; n=0} \
{if($2 != ".MSG_COUNT") print $2; else if(n++ > 0) print ""}' | \
blookup
In the above script, the blookup command does the most important part of the work. Some time ago, I had split bogofilter into three parallel processes, a lexer, a database lookup module and a classifier (I didn't implement registration). The lookup module takes a stream of tokens, one per line, with a blank line separating messages; looks up message counts for the tokens; and outputs a message-count file to standard output. Code (an ugly hack of bogofilter's then-current database lookup) is available on request.
Script to deal out messages from message-count or mbox files
#! /usr/bin/perl
sub usage {
print("NAME\n",
" distrib -- distribute from mbox or message-count files\n",
"SYNOPSIS\n",
" distrib destfile ... -(b|c) [sourcefile ...]\n",
" distrib -h\n",
"DESCRIPTION\n",
" distrib deals out message text from either a msg-count file\n",
" or an mbox file, writing messages to each of the specified\n",
" destination files in turn until all the source messages have\n",
" been distributed.\n",
"COMMAND LINE PARAMETERS\n",
" destfile File into which to write messages.\n",
" -b Source is in mbox format.\n",
" -c Source is in message-count format.\n",
" The -b and -c options are exclusive,\n",
" and one of them must be given.\n",
" sourcefile File from which to read messages. If\n",
" no source file is supplied, distrib\n",
" reads from stdin.\n",
" -h Display help.\n");
exit(0);
}
my $arg, $pat, $file, $writing, @files;
for ($arg = shift @ARGV; $#ARGV >= 0 && ($arg ne "-b" or $arg eq "-c");
$arg = shift @ARGV) {
usage if ($arg eq "-h");
push @files, $arg;
}
usage unless ($#files >= 0 and ($arg eq "-b" or $arg eq "-c"));
$pat = $arg eq "-c" ? "\.MSG_COUNT" : "^From ";
$writing = 0; $file = 0;
while (<>) {
if (/$pat/) {
if ($writing) { close OUTF; }
$writing = 1;
open (OUTF, ">>$files[$file]") || die "couldn't open $files[$file]";
if (++$file > $#files) { $file = 0; }
}
if ($writing) { print OUTF "$_"; }
}
if ($writing) { close OUTF; }
Patch to bogotune.c to limit x within 0.5 plus or minus min_dev:
--- bogotune.c 2004-03-20 08:48:26.000000000 -0500
+++ bogotune.c.limit-x 2004-03-20 08:48:26.000000000 -0500
@@ -1475,6 +1475,7 @@
for (rxi = 0; rxi < rxval->cnt; rxi += 1) {
uint fp, fn;
robx = rxval->data[rxi];
+ if(fabs(robx - 0.5) > min_dev) { r_count--; continue; }
/* save parms */
r = &results[cnt++];
The database merge required a gigabyte of RAM and even then it took up 600 Mb of swap as well, and slowed down considerably due to thrashing. Pavel Kankovsky (peak at argo dot troja dot mff dot cuni dot cz) kindly supplied this alternative and much more efficient version (though he warns that it's only been lightly tested):
#!/usr/bin/perl
use IO::File;
@pipes = ();
@pibuf = ();
sub fetch($)
{
my $i = $_[0];
my $io = $pipes[$i];
unless ($io->opened) {
$pibuf[$i] = undef;
return;
}
my $line = $io->getline;
if (defined $line) {
my ($token, $nsp, $nns, $junk) = split(/\s+/, $line);
die("unsorted bogoutil output on ", $ARGV[$i], "\n")
if (defined($pibuf[$i]) && $token le $pibuf[$i]->{token});
$pibuf[$i] = { token=>$token, nsp=>$nsp, nns=>$nns };
}
else {
$io->close;
die("bogoutil failed on ", $ARGV[$i], ": $?\n")
unless ($? == 0);
$pibuf[$i] = undef;
}
}
foreach my $file(@ARGV) {
die("$file: not found or not a regular file\n")
unless (-f $file);
my $pipe = new IO::File("bogoutil -d $file |");
die("bogoutil failed on $file\n")
unless defined($pipe);
push(@pipes, $pipe);
}
foreach my $i(0..$#pipes) {
fetch($i);
}
for (;;) {
my $min = -1;
foreach my $i(0..$#pipes) {
next unless defined($pibuf[$i]);
my $cmp = $min < 0 ? -1 :
$pibuf[$i]->{token} cmp $pibuf[$min]->{token};
if ($cmp < 0) {
$min = $i;
}
elsif ($cmp == 0) {
$pibuf[$min]->{nsp} += $pibuf[$i]->{nsp};
$pibuf[$min]->{nns} += $pibuf[$i]->{nns};
fetch($i);
}
}
last if ($min < 0);
printf("%s %d %d\n",
$pibuf[$min]->{token},
$pibuf[$min]->{nsp},
$pibuf[$min]->{nns});
fetch($min);
}