Default Parameters for Bogofilter

Dedicated to my little sister, who turned 50 today.

Introduction and general description:

Up through bogofilter 0.17.4, default values for parameters s (robs), x (robx), the minimum deviation (min_dev) and the score above which a message is deemed to be spam (spam_cutoff) have been supplied based on relatively unsystematic, anecdotal information obtained in the early days of bogofilter development.  It's more or less a matter of luck that these default values (shown in the table below) have worked fairly well for many users.
robx=0.415
robs=0.01
min_dev=0.1
spam_cutoff=0.95
Two bogofilter developers started to feel that it might be worthwhile to try to find more optimal, generally applicable default parameters for the bogofilter distribution.  Whether this could be done wasn't obvious from the outset: it seemed possible that no truly general parameter set exists, and that some users will always find whatever defaults we choose to be inappropriate to their message populations. Nevertheless, here is an experiment to start the search.  The general idea was to obtain several large collections of messages with their associated wordlists, to optimize parameters individually and collectively for these corpora, and then compare results obtained with the collectively and individually optimized parameter sets.

Four corpora were used in the present evaluation.  They were provided by Andrew Partan, David Relson, Consultronics Limited (to all of whom many thanks) and myself.  Each was supplied in message-count format, together with the wordlist from which the counts had been generated.

Procedure and results:

Each corpus was first submitted to a bogotune run to determine optimum parameters for that corpus.  The individual parameter optima were quite different:
 source      robx min_dev   robs spam_cutoff ham_cutoff
     ap  0.514174   0.355 0.0562      0.9978      0.450
     dr  0.531589   0.465 0.0562      0.7037      0.167
    csl  0.500000   0.035 0.0562      0.9946      0.100
     gl  0.431615   0.110 0.1778      0.6515      0.100
Scripts were written to merge the wordlists and to recalculate message counts in message-count files.  These appear in the Appendix.  The four wordlists were merged, and each message-count file was processed to generate a copy with updated token counts based on the merged wordlist.

There were too many messages in the corpora for me to run them all in a single invocation of bogotune, so I wrote another script to distribute them ("deal them out") into three separate spam files and three separate nonspam files, each sampled in proportion from the original four corpora.  This provided an opportunity to check whether bogotune, when fed very similar message corpora, would produce very similar recommendations.

The merged wordlist contained 8,571,796 individual tokens; its message counts were 190,852 spam and 226,204 nonspam.  It occupied 271 megabytes.

A discussion on the bogofilter mailing list had raised the issue that for beginning users with small training databases, it might be wise to restrict the range of x so that it would always fall within 0.5 plus or minus min_dev.  A one-line change to bogotune.c imposed this restriction.  (The one-line change didn't completely correct outlier detection, but in this experiment that problem wasn't encountered.)

The three merged corpora produced the following three sets of bogotune recommendations:

robx=0.520000
min_dev=0.375
robs=0.0100
spam_cutoff=0.9755      # for 0.47% fpos (477); expect 2.51% fneg (2049).
ham_cutoff=0.450        

robx=0.530000
min_dev=0.375
robs=0.0178
spam_cutoff=0.9753      # for 0.49% fpos (506); expect 1.87% fneg (1524).
ham_cutoff=0.450        

robx=0.510000
min_dev=0.345
robs=0.0178
spam_cutoff=0.9752      # for 0.51% fpos (517); expect 1.78% fneg (1451).
ham_cutoff=0.450        
I chose to go by majority, so the parameters I used to test the individual corpora again were x=0.52, min_dev=0.375 and s=0.0178.  Obviously the recommended spam cutoffs were too low; after a bit of trial and error I chose 0.99, which gave false-positive counts reasonably similar to those seen in the individual optimization runs (these data appear in the table below).

With these new parameters, then, the individual corpora with their own wordlists were re-evaluated.  In the following table, the fp and fn columns present percentages of false positives and false negatives achieved with the new parameters; the ofp and ofn columns present the corresponding percentages obtained with the individual bogotuned parameters:

 source    fp   fn   ofp  ofn    spam nonspam
     ap 0.008 0.39  0.01 0.47  76,324 132,849
     dr 0.030 4.87  0.05 2.74  56,172  60,462
    csl 0.180 5.45  0.20 3.59  60,120  62,312
     gl 0.013 1.19  0.01 0.99  52,127  51,951  
Interestingly, the common parameter set actually proved better than what bogotune came up with for the ap messages alone.  For the dr and csl corpora, the difference was minor (and could likely be almost eliminated by adjusting the cutoff slightly).  Only my own mail's results were notably improved when parameters determined with that corpus were applied.

Conclusions:

  1. It may be acceptable to set bogofilter's default parameters as follows:
    robx=0.52
    min_dev=0.375
    robs=0.0178
    spam_cutoff=0.99
    ham_cutoff=0 # (or 0.45 if we want to make ternary outcomes the default).
    
  2. For users with large corpora of their own, bogotune may (but is not guaranteed to) help them do significantly better than these defaults.
  3. When truly similar message corpora are fed to it, bogotune gives quite comparable recommendations.

Appendix:

Script to merge wordlists 
Follow with bogoutil -l wordlist.db < wordlist.counts

#! /usr/bin/perl
#  mergedb -- merge wordlist files named on command line, output counts
my $file, %spcounts, %nscounts;
my $token, $nsp, $nns, $junk;
foreach $file (@ARGV) {
    next unless open(DUMP, "bogoutil -d $file |");
    while() {  # that's "while(<DUMP>)" if you download from here
	($token, $nsp, $nns, $junk) = split;
	$spcounts{$token} += $nsp;
	$nscounts{$token} += $nns;
    }
    close DUMP;
}
open(LOAD, ">wordlist.count") || die "couldn't open wordlist.count";
foreach $token (sort keys %spcounts) {
    printf(LOAD "%s %d %d\n", $token, $spcounts{$token}, $nscounts{$token})
	|| die "couldn't write $token to wordlist.count";
}
close LOAD || die "couldn't close wordlist.count";
exit 0;

Script to recalculate token counts in message-count files

#! /bin/sh
#  Usage: mergemc path/to/dir/containing/wordlist.db mc.file.to.merge ...
#         reads stdin if no mc files named, outputs to stdout
export BOGOFILTER_DIR=$1
shift
if [ ! -r $BOGOFILTER_DIR/wordlist.db ]; then
    echo "$BOGOFILTER_DIR/wordlist.db not found"
    exit 1
fi
cat $* | awk 'BEGIN {FS="\""; n=0} \
    {if($2 != ".MSG_COUNT") print $2; else if(n++ > 0) print ""}' | \
    blookup

In the above script, the blookup command does the most important part of the work.  Some time ago, I had split bogofilter into three parallel processes, a lexer, a database lookup module and a classifier (I didn't implement registration).  The lookup module takes a stream of tokens, one per line, with a blank line separating messages; looks up message counts for the tokens; and outputs a message-count file to standard output. Code (an ugly hack of bogofilter's then-current database lookup) is available on request.


Script to deal out messages from message-count or mbox files

#! /usr/bin/perl
sub usage {
    print("NAME\n",
	"  distrib -- distribute from mbox or message-count files\n",
	"SYNOPSIS\n",
	"  distrib destfile ... -(b|c) [sourcefile ...]\n",
	"  distrib -h\n",
	"DESCRIPTION\n",
	"  distrib deals out message text from either a msg-count file\n",
	"  or an mbox file, writing messages to each of the specified\n",
	"  destination files in turn until all the source messages have\n",
	"  been distributed.\n",
	"COMMAND LINE PARAMETERS\n",
	"  destfile    File into which to write messages.\n",
	"  -b          Source is in mbox format.\n",
	"  -c          Source is in message-count format.\n",
	"              The -b and -c options are exclusive,\n",
	"              and one of them must be given.\n",
	"  sourcefile  File from which to read messages.  If\n",
	"              no source file is supplied, distrib\n",
	"              reads from stdin.\n",
	"  -h          Display help.\n");
    exit(0);
}
my $arg, $pat, $file, $writing, @files;
for ($arg = shift @ARGV; $#ARGV >= 0 && ($arg ne "-b" or $arg eq "-c");
    $arg = shift @ARGV) {
    usage if ($arg eq "-h");
    push @files, $arg;
}
usage unless ($#files >= 0 and ($arg eq "-b" or $arg eq "-c"));
$pat = $arg eq "-c" ? "\.MSG_COUNT" : "^From ";
$writing = 0; $file = 0;
while (<>) {
    if (/$pat/) {
	if ($writing) { close OUTF; }
	$writing = 1;
	open (OUTF, ">>$files[$file]") || die "couldn't open $files[$file]";
	if (++$file > $#files) { $file = 0; }
    }
    if ($writing) { print OUTF "$_"; }
}
if ($writing) { close OUTF; }

Patch to bogotune.c to limit x within 0.5 plus or minus min_dev:

--- bogotune.c  2004-03-20 08:48:26.000000000 -0500
+++ bogotune.c.limit-x  2004-03-20 08:48:26.000000000 -0500
@@ -1475,6 +1475,7 @@
                for (rxi = 0; rxi < rxval->cnt; rxi += 1) {
                    uint fp, fn;
                    robx = rxval->data[rxi];
+                   if(fabs(robx - 0.5) > min_dev) { r_count--; continue; }
 
                    /* save parms */
                    r = &results[cnt++];

The database merge required a gigabyte of RAM and even then it took up 600 Mb of swap as well, and slowed down considerably due to thrashing.  Pavel Kankovsky (peak at argo dot troja dot mff dot cuni dot cz) kindly supplied this alternative and much more efficient version (though he warns that it's only been lightly tested):


#!/usr/bin/perl

use IO::File;

@pipes = ();
@pibuf = ();

sub fetch($)
{
  my $i = $_[0];
  my $io = $pipes[$i];
  unless ($io->opened) {
    $pibuf[$i] = undef;
    return;
  }
  my $line = $io->getline;
  if (defined $line) {
    my ($token, $nsp, $nns, $junk) = split(/\s+/, $line);
    die("unsorted bogoutil output on ", $ARGV[$i], "\n")
        if (defined($pibuf[$i]) && $token le $pibuf[$i]->{token});
    $pibuf[$i] = { token=>$token, nsp=>$nsp, nns=>$nns };
  }
  else {
    $io->close;
    die("bogoutil failed on ", $ARGV[$i], ": $?\n")
        unless ($? == 0);
    $pibuf[$i] = undef;
  }
}

foreach my $file(@ARGV) {
  die("$file: not found or not a regular file\n")
      unless (-f $file);
  my $pipe = new IO::File("bogoutil -d $file |");
  die("bogoutil failed on $file\n")
      unless defined($pipe);
  push(@pipes, $pipe);
}

foreach my $i(0..$#pipes) {
  fetch($i);
}

for (;;) {
  my $min = -1;
  foreach my $i(0..$#pipes) {
    next unless defined($pibuf[$i]);
    my $cmp = $min < 0 ? -1 :
        $pibuf[$i]->{token} cmp $pibuf[$min]->{token};
    if ($cmp < 0) {
      $min = $i;
    }
    elsif ($cmp == 0) {
      $pibuf[$min]->{nsp} += $pibuf[$i]->{nsp};
      $pibuf[$min]->{nns} += $pibuf[$i]->{nns};
      fetch($i);
    }
  }
  last if ($min < 0);
  printf("%s %d %d\n",
         $pibuf[$min]->{token},
         $pibuf[$min]->{nsp},
         $pibuf[$min]->{nns});
  fetch($min);
}
Greg Louis, 2004; posted 2004-03-24, last modified 2004-03-25]