Bogofilter Subject Tagging

Introduction and general description:

Subject tagging refers to putting subj: in front of each token taken from a Subject: header.  The idea is that some tokens may be more characteristic when found in a Subject: header than they are when they occur in the body of a message.

The purpose of this experiment is to determine whether subject tagging actually helps discrimination between spam and nonspam messages.  It was anticipated that there may be some interaction with the minimum deviation (threshold below which a token may be ignored as having a probability too close to 0.5).

Procedure and Results:

I run bogofilter with the Robinson-Fisher method of calculation (described in Appendix A below), in which values of a guess a priori and a weight parameter for that guess are used in calculating individual token-probability estimates, and Fisher's method of combining probabilities is applied in calculation of the overall message "spamicity."

The experimental design is factorial: tagging (two levels, without and with) and min_dev (five levels, 0.025 to 0.125 in steps of 0.025).  For this experiment I used a training corpus of 19,823 nonspams and 21,969 spam emails.  Two test runs were performed, each with 3,204 nonspams and 5,948 spams.  For each of the ten experimental configurations, the nonspams were first classified to determine a spam-cutoff level that would give four false positives (nonspams classified as spam).  That cutoff was then used to determine the number of false negatives (spams classified as nonspam or uncertain) for the given configuration in each of the two runs. The results appear in the following table and in the graph below.

   tagging mindev  cutoff run fp  fn
1    notag  0.025 0.99169   0  4 241
2    notag  0.025 0.99169   1  4 266
3    notag   0.05 0.99109   0  4 246
4    notag   0.05 0.99109   1  4 262
5    notag  0.075 0.98480   0  4 225
6    notag  0.075 0.98480   1  4 241
7    notag    0.1 0.99444   0  4 255
8    notag    0.1 0.99444   1  4 284
9    notag  0.125 0.99643   0  4 284
10   notag  0.125 0.99643   1  4 304
11     tag  0.025 0.98510   0  4 216
12     tag  0.025 0.98510   1  4 222
13     tag   0.05 0.98150   0  4 214
14     tag   0.05 0.98150   1  4 214
15     tag  0.075 0.98750   0  4 224
16     tag  0.075 0.98750   1  4 234
17     tag    0.1 0.99717   0  4 271
18     tag    0.1 0.99717   1  4 296
19     tag  0.125 0.99765   0  4 292
20     tag  0.125 0.99765   1  4 309
Analysing the variance found in this set of results, we see that tagging, the minimum deviation, the message composition in the run, and the interaction between tagging and the minimum deviation are all highly significant:
               Df  Sum Sq Mean Sq F value    Pr(>F)
tagging         1   672.8   672.8  16.401 0.0028858 **
mindev          4 14463.5  3615.9  88.144 3.263e-07 ***
run             1  1344.8  1344.8  32.782 0.0002849 ***
tagging:mindev  4  2371.7   592.9  14.454 0.0005913 ***
Residuals       9   369.2    41.0                      
From the data, and more easily from the graph to follow, it may be seen that tagging has a beneficial effect provided a low minimum deviation is applied. Since the second run generally had more false negatives than the first, and since that effect is statistically significant, the two runs are plotted separately (solid and dashed lines) instead of being treated as replicates.

graph

At minimum deviations of 0.1 and above, subject tagging had a small deleterious effect on discrimination, possibly by withdrawing a number of tokens from consideration (moving them below the minimum-deviation threshold).  When low-deviation tokens were permitted to participate in the decision, however, tagging produced better discrimination than could be achieved without it.

Conclusions:

Subject tagging appeared to provide a noticeable improvement in discrimination, but this effect depends on the minimum deviation in effect when subject tagging is used.  I don't think I'm willing to speculate, at this point, on why the effect is greatest among the less characteristic tokens (those that are ignored at higher minimum deviation values).

Appendix A: Calculation methods:

Robinson's approach involves calculating a value of f(w) for each unique token in a message, based on the numbers of times that the token has been encountered in spam during training (badcount) and in nonspam (goodcount), scaled by the numbers of messages used to build the training database (bad- or goodlist_messagecount):
                     (badcount/badlist_messagecount)
p(w) = -----------------------------------------------------------------
       (badcount/badlist_messagecount + goodcount/goodlist_messagecount)

n = badcount + goodcount

f(w) = (s * x + n * p(w)) / (s + n)
An alternative used in some implementations, that gives the same result, is:
scalefactor = badlist_messagecount / goodlist_messagecount

f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
The scale factor is the ratio of the number of messages used to make up the list of spam tokens to the number of messages used to make up the list of nonspam tokens.  The f(w) value to use when an unknown token is encountered is represented by x, and the degree to which x should have weight in the calculation of f(w) when a token has been seen only a few times before is determined by parameter s.  The number of unique tokens in the message is represented below by N.  Not obvious, but implicit in the second formula, is the replacement of n by n' = badcount + goodount * scalefactor; this should have a negligible effect, unless the training set is extremely lopsided.

Fisher's method uses an inverse chi-squared function, prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2N degrees of freedom:

P = prbx(-2 * sum(ln(1-f(w))), 2*N)
Q = prbx(-2 * sum(ln(f(w))), 2*N)
S = (1 + Q - P) / 2

Appendix B: Notes on experimental procedure:

The following notes should suffice if anyone wishes to repeat the experiment:

The experimental design is factorial: tagging (two levels, without and
with) and min_dev (five levels, 0.025 to 0.125 in steps of 0.025).

The following scripts were used to distribute messages into test and
training files:

# mkdb
echo -n "distributing spam... "
bzip2 -dc ../mail/corpus*.bad.bz2 | FILENO=0 formail -s ./distrib sp
echo done
echo -n "training on uncertain spam... "
bzip2 -dc ../mail/corpus*.sptrain.bz2 >>t.sp
echo done
echo -n "distributing nonspam... "
bzip2 -dc ../mail/corpus0*.good.bz2 | FILENO=0 formail -s ./distrib ns
echo done
echo -n "training on additional and uncertain nonspam... "
bzip2 -dc  ../mail/corpus1223.good.bz2 ../mail/corpus*.nstrain.bz2 >>t.ns
echo done
echo building nontagged wordlists
./bogonotag -v -s -d dbnotag < t.sp
./bogonotag -v -n -d dbnotag < t.ns
bogoutil -w dbnotag .MSG_COUNT
echo building tagged wordlists
./bogotag -v -s -d dbtag < t.sp
./bogotag -v -n -d dbtag < t.ns
bogoutil -w dbtag .MSG_COUNT
echo all done

# distrib
FILE=(t.$1 r0.$1 t.$1 r1.$1 t.$1)
let n=${FILENO}%5
fname=${FILE[$n]}
cat >>$fname

bogotag and bogonotag are bogofilter, compiled from cvs checked out on
2003-02-13 at 1504, with modifications as per patch file patch-bogl0213
(available on request).  The default for the tag option in the two
binaries is on and off respectively.

The corpora and training database had the following properties:

$ bogoutil -w dbnotag .MSG_COUNT
                       spam   good
.MSG_COUNT            21969  19823
$ bogoutil -w dbtag .MSG_COUNT
                       spam   good
.MSG_COUNT            21969  19823
$ grep -c '^From ' *.??
r0.ns:3204
r0.sp:5948
r1.ns:3204
r1.sp:5948
t.ns:19823
t.sp:21969

The test runs were conducted according to the following script:

#! /bin/bash
#  runex for studying effects of subject tagging and mindev

function getco () {
  mopt=$1; bogo=$2; shift 2
  res=`cat $* | formail -s ./bogo$bogo -d db$bogo -m $mopt -v 2>&1 | \
    perl -e ' $target = 10; while (<>) { ' \
	 -e ' ($i, $d) = split; push @diffs, $d unless $i != 1; }' \
	 -e ' die "dainbramage" unless scalar @diffs > 15;' \
         -e ' @s = sort { $a <=> $b } (@diffs); $co = $s[$target];' \
	 -e ' while($co < 0.000001) { ++$target; $co = $s[$target]; }' \
	 -e ' printf("%8.6f %d",1.0-$s[$target],$target-1);'`
}

function wrapper () {
  mopt=$1; bogo=$2; shift 2; v=-v
  res=`cat $* | formail -s ./bogo$bogo -d db$bogo -m $mopt -v 2>&1 | \
    grep -c $v '^1'`
}

for mode in notag tag; do
  for md in 0.025 0.05 0.075 0.1 0.125; do
    echo -n "$mode $md fpos... "
    getco $md,0.1 $mode r0.ns r1.ns
    fpos=${res##* }
    co=${res%% *};
    let fpos=$fpos/2
    echo -n "$fpos at cutoff $co, run0... "
    run=0; wrapper $md,$co $mode r0.sp
    fneg=$res
    echo "$mode $md $co $run $fpos $fneg" >>parms.tbl
    echo -n "$fneg, run1... "
    run=1; wrapper $md,$co $mode r1.sp
    fneg=$res
    echo "$mode $md $co $run $fpos $fneg" >>parms.tbl
    echo $fneg
  done
done

Running this script produced the following data in parms.tbl
(with row and column labels added manually):

  tagging mindev cutoff run fp fn
1 notag 0.025 0.991690 0 4 241
2 notag 0.025 0.991690 1 4 266
3 notag 0.05 0.991090 0 4 246
4 notag 0.05 0.991090 1 4 262
5 notag 0.075 0.984800 0 4 225
6 notag 0.075 0.984800 1 4 241
7 notag 0.1 0.994440 0 4 255
8 notag 0.1 0.994440 1 4 284
9 notag 0.125 0.996430 0 4 284
10 notag 0.125 0.996430 1 4 304
11 tag 0.025 0.985100 0 4 216
12 tag 0.025 0.985100 1 4 222
13 tag 0.05 0.981500 0 4 214
14 tag 0.05 0.981500 1 4 214
15 tag 0.075 0.987500 0 4 224
16 tag 0.075 0.987500 1 4 234
17 tag 0.1 0.997170 0 4 271
18 tag 0.1 0.997170 1 4 296
19 tag 0.125 0.997650 0 4 292
20 tag 0.125 0.997650 1 4 309

This was read into R and manipulated as follows:

read.table("subjtag.tbl") -> bogo
bogo$tagging <- factor(bogo$tagging)
bogo$mindev <- factor(bogo$mindev)
bogo$run <- factor(bogo$run)
attach(bogo)       
bogaov <- aov(fn ~ tagging + mindev + tagging * mindev + run)
summary(bogaov)

               Df  Sum Sq Mean Sq F value    Pr(>F)    
tagging         1   672.8   672.8  16.401 0.0028858 ** 
mindev          4 14463.5  3615.9  88.144 3.263e-07 ***
run             1  1344.8  1344.8  32.782 0.0002849 ***
tagging:mindev  4  2371.7   592.9  14.454 0.0005913 ***
Residuals       9   369.2    41.0                      

Since the second run had significantly more false negatives than the
first, the runs are plotted separately instead of being treated as
replicates.

fn0n <- fn[run==0][1:5]
fn1n <- fn[run==1][1:5]
fn0t <- fn[run==0][6:10]
fn1t <- fn[run==1][6:10]
as.real(levels(mindev)) -> md

X11(width=3.5, height=3.5)

plot(md, fn0t/59.48, ylim=c(3.5,5.3), xlim=c(0.02,0.13),
  main="Subject tagging effect with varying min_dev", axes=FALSE,
  xlab="min_dev", ylab="Percent false negatives", type="p", col="blue")
axis(1, at=c(0.025, 0.05, 0.075, 0.1, 0.125))
axis(2)
box()
lines(md, fn0t/59.48, col="blue")
points(md,fn0n/59.48, col="red")
lines(md,fn0n/59.48, col="red")
text(0.02, 5.2, labels="Untagged", col="red", pos=4)
text(0.02, 5.1, labels="Tagged", col="blue", pos=4)
points(md,fn1n/59.48, col="red") 
lines(md,fn1n/59.48, col="red", lty="dashed") 
points(md,fn1t/59.48, col="blue") 
lines(md,fn1t/59.48, col="blue", lty="dashed") 

Greg Louis, 2003; last modified 2003-04-12]