Paul Graham's Refinements to Bayesian Filtering

Introduction and general description:

Three refinements to message processing for Bayesian filtering were suggested in a recent paper by Paul Graham.  David Relson implemented them for bogofilter, and this document records the results of tests performed to evaluate them.

Bogofilter is a Bayesian spam filter; background is presented on my general bogofilter page.  Bogofilter runs a modification of Paul Graham's original calculation method that was suggested by Gary Robinson.  (In mainline bogofilter, the original method is still optional.)  Graham now suggests:

  1. Preserving case in tokens
  2. Tagging tokens extracted from To:, From:, Subject: and Return-Path: headers to distinguish them from tokens found in the message body
  3. Extracting tokens from html A, IMG and FONT tags

Four collections of email were subjected to a factorial analysis according to the following table:

Factorial:
0 I H T
1 I H t
2 I h T
3 I h t
4 i H T
5 i H t
6 i h T
7 i h t

In the above table, letters i, h and t refer to ignoring case, header tagging, and html tag content processing, respectively; lower case means the feature is enabled, upper case means the feature is not used. This convention may be rather unintuitive to the lay reader, but it reflects the options one would use in bogofilter to obtain the corresponding behaviour; for example, -PIht on the bogofilter command line means preserve (do not ignore) case, tag headers, and process html tags.

Procedure and Results:

Four existing corpora of email were used. Each had been classified and verified so that spam and nonspam messages were separate. The spam and nonspam messages had been separately distributed into either three files (named t, r0 and r1) or four (t, r0, r1 and r2).  The t.sp (spam) and t.ns (nonspam) files were used in training; the r?.sp and r?.ns files, in testing.  In the appendix is a sample of the script used to run these evaluations.

In each experiment, with each of the eight combinations of factors, a training database was first built, and then the r?.ns files were used to determine a spam cutoff that would produce a chosen number (3, 5 or 7) of false positives per run.  Finally the r?.sp files were classified with that spam cutoff value, and the number of false negatives was determined and expressed as a percentage of the total number of spam messages in the file being tested.

Experiment 1

This experiment was performed by David Relson, with whose permission I publish the results here.  Message counts were as follows:

t.ns:11388
t.sp:4824
r0.ns:5694
r0.sp:2411
r1.ns:5694
r1.sp:2412
r2.ns:5694
r2.sp:2413

The results obtained were:

      case head html run       co fp  fn   sp       pc
0uHT0    I    H    T   0 0.841203  3 145 2411 6.014102
0uHT1    I    H    T   1 0.841203  3 131 2412 5.431177
0uHT2    I    H    T   2 0.841203  3 176 2413 7.293825
1uHt0    I    H    t   0 0.868294  3 105 2411 4.355039
1uHt1    I    H    t   1 0.868294  3  90 2412 3.731343
1uHt2    I    H    t   2 0.868294  3 127 2413 5.263158
2uhT0    I    h    T   0 0.860041  3 131 2411 5.433430
2uhT1    I    h    T   1 0.860041  3 124 2412 5.140962
2uhT2    I    h    T   2 0.860041  3 157 2413 6.506424
3uht0    I    h    t   0 0.867431  3 104 2411 4.313563
3uht1    I    h    t   1 0.867431  3  88 2412 3.648425
3uht2    I    h    t   2 0.867431  3 118 2413 4.890178
4UHT0    i    H    T   0 0.879890  3 170 2411 7.051016
4UHT1    i    H    T   1 0.879890  3 163 2412 6.757877
4UHT2    i    H    T   2 0.879890  3 188 2413 7.791131
5UHt0    i    H    t   0 0.877228  3 113 2411 4.686852
5UHt1    i    H    t   1 0.877228  3  98 2412 4.063018
5UHt2    i    H    t   2 0.877228  3 127 2413 5.263158
6UhT0    i    h    T   0 0.881506  3 141 2411 5.848196
6UhT1    i    h    T   1 0.881506  3 140 2412 5.804312
6UhT2    i    h    T   2 0.881506  3 167 2413 6.920845
7Uht0    i    h    t   0 0.883164  3 101 2411 4.189133
7Uht1    i    h    t   1 0.883164  3  93 2412 3.855721
7Uht2    i    h    t   2 0.883164  3 124 2413 5.138831

In all of the result tables, the same convention is followed with regard to the letters i, h and t: lower case means enabled, upper case means the feature was not used.  The sp column gives the number of spam messages in the run, and the pc column expresses the number of false negatives as a percentage of the total.

Analysis of variance facilitates interpretation of the numbers:

               Df  Sum Sq Mean Sq F value    Pr(>F)    
case            1  1.1919  1.1919  2.4505    0.1370    
head            1  1.5058  1.5058  3.0959    0.0976 .  
html            1 21.2720 21.2720 43.7343 5.964e-06 ***
case:head       1  0.1205  0.1205  0.2477    0.6255    
case:html       1  0.4700  0.4700  0.9662    0.3403    
head:html       1  0.4699  0.4699  0.9661    0.3403    
case:head:html  1  0.0448  0.0448  0.0920    0.7655    
Residuals      16  7.7823  0.4864

Ignoring case slightly increased the number of false negatives, though in this experiment the effect was not statistically significant.  Header tagging slightly decreased the false-negative percentages, but again the effect was not statistically significant.  Extracting html tag contents reduced the false-negative counts by about a third, a highly significant effect.

Experiment 2

This time the data were from a collection of email from a multi-user bogofilter installation. Message counts were:

t.ns:21116
t.sp:20935
r0.ns:3502
r0.sp:5667
r1.ns:3502
r1.sp:5667
r2.ns:3501
r2.sp:5666

The results:

      case head html run       co fp  fn   sp       pc
0fht0    I    H    T   0 0.501924  7 100 5667 1.764602
0fht1    I    H    T   1 0.501924  7 111 5667 1.958708
0fht2    I    H    T   2 0.501924  7 113 5666 1.994352
1fhT0    I    H    t   0 0.502053  7  97 5667 1.711664
1fhT1    I    H    t   1 0.502053  7 113 5667 1.994000
1fhT2    I    H    t   2 0.502053  7 110 5666 1.941405
2fHt0    I    h    T   0 0.504537  7  98 5667 1.729310
2fHt1    I    h    T   1 0.504537  7 112 5667 1.976354
2fHt2    I    h    T   2 0.504537  7 104 5666 1.835510
3fHT0    I    h    t   0 0.500009  7  91 5667 1.605788
3fHT1    I    h    t   1 0.500009  7 104 5667 1.835186
3fHT2    I    h    t   2 0.500009  7  96 5666 1.694317
4Fht0    i    H    T   0 0.862938  7 156 5667 2.752779
4Fht1    i    H    T   1 0.862938  7 180 5667 3.176284
4Fht2    i    H    T   2 0.862938  7 185 5666 3.265090
5FhT0    i    H    t   0 0.865804  7 152 5667 2.682195
5FhT1    i    H    t   1 0.865804  7 172 5667 3.035116
5FhT2    i    H    t   2 0.865804  7 180 5666 3.176844
6FHt0    i    h    T   0 0.722647  7 139 5667 2.452797
6FHt1    i    h    T   1 0.722647  7 147 5667 2.593965
6FHt2    i    h    T   2 0.722647  7 140 5666 2.470879
7FHT0    i    h    t   0 0.722647  7 134 5667 2.364567
7FHT1    i    h    t   1 0.722647  7 149 5667 2.629257
7FHT2    i    h    t   2 0.722647  7 137 5666 2.417932

The analysis:

               Df Sum Sq Mean Sq  F value    Pr(>F)    
case            1 5.0202  5.0202 173.1783 5.347e-10 ***
head            1 0.6167  0.6167  21.2740 0.0002883 ***
html            1 0.0324  0.0324   1.1191 0.3058296    
case:head       1 0.2543  0.2543   8.7738 0.0091773 ** 
case:html       1 0.0002  0.0002   0.0072 0.9336020    
head:html       1 0.0008  0.0008   0.0286 0.8677154    
case:head:html  1 0.0117  0.0117   0.4028 0.5345953    
Residuals      16 0.4638  0.0290

Ignoring case significantly increased the number of false negatives (that is, preserving case was shown, as Paul Graham suggested, to be better than not doing so). Header tagging decreased the number of false negatives; again, this validates Paul's suggestion. Extracting html tag contents didn't help much in this case.

Experiment 3

The messages used for the third run were from a collection of my own email.  The message counts were:

t.ns:26731
t.sp:9463
r0.ns:13367
r0.sp:4730
r1.ns:13365
r1.sp:4731

The results:

      case head html run       co fp  fn   sp       pc
0fht0    I    H    T   0 0.555546  5 138 4730 2.917548
0fht1    I    H    T   1 0.555546  5 134 4731 2.832382
1fhT0    I    H    t   0 0.555593  5 135 4730 2.854123
1fhT1    I    H    t   1 0.555593  5 130 4731 2.747833
2fHt0    I    h    T   0 0.507192  5 113 4730 2.389006
2fHt1    I    h    T   1 0.507192  5 118 4731 2.494187
3fHT0    I    h    t   0 0.507195  5 109 4730 2.304440
3fHT1    I    h    t   1 0.507195  5 113 4731 2.388501
4Fht0    i    H    T   0 0.812195  5 182 4730 3.847780
4Fht1    i    H    T   1 0.812195  5 184 4731 3.889241
5FhT0    i    H    t   0 0.812195  5 170 4730 3.594080
5FhT1    i    H    t   1 0.812195  5 174 4731 3.677869
6FHt0    i    h    T   0 0.752086  5 173 4730 3.657505
6FHt1    i    h    T   1 0.752086  5 168 4731 3.551046
7FHT0    i    h    t   0 0.752487  5 169 4730 3.572939
7FHT1    i    h    t   1 0.752487  5 160 4731 3.381949

This time all three factors made a difference:

               Df Sum Sq Mean Sq  F value    Pr(>F)    
case            1 4.2481  4.2481 729.0553 3.812e-09 ***
head            1 0.4294  0.4294  73.7006 2.619e-05 ***
html            1 0.0698  0.0698  11.9829  0.008547 ** 
case:head       1 0.0541  0.0541   9.2814  0.015904 *  
case:html       1 0.0090  0.0090   1.5530  0.247946    
head:html       1 0.0018  0.0018   0.3068  0.594780    
case:head:html  1 0.0040  0.0040   0.6903  0.430159    
Residuals       8 0.0466  0.0058

Ignoring case significantly increased the number of false negatives; preserving case is preferable.  Header tagging decreased the false-negative counts, as did html-tag content extraction. The effect of tagging headers was greater when case was preserved.

Experiment 4

The messages used in the last run were from a more recent collection of my personal email. Message counts:

t.ns:6697
t.sp:9815
r0.ns:3349
r0.sp:4908
r1.ns:3348
r1.sp:4908

Results:

      case head html run       co fp fn        pc
0fht0    I    H    T   0 0.500686  3 57 1.1613692
0fht1    I    H    T   1 0.500686  3 60 1.2224939
1fhT0    I    H    t   0 0.500866  3 50 1.0187449
1fhT1    I    H    t   1 0.500866  3 53 1.0798696
2fHt0    I    h    T   0 0.500021  3 46 0.9372453
2fHt1    I    h    T   1 0.500021  3 48 0.9779951
3fHT0    I    h    t   0 0.500025  3 42 0.8557457
3fHT1    I    h    t   1 0.500025  3 43 0.8761206
4Fht0    i    H    T   0 0.503843  3 69 1.4058680
4Fht1    i    H    T   1 0.503843  3 76 1.5484923
5FhT0    i    H    t   0 0.505731  3 62 1.2632437
5FhT1    i    H    t   1 0.505731  3 68 1.3854931
6FHt0    i    h    T   0 0.500011  3 48 0.9779951
6FHt1    i    h    T   1 0.500011  3 46 0.9372453
7FHT0    i    h    t   0 0.500053  3 43 0.8761206
7FHT1    i    h    t   1 0.500053  3 43 0.8761206

The effects were similar to what was seen in the previous experiment, except that this time the effect of tagging headers was greater when case was not preserved:

               Df    Sum Sq   Mean Sq   F value    Pr(>F)    
case            1   0.08137   0.08137   28.0000 0.0007359 ***
head            1   0.47990   0.47990  165.1429  1.27e-06 ***
html            1   0.05490   0.05490   18.8929 0.0024565 ** 
case:head       1   0.07566   0.07566   26.0357 0.0009270 ***
case:html       1 1.615e-36 1.615e-36 5.557e-34 1.0000000    
head:html       1   0.00374   0.00374    1.2857 0.2896722    
case:head:html  1   0.00010   0.00010    0.0357 0.8548131    
Residuals       8   0.02325   0.00291

Conclusion:

Each of Paul Graham's three suggestions (preserve case, tag headers, use contents of A, IMG and FONT tags) was beneficial, although in some cases not strongly so, in each of the four experiments performed.  It would seem reasonable to make case preservation, header tagging and html tag content extraction the defaults for future bogofilter versions.

Appendix: Details of the experimental procedure

The following bash script was used to run the second, third and fourth experiments, with variations in the $target parameter and the upper num limit to accomodate the different message corpora.  The experiment run by David Relson (#1 above) used a different script, not shown here.
#! /bin/sh
#  runex for the new -P parameters
#  set train to yes to rebuild the training databases
train=yes

function getco () {
  res=`cat $* | ./bogofilter -d $fnam -o 0.1 -Mv | \
    perl -e ' $target = 6; while (<>) { ' \
         -e ' ($i, $d) = split; push @diffs, $d unless $i != 1; }' \
         -e ' die "dainbramage" unless scalar @diffs > 15;' \
         -e ' @s = sort { $a <=> $b } (@diffs); $co = $s[$target];' \
         -e ' while($co < 0.000001) { ++$target; $co = $s[$target]; }' \
         -e ' printf("%8.6f %d",1.0-$s[$target],$target);'`
}

files=(0fht 1fhT 2fHt 3fHT 4Fht 5FhT 6FHt 7FHT)
factors=("I H T" "I H t" "I h T" "I h t" "i H T" "i H t" "i h T" "i h t")
#opts=(-Pfh -Pfht -Pf -Pft -Ph -Pht "" -Pt)
#opts=(-PuH -PuHt -Pu -Put -PH -PHt "" -Pt)
opts=(-PIHT -PIHt -PIhT -PIht -PiHT -PiHt -PihT -Piht)

echo "fold head html run co fp fn" > Pparms.tbl
for i in `seq 0 7`; do
    fnam=${files[$i]}
    popt=${opts[$i]}
    if [ $train = yes ]; then
	/bin/rm -f $fnam/*
	./bogofilter -d $fnam $popt -s < t.sp
	./bogofilter -d $fnam $popt -n < t.ns
    fi
    getco r0.ns r1.ns
    fp=${res##* }; co=${res%% *}; let fp=$fp/2
    for num in 0 1; do
	./bogofilter -d $fnam $popt -o $co -Mv < r$num.sp \
	    > r$num.sp.$fnam
	fn=`grep -c -v '^1' r$num.sp.$fnam`
	echo "$fnam$num ${factors[$i]} $num $co $fp $fn" >> Pparms.tbl
    done
done

Greg Louis, 2003; last modified 2003-05-21]