Bogofilter is a Bayesian spam filter; background is presented on my general bogofilter page. Bogofilter runs a modification of Paul Graham's original calculation method that was suggested by Gary Robinson. (In mainline bogofilter, the original method is still optional.) Graham now suggests:
Four collections of email were subjected to a factorial analysis according to the following table:
Factorial: 0 I H T 1 I H t 2 I h T 3 I h t 4 i H T 5 i H t 6 i h T 7 i h t
In the above table, letters i, h and t refer to ignoring case, header tagging, and html tag content processing, respectively; lower case means the feature is enabled, upper case means the feature is not used. This convention may be rather unintuitive to the lay reader, but it reflects the options one would use in bogofilter to obtain the corresponding behaviour; for example, -PIht on the bogofilter command line means preserve (do not ignore) case, tag headers, and process html tags.
In each experiment, with each of the eight combinations of factors, a training database was first built, and then the r?.ns files were used to determine a spam cutoff that would produce a chosen number (3, 5 or 7) of false positives per run. Finally the r?.sp files were classified with that spam cutoff value, and the number of false negatives was determined and expressed as a percentage of the total number of spam messages in the file being tested.
Experiment 1
This experiment was performed by David Relson, with whose permission I publish the results here. Message counts were as follows:
t.ns:11388 t.sp:4824 r0.ns:5694 r0.sp:2411 r1.ns:5694 r1.sp:2412 r2.ns:5694 r2.sp:2413
The results obtained were:
case head html run co fp fn sp pc
0uHT0 I H T 0 0.841203 3 145 2411 6.014102
0uHT1 I H T 1 0.841203 3 131 2412 5.431177
0uHT2 I H T 2 0.841203 3 176 2413 7.293825
1uHt0 I H t 0 0.868294 3 105 2411 4.355039
1uHt1 I H t 1 0.868294 3 90 2412 3.731343
1uHt2 I H t 2 0.868294 3 127 2413 5.263158
2uhT0 I h T 0 0.860041 3 131 2411 5.433430
2uhT1 I h T 1 0.860041 3 124 2412 5.140962
2uhT2 I h T 2 0.860041 3 157 2413 6.506424
3uht0 I h t 0 0.867431 3 104 2411 4.313563
3uht1 I h t 1 0.867431 3 88 2412 3.648425
3uht2 I h t 2 0.867431 3 118 2413 4.890178
4UHT0 i H T 0 0.879890 3 170 2411 7.051016
4UHT1 i H T 1 0.879890 3 163 2412 6.757877
4UHT2 i H T 2 0.879890 3 188 2413 7.791131
5UHt0 i H t 0 0.877228 3 113 2411 4.686852
5UHt1 i H t 1 0.877228 3 98 2412 4.063018
5UHt2 i H t 2 0.877228 3 127 2413 5.263158
6UhT0 i h T 0 0.881506 3 141 2411 5.848196
6UhT1 i h T 1 0.881506 3 140 2412 5.804312
6UhT2 i h T 2 0.881506 3 167 2413 6.920845
7Uht0 i h t 0 0.883164 3 101 2411 4.189133
7Uht1 i h t 1 0.883164 3 93 2412 3.855721
7Uht2 i h t 2 0.883164 3 124 2413 5.138831In all of the result tables, the same convention is followed with regard to the letters i, h and t: lower case means enabled, upper case means the feature was not used. The sp column gives the number of spam messages in the run, and the pc column expresses the number of false negatives as a percentage of the total.
Analysis of variance facilitates interpretation of the numbers:
Df Sum Sq Mean Sq F value Pr(>F)
case 1 1.1919 1.1919 2.4505 0.1370
head 1 1.5058 1.5058 3.0959 0.0976 .
html 1 21.2720 21.2720 43.7343 5.964e-06 ***
case:head 1 0.1205 0.1205 0.2477 0.6255
case:html 1 0.4700 0.4700 0.9662 0.3403
head:html 1 0.4699 0.4699 0.9661 0.3403
case:head:html 1 0.0448 0.0448 0.0920 0.7655
Residuals 16 7.7823 0.4864Ignoring case slightly increased the number of false negatives, though in this experiment the effect was not statistically significant. Header tagging slightly decreased the false-negative percentages, but again the effect was not statistically significant. Extracting html tag contents reduced the false-negative counts by about a third, a highly significant effect.
Experiment 2
This time the data were from a collection of email from a multi-user bogofilter installation. Message counts were:
t.ns:21116 t.sp:20935 r0.ns:3502 r0.sp:5667 r1.ns:3502 r1.sp:5667 r2.ns:3501 r2.sp:5666
The results:
case head html run co fp fn sp pc
0fht0 I H T 0 0.501924 7 100 5667 1.764602
0fht1 I H T 1 0.501924 7 111 5667 1.958708
0fht2 I H T 2 0.501924 7 113 5666 1.994352
1fhT0 I H t 0 0.502053 7 97 5667 1.711664
1fhT1 I H t 1 0.502053 7 113 5667 1.994000
1fhT2 I H t 2 0.502053 7 110 5666 1.941405
2fHt0 I h T 0 0.504537 7 98 5667 1.729310
2fHt1 I h T 1 0.504537 7 112 5667 1.976354
2fHt2 I h T 2 0.504537 7 104 5666 1.835510
3fHT0 I h t 0 0.500009 7 91 5667 1.605788
3fHT1 I h t 1 0.500009 7 104 5667 1.835186
3fHT2 I h t 2 0.500009 7 96 5666 1.694317
4Fht0 i H T 0 0.862938 7 156 5667 2.752779
4Fht1 i H T 1 0.862938 7 180 5667 3.176284
4Fht2 i H T 2 0.862938 7 185 5666 3.265090
5FhT0 i H t 0 0.865804 7 152 5667 2.682195
5FhT1 i H t 1 0.865804 7 172 5667 3.035116
5FhT2 i H t 2 0.865804 7 180 5666 3.176844
6FHt0 i h T 0 0.722647 7 139 5667 2.452797
6FHt1 i h T 1 0.722647 7 147 5667 2.593965
6FHt2 i h T 2 0.722647 7 140 5666 2.470879
7FHT0 i h t 0 0.722647 7 134 5667 2.364567
7FHT1 i h t 1 0.722647 7 149 5667 2.629257
7FHT2 i h t 2 0.722647 7 137 5666 2.417932The analysis:
Df Sum Sq Mean Sq F value Pr(>F)
case 1 5.0202 5.0202 173.1783 5.347e-10 ***
head 1 0.6167 0.6167 21.2740 0.0002883 ***
html 1 0.0324 0.0324 1.1191 0.3058296
case:head 1 0.2543 0.2543 8.7738 0.0091773 **
case:html 1 0.0002 0.0002 0.0072 0.9336020
head:html 1 0.0008 0.0008 0.0286 0.8677154
case:head:html 1 0.0117 0.0117 0.4028 0.5345953
Residuals 16 0.4638 0.0290Ignoring case significantly increased the number of false negatives (that is, preserving case was shown, as Paul Graham suggested, to be better than not doing so). Header tagging decreased the number of false negatives; again, this validates Paul's suggestion. Extracting html tag contents didn't help much in this case.
Experiment 3
The messages used for the third run were from a collection of my own email. The message counts were:
t.ns:26731 t.sp:9463 r0.ns:13367 r0.sp:4730 r1.ns:13365 r1.sp:4731
The results:
case head html run co fp fn sp pc
0fht0 I H T 0 0.555546 5 138 4730 2.917548
0fht1 I H T 1 0.555546 5 134 4731 2.832382
1fhT0 I H t 0 0.555593 5 135 4730 2.854123
1fhT1 I H t 1 0.555593 5 130 4731 2.747833
2fHt0 I h T 0 0.507192 5 113 4730 2.389006
2fHt1 I h T 1 0.507192 5 118 4731 2.494187
3fHT0 I h t 0 0.507195 5 109 4730 2.304440
3fHT1 I h t 1 0.507195 5 113 4731 2.388501
4Fht0 i H T 0 0.812195 5 182 4730 3.847780
4Fht1 i H T 1 0.812195 5 184 4731 3.889241
5FhT0 i H t 0 0.812195 5 170 4730 3.594080
5FhT1 i H t 1 0.812195 5 174 4731 3.677869
6FHt0 i h T 0 0.752086 5 173 4730 3.657505
6FHt1 i h T 1 0.752086 5 168 4731 3.551046
7FHT0 i h t 0 0.752487 5 169 4730 3.572939
7FHT1 i h t 1 0.752487 5 160 4731 3.381949This time all three factors made a difference:
Df Sum Sq Mean Sq F value Pr(>F)
case 1 4.2481 4.2481 729.0553 3.812e-09 ***
head 1 0.4294 0.4294 73.7006 2.619e-05 ***
html 1 0.0698 0.0698 11.9829 0.008547 **
case:head 1 0.0541 0.0541 9.2814 0.015904 *
case:html 1 0.0090 0.0090 1.5530 0.247946
head:html 1 0.0018 0.0018 0.3068 0.594780
case:head:html 1 0.0040 0.0040 0.6903 0.430159
Residuals 8 0.0466 0.0058Ignoring case significantly increased the number of false negatives; preserving case is preferable. Header tagging decreased the false-negative counts, as did html-tag content extraction. The effect of tagging headers was greater when case was preserved.
Experiment 4
The messages used in the last run were from a more recent collection of my personal email. Message counts:
t.ns:6697 t.sp:9815 r0.ns:3349 r0.sp:4908 r1.ns:3348 r1.sp:4908
Results:
case head html run co fp fn pc
0fht0 I H T 0 0.500686 3 57 1.1613692
0fht1 I H T 1 0.500686 3 60 1.2224939
1fhT0 I H t 0 0.500866 3 50 1.0187449
1fhT1 I H t 1 0.500866 3 53 1.0798696
2fHt0 I h T 0 0.500021 3 46 0.9372453
2fHt1 I h T 1 0.500021 3 48 0.9779951
3fHT0 I h t 0 0.500025 3 42 0.8557457
3fHT1 I h t 1 0.500025 3 43 0.8761206
4Fht0 i H T 0 0.503843 3 69 1.4058680
4Fht1 i H T 1 0.503843 3 76 1.5484923
5FhT0 i H t 0 0.505731 3 62 1.2632437
5FhT1 i H t 1 0.505731 3 68 1.3854931
6FHt0 i h T 0 0.500011 3 48 0.9779951
6FHt1 i h T 1 0.500011 3 46 0.9372453
7FHT0 i h t 0 0.500053 3 43 0.8761206
7FHT1 i h t 1 0.500053 3 43 0.8761206The effects were similar to what was seen in the previous experiment, except that this time the effect of tagging headers was greater when case was not preserved:
Df Sum Sq Mean Sq F value Pr(>F)
case 1 0.08137 0.08137 28.0000 0.0007359 ***
head 1 0.47990 0.47990 165.1429 1.27e-06 ***
html 1 0.05490 0.05490 18.8929 0.0024565 **
case:head 1 0.07566 0.07566 26.0357 0.0009270 ***
case:html 1 1.615e-36 1.615e-36 5.557e-34 1.0000000
head:html 1 0.00374 0.00374 1.2857 0.2896722
case:head:html 1 0.00010 0.00010 0.0357 0.8548131
Residuals 8 0.02325 0.00291
#! /bin/sh
# runex for the new -P parameters
# set train to yes to rebuild the training databases
train=yes
function getco () {
res=`cat $* | ./bogofilter -d $fnam -o 0.1 -Mv | \
perl -e ' $target = 6; while (<>) { ' \
-e ' ($i, $d) = split; push @diffs, $d unless $i != 1; }' \
-e ' die "dainbramage" unless scalar @diffs > 15;' \
-e ' @s = sort { $a <=> $b } (@diffs); $co = $s[$target];' \
-e ' while($co < 0.000001) { ++$target; $co = $s[$target]; }' \
-e ' printf("%8.6f %d",1.0-$s[$target],$target);'`
}
files=(0fht 1fhT 2fHt 3fHT 4Fht 5FhT 6FHt 7FHT)
factors=("I H T" "I H t" "I h T" "I h t" "i H T" "i H t" "i h T" "i h t")
#opts=(-Pfh -Pfht -Pf -Pft -Ph -Pht "" -Pt)
#opts=(-PuH -PuHt -Pu -Put -PH -PHt "" -Pt)
opts=(-PIHT -PIHt -PIhT -PIht -PiHT -PiHt -PihT -Piht)
echo "fold head html run co fp fn" > Pparms.tbl
for i in `seq 0 7`; do
fnam=${files[$i]}
popt=${opts[$i]}
if [ $train = yes ]; then
/bin/rm -f $fnam/*
./bogofilter -d $fnam $popt -s < t.sp
./bogofilter -d $fnam $popt -n < t.ns
fi
getco r0.ns r1.ns
fp=${res##* }; co=${res%% *}; let fp=$fp/2
for num in 0 1; do
./bogofilter -d $fnam $popt -o $co -Mv < r$num.sp \
> r$num.sp.$fnam
fn=`grep -c -v '^1' r$num.sp.$fnam`
echo "$fnam$num ${factors[$i]} $num $co $fp $fn" >> Pparms.tbl
done
done[© Greg Louis, 2003; last modified 2003-05-21]