Paul Graham's original calculation uses a parameter p(w) that is calculated for each unique token in a message, based on the numbers of times that the token has been encountered in spam during training (badcount) and in nonspam (goodcount), scaled by the numbers of messages used to build the training database (bad- or goodlist_messagecount):
(badcount/badlist_messagecount)
p(w) = -----------------------------------------------------------------
(badcount/badlist_messagecount + goodcount/goodlist_messagecount)
n = badcount + goodcount f(w) = (s * x + n * p(w)) / (s + n)
scalefactor = badlist_messagecount / goodlist_messagecount f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
n' = badcount + goodcount * scalefactor
-- this should have a negligible effect, unless the training
set is extremely lopsided.Robinson recommends applying Fisher's method of combining probabilities to obtain the actual index used to classify a message as spam or nonspam. Fisher's method uses an inverse chi-squared function, which I'll refer to as prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2N degrees of freedom, N being the number of unique tokens in the message:
P = prbx(-2 * sum(ln(1-f(w))), 2*N) Q = prbx(-2 * sum(ln(f(w))), 2*N) S = (1 + Q - P) / 2
Graham's original calculation involved combining only the fifteen most extreme values of p(w), i.e. the fifteen values farthest from 0.5 (nearest to 1 or 0). With Robinson's approach, there is no need to restrict the calculation in this way; however, bogofilter supports the possibility of ignoring values near to 0.5 if the user wishes to do so. The rationale is that such values contribute little to the overall result of the calculation and it saves time to concentrate on the more "characteristic" tokens; the counterargument is that one might as well take advantage of all available information.
In practice, it has been found -- for example, in this experiment -- that in some typical operating conditions, a low minimum deviation from 0.5 (around 0.1 or 0.15) gives better discrimination than when no minimum is applied. Using a minimum deviation (hereinafter abbreviated mindev) violates one of the assumptions on which the Bayesian classification is based, but that assumption is violated anyway by the nature of the data; it may even be (I won't discuss this in detail here but a reference may soon be available) that the two violations act to cancel one another out and approximate the required uniformity of distribution more precisely.
Robinson discusses choosing the value of the parameter x in his essay on spam detection. Basically, a good starting point might be 0.5, or the average p(w) for words in the training set that have counts of 10 or more. Notice that this sets up a potential interaction of x and s with mindev, since x may well be close enough to 0.5 to fall beneath the mindev threshold. Parameter s is a weight factor, as described above; Robinson suggests starting with 1 and tuning from there. I found in practice that a value of 0.001 for s gave better discrimination, even with mindev set to zero; the optimum size of s seems related to the size of the training set in a way that is not mathematically obvious (to me, anyway).
Note added after publication: The value of s should be kept between 1 and 0.01; lower values give too much weight to tokens that appear a few times in one list (spam or nonspam) but not at all in the other. See a later experiment's first conclusion for further information.
I'm not sure I've answered that question, but I hope the results to be presented here may contribute. What I did was to take a corpus of 6,375 spams and 6,375 nonspams and "deal" them out into two sets of 3 files. The dealing was done like this:
$cat distrib
FILE=(t.$1 r0.$1 t.$1 r1.$1 t.$1)
let n=${FILENO}%5
fname=${FILE[$n]}
cat >>$fname
FILENO=0 formail -s ./distrib ns < corpus.nonspam
FILENO=0 formail -s ./distrib sp < corpus.spam
-m mindev[,cutoff[,s]] so that mindev, s and
the spam-cutoff value (the value of the spam index that determines the
classification) could be set from the command line.(Note that as of bogofilter version 0.11.1.9, the -m and -o command line options allow the specification of these parameters; the special version mentioned above is no longer needed. An example of the use of the new version is presented in the Appendix to the report of another experiment.)
For mindev in 0 to 0.45 by steps of 0.05, and within that loop, for s in 1 to 1e-12 by logarithmic steps of a half decade each (1, 0.32, 0.1, 0.032 ...) I first determined the spam cutoff that would give closest to 12 (approx. 1%) false positives; then, for each of r0.sp and r1.sp, I applied that spam cutoff and determined the number of false negatives. The counts of false positives and false negatives were added together and expressed as a percentage error. These percentage errors are rather high, of course, given the small size of the training set, but they serve to indicate the intricate way in which mindev and s interact to influence the discrimination.
There are five hundred data in this experiment, so the table of results is presented in the Appendix rather than here. The best discrimination was achieved with a mindev of 0.35 and s of 0.0032; the cutoff point for those values was 0.992. There are, however, other local minima; it's not clear at this point how generally applicable this minimum might prove. Note that in this experiment the value of x was 0.415, so at all values of mindev other than 0.05 and 0, unknown tokens did not figure in the calculation, and little-known tokens were pulled toward the minimum deviation as s increased.
The data are presented on the perspective plots below. On the left is a graph showing the cutoff values used at each point (chosen to give 12 false positives, as nearly as possible). The actual experimental results appear on the graph at the right. It would be nice if I could offer the reader a joystick with which to rotate the perspective plot; the angles I have chosen seem to me to present the data most clearly, but there's no doubt that viewing the plots in slow rotation helps one to visualize the relationships. The lower pair of plots is the same as the upper, but rotated 30 degrees clockwise, which is about the best I can do statically.

cat distrib
FILE=(t.$1 r0.$1 t.$1 r1.$1 t.$1)
let n=${FILENO}%5
fname=${FILE[$n]}
cat >>$fname
cat runex
#! /bin/bash
# runex for studying effects of s and mindev
function getco () {
mopt=$1
shift
res=`cat $* | formail -s bogofilter -d db -m $mopt -v 2>&1 | \
perl -e ' $target = 12; while (<>) { ' \
-e ' ($i, $d) = split; push @diffs, $d unless $i != 1; }' \
-e ' die "dainbramage" unless scalar @diffs > 501;' \
-e ' @s = sort { $a <=> $b } (@diffs); $co = $s[$target];' \
-e ' while($co < 0.000001) { ++$target; $co = $s[$target]; }' \
-e ' printf("%8.6f %d",1.0-$s[$target],$target-1);'`
}
function getfn () {
mopt=$1
shift
v=-v
res=`cat $* | formail -s bogofilter -d db -m $mopt -v 2>&1 | grep -c $v '^1'`
}
for md in 0.0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45; do
for rs in 1 0.32 0.1 0.032 1e-2 3.2e-3 1e-3 3.2e-4 1e-4 3.2e-5\
1e-5 3.2e-6 1e-6 3.2e-7 1e-7 3.2e-8 1e-8 3.2e-9 1e-9\
3.2e-10 1e-10 3.2e-11 1e-11 3.2e-12 1e-12; do
echo -n "$md $rs fpos... "
getco $md,0.1,$rs r0.ns r1.ns
fpos=${res##* }
co=${res%% *}
let fpos=$fpos/2
echo -n "$fpos at cutoff $co, run0... "
run=0
getfn $md,$co,$rs r0.sp
fneg=$res
echo "$md $rs $co $run $fpos $fneg" >>parms.tbl
echo -n "$fneg, run1... "
run=1
getfn $md,$co,$rs r1.sp
fneg=$res
echo "$md $rs $co $run $fpos $fneg" >>parms.tbl
echo $fneg
done
done
bzip2 -dc ../mail/sieved.bad.bz2 | FILENO=0 formail -s ./distrib sp
bzip2 -dc ../mail/sieved.good.bz2 | FILENO=0 formail -s ./distrib ns
bogofilter -d db -s < t.sp; bogofilter -d db -n < t.ns
runex
Now in R:
read.table("parms.tbl") -> parms
sval <- sort(unique(parms$V2), decreasing=TRUE)
x <- -log10(sval)
y <- sort(unique(parms$V1))
parms$percent = (parms$V5 + parms$V6) * 100 / 1275
parms$V1 = factor(parms$V1) # mindev
parms$V2 = factor(parms$V2) # s
paov <- aov(percent ~ V1 + V2, data=parms)
summary(paov)
Df Sum Sq Mean Sq F value Pr(>F)
V1 9 2889.52 321.06 77.621 < 2.2e-16 ***
V2 24 1678.88 69.95 16.913 < 2.2e-16 ***
Residuals 466 1927.47 4.14
meanerr <- apply(array(parms$percent,dim=c(2,length(parms$V1)/2)), 2,
mean)
cutoffs <- array(parms$V3,dim=c(2,length(parms$V1)/2))[1,]
parmres <- data.frame(md=rep(y, each=length(sval)),
rs=rep(sval,length(parms$V1)/(2 * length(sval))),
cutoff=cutoffs, percent=meanerr)
parmres
md rs cutoff percent
1 0.00 1.0e+00 0.945700 21.13725
2 0.00 3.2e-01 0.995790 22.66667
3 0.00 1.0e-01 0.999937 25.49020
4 0.00 3.2e-02 0.999998 25.84314
5 0.00 1.0e-02 0.999998 22.15686
6 0.00 3.2e-03 0.999999 21.25490
7 0.00 1.0e-03 0.999999 19.84314
8 0.00 3.2e-04 0.999999 19.49020
9 0.00 1.0e-04 0.999998 18.35294
10 0.00 3.2e-05 0.999998 18.00000
11 0.00 1.0e-05 0.999998 17.96078
12 0.00 3.2e-06 0.999999 18.98039
13 0.00 1.0e-06 0.999998 18.98039
14 0.00 3.2e-07 0.999998 19.41176
15 0.00 1.0e-07 0.999995 18.82353
16 0.00 3.2e-08 0.999995 19.17647
17 0.00 1.0e-08 0.999994 19.45098
18 0.00 3.2e-09 0.999986 19.17647
19 0.00 1.0e-09 0.999943 18.27451
20 0.00 3.2e-10 0.999860 17.37255
21 0.00 1.0e-10 0.999790 17.49020
22 0.00 3.2e-11 0.999861 18.39216
23 0.00 1.0e-11 0.999861 18.74510
24 0.00 3.2e-12 0.999893 19.45098
25 0.00 1.0e-12 0.999861 19.37255
26 0.05 1.0e+00 0.998050 21.25490
27 0.05 3.2e-01 0.999994 25.52941
28 0.05 1.0e-01 0.999998 21.64706
29 0.05 3.2e-02 0.999998 19.13725
30 0.05 1.0e-02 0.999999 18.35294
31 0.05 3.2e-03 0.999998 16.98039
32 0.05 1.0e-03 0.999999 17.56863
33 0.05 3.2e-04 0.999999 17.96078
34 0.05 1.0e-04 0.999995 16.50980
35 0.05 3.2e-05 0.999999 18.66667
36 0.05 1.0e-05 0.999995 17.49020
37 0.05 3.2e-06 0.999999 19.25490
38 0.05 1.0e-06 0.999998 19.33333
39 0.05 3.2e-07 0.999997 19.49020
40 0.05 1.0e-07 0.999994 19.49020
41 0.05 3.2e-08 0.999959 18.43137
42 0.05 1.0e-08 0.999918 18.15686
43 0.05 3.2e-09 0.999814 17.80392
44 0.05 1.0e-09 0.999814 18.35294
45 0.05 3.2e-10 0.999814 18.90196
46 0.05 1.0e-10 0.999810 19.25490
47 0.05 3.2e-11 0.999462 18.54902
48 0.05 1.0e-11 0.999462 19.17647
49 0.05 3.2e-12 0.999560 19.68627
50 0.05 1.0e-12 0.999691 20.50980
51 0.10 1.0e+00 0.999997 25.17647
52 0.10 3.2e-01 0.999998 20.23529
53 0.10 1.0e-01 0.999999 17.29412
54 0.10 3.2e-02 0.999999 16.27451
55 0.10 1.0e-02 0.999999 15.56863
56 0.10 3.2e-03 0.999999 15.60784
57 0.10 1.0e-03 0.999998 15.56863
58 0.10 3.2e-04 0.999997 15.49020
59 0.10 1.0e-04 0.999999 16.78431
60 0.10 3.2e-05 0.999999 17.45098
61 0.10 1.0e-05 0.999995 16.82353
62 0.10 3.2e-06 0.999996 17.84314
63 0.10 1.0e-06 0.999995 18.27451
64 0.10 3.2e-07 0.999975 17.37255
65 0.10 1.0e-07 0.999834 15.84314
66 0.10 3.2e-08 0.999834 16.62745
67 0.10 1.0e-08 0.999722 16.70588
68 0.10 3.2e-09 0.999779 17.60784
69 0.10 1.0e-09 0.999522 17.17647
70 0.10 3.2e-10 0.999519 17.76471
71 0.10 1.0e-10 0.999592 18.35294
72 0.10 3.2e-11 0.999672 18.74510
73 0.10 1.0e-11 0.999726 19.64706
74 0.10 3.2e-12 0.999703 20.15686
75 0.10 1.0e-12 0.999519 20.23529
76 0.15 1.0e+00 0.999996 21.45098
77 0.15 3.2e-01 0.999999 18.58824
78 0.15 1.0e-01 0.999999 16.11765
79 0.15 3.2e-02 0.999999 15.41176
80 0.15 1.0e-02 0.999999 15.25490
81 0.15 3.2e-03 0.999999 15.76471
82 0.15 1.0e-03 0.999998 15.37255
83 0.15 3.2e-04 0.999995 15.45098
84 0.15 1.0e-04 0.999996 16.23529
85 0.15 3.2e-05 0.999996 17.05882
86 0.15 1.0e-05 0.999980 16.35294
87 0.15 3.2e-06 0.999755 14.47059
88 0.15 1.0e-06 0.999847 15.72549
89 0.15 3.2e-07 0.999768 15.96078
90 0.15 1.0e-07 0.999248 15.29412
91 0.15 3.2e-08 0.999037 15.80392
92 0.15 1.0e-08 0.999037 16.27451
93 0.15 3.2e-09 0.999037 16.86275
94 0.15 1.0e-09 0.999058 17.45098
95 0.15 3.2e-10 0.999072 17.88235
96 0.15 1.0e-10 0.999082 18.58824
97 0.15 3.2e-11 0.999037 19.09804
98 0.15 1.0e-11 0.998870 19.37255
99 0.15 3.2e-12 0.998240 19.64706
100 0.15 1.0e-12 0.997280 19.60784
101 0.20 1.0e+00 0.999999 21.25490
102 0.20 3.2e-01 0.999998 15.80392
103 0.20 1.0e-01 0.999999 14.74510
104 0.20 3.2e-02 0.999999 14.47059
105 0.20 1.0e-02 0.999998 14.23529
106 0.20 3.2e-03 0.999999 15.76471
107 0.20 1.0e-03 0.999995 15.25490
108 0.20 3.2e-04 0.999996 16.47059
109 0.20 1.0e-04 0.999943 14.62745
110 0.20 3.2e-05 0.999915 15.21569
111 0.20 1.0e-05 0.999948 16.43137
112 0.20 3.2e-06 0.999700 15.09804
113 0.20 1.0e-06 0.999348 15.25490
114 0.20 3.2e-07 0.999354 16.00000
115 0.20 1.0e-07 0.998260 15.52941
116 0.20 3.2e-08 0.998260 16.07843
117 0.20 1.0e-08 0.998260 16.82353
118 0.20 3.2e-09 0.998260 17.21569
119 0.20 1.0e-09 0.998010 17.76471
120 0.20 3.2e-10 0.997840 18.31373
121 0.20 1.0e-10 0.996400 18.23529
122 0.20 3.2e-11 0.993680 18.03922
123 0.20 1.0e-11 0.989300 18.07843
124 0.20 3.2e-12 0.985200 18.00000
125 0.20 1.0e-12 0.985200 18.90196
126 0.25 1.0e+00 0.999999 20.39216
127 0.25 3.2e-01 0.999998 15.45098
128 0.25 1.0e-01 0.999999 15.13725
129 0.25 3.2e-02 0.999999 15.41176
130 0.25 1.0e-02 0.999998 15.56863
131 0.25 3.2e-03 0.999998 16.78431
132 0.25 1.0e-03 0.999925 14.27451
133 0.25 3.2e-04 0.999806 14.00000
134 0.25 1.0e-04 0.999481 13.72549
135 0.25 3.2e-05 0.999420 14.62745
136 0.25 1.0e-05 0.999357 15.17647
137 0.25 3.2e-06 0.999357 16.27451
138 0.25 1.0e-06 0.998600 16.00000
139 0.25 3.2e-07 0.997100 15.92157
140 0.25 1.0e-07 0.994240 15.76471
141 0.25 3.2e-08 0.989600 15.68627
142 0.25 1.0e-08 0.986000 15.88235
143 0.25 3.2e-09 0.986000 16.43137
144 0.25 1.0e-09 0.986000 16.94118
145 0.25 3.2e-10 0.981500 17.09804
146 0.25 1.0e-10 0.967900 17.09804
147 0.25 3.2e-11 0.962000 17.68627
148 0.25 1.0e-11 0.962000 18.31373
149 0.25 3.2e-12 0.962000 18.78431
150 0.25 1.0e-12 0.962000 19.88235
151 0.30 1.0e+00 0.999998 19.80392
152 0.30 3.2e-01 0.999998 16.39216
153 0.30 1.0e-01 0.999999 16.82353
154 0.30 3.2e-02 0.999998 16.50980
155 0.30 1.0e-02 0.999968 14.54902
156 0.30 3.2e-03 0.999839 13.84314
157 0.30 1.0e-03 0.999469 13.56863
158 0.30 3.2e-04 0.998700 13.41176
159 0.30 1.0e-04 0.998530 14.03922
160 0.30 3.2e-05 0.997400 14.66667
161 0.30 1.0e-05 0.994370 14.39216
162 0.30 3.2e-06 0.986800 14.35294
163 0.30 1.0e-06 0.983300 14.78431
164 0.30 3.2e-07 0.970000 14.70588
165 0.30 1.0e-07 0.970000 15.33333
166 0.30 3.2e-08 0.970000 15.80392
167 0.30 1.0e-08 0.970000 16.50980
168 0.30 3.2e-09 0.960900 17.09804
169 0.30 1.0e-09 0.935400 17.09804
170 0.30 3.2e-10 0.924700 17.88235
171 0.30 1.0e-10 0.924700 18.31373
172 0.30 3.2e-11 0.924700 18.86275
173 0.30 1.0e-11 0.924700 19.68627
174 0.30 3.2e-12 0.924700 20.31373
175 0.30 1.0e-12 0.917300 20.74510
176 0.35 1.0e+00 0.999999 22.43137
177 0.35 3.2e-01 0.999998 18.07843
178 0.35 1.0e-01 0.999964 15.21569
179 0.35 3.2e-02 0.999452 12.43137
180 0.35 1.0e-02 0.998330 12.11765
181 0.35 3.2e-03 0.992310 11.60784
182 0.35 1.0e-03 0.986400 12.03922
183 0.35 3.2e-04 0.983700 13.09804
184 0.35 1.0e-04 0.982800 13.88235
185 0.35 3.2e-05 0.978100 14.31373
186 0.35 1.0e-05 0.978100 15.41176
187 0.35 3.2e-06 0.973600 15.80392
188 0.35 1.0e-06 0.973600 16.74510
189 0.35 3.2e-07 0.973400 17.56863
190 0.35 1.0e-07 0.949100 17.37255
191 0.35 3.2e-08 0.913900 17.37255
192 0.35 1.0e-08 0.901300 17.80392
193 0.35 3.2e-09 0.901300 18.19608
194 0.35 1.0e-09 0.901300 19.17647
195 0.35 3.2e-10 0.893000 20.00000
196 0.35 1.0e-10 0.893000 20.74510
197 0.35 3.2e-11 0.893000 21.37255
198 0.35 1.0e-11 0.893000 21.80392
199 0.35 3.2e-12 0.893000 22.62745
200 0.35 1.0e-12 0.891000 22.86275
201 0.40 1.0e+00 0.999982 23.01961
202 0.40 3.2e-01 0.999951 17.29412
203 0.40 1.0e-01 0.999709 15.09804
204 0.40 3.2e-02 0.999703 15.56863
205 0.40 1.0e-02 0.998890 14.74510
206 0.40 3.2e-03 0.997960 15.21569
207 0.40 1.0e-03 0.998440 16.74510
208 0.40 3.2e-04 0.998600 18.31373
209 0.40 1.0e-04 0.998440 19.37255
210 0.40 3.2e-05 0.986400 18.19608
211 0.40 1.0e-05 0.984000 18.74510
212 0.40 3.2e-06 0.963100 18.27451
213 0.40 1.0e-06 0.963000 19.17647
214 0.40 3.2e-07 0.963000 20.27451
215 0.40 1.0e-07 0.945200 20.58824
216 0.40 3.2e-08 0.916600 20.70588
217 0.40 1.0e-08 0.881000 21.05882
218 0.40 3.2e-09 0.840000 21.05882
219 0.40 1.0e-09 0.823000 21.45098
220 0.40 3.2e-10 0.823000 22.15686
221 0.40 1.0e-10 0.823000 22.82353
222 0.40 3.2e-11 0.823000 23.33333
223 0.40 1.0e-11 0.823000 24.07843
224 0.40 3.2e-12 0.823000 24.66667
225 0.40 1.0e-12 0.823000 25.13725
226 0.45 1.0e+00 0.999972 32.03922
227 0.45 3.2e-01 0.998970 19.01961
228 0.45 1.0e-01 0.969800 14.43137
229 0.45 3.2e-02 0.999523 17.96078
230 0.45 1.0e-02 0.998330 17.68627
231 0.45 3.2e-03 0.999132 19.56863
232 0.45 1.0e-03 0.999416 21.01961
233 0.45 3.2e-04 0.999519 22.66667
234 0.45 1.0e-04 0.999037 23.37255
235 0.45 3.2e-05 0.996870 23.49020
236 0.45 1.0e-05 0.990030 23.29412
237 0.45 3.2e-06 0.973200 22.82353
238 0.45 1.0e-06 0.958900 23.05882
239 0.45 3.2e-07 0.954500 23.72549
240 0.45 1.0e-07 0.954500 25.13725
241 0.45 3.2e-08 0.954500 26.07843
242 0.45 1.0e-08 0.954500 26.74510
243 0.45 3.2e-09 0.954500 27.33333
244 0.45 1.0e-09 0.954500 27.68627
245 0.45 3.2e-10 0.954500 28.23529
246 0.45 1.0e-10 0.954500 28.58824
247 0.45 3.2e-11 0.954500 29.17647
248 0.45 1.0e-11 0.954500 29.56863
249 0.45 3.2e-12 0.954500 29.96078
250 0.45 1.0e-12 0.954500 30.58824
z <- array(parmres$percent,dim=c(length(sval),
length(parms$V1)/(2 * length(sval))))
co <- array(parmres$cutoff,dim=c(length(sval),
length(parms$V1)/(2 * length(sval))))
X11(width=4.5, height=4.5)
pplot <- function(th,ph) {
persp(x, y, z, ticktype="detailed", zlim=c(10,34), theta=th, phi=ph,
main="Percent error vs s and mindev",
xlab="-log(10) s", ylab="mindev",
zlab="percent error", shade=0.6, border=4, r=sqrt(2), d=2.5)
}
pplot(-60,20)
X11(width=4.5, height=4.5)
qplot <- function(th,ph) {
persp(x, y, co, ticktype="detailed", zlim=c(0.8,1), theta=th, phi=ph,
main="Cutoff vs s and mindev", xlab="-log(10) s",
ylab="mindev", zlab="cutoff", shade=0.6, border=4, r=sqrt(2), d=2.5)
}
qplot(-60,20)
[© Greg Louis, 2002, 2003; last modified 2003-07-25]