Bogofilter is a Bayesian spam filter; background is presented on my general bogofilter page. Bogofilter runs a modification of Paul Graham's original calculation method that was suggested by Gary Robinson. Since this modification is the basis for the present experiment, a brief review of the calculations follows:
Paul Graham's original calculation uses a parameter p(w) that is calculated for each unique token in a message, based on the numbers of times that the token has been encountered in spam during training (badcount) and in nonspam (goodcount), scaled by the numbers of messages used to build the training database (bad- or goodlist_messagecount):
(badcount/badlist_messagecount)
p(w) = -----------------------------------------------------------------
(badcount/badlist_messagecount + goodcount/goodlist_messagecount)
n = badcount + goodcount f(w) = (s * x + n * p(w)) / (s + n)
scalefactor = badlist_messagecount / goodlist_messagecount f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
n' = badcount + goodcount * scalefactor
-- this should have a negligible effect, unless the training
set is extremely lopsided.It has been found helpful to set a minimum deviation, or exclusion radius as Robinson aptly calls it. Tokens that fall within this distance from 0.5 in their f(w) scores are ignored.
Robinson recommends applying Fisher's method of combining probabilities to obtain the actual index used to classify a message as spam or nonspam. Fisher's method uses an inverse chi-squared function, which I'll refer to as prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2N degrees of freedom, N being the number of unique tokens in the message:
P = prbx(-2 * sum(ln(1-f(w))), 2*N) Q = prbx(-2 * sum(ln(f(w))), 2*N) S = (1 + Q - P) / 2
In the new article, Robinson suggests that since spam and nonspam messages both contain a degree of token redundancy, and that degree differs between the two types of messages, it might be valuable to apply an effective size factor (ESF), thus:
P = prbx(-2 * ln(prod(1-f(w))^y), 2*N*y) Q = prbx(-2 * ln(prod(f(w))^z), 2*N*z) S = Q / (Q + P)
Robinson's article includes a report of a verification experiment in which he found that using ESF could reduce classification error by an amount that was statistically significant at the 0.05 level. The experiment now to be described was similar to his, though there were some method differences. Also, I used a much larger corpus of messages in the hope of obtaining more statistically definitive results.
s: 1 to 0.01
min_dev: 0.05 to 0.45
x: Robinson's suggested starting value, plus or minus 0.1,
bounded between 0.4 and 0.6
sp_esf 1 to 0.003171
ns_esf 1 to 0.003171
Bogotune operates by identifying a target number of false positives (nonspam classified as spam). For each set of parameters to be tried, it determines, using the nonspam test messages, what spam cutoff value will produce the target false-positive count. Using that value, it then processes the test spam to get the false-negative count. The objective is to find the parameter set that gives the lowest false-negative count.
Bogotune tries not to use "outliers" -- that is, parameter sets that give good results but are extremely sensitive to small changes in any parameter.
In the end, bogotune will report the parameter values it recommends, and will (if numbers permit) offer alternative spam cutoff values for 0.01, 0.05, 0.1 and 0.2% false positives. In this experiment, only the values suggested for 0.2% were compared.
The following table gives the results:
run esf x md s spesf nsesf spco nsco fn 1 no 0.70 0.450 0.0562 1.000000 1.000000 0.991 0.100 2010 1 yes 0.55 0.200 0.1000 0.003171 0.013630 0.766 0.351 1569 2 no 0.70 0.435 0.1000 1.000000 1.000000 0.996 0.100 2270 2 yes 0.68 0.070 0.1000 0.004228 0.750000 0.847 0.100 2239 3 no 0.70 0.415 0.0316 1.000000 1.000000 0.941 0.100 1824 3 yes 0.65 0.115 0.3160 0.003171 0.100113 0.818 0.235 1541 4 no 0.72 0.435 0.1000 1.000000 1.000000 0.996 0.100 1811 4 yes 0.58 0.200 0.0562 0.003171 0.013363 0.762 0.347 1439 5 no 0.74 0.435 0.0562 1.000000 1.000000 0.960 0.100 1759 5 yes 0.64 0.185 0.0100 0.003171 0.017818 0.791 0.223 1521

It seems that using ESF does lead to a useful degree of reduction in the error rate. Gary Robinson has proposed the following way of testing the statistical significance of these results. If there were no effect of ESF, one would expect equal numbers of fn with and without. Therefore, we have, for each run, a binomial distribution with the sum of the with and without values as the number of trials, and the number of fn with ESF as the number of failures, with a probability of 0.5 for the null hypothesis. Each run is an independent test of this hypothesis (Ho), yielding the following probabilites:
run P(Ho) 1 8.93e-14 2 0.33 3 5.75e-07 4 3.64e-11 5 1.74e-05
esf x md s spesf nsesf spco nsco fp fn fppc fnpc no 0.463 0.115 0.1000 1.0000 1.0000 0.946 0.1 5 632 0.01 1.21 yes 0.463 0.145 0.0562 0.7500 1.0000 0.968 0.1 5 623 0.01 1.20
The binomial-distribution significance test gives a P(Ho) value of 0.41 for this run.
The same training database and the above parameter values were used to classify a further 8,611 spam and 31,279 nonspam that had not been included in any previous testing:
esf fp fn fppc fnpc no 1 87 0.0032 1.01 yes 1 67 0.0032 0.78 (P=0.0627)
esf x md s spesf nsesf spco nsco fp fn fppc fnpc no 0.48 0.035 0.1778 1.0000 1.0000 0.998 0.1 124 3215 0.2 5.35 yes 0.50 0.035 0.0100 0.0317 0.5626 0.813 0.1 124 2158 0.2 3.59
esf fp fn fppc fnpc no 7 3627 0.067 9.45 yes 7 2765 0.067 7.21 (P=2e-27)
The ranges written into bogotune (after some experimentation) were as follows (both spam and nonspam esf are varied over the range shown):
Coarse scan: s: 10**0 to -2 by -0.5 five md: 0.05 to 0.45 by 0.05 nine x: calculated x +/- 0.2 by 0.1 five esf: 0.75**0 to 20 by 5 five total 5,625 Fine scan: s: 10**optimum +/- 0.5 by 0.25 five md: optimum +/- 0.075 by 0.015 eleven x: optimum +/- 0.04 by 0.02 five esf: 0.75**optimum +/- 2 by 1 five total (max) 6,875
These are the new ranges covered by bogotune:
Coarse scan: s: 10**0 to -2 by -1 three md: 0.06 to 0.38 by 0.08 five x: calculated x +/- 0.1 by 0.05 five esf: 0.75**2 to 20 by 3 seven total 3,675 Fine scan: s: 10**optimum +/- 0.5 by 0.25 five md: optimum +/- 0.042 by 0.014 seven x: optimum +/- 0.026 by 0.013 five esf: 0.75**optimum +/- 1.5 by 0.5 seven total (max) 8,575
The highest md that can be inspected with this scheme is 0.422; some may wish to range from -0.042 to +0.07 in the fine scan, thereby setting the upper limit at 0.45. The total number of combinations in the fine scan would then rise to 11,025. Alternatively, the coarse scan could include 0.46, which should then be the upper bound for the fine scan also; that would raise the coarse scan's total to 4,410 combinations. I did neither of these because of a wish to avoid the "shelf effect" that can occur with md values above 0.44.
To test the value of these modifications, tuning runs were performed with a training database built from 23,799 nonspam and 23,953 spam, and a corpus of 51,907 nonspam and 52,127 spam, with the old and new grid settings, with the following result:
Old grid settings (repeated, from above, for convenience):
esf x md s spesf nsesf spco nsco fp fn fppc fnpc no 0.463 0.115 0.1000 1.0000 1.0000 0.946 0.1 5 632 0.01 1.21 yes 0.463 0.145 0.0562 0.7500 1.0000 0.968 0.1 5 623 0.01 1.20
esf x md s spesf nsesf spco nsco fp fn fppc fnpc no 0.589 0.132 0.3162 1.000000 1.000000 0.846 0.10 5 603 0.01 1.16 yes 0.500 0.118 0.1000 0.003171 0.004228 0.559 0.45 5 523 0.01 1.00
Wishing to try this scheme in production, I took messages received in my personal mailbox between April 29 and May 6. There were 3,490 spam and 9,136 nonspam.
esf x md s spesf nsesf spco nsco fp fn fppc fnpc no 0.333 0.19 0.1778 1.000000 1.00000 0.422 0.100 4 65 0.05 1.86 yes 0.543 0.04 0.0316 0.013363 0.36535 0.586 0.205 4 39 0.05 1.12
Mail for my personal box is now being filtered with bogofilter with the above ESF parameters (0.543, 0.04 etc.)
[© Greg Louis, 2004; last modified 2004-05-09]