Token Redundancy and the Effective Size Factor

Introduction and general description:

This report describes an experiment intended to characterize the effects on bogofilter's discrimination capability of applying Gary Robinson's concept of an effective size factor (ESF), which he describes in an article available from this site.  For convenience (the reader's and mine  ;)  the introduction from one of my earlier experiments is repeated below.

Bogofilter is a Bayesian spam filter; background is presented on my general bogofilter page.  Bogofilter runs a modification of Paul Graham's original calculation method that was suggested by Gary Robinson.  Since this modification is the basis for the present experiment, a brief review of the calculations follows:

Paul Graham's original calculation uses a parameter p(w) that is calculated for each unique token in a message, based on the numbers of times that the token has been encountered in spam during training (badcount) and in nonspam (goodcount), scaled by the numbers of messages used to build the training database (bad- or goodlist_messagecount):

                     (badcount/badlist_messagecount)
p(w) = -----------------------------------------------------------------
       (badcount/badlist_messagecount + goodcount/goodlist_messagecount)
Special handling is needed when a token has not been encountered before, and there is some uncertainty involved when the token has been seen before, but very few times.  Robinson proposed that we calculate a value f(w) using a prior estimate, x, of the probability to be assigned to an unknown word, and a weight, s, that determines how strongly the prior influences the outcome when the counts are low:
n = badcount + goodcount

f(w) = (s * x + n * p(w)) / (s + n)
An alternative used in some implementations, that gives the same result, is:
scalefactor = badlist_messagecount / goodlist_messagecount

f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
The scale factor is the ratio of the number of messages used to make up the list of spam tokens to the number of messages used to make up the list of nonspam tokens.  Not obvious, but implicit in the second formula, is the replacement of n by  n' = badcount + goodcount * scalefactor -- this should have a negligible effect, unless the training set is extremely lopsided.

It has been found helpful to set a minimum deviation, or exclusion radius as Robinson aptly calls it.  Tokens that fall within this distance from 0.5 in their f(w) scores are ignored.

Robinson recommends applying Fisher's method of combining probabilities to obtain the actual index used to classify a message as spam or nonspam.  Fisher's method uses an inverse chi-squared function, which I'll refer to as prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2N degrees of freedom, N being the number of unique tokens in the message:

P = prbx(-2 * sum(ln(1-f(w))), 2*N)
Q = prbx(-2 * sum(ln(f(w))), 2*N)
S = (1 + Q - P) / 2

In the new article, Robinson suggests that since spam and nonspam messages both contain a degree of token redundancy, and that degree differs between the two types of messages, it might be valuable to apply an effective size factor (ESF), thus:

P = prbx(-2 * ln(prod(1-f(w))^y), 2*N*y)
Q = prbx(-2 * ln(prod(f(w))^z), 2*N*z)
S = Q / (Q + P)
where y and z are the spam and nonspam ESF values, the prod function returns the product of all its arguments, and S is forced to 0.5 if Q and P are both very near zero.

Robinson's article includes a report of a verification experiment in which he found that using ESF could reduce classification error by an amount that was statistically significant at the 0.05 level.  The experiment now to be described was similar to his, though there were some method differences.  Also, I used a much larger corpus of messages in the hope of obtaining more statistically definitive results.

Procedure and Results:

The bogofilter package now includes a program called bogotune, which can be used to search for optimal values for the x, s, and minimum deviation parameters used by bogofilter in classification.  This program operates by doing an exhaustive grid search of the useful range of values for these three parameters; to save time, the optimum found in a first, coarsely-spaced scan is made the centre of a finer but smaller range for a second pass.  I modified bogotune to search in five dimensions instead of three: s, min_dev, x, sp_esf and ns_esf (the last two being spam and nonspam ESF respectively).  The ranges searched are:
s:        1 to 0.01
min_dev:  0.05 to 0.45
x:        Robinson's suggested starting value, plus or minus 0.1,
          bounded between 0.4 and 0.6
sp_esf    1 to 0.003171
ns_esf    1 to 0.003171
A corpus of 62,312 nonspam and 60,120 spam was used.  The spam and nonspam were each divided into five equal parts by dealing out the messages as one might deal a deck of cards to five players; that is, each file got every fifth message, starting with offsets of 0, 1, 2, 3 and 4 messages in the source corpus.  Each of the five pairs of spam and nonspam files was then used to make a bogofilter training database, which was provided to bogotune, with the four remaining parts as test messages.  In the first round of 5 bogotune runs, the ESF values were held constant at 1.0 (in effect, no ESF was used); in the second, they were allowed to vary as indicated above.

Bogotune operates by identifying a target number of false positives (nonspam classified as spam).  For each set of parameters to be tried, it determines, using the nonspam test messages, what spam cutoff value will produce the target false-positive count.  Using that value, it then processes the test spam to get the false-negative count.  The objective is to find the parameter set that gives the lowest false-negative count.

Bogotune tries not to use "outliers" -- that is, parameter sets that give good results but are extremely sensitive to small changes in any parameter.

In the end, bogotune will report the parameter values it recommends, and will (if numbers permit) offer alternative spam cutoff values for 0.01, 0.05, 0.1 and 0.2% false positives.  In this experiment, only the values suggested for 0.2% were compared.

The following table gives the results:

 run esf    x    md      s    spesf    nsesf  spco  nsco   fn
   1  no 0.70 0.450 0.0562 1.000000 1.000000 0.991 0.100 2010
   1 yes 0.55 0.200 0.1000 0.003171 0.013630 0.766 0.351 1569
   2  no 0.70 0.435 0.1000 1.000000 1.000000 0.996 0.100 2270
   2 yes 0.68 0.070 0.1000 0.004228 0.750000 0.847 0.100 2239
   3  no 0.70 0.415 0.0316 1.000000 1.000000 0.941 0.100 1824
   3 yes 0.65 0.115 0.3160 0.003171 0.100113 0.818 0.235 1541
   4  no 0.72 0.435 0.1000 1.000000 1.000000 0.996 0.100 1811
   4 yes 0.58 0.200 0.0562 0.003171 0.013363 0.762 0.347 1439
   5  no 0.74 0.435 0.0562 1.000000 1.000000 0.960 0.100 1759
   5 yes 0.64 0.185 0.0100 0.003171 0.017818 0.791 0.223 1521
Column headings are self-explanatory, except perhaps for the last three: spco is the spam cutoff required for 0.2% false positives, nsco is the nonspam cutoff that defines an "unsure" region, and fn is the count of false negatives.  A bar graph of the fn values follows:

bar graph

It seems that using ESF does lead to a useful degree of reduction in the error rate.  Gary Robinson has proposed the following way of testing the statistical significance of these results.  If there were no effect of ESF, one would expect equal numbers of fn with and without.  Therefore, we have, for each run, a binomial distribution with the sum of the with and without values as the number of trials, and the number of fn with ESF as the number of failures, with a probability of 0.5 for the null hypothesis.  Each run is an independent test of this hypothesis (Ho), yielding the following probabilites:

 run     P(Ho)
   1  8.93e-14
   2  0.33
   3  5.75e-07
   4  3.64e-11
   5  1.74e-05
These independently derived probabilities can be combined by using Fisher's method (the same calculation as is used in scoring the messages): under the null hypothesis, -2 * sum(ln P(Ho)) is distributed as chi-squared with 2N degrees of freedom, N being the number of probabilities to be combined.  The overall result is that the combined P(Ho) is equal to 1.96e-29 -- we are entitled to reject the idea that using ESF has no effect.

Further Testing:

Two further attempts were made to validate these results independently, with messages not used elsewhere. A tuning run was performed with a training database built from 23,799 nonspam and 23,953 spam, and a corpus of 51,907 nonspam and 52,127 spam, none of which had been used in the five-way test reported above.  The results appear below:
esf      x     md       s   spesf   nsesf   spco  nsco  fp   fn  fppc  fnpc
 no  0.463  0.115  0.1000  1.0000  1.0000  0.946   0.1   5  632  0.01  1.21
yes  0.463  0.145  0.0562  0.7500  1.0000  0.968   0.1   5  623  0.01  1.20
The last two columns are fp and fn as percentages.

The binomial-distribution significance test gives a P(Ho) value of 0.41 for this run.

The same training database and the above parameter values were used to classify a further 8,611 spam and 31,279 nonspam that had not been included in any previous testing:

esf  fp   fn    fppc  fnpc
 no   1   87  0.0032  1.01
yes   1   67  0.0032  0.78  (P=0.0627)
In this case the difference was not statistically significant. Another similar test was performed, this time with messages from a group of 80 users.  The training database had been built from 72,434 spam and 51,697 nonspam, and there were 60,120 spam and 62,312 nonspam in the test set, none of which had been used in training.
esf     x     md       s   spesf   nsesf   spco  nsco  fp   fn  fppc  fnpc
 no  0.48  0.035  0.1778  1.0000  1.0000  0.998   0.1 124 3215   0.2  5.35
yes  0.50  0.035  0.0100  0.0317  0.5626  0.813   0.1 124 2158   0.2  3.59
A probability of 1.2e-47 was obtained in this case.  The parameters from this run were then used to classify a further 10,504 nonspam and 38,365 spam from the same user group.  The spam cutoff values were adjusted to reduce the percent of false positives.
esf  fp    fn   fppc  fnpc
 no   7  3627  0.067  9.45
yes   7  2765  0.067  7.21  (P=2e-27)
The results obtained with ESF were uniformly better than those obtained when ESF values were held at 1.0.  Although there were some cases in which the difference was slight, the overall probability of ESF having no beneficial effect seems to be vanishingly small.

Conclusions:

  1. The use of ESF with bogofilter should, in general, lead to better discrimination between spam and nonspam.
  2. Determining the correct values of ESF to use requires a significant corpus of test data.  Users with small corpora will probably do better without ESF.
  3. Tuning, for ESF as well as the other parameters, will need to be repeated from time to time as the properties of the message corpora change.  It is quite unlikely that values can be found that will be optimal for all users, for all time.

Appendix: Improving Bogotune

When bogotune was first written, it was deemed essential to save time by doing the grid search in two parts. First the whole grid range was searched with relatively large intervals between points, to identify a region of interest. This was then rescanned, over a smaller range and with smaller intervals, to find the parameter set that gave the least error (more precisely, the lowest false-negative count when the spam cutoff was set to give a predetermined target false-positive count).

The ranges written into bogotune (after some experimentation) were as follows (both spam and nonspam esf are varied over the range shown):

Coarse scan:
s: 10**0 to -2 by -0.5			five
md: 0.05 to 0.45 by 0.05		nine
x: calculated x +/- 0.2 by 0.1		five
esf: 0.75**0 to 20 by 5			five
total					5,625

Fine scan:
s: 10**optimum +/- 0.5 by 0.25		five
md: optimum +/- 0.075 by 0.015		eleven
x: optimum +/- 0.04 by 0.02		five
esf: 0.75**optimum +/- 2 by 1		five
total (max)				6,875
Bogotune has now been altered experimentally to rebalance the intervals used in its grid search, and to eliminate gaps and overlap. In addition, the ranges of the md and esf parameters were reduced in the coarse scan, in order to reduce the likelihood of getting caught in a local minimum.

These are the new ranges covered by bogotune:

Coarse scan:
s: 10**0 to -2 by -1			three
md: 0.06 to 0.38 by 0.08		five
x: calculated x +/- 0.1 by 0.05		five
esf: 0.75**2 to 20 by 3			seven
total					3,675

Fine scan:
s: 10**optimum +/- 0.5 by 0.25		five
md: optimum +/- 0.042 by 0.014		seven
x: optimum +/- 0.026 by 0.013		five
esf: 0.75**optimum +/- 1.5 by 0.5	seven
total (max)				8,575
The totals for the fine scans are maxima, since the ranges are all bounded above and below; values outside the bounds are not tried.

The highest md that can be inspected with this scheme is 0.422; some may wish to range from -0.042 to +0.07 in the fine scan, thereby setting the upper limit at 0.45. The total number of combinations in the fine scan would then rise to 11,025. Alternatively, the coarse scan could include 0.46, which should then be the upper bound for the fine scan also; that would raise the coarse scan's total to 4,410 combinations. I did neither of these because of a wish to avoid the "shelf effect" that can occur with md values above 0.44.

To test the value of these modifications, tuning runs were performed with a training database built from 23,799 nonspam and 23,953 spam, and a corpus of 51,907 nonspam and 52,127 spam, with the old and new grid settings, with the following result:

Old grid settings (repeated, from above, for convenience):

esf      x     md       s   spesf   nsesf   spco  nsco  fp   fn  fppc  fnpc
 no  0.463  0.115  0.1000  1.0000  1.0000  0.946   0.1   5  632  0.01  1.21
yes  0.463  0.145  0.0562  0.7500  1.0000  0.968   0.1   5  623  0.01  1.20
New:
esf      x     md       s    spesf    nsesf   spco  nsco  fp   fn  fppc  fnpc
 no  0.589  0.132  0.3162 1.000000 1.000000  0.846  0.10   5  603  0.01  1.16
yes  0.500  0.118  0.1000 0.003171 0.004228  0.559  0.45   5  523  0.01  1.00
The P(Ho) value for this set of messages went from 0.41 to 0.0093. Even without ESF, the new scan did a bit better than the old, though the difference is not statistically (or functionally) significant.

Wishing to try this scheme in production, I took messages received in my personal mailbox between April 29 and May 6. There were 3,490 spam and 9,136 nonspam.

esf      x    md       s     spesf    nsesf   spco   nsco  fp  fn  fppc  fnpc
 no  0.333  0.19  0.1778  1.000000  1.00000  0.422  0.100   4  65  0.05  1.86
yes  0.543  0.04  0.0316  0.013363  0.36535  0.586  0.205   4  39  0.05  1.12
P(Ho) for this run is 0.007.

Mail for my personal box is now being filtered with bogofilter with the above ESF parameters (0.543, 0.04 etc.)

Greg Louis, 2004; last modified 2004-05-09]