Really Bayesian Bogofilter

Altering the calculation slightly

Greg Louis, Aug. 2003

 

Note added September 2004

I'd been meaning to do some followup work on the ideas presented here, because the results haven't been consistently reproducible in further experiments.  I think that might be explained by my failure to account for token redundancy as described by Gary Robinson. It's conceivable that my changes unmask the distortion caused by token redundancy and that that's why some populations of messages do much worse than expected when equation [#5] below is applied. It might be interesting to revisit this work with effective size factors (ESF) enabled.

Introduction

The way Paul Graham calculates token probability in his article A Plan for Spam is a simplification of Bayes' theorem.  I have been wondering if using the simplification is actually better than just using Bayes' theorem directly.  What Paul wrote is:
 (let ((g (* 2 (or (gethash word good) 0)))
      (b (or (gethash word bad) 0)))
 (unless (< (+ g b) 5)
  (max .01
       (min .99 (float (/ (min 1 (/ b nbad))
                          (+ (min 1 (/ g ngood))
                             (min 1 (/ b nbad)))))))))
Which in ordinary math, and shorn of limits and omitting doubling of g, reads
Let g be the number of times a token is found in nonspam
    b be the number of times the token is found in spam
    G be the number of nonspam messages used in building the database
    B be the number of spam messages used in building the database
    p be the likelihood that a mail containing the token is spam

Then

                  b/B
[#1]       p = ---------
               b/B + g/G
We've all been using that, and it works well.  Paul wrote:
"The especially observant will notice that while I consider each corpus to be a single long stream of text for purposes of counting occurrences, I use the number of emails in each, rather than their combined length, as the divisor in calculating spam probabilities.  This adds another slight bias to protect against false positives."

As I hope to demonstrate below by deriving the equation for p from first principles, the concept is right but the application may not be.  I will also point out a possible difficulty in training on error, and suggest one way to work around it.

 

Derivation

Bayes' rule relates the probability of a binary condition C, given a positive result T of a test for C, to the power of the test (the probability of T given C) and the likelihood of C.  If P[x|y] is read as "the probability of x given y", and C' is the complement of C, Bayes' rule is

                             P[T|C] * P[C]
[#2]       P[C|T] = ---------------------------------
                    P[T|C] * P[C]  +  P[T|C'] * P[C']
Let's express each of these terms in spam-filtering language.  C is the condition "the message is spam".  T is "the token is present".

P[C|T] is the likelihood that a message containing the token is spam.


           P[C|T] == p
P[T|C] is the likelihood that a spam will contain the token.  Our estimate of that is the proportion of spam messages in which the token was found.

           P[T|C] = b / B
P[C] is the likelihood that a random message is spam.  Our estimate of that is the proportion of spam messages used to build the training database.

           P[C] = B / (B + G)
(If we have done full training, this may be a good estimate; but what if we train on errors and unsures?  The balance may not be perfect, and to the extent it isn't, our P[C] and P[C'] values will be askew.  It might be preferable, in such a case, to have bogofilter keep a running count of the proportion of spam in the messages it has processed, and use that instead; it will be less than perfectly accurate because of bogofilter's classification errors, but for this purpose it would likely be good enough.)

P[T|C'] is the likelihood that a nonspam will contain the token.  Our estimate of that is the proportion of nonspam messages in which the token was found.


           P[T|C'] = g / G
P[C'] is the likelihood that a random message is not spam.

           P[C'] = G / (B + G)
Given these definitions, we can see that Paul's equation above, expressed in the notation I used to present Bayes' rule, is

                        P[T|C]
[#3]       P[C|T] = ----------------
                    P[T|C] + P[T|C']
This simplification is correct if and only if the numbers of spam and nonspam messages are equal (P[C] == P[C'] == 0.5).

If the training database and the population of messages as a whole have the same ratio of spam to nonspam, the above definitions apply.  Substituting them into equation [#2] (Bayes' rule) yields a very simple result:


                     b    B              b
                     - * ---            ---
                     B   B+G            B+G       b 
[#4]       p = -------------------  ==  ---  ==  ---
               b    B      g    G       b+g      b+g
               - * ---  +  - * ---      --- 
               B   B+G     G   B+G      B+G

 

Hypothesis

So in fact, for bogofilter to work optimally, it would appear that:
  1. Whatever training method we employ, we should try to keep the ratio of spam to nonspam the same as it is in the messages we are receiving.  If we can do that, then we can use p = b / (b+g) to calculate token probabilities.
  2. If we can't do that, then we need to count spam and nonspam in our incoming messages in some other way (call these counts B' and G'), and calculate
    
                           b     B'
                           - * -----
                           B   B'+G'
    [#5]       p = -----------------------
                   b     B'      g     G'
                   - * -----  +  - * -----
                   B   B'+G'     G   B'+G'
    
    instead.

 

Experimental verification

A corpus of messages collected during April and May of this year contained 21,345 (56.64%) spam and 16,340 (43.36%) nonspam.  These messages were classified with bogofilter 0.14.3.  The training database was generated from another corpus with a similar spam/nonspam ratio, but fewer spam were used in training, making the database (message counts 15,240 spam and 21,040 nonspam) deliberately unrepresentative of the population spam/nonspam ratio.  I did that because I wanted to show the effect such a distortion would have.

When classified with normal bogofilter (equation [#1]), the test messages yielded 136 (0.83%) false positives. The messages were then classified with a modified bogofilter using equation [#5], with B' 0.5664 and G' 0.4336.  The spam cutoff was adjusted to give exactly the same number of false positives.  For comparison, the same was done with another modified bogofilter using equation [#4].  This would be expected to perform worst of the three, because of the large discrepancy between the spam/nonspam ratios of the database and in the message population.


        Equation used    False negatives
            [#1]           311 (1.46%)
            [#5]           303 (1.42%)
            [#4]           377 (1.76%)
These results are as expected.  Taking the population distribution into account (equation [#5]) gave the most accurate discrimination, though ignoring it (equation [#1]) only worsened discrimination by 2.6%, which is not extreme.  Using equation [#4], clearly unjustified given the skewed spam/ham ratio, had a more severe effect (24.4%) on the number of false negatives.  These differences, of course, reflect the degree to which the spam/nonspam ratio used in calculation differed from that present in the source population.  The effect, though small, seems to be of practical significance.

 

Conclusion

The suggestions presented above under "Hypothesis" should be adopted in bogofilter.

 

List of equations


[#1]  Graham's p(w) as currently used in bogofilter
[#2]  Bayes' theorem
[#3]  Graham's p(w) in the notation of [#2]
[#4]  Theoretically correct p(w) if spam/nonspam is the same in the
      training database as in the incoming messages
[#5]  p(w) if spam/nonspam has to be estimated separately in the
      population of incoming messages

 

Acknowledgement

Thanks go to David Relson for reviewing this document in draft. Several of his helpful suggestions have been adopted; of course, responsibility for any mistakes remains with the author.

Greg Louis, 2003; last modified 2004-09-10]