Greg Louis, Aug. 2003
Note added September 2004
I'd been meaning to do some followup work on the ideas presented here,
because the results haven't been consistently reproducible in further
experiments. I think that might be explained by my failure to
account for token redundancy as described
by Gary Robinson. It's conceivable that my changes unmask the
distortion caused by token redundancy and that that's why some
populations of messages do much worse than expected when equation [#5]
below is applied. It might be interesting to revisit this work with
effective size factors (ESF) enabled.
(let ((g (* 2 (or (gethash word good) 0)))
(b (or (gethash word bad) 0)))
(unless (< (+ g b) 5)
(max .01
(min .99 (float (/ (min 1 (/ b nbad))
(+ (min 1 (/ g ngood))
(min 1 (/ b nbad)))))))))
Let g be the number of times a token is found in nonspam
b be the number of times the token is found in spam
G be the number of nonspam messages used in building the database
B be the number of spam messages used in building the database
p be the likelihood that a mail containing the token is spam
Then
b/B
[#1] p = ---------
b/B + g/G
"The especially observant will notice that while I consider each corpus to be a single long stream of text for purposes of counting occurrences, I use the number of emails in each, rather than their combined length, as the divisor in calculating spam probabilities. This adds another slight bias to protect against false positives."
As I hope to demonstrate below by deriving the equation for p from first principles, the concept is right but the application may not be. I will also point out a possible difficulty in training on error, and suggest one way to work around it.
P[T|C] * P[C]
[#2] P[C|T] = ---------------------------------
P[T|C] * P[C] + P[T|C'] * P[C']
P[C|T] is the likelihood that a message containing the token is spam.
P[C|T] == p
P[T|C] = b / B
P[C] = B / (B + G)
P[T|C'] is the likelihood that a nonspam will contain the token. Our estimate of that is the proportion of nonspam messages in which the token was found.
P[T|C'] = g / G
P[C'] = G / (B + G)
P[T|C]
[#3] P[C|T] = ----------------
P[T|C] + P[T|C']
If the training database and the population of messages as a whole have the same ratio of spam to nonspam, the above definitions apply. Substituting them into equation [#2] (Bayes' rule) yields a very simple result:
b B b
- * --- ---
B B+G B+G b
[#4] p = ------------------- == --- == ---
b B g G b+g b+g
- * --- + - * --- ---
B B+G G B+G B+G
b B'
- * -----
B B'+G'
[#5] p = -----------------------
b B' g G'
- * ----- + - * -----
B B'+G' G B'+G'
When classified with normal bogofilter (equation [#1]), the test messages yielded 136 (0.83%) false positives. The messages were then classified with a modified bogofilter using equation [#5], with B' 0.5664 and G' 0.4336. The spam cutoff was adjusted to give exactly the same number of false positives. For comparison, the same was done with another modified bogofilter using equation [#4]. This would be expected to perform worst of the three, because of the large discrepancy between the spam/nonspam ratios of the database and in the message population.
Equation used False negatives
[#1] 311 (1.46%)
[#5] 303 (1.42%)
[#4] 377 (1.76%)
[#1] Graham's p(w) as currently used in bogofilter
[#2] Bayes' theorem
[#3] Graham's p(w) in the notation of [#2]
[#4] Theoretically correct p(w) if spam/nonspam is the same in the
training database as in the incoming messages
[#5] p(w) if spam/nonspam has to be estimated separately in the
population of incoming messages
[© Greg Louis, 2003; last modified 2004-09-10]