More recently, Gary suggested using Fisher's method for combining probabilities to perform the S calculation. A brief overview of the theory, and results of an experiment to compare Fisher's method with the geometric-mean calculation, are also on this site.
More recently still, the question was raised whether yet another means of selecting and combining key (that is, highly characteristic) probabilities would offer further improvement. This method employes the Bayesian chain rule; it's used in a project called crm114. The method is described in a file on that site named classify_details.txt. I attempted to duplicate the algorithm from crm114's November 26 release in a special version of bogofilter. The algorithm was modified in accordance with a suggestion of Gary Robinson's: either p(w) (as used in crm114) or f(w) (as defined in Gary's paper) can be used in calculating the "spamicity" score.
Bill Yerazunis, the author of crm114, suggested that the classifier should be trained on errors only. That is, messages from the training corpus should be picked at random, without replacement (no message is used more than once), and fed one by one to the classifier for evaluation. If the classification is done correctly, nothing more is done with that message. If it's wrong, or if the classification result is uncertain, the message is then used to train the classifier. Either way, we then move on to the next message.
A specially-modified version of bogofilter was produced, that reported "spamicity" values from the usual Fisher (chi-squared) calculation, plus values calculated with the Bayesian chain rule with p(w) as input, and again with f(w) as input. The code for calculating the probability for an individual token (an ugly hack of the routine in bogofilter-0.8.0's bogofilter.c file) appears in Appendix A.
The training files were used to build training databases by training-on-error as described in the previous section. Three training databases were produced, one for each algorithm. The scripting used to do this was excessively clumsy, and I subsequently wrote a bash script to perform this type of training. The original quick-and-dirty stuff appears in Appendix B for verification, but I commend it to the reader's inattention ;-)
Next, the test files were evaluated with the three training databases. Because of the way the modified bogofilter was written, all three algorithms were applied in all three tests; however, results were saved only for the algorithm that had been used to create the training set for the current test. The Robinson-Fisher method was also run, for comparison, against a training set created with all the training messages (full training, instead of training on error). Training on error was superior to full training in at least one respect: at the same spam cutoff (0.93), it led to fewer false positives in the test runs:
run fp-full fp-error
1 49 12
2 40 6
3 39 6
Now in R:
t.test(c(49,40,39), c(12,6,6), paired=TRUE, conf.level=0.99)
Paired t-test
data: c(49, 40, 39) and c(12, 6, 6)
t = 28.8444, df = 2, p-value = 0.001200
alternative hypothesis: true difference in means is not equal to 0
99 percent confidence interval:
22.73849 46.59484
sample estimates:
mean of the differences
34.66667
Setting spam cutoffs to yield 12 false positives proved impractical with the Bayesian chain rule method, as this led to an extreme excess of false negatives:
calc run fneg percent
1 chi 0 145 10.24
2 chi 1 119 8.40
3 chi 2 139 9.82
4 bcr-pw 0 508 35.88
5 bcr-pw 1 533 37.64
6 bcr-pw 2 556 39.27
7 bcr-fw 0 405 28.60
8 bcr-fw 1 422 29.80
9 bcr-fw 2 429 30.30
calc meanfnpc lcl95 ucl95
1 chi 9.49 7.66 11.3
2 bcr-pw 37.59 35.77 39.4
3 bcr-fw 29.57 27.74 31.4
All the algorithms we've tried produce values of the "spamicity" index between 0 and 1, 0 being nonspam and 1 being spam. If one plots the cumulative distributions of these values for a large number of messages, the result is something like this:
Robinson-geometric-mean Robinson-Fisher Bayes Chain Rule
--- ---- --------
/ / |
/ | |
/ | |
/ | |
/ | |
/ / |
--- ---- --------
All three methods have an area of doubt, within which the discrimination between spam and nonspam is uncertain. With the Robinson-geometric-mean model, spams return values ranging from about 0.45 to 1, and nonspams return values ranging from 0 to about 0.55; there's a grey area where they overlap and a decision is difficult. With Robinson-Fisher, the grey area is more clearly delineated: most spams score very close to 1, most nonspams are close to zero, and anything from about 0.1 to 0.9 clearly belongs in the grey area. Nevertheless it's possible to choose a spam cutoff value, somewhere in the range of 0.93 to 0.98, that will give quite strong discrimination with relatively few errors with this algorithm. With the Bayes chain rule, from what I've seen, the grey area practically disappears; if a mistake is made, the algorithm leaves almost no room to express doubt. It follows that the best way to compare the Robinson-Fisher method with the Bayesian chain rule method is to set the spam_cutoff to 0.5 in each case and count errors in either direction:
calc run fpos fneg err percent
1 chi 0 54 56 110 7.77
2 chi 1 36 40 76 5.37
3 chi 2 41 55 96 6.78
4 bcr-pw 0 47 152 199 14.05
5 bcr-pw 1 38 180 218 15.40
6 bcr-pw 2 45 172 217 15.32
7 bcr-fw 0 92 78 170 12.01
8 bcr-fw 1 76 94 170 12.01
9 bcr-fw 2 76 105 181 12.78
calc meanfnpc lcl95 ucl95
1 chi 6.64 5.1 8.18
2 bcr-pw 14.92 13.4 16.47
3 bcr-fw 12.26 10.7 13.81
Even so, the Robinson-Fisher method ("chi" in these tables) gave fewer errors, by almost half, than the Bayesian chain rule method with these test data.
Here is a graph of the results in the last table above. The vertical lines indicate the 95% confidence limits.

b = badcount / badlist_messagecount g = goodcount / goodlist_messagecount f(w) = (s * x + b) / (s + b + g) fn(w) = (s * x + g) / (s + b + g)
Fisher's method uses an inverse chi-squared function, prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2n degrees of freedom:
P = prbx(-2 * sum(ln(1-f(w))), 2*n) Q = prbx(-2 * sum(ln(f(w))), 2*n) S = (1 + Q - P) / 2
Without s, special precautions have to be taken in the event that a token occurs nowhere in the training set (b == g == 0). Also, a truncation step has to be performed so that probabilities of 1.0 and 0.0 are never included in the calculation. These aren't needed if s is nonzero.
The Bayesian chain rule works like this: We calculate Ps (spam) and Pns (nonspam), setting them initially to 0.5, and then for each token:
temp = f(w) * Ps / (f(w) * Ps + fn(w) * Pns) Pns = fn(w) * Pns / (f(w) * Ps + fn(w) * Pns) Ps = temp
Next we deal with underflow: if either Ps or Pns has become zero, we set it to the smallest possible floating-point number, 1e-308. Finally, we use the smaller of Ps and Pns to recalculate the other:
if(*Pns < *Ps) { *Ps = 1.0 - *Pns; } else { *Pns = 1.0 - *Ps; }
If Ps is over 0.5, the message is deemed to be spam. As discussed in the Procedure and Results section, tuning spam_cutoff values is of little use with this calculation method. Finally, here is the code used to get the probabilities in bogofilter for this experiment. This function is called once with fwflag true, to get Robinson-Fisher and Bayesian chain rule probabilities with fw in the calculation; and again with fwflag false, to get the Bayesian chain rule probabilities with p(w) as input. If anyone notices an error, I would be most grateful for a report; I wrote this without a full understanding of what some parts were doing.
double wordprob_result(wordprob_t* wordstats, int fwflag,
double *Ps, double *Pns)
{
double s = fwflag ? ROBS/badmsgs : 0.0; double b, g, prob, probns, c;
b = wordstats->bad/badmsgs; g = wordstats->good/goodmsgs;
if(!fwflag && wordstats->bad == 0.0 && wordstats->good == 0.0) {
prob = probns = 0.5;
} else {
prob = (s * ROBX + b) / (s + b + g);
probns = (s * ROBX + g) / (s + b + g);
}
if(!fwflag) {
c = 1.0 / (wordstats->bad + wordstats->good + 2.0) +
1.0 / (wordstats->bad + wordstats->good + 1.0);
c = min(c, 0.5);
prob = min(prob, 1.0 - c);
prob = max(prob, c);
probns = min(probns, 1.0 - c);
probns = max(probns, c);
}
c = prob * *Ps / (prob * *Ps + probns * *Pns);
*Pns = probns * *Pns / (prob * *Ps + probns * *Pns);
*Ps = c;
*Ps = max(*Ps, MINDOUBLE);
*Pns = max(*Pns, MINDOUBLE);
if(*Pns < *Ps) { *Ps = 1.0 - *Pns; } else { *Pns = 1.0 - *Ps; }
return (prob);
}
Appendix B: Notes on experimental procedure:
The following notes should suffice if anyone wishes to repeat the
experiment:
This is a repeat of bogobcr2 but with training only on errors.
6372 spam and 6372 nonspam, filtered to remove duplicates and errors.
cat distrib
#! /bin/sh
# usage: FILENO=0 formail -s ./distrib [bad|good]