Bogofilter Calculations: Comparing Bayes Chain Rule with Fisher's Method for Combining Probabilities

Introduction and general description:

The original version of bogofilter uses the computation method presented in Paul Graham's paper A Plan for Spam.  Gary Robinson took an interest in Graham's paper, and wrote an insightful commentary in which he presented several untested suggestions for improvements to Graham's method.  I was intrigued, and modified bogofilter 0.7 (and subsequently 0.7.4 and 0.7.5) to try them out.  They worked; I ran a comparison that's one of several indicating that implementing Robinson's f(w) and geometric-mean S calculations leads to improved discrimination between spam and nonspam.

More recently, Gary suggested using Fisher's method for combining probabilities to perform the S calculation.  A brief overview of the theory, and results of an experiment to compare Fisher's method with the geometric-mean calculation, are also on this site.

More recently still, the question was raised whether yet another means of selecting and combining key (that is, highly characteristic) probabilities would offer further improvement.  This method employes the Bayesian chain rule; it's used in a project called crm114.  The method is described in a file on that site named classify_details.txt.  I attempted to duplicate the algorithm from crm114's November 26 release in a special version of bogofilter.  The algorithm was modified in accordance with a suggestion of Gary Robinson's: either p(w) (as used in crm114) or f(w) (as defined in Gary's paper) can be used in calculating the "spamicity" score.

Bill Yerazunis, the author of crm114, suggested that the classifier should be trained on errors only.  That is, messages from the training corpus should be picked at random, without replacement (no message is used more than once), and fed one by one to the classifier for evaluation.  If the classification is done correctly, nothing more is done with that message.  If it's wrong, or if the classification result is uncertain, the message is then used to train the classifier.  Either way, we then move on to the next message.

Procedure and Results:

For this experiment I used a test corpus of 6372 nonspams and 6372 spam emails.  The nonspams and spams were "dealt" out, as you'd deal a deck of cards, into four separate mailboxes each; one containing 2124 messages, to be used for training, and three that contained 1416 messages each, to be used in the test.  The scripts used to accomplish this appear in Appendix B, below.

A specially-modified version of bogofilter was produced, that reported "spamicity" values from the usual Fisher (chi-squared) calculation, plus values calculated with the Bayesian chain rule with p(w) as input, and again with f(w) as input.  The code for calculating the probability for an individual token (an ugly hack of the routine in bogofilter-0.8.0's bogofilter.c file) appears in Appendix A.

The training files were used to build training databases by training-on-error as described in the previous section.  Three training databases were produced, one for each algorithm.  The scripting used to do this was excessively clumsy, and I subsequently wrote a bash script to perform this type of training.  The original quick-and-dirty stuff appears in Appendix B for verification, but I commend it to the reader's inattention  ;-)

Next, the test files were evaluated with the three training databases.  Because of the way the modified bogofilter was written, all three algorithms were applied in all three tests; however, results were saved only for the algorithm that had been used to create the training set for the current test.  The Robinson-Fisher method was also run, for comparison, against a training set created with all the training messages (full training, instead of training on error). Training on error was superior to full training in at least one respect: at the same spam cutoff (0.93), it led to fewer false positives in the test runs:

run fp-full fp-error
1        49       12         
2        40        6
3        39        6

Now in R:
t.test(c(49,40,39), c(12,6,6), paired=TRUE, conf.level=0.99)

        Paired t-test

data:  c(49, 40, 39) and c(12, 6, 6) 
t = 28.8444, df = 2, p-value = 0.001200
alternative hypothesis: true difference in means is not equal to 0 
99 percent confidence interval:
 22.73849 46.59484 
sample estimates:
mean of the differences 
               34.66667 

Setting spam cutoffs to yield 12 false positives proved impractical with the Bayesian chain rule method, as this led to an extreme excess of false negatives:

    calc run fneg percent
1    chi   0  145   10.24
2    chi   1  119    8.40
3    chi   2  139    9.82
4 bcr-pw   0  508   35.88
5 bcr-pw   1  533   37.64
6 bcr-pw   2  556   39.27
7 bcr-fw   0  405   28.60
8 bcr-fw   1  422   29.80
9 bcr-fw   2  429   30.30

    calc meanfnpc lcl95 ucl95
1    chi     9.49  7.66  11.3
2 bcr-pw    37.59 35.77  39.4
3 bcr-fw    29.57 27.74  31.4

All the algorithms we've tried produce values of the "spamicity" index between 0 and 1, 0 being nonspam and 1 being spam.  If one plots the cumulative distributions of these values for a large number of messages, the result is something like this:


Robinson-geometric-mean  Robinson-Fisher  Bayes Chain Rule
           ---                  ----              --------
          /                    /                  |
         /                    |                   |
        /                     |                   |
       /                      |                   |
      /                       |                   |
     /                       /                    |
  ---                    ----              --------

All three methods have an area of doubt, within which the discrimination between spam and nonspam is uncertain.  With the Robinson-geometric-mean model, spams return values ranging from about 0.45 to 1, and nonspams return values ranging from 0 to about 0.55; there's a grey area where they overlap and a decision is difficult.  With Robinson-Fisher, the grey area is more clearly delineated: most spams score very close to 1, most nonspams are close to zero, and anything from about 0.1 to 0.9 clearly belongs in the grey area.  Nevertheless it's possible to choose a spam cutoff value, somewhere in the range of 0.93 to 0.98, that will give quite strong discrimination with relatively few errors with this algorithm.  With the Bayes chain rule, from what I've seen, the grey area practically disappears; if a mistake is made, the algorithm leaves almost no room to express doubt.  It follows that the best way to compare the Robinson-Fisher method with the Bayesian chain rule method is to set the spam_cutoff to 0.5 in each case and count errors in either direction:

    calc run fpos fneg err percent
1    chi   0   54   56 110    7.77
2    chi   1   36   40  76    5.37
3    chi   2   41   55  96    6.78
4 bcr-pw   0   47  152 199   14.05
5 bcr-pw   1   38  180 218   15.40
6 bcr-pw   2   45  172 217   15.32
7 bcr-fw   0   92   78 170   12.01
8 bcr-fw   1   76   94 170   12.01
9 bcr-fw   2   76  105 181   12.78

    calc meanfnpc lcl95 ucl95
1    chi     6.64   5.1  8.18
2 bcr-pw    14.92  13.4 16.47
3 bcr-fw    12.26  10.7 13.81

Even so, the Robinson-Fisher method ("chi" in these tables) gave fewer errors, by almost half, than the Bayesian chain rule method with these test data.

Here is a graph of the results in the last table above.  The vertical lines indicate the 95% confidence limits.

graph

Conclusion:

The Fisher-based method of calculation (labelled "chi" in the graph and tables above) seems at this point to be superior to the Bayesian chain rule method in terms of discrimination.

Appendix A: Calculation methods:

The Bayesian chain rule method uses p(w), which is equivalent to f(w) with s set to zero, to get the initial probability estimate for a single token. The f(w) calculation is done like this:
b = badcount / badlist_messagecount
g = goodcount / goodlist_messagecount
f(w) = (s * x + b) / (s + b + g)
fn(w) = (s * x + g) / (s + b + g)
The fn(w) value is not used in the Robinson-Fisher method.

Fisher's method uses an inverse chi-squared function, prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2n degrees of freedom:

P = prbx(-2 * sum(ln(1-f(w))), 2*n)
Q = prbx(-2 * sum(ln(f(w))), 2*n)
S = (1 + Q - P) / 2

Without s, special precautions have to be taken in the event that a token occurs nowhere in the training set (b == g == 0).  Also, a truncation step has to be performed so that probabilities of 1.0 and 0.0 are never included in the calculation.  These aren't needed if s is nonzero.

The Bayesian chain rule works like this: We calculate Ps (spam) and Pns (nonspam), setting them initially to 0.5, and then for each token:

temp = f(w) * Ps / (f(w) * Ps + fn(w) * Pns)
Pns  = fn(w) * Pns / (f(w) * Ps + fn(w) * Pns)
Ps   = temp

Next we deal with underflow: if either Ps or Pns has become zero, we set it to the smallest possible floating-point number, 1e-308. Finally, we use the smaller of Ps and Pns to recalculate the other:

if(*Pns < *Ps) { *Ps = 1.0 - *Pns; } else { *Pns = 1.0 - *Ps; }

If Ps is over 0.5, the message is deemed to be spam.  As discussed in the Procedure and Results section, tuning spam_cutoff values is of little use with this calculation method. Finally, here is the code used to get the probabilities in bogofilter for this experiment. This function is called once with fwflag true, to get Robinson-Fisher and Bayesian chain rule probabilities with fw in the calculation; and again with fwflag false, to get the Bayesian chain rule probabilities with p(w) as input.  If anyone notices an error, I would be most grateful for a report; I wrote this without a full understanding of what some parts were doing.


double wordprob_result(wordprob_t* wordstats, int fwflag,
    double *Ps, double *Pns)
{
    double s = fwflag ? ROBS/badmsgs : 0.0; double b, g, prob, probns, c;
    b = wordstats->bad/badmsgs; g = wordstats->good/goodmsgs;
    if(!fwflag && wordstats->bad == 0.0 && wordstats->good == 0.0) {
        prob = probns = 0.5;
    } else {
        prob = (s * ROBX + b) / (s + b + g);
        probns = (s * ROBX + g) / (s + b + g);
    }
    if(!fwflag) {
        c = 1.0 / (wordstats->bad + wordstats->good + 2.0) +
            1.0 / (wordstats->bad + wordstats->good + 1.0);
        c = min(c, 0.5);
        prob = min(prob, 1.0 - c);
        prob = max(prob, c);
        probns = min(probns, 1.0 - c);
        probns = max(probns, c);
        }
    c = prob * *Ps / (prob * *Ps + probns * *Pns);
    *Pns = probns * *Pns / (prob * *Ps + probns * *Pns);
    *Ps = c;
    *Ps = max(*Ps, MINDOUBLE);
    *Pns = max(*Pns, MINDOUBLE);
    if(*Pns < *Ps) { *Ps = 1.0 - *Pns; } else { *Pns = 1.0 - *Ps; }
    return (prob);
}

Appendix B: Notes on experimental procedure:

The following notes should suffice if anyone wishes to repeat the experiment:

This is a repeat of bogobcr2 but with training only on errors.

6372 spam and 6372 nonspam, filtered to remove duplicates and errors.
cat distrib
#! /bin/sh
# usage: FILENO=0 formail -s ./distrib [bad|good] >$fname

FILENO=0 formail -s ./distrib sp < sieved.bad
FILENO=0 formail -s ./distrib ns < sieved.good

grep -c '^From ' t.*   
t.ns:2124
t.sp:2124

grep -c '^From ' r*
r0.ns:1416
r0.sp:1416
r1.ns:1416
r1.sp:1416
r2.ns:1416
r2.sp:1416

Use formail to separate training spams and nonspams into individual
$files named {RANDOM}s or ${RANDOM}n respectively.

cat randomfn
#! /bin/sh
while true; do
  fnam=${RANDOM}$1
  if [ ! -f $fnam ]; then break; fi
done
cat >F/$fnam

mkdir F
formail -s randomfn s < t.sp
formail -s randomfn n < t.ns

Use "for file in *" to train or not, based on error

cat trainerr
#! /bin/sh                                               
file=$1                                                  
expect=${file:0-1:1}                                     
# print $1 for Fisher, $3 for bcrPw, $5 for bcrFw
got=`~/bin/bogobcr -d ~/bcr/db -b <$file 2>&1 | awk '{print $1}'`
test "$got" = ${got:0:1} || got=3
echo -n "bogo=$got, "
if [ $got -eq 0 ]; then got="n"; elif [ $got -eq 1 ]; then got="s"; fi
echo -n "expected $expect, got $got"
if [ $got != $expect ]; then
    echo -n ", registering $file"
    ~/bin/bogobcr -d ~/bcr/db -$expect < $file
fi
echo

===== PART 1: Train with Robinson-Fisher =====

for file in *; do ~/bin/trainerr $file; done

bogoutil -w ../db .MSG_COUNT
                       spam   good
.MSG_COUNT              466    415

for run in 0 1 2; do
  formail -s ~/bin/bogobcr -d db -b < r$run.sp &> sp-$run
  formail -s ~/bin/bogobcr -d db -b < r$run.ns &> ns-$run
done  

grep -c -v ^1 ns-?
ns-0:1404  # 12
ns-1:1408  # 6
ns-2:1408  # 6

Compare this with the results obtained when all messages in t.ns and
t.sp were used for training:

ns-0:1367  # 49
ns-1:1376  # 40
ns-2:1377  # 39

===== PART 2: Train with Bcr-fw =====

for file in *; do ~/bin/trainerr $file; done

bogoutil -w ../db .MSG_COUNT
                       spam   good
.MSG_COUNT              170    137

for run in 0 1 2; do
  formail -s ~/bin/bogobcr -d db -b < r$run.sp &> sp-$run
  formail -s ~/bin/bogobcr -d db -b < r$run.ns &> ns-$run
done

grep -c ". ........   . ........   1" ns-bfw-?
ns-bfw-0:92
ns-bfw-1:76
ns-bfw-2:76

======= PART 3: Train with Bcr-pw =========

for file in *; do ~/bin/trainerr $file; done

bogoutil -w ../db .MSG_COUNT
                       spam   good
.MSG_COUNT              183    115

for run in 0 1 2; do
  formail -s ~/bin/bogobcr -d db -b < r$run.sp &> sp-$run
  formail -s ~/bin/bogobcr -d db -b < r$run.ns &> ns-$run
done

grep -c "^. ........   1" ns-bpw-?
ns-bpw-0:47
ns-bpw-1:38
ns-bpw-2:45

pr -l 9999 -F -w 990 -m -n -J -S" " -t ns-???-0 >ns0.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t ns-???-1 >ns1.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t ns-???-2 >ns2.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t sp-???-0 >sp0.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t sp-???-1 >sp1.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t sp-???-2 >sp2.dat

for num in 0 1 2; do
  awk '{print $1,"   ",$14," ",$15,"   ",$10," ",$11,"   ",$6," ",$7}' \
    ns$num.dat >ns-$num
  awk '{print $1,"   ",$14," ",$15,"   ",$10," ",$11,"   ",$6," ",$7}' \
    sp$num.dat >sp-$num
done
joe hdr
for file in ns-0 ns-1 ns-2 sp-0 sp-1 sp-2; do
  cat hdr $file >work; mv work $file
done

At this point we have data for each algorithm's performance when
training was performed with that algorithm.

Now in R:
> t.test(c(49,40,39), c(12,6,6), paired=TRUE, conf.level=0.99)

        Paired t-test

data:  c(49, 40, 39) and c(12, 6, 6) 
t = 28.8444, df = 2, p-value = 0.001200
alternative hypothesis: true difference in means is not equal to 0 
99 percent confidence interval:
 22.73849 46.59484 
sample estimates:
mean of the differences 
               34.66667 

read.table("F/ns-0") -> ns0
read.table("F/ns-1") -> ns1
read.table("F/ns-2") -> ns2
read.table("F/sp-0") -> sp0
read.table("F/sp-1") -> sp1
read.table("F/sp-2") -> sp2

comb <- function (x) {
    y <- if(x[1] == 1) 1 - as.real(x[2]) else as.real(x[2])
    for(i in c(3,5))
        y <- c(y, if(x[i] == 1) 1 - as.real(x[i+1]) else
as.real(x[i+1]))
    y
}
 
cns0 <- t(apply(ns0,1,comb))
cns1 <- t(apply(ns1,1,comb))
cns2 <- t(apply(ns2,1,comb))
csp0 <- t(apply(sp0,1,comb))
csp1 <- t(apply(sp1,1,comb))
csp2 <- t(apply(sp2,1,comb))

fp <- 12
ts <- apply(cns0,2,sort,decreasing=TRUE)
fp0 <- pmin(ts[fp+1,], 1-1e-15)
ts <- apply(cns1,2,sort,decreasing=TRUE)
fp1 <- pmin(ts[fp+1,], 1-1e-15)
ts <- apply(cns2,2,sort,decreasing=TRUE)
fp2 <- pmin(ts[fp+1,], 1-1e-15)
fp3 <- t(array(c(fp0,fp1,fp2), dim=c(3,3)))
fcutoff <- apply(fp3,2,mean)

fn0 <- c(0,0,0)
fn1 <- fn0
fn2 <- fn0
for(n in 1:3) fn0[n] <- sum(csp0[,n] <= fcutoff[n])
for(n in 1:3) fn1[n] <- sum(csp1[,n] <= fcutoff[n])
for(n in 1:3) fn2[n] <- sum(csp2[,n] <= fcutoff[n])
fnf <- t(array(c(fn0,fn1,fn2), dim=c(3,3)))
fnf
     [,1] [,2] [,3]
[1,]  145  508  405
[2,]  119  533  422
[3,]  139  556  429

fn <- c(fnf)
n <- rep(c(0,1,2),3)
calc <- c(rep("chi",3),rep("bcr-pw",3),rep("bcr-fw",3))
bcr3 <- data.frame(calc=calc, run=n, fneg=fn, percent=100*fn/1416)

print(bcr3,digits=3)
    calc run fneg percent
1    chi   0  145   10.24
2    chi   1  119    8.40
3    chi   2  139    9.82
4 bcr-pw   0  508   35.88
5 bcr-pw   1  533   37.64
6 bcr-pw   2  556   39.27
7 bcr-fw   0  405   28.60
8 bcr-fw   1  422   29.80
9 bcr-fw   2  429   30.30

attach(bcr3)
bcr3$calc <- factor(bcr3$calc)
bcr3$run <- factor(run)
bcr3aov <- aov(percent ~ calc + run, data=bcr3)
summary(bcr3aov)
            Df  Sum Sq Mean Sq  F value    Pr(>F)    
calc         2 1257.67  628.83 485.9799 1.680e-05 ***
run          2    3.94    1.97   1.5229    0.3223    
Residuals    4    5.18    1.29                       

rdf <- 4
rms <- deviance(bcr3aov)/rdf
d <- c(1.95996, 0.412, 0.423)
z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms/3)
meanfn <- apply(array(bcr3$percent,dim=c(3,3)), 2, mean)
lcl95 <- meanfn - z       
ucl95 <- meanfn + z       

bcr3res <- data.frame(calc=c("chi", "bcr-pw", "bcr-fw"),
  meanfnpc=meanfn, lcl95=lcl95, ucl95=ucl95)
print(bcr3res,digits=3)
    calc meanfnpc lcl95 ucl95
1    chi     9.49  7.66  11.3
2 bcr-pw    37.59 35.77  39.4
3 bcr-fw    29.57 27.74  31.4

plot(bcr3res$calc,bcr3res$meanfnpc,ylim=c(0,40),
  main="Bayes Chain Rule vs Fisher Chi-Squared",
  xlab="Calculation", ylab="Percent false negatives")
lines(bcr3res$calc,bcr3res$ucl95,type="h")
lines(bcr3res$calc,bcr3res$lcl95,type="h",col="white")

fcutoff <- c(rep(0.5,3))
fn0 <- c(0,0,0)
fn1 <- fn0
fn2 <- fn0
for(n in 1:3) fn0[n] <- sum(csp0[,n] <= fcutoff[n])
for(n in 1:3) fn1[n] <- sum(csp1[,n] <= fcutoff[n])
for(n in 1:3) fn2[n] <- sum(csp2[,n] <= fcutoff[n])
fnf <- t(array(c(fn0,fn1,fn2), dim=c(3,3)))
fnf
     [,1] [,2] [,3]
[1,]   56  152   78
[2,]   40  180   94
[3,]   55  172  105

fn <- c(fnf)
fp0 <- c(0,0,0)
fp1 <- fp0
fp2 <- fp0 
for(n in 1:3) fp0[n] <- sum(cns0[,n] > fcutoff[n])
for(n in 1:3) fp1[n] <- sum(cns1[,n] > fcutoff[n])
for(n in 1:3) fp2[n] <- sum(cns2[,n] > fcutoff[n])
fpf <- t(array(c(fp0,fp1,fp2), dim=c(3,3)))
fpf
     [,1] [,2] [,3]
[1,]   54   47   92
[2,]   36   38   76
[3,]   41   45   76

fp <- c(fpf)
bcr3 <- data.frame(calc=calc, run=n, fpos=fp, fneg=fn, err=fp+fn,
  percent=100*(fp+fn)/1416)

print(bcr3,digits=3)
    calc run fpos fneg err percent
1    chi   0   54   56 110    7.77
2    chi   1   36   40  76    5.37
3    chi   2   41   55  96    6.78
4 bcr-pw   0   47  152 199   14.05
5 bcr-pw   1   38  180 218   15.40
6 bcr-pw   2   45  172 217   15.32
7 bcr-fw   0   92   78 170   12.01
8 bcr-fw   1   76   94 170   12.01
9 bcr-fw   2   76  105 181   12.78

attach(bcr3)
bcr3$calc <- factor(bcr3$calc)
bcr3$run <- factor(bcr3$run)
bcraov <- aov(percent ~ calc + run, data=bcr3)
summary (bcraov)
            Df  Sum Sq Mean Sq F value   Pr(>F)   
calc         2 107.392  53.696 57.9354 0.001114 **
run          2   0.748   0.374  0.4036 0.692373   
Residuals    4   3.707   0.927                    

rms <- deviance(bcraov)/rdf
z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms/3)
meanfn <- apply(array(bcr3$percent,dim=c(3,3)), 2, mean)
lcl95 <- meanfn - z          
ucl95 <- meanfn + z       

bcr3res <- data.frame(calc=c("chi", "bcr-pw", "bcr-fw"),
  meanfnpc=meanfn, lcl95=lcl95, ucl95=ucl95)

print(bcr3res,digits=3)
    calc meanfnpc lcl95 ucl95
1    chi     6.64   5.1  8.18
2 bcr-pw    14.92  13.4 16.47
3 bcr-fw    12.26  10.7 13.81

plot(bcr3res$calc,bcr3res$meanfnpc,ylim=c(0,20),
  main="Bayes Chain Rule vs Fisher Chi-Squared",
  xlab="Calculation", ylab="Percent error")
lines(bcr3res$calc,bcr3res$ucl95,type="h")
lines(bcr3res$calc,bcr3res$lcl95,type="h",col="white")

Greg Louis, 2002, 2003; last modified 2003-04-12]