Bogofilter Calculations: Comparing Beta Distribution with Fisher's Method for Combining Probabilities

Introduction and general description:

The original version of bogofilter uses the computation method presented in Paul Graham's paper A Plan for Spam.  Gary Robinson took an interest in Graham's paper, and wrote an insightful commentary in which he presented several untested suggestions for improvements to Graham's method.  I was intrigued, and modified bogofilter 0.7 (and subsequently 0.7.4 and 0.7.5) to try them out.  They worked; I ran a comparison that's one of several indicating that implementing Robinson's f(w) and geometric-mean S calculations leads to improved discrimination between spam and nonspam.

More recently, Gary suggested using Fisher's method for combining probabilities to perform the S calculation.  A brief overview of the theory, and results of an experiment to compare Fisher's method with the geometric-mean calculation, are also on this site.

More recently still, the question was raised whether yet another means of selecting and combining key (that is, highly characteristic) probabilities would offer further improvement.  The idea is that we should, for some relatively small number r, determine the rth-closest f(w) value to 0 and the rth-closest f(w) value to 1; these will present beta distributions with parameters r and n-r+1, where n is the number of f(w) values from among which we chose. It seemed possible that the selected values, plus the fact that there must be r-1 other f(w) values more extreme than those chosen, would constitute adequate information on which to base a decision. This idea too was presented by Gary Robinson, and the object of the experiment to be reported here was to try it out.

Procedure and Results:

For this experiment I used a test corpus of 2499 nonspams and 2526 spam emails.  The nonspams and spams were "dealt" out into three separate mailboxes each; these were fed through a specially-modified version of bogofilter that reported "spamicity" values from the usual Fisher (chi-squared) calculation, plus values calculated via the beta distribution with r ranging from 1 to 10.

The nonspam runs with Fisher's method each produced 4 false positives out of 833 nonspams.  This was taken as the standard for the remaining runs. For each of these, the nonspam "spamicity" values were sorted in decreasing order and the mean of the 5th values on the three sorted lists was taken as the cutoff for the corresponding spam run. This produced the following cutoff values:


   calc     fcutoff
1   chi   0.9433333
2   r01   0.9940433
3   r02   0.9600000
4   r03   0.9026667
5   r04   0.9221667
6   r05   0.9350000
7   r06   0.8591000
8   r07   0.8936667
9   r08   0.8903333
10  r09   0.9172667
11  r10   0.9095000

For each spam run, the wanted datum was the count of false negatives (spam messages with "spamicity" values below the cutoff).  In the following table, the "percent" column gives the percentage of false negatives; see Appendix A for details.


   calc run fneg
1   chi   0   89
2   chi   1   92
3   chi   2  100
4   r01   0  430
5   r01   1  429
6   r01   2  437
7   r02   0  262
8   r02   1  281
9   r02   2  283
10  r03   0  210
11  r03   1  236
12  r03   2  239
13  r04   0  194
14  r04   1  221
15  r04   2  225
16  r05   0  198
17  r05   1  217
18  r05   2  223
19  r06   0  173
20  r06   1  191
21  r06   2  194
22  r07   0  180
23  r07   1  198
24  r07   2  209
25  r08   0  174
26  r08   1  192
27  r08   2  208
28  r09   0  182
29  r09   1  200
30  r09   2  222
31  r10   0  185
32  r10   1  206
33  r10   2  222

 

Summarizing, with means and 95% confidence limits:


   calc percent lcl95 ucl95
1   chi    11.1  10.3  12.0
2   r01    51.4  50.5  52.2
3   r02    32.7  31.9  33.6
4   r03    27.2  26.3  28.0
5   r04    25.4  24.5  26.2
6   r05    25.3  24.4  26.2
7   r06    22.1  21.2  23.0
8   r07    23.3  22.4  24.1
9   r08    22.8  21.9  23.6
10  r09    23.9  23.1  24.8
11  r10    24.3  23.4  25.2

Here's a graph of these results.  The vertical lines show the 95% confidence limits.

graph

Conclusion:

The Fisher-based method of calculation (labelled "chi" in the graph and tables above) seems to offer a distinct advantage over the beta-distribution method in terms of discrimination power.

Appendix A: Calculation methods:

Both beta-distribution- and Fisher-based calculation use
scalefactor = badlist_messagecount / goodlist_messagecount
f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
The scale factor is the ratio of messages used to make up the list of spam tokens to messages used to make up the list of nonspam tokens.

The f(w) value is calculated for each unique token in the message; the number of such tokens is represented below by n.

Fisher's method uses an inverse chi-squared function, prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2n degrees of freedom:

P = prbx(-2 * sum(ln(1-f(w))), 2*n)
Q = prbx(-2 * sum(ln(f(w))), 2*n)
S = (1 + Q - P) / 2

With the beta distribution, we choose a value of r and rank the r f(w) values closest to 0 and the r f(w) values closest to 1.  We then obtain, with a beta-distribution function, the values of p that correspond to the rth values; 1-p for the f(w) close to zero is our P, and 1-p for the f(w) close to 1 is our Q in the calculation:

S = (1 + Q - P) / 2

Appendix B: Notes on experimental procedure:

The following notes should suffice if anyone wishes to repeat the experiment:

for file in corpus.*; do echo $file; grep -c "^From " $file; done
 corpus.bad
 2526
 corpus.good
 2499

cat ~/bin/thirds
 #! /bin/sh
 let n=${FILENO}%3
 fname=cgx-$n
 cat >>$fname

FILENO=0 formail -s ~/bin/thirds sp-$run
  formail -s ~/bin/bogobeta -b ns-$run
done

wc -l *
    833 ns-0
    833 ns-1
    833 ns-2
    841 sp-0
    841 sp-1
    841 sp-2
   5022 total
grep -c -v ^1 ns-?
ns-0:829
ns-1:829
ns-2:829

# with cutoff 0.952, chi gave 4 false positives in each case.
# cutoffs for beta will be set to match.

# now in R:
read.table("F/ns-0") -> ns0
read.table("F/ns-1") -> ns1
read.table("F/ns-2") -> ns2
read.table("F/sp-0") -> sp0
read.table("F/sp-1") -> sp1
read.table("F/sp-2") -> sp2

comb <- function (x) {
    y <- if(x[1] == 1) 1 - as.real(x[2]) else as.real(x[2])
    for(i in seq(3,21,by=2)) 
	y <- c(y, if(x[i] == 1) 1 - as.real(x[i+1]) else as.real(x[i+1]))
    y
}

cns0 <- t(apply(ns0,1,comb))
cns1 <- t(apply(ns1,1,comb))
cns2 <- t(apply(ns2,1,comb))
csp0 <- t(apply(sp0,1,comb))
csp1 <- t(apply(sp1,1,comb))
csp2 <- t(apply(sp2,1,comb))

fp <- 4
ts <- apply(cns0,2,sort,decreasing=TRUE)
fp0 <- pmin(ts[fp+1,],0.997)
ts <- apply(cns1,2,sort,decreasing=TRUE)
fp1 <- pmin(ts[fp+1,],0.997)
ts <- apply(cns2,2,sort,decreasing=TRUE)
fp2 <- pmin(ts[fp+1,],0.997)
fp3 <- t(array(c(fp0,fp1,fp2), dim=c(11,3)))
fcutoff <- apply(fp3,2,mean)

fn0 <- c(0,0,0,0,0,0,0,0,0,0,0)
fn1 <- fn0
fn2 <- fn0
for(n in 1:11) fn0[n] <- sum(csp0[,n] <= fcutoff[n])
for(n in 1:11) fn1[n] <- sum(csp1[,n] <= fcutoff[n])
for(n in 1:11) fn2[n] <- sum(csp2[,n] <= fcutoff[n])
fnf <- (t(array(c(fn0,fn1,fn2),dim=c(11,3))))
fn <- c(fnf)       
n <- rep(c(0,1,2),11)
calc <- c(rep("chi",3),paste("r0",c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,
  6,6,6,7,7,7,8,8,8,9,9,9),sep=""),rep("r10",3))
beta <- data.frame(calc=calc,run=n,fneg=fn)
write.table(beta,"beta.tbl")
attach(beta)
beta$calc <- factor(beta$calc)
beta$run <- factor(run)
betaov <- aov(fneg ~ calc + run, data=beta)
d <- c(1.95996, 0.412, 0.423)
rdf <- 20
rms <- deviance(betaov) / rdf
z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms/3)
meanfn <- apply(array(beta$fneg,dim=c(3,11)), 2, mean)
xcalc <- c("chi", paste("r0", 1:9, sep=""),"r10")
lcl95 <- meanfn - z       
ucl95 <- meanfn + z
betares <- data.frame(calc=xcalc,
  percent=round(meanfn*100/841,digits=1),
  lcl95=round(lcl95*100/841,digits=1),
  ucl95=round(ucl95*100/841,digits=1))
write.table(betares,"beta.result")
betares$calc = factor(betares$calc)
plot(betares$calc,betares$percent,ylim=c(0,60),
  main="Beta Distribution vs Fisher Chi-Squared",
  xlab="Calculation", ylab="Percent false negatives", type="p")
> lines(betares$calc,betares$ucl95,type="h")
> lines(betares$calc,betares$lcl95,type="h",col="white")
> lines(betares$calc[2:11],betares$percent[2:11])

Greg Louis, 2002, 2003; last modified 2003-04-12]