Bogofilter Calculations: Comparing Naïve Bayes with Fisher's Method for Combining Probabilities

Introduction and general description:

The original version of bogofilter uses the computation method presented in Paul Graham's paper A Plan for Spam.  Gary Robinson took an interest in Graham's paper, and wrote an insightful commentary in which he presented several untested suggestions for improvements to Graham's method.  I was intrigued, and modified bogofilter 0.7 (and subsequently 0.7.4 and 0.7.5) to try them out.  They worked; I ran a comparison that's one of several indicating that implementing Robinson's f(w) and geometric-mean S calculations leads to improved discrimination between spam and nonspam.

More recently, Gary suggested using Fisher's method for combining probabilities to perform the S calculation.  A brief overview of the theory, and results of an experiment to compare Fisher's method with the geometric-mean calculation, are also on this site.

Scott Lenser has modified an early version of bogofilter to apply naïve Bayesian learning.  In the present experiment, we compare that method with Robinson-Fisher.

Procedure and Results:

In the tables, chi is used to designate the Robinson-Fisher results (since a chi-squared combination of probabilities is involved), and srl the naïve Bayesian data.  The "proc" and "lost" columns document a bug in bogofilter_srl that causes some few messages to be lost, in the sense that no outcome report is issued for those messages. For this experiment, both programs were trained with 2525 spam and 2499 nonspam, and then run against three files of nonspams containing 2829, 2829 and 2828 messages, and three files of spam containing 1267 messages each.
# wc -l *[il]
   2829 ns-0-chi
   2769 ns-0-srl
   2829 ns-1-chi
   2777 ns-1-srl
   2828 ns-2-chi
   2774 ns-2-srl
   1267 sp-0-chi
   1266 sp-0-srl
   1267 sp-1-chi
   1266 sp-1-srl
   1267 sp-2-chi
   1265 sp-2-srl
  24404 total

# grep -c "^1" *[il]
ns-0-chi:390
ns-0-srl:2334
ns-1-chi:360
ns-1-srl:2358
ns-2-chi:377
ns-2-srl:2346
sp-0-chi:1231
sp-0-srl:160
sp-1-chi:1146
sp-1-srl:153
sp-2-chi:1151
sp-2-srl:145

The comparison is easier to interpret if we adjust the spam cutoff so the false-positive counts are comparable.  Since I wasn't sure how to do that in the srl calculation, and I can adjust the threshold for the chi results without rerunning bogofilter, I set the chi spam_cutoff to give counts matching the weighted mean of the srl counts.  When that was done, false-negative counts look like this:

  calc run proc lost  fn percent
1  chi   1 1267    0  31   2.447
2  chi   2 1267    0  90   7.103
3  chi   3 1267    0  91   7.182
4  srl   1 1266    1 160  12.638
5  srl   2 1266    1 153  12.085
6  srl   3 1265    2 145  11.462

  calc meanfnpc lcl95  ucl95
1  chi    5.577 2.433  8.722
2  srl   12.062 8.917 15.207

The difference is statistically significant at the 0.05 level.

Conclusion:

The Fisher-based method of calculation (labelled "chi" in the tables above) seems to offer a slight advantage over this implementation of the naïve Bayesian learning method in terms of discrimination power.  It also runs far faster (data not shown).

Appendix A: Notes on experimental procedure:

The following notes should suffice if anyone wishes to repeat the experiment:

In R:

data.frame(calc=c(rep("chi",3),rep("srl",3)),
  proc=c(2829,2829,2828,2769,2777,2774),
  lost=c(0,0,0,2829-2769,2829-2777,2828-2774)) -> ns

data.frame(calc=c(rep("chi",3),rep("srl",3)),
  proc=c(1267,1267,1267,1266,1266,1265),
  lost=c(0,0,0,1,1,2)) -> sp

ns$fp=c(390,360,337,2769-2334,2777-2358,2774-2346)
sp$fn=c(1267-1231,1267-1146,1267-1151,160,153,145)

> ns
  calc proc lost  fp
1  chi 2829    0 390
2  chi 2829    0 360
3  chi 2828    0 337
4  srl 2769   60 435
5  srl 2777   52 419
6  srl 2774   54 428

  calc proc lost  fn
1  chi 1267    0  36
2  chi 1267    0 121
3  chi 1267    0 116
4  srl 1266    1 160
5  srl 1266    1 153
6  srl 1265    2 145

Adjusting the spam cutoff

sum(c(435*2769,419*2777,428*2774))/(2769+2777+2774)
[1] 427.3257

read.table("F/ns-0-chi") -> cns0
read.table("F/ns-1-chi") -> cns1
read.table("F/ns-2-chi") -> cns2

comb <- function(x) {
  if(x[1] == 1) 1.0 - as.real(x[2]) else as.real(x[2])
}

apply(cns0[,1:2],1,comb) -> ccns0
apply(cns1[,1:2],1,comb) -> ccns1
apply(cns2[,1:2],1,comb) -> ccns2

mean(c(sort(ccns0,decreasing=TRUE)[428],
  sort(ccns1,decreasing=TRUE)[428],
  sort(ccns2,decreasing=TRUE)[428])) -> cutoff

ns$fp[1] <- sum(ccns0 > cutoff)
ns$fp[2] <- sum(ccns1 > cutoff)
ns$fp[3] <- sum(ccns2 > cutoff)

ns
  calc proc lost  fp
1  chi 2829    0 434
2  chi 2829    0 421
3  chi 2828    0 438
4  srl 2769   60 435
5  srl 2777   52 419
6  srl 2774   54 428

Now compare false-negative counts

read.table("F/sp-0-chi") -> csp0
read.table("F/sp-1-chi") -> csp1
read.table("F/sp-2-chi") -> csp2

apply(csp0[,1:2],1,comb) -> ccsp0
apply(csp1[,1:2],1,comb) -> ccsp1
apply(csp2[,1:2],1,comb) -> ccsp2

sp$fn[1] <- sum(ccsp0 <= cutoff)
sp$fn[2] <- sum(ccsp1 <= cutoff)
sp$fn[3] <- sum(ccsp2 <= cutoff)

sp$percent <- sp$fn * 100 / sp$proc

data.frame(calc=sp$calc, run=c(1,2,3,1,2,3), proc=sp$proc,
  lost=sp$lost,fn=sp$fn,percent=sp$percent) -> sp

print(sp,digits=4)
  calc run proc lost  fn percent
1  chi   1 1267    0  31   2.447
2  chi   2 1267    0  90   7.103
3  chi   3 1267    0  91   7.182
4  srl   1 1266    1 160  12.638
5  srl   2 1266    1 153  12.085
6  srl   3 1265    2 145  11.462

> summary(aov(percent ~ calc + run))
            Df Sum Sq Mean Sq F value  Pr(>F)  
calc         1 63.073  63.073 15.4724 0.02926 *
run          1  3.168   3.168  0.7772 0.44293  
Residuals    3 12.230   4.077                  

Since run has no significant effect, we can treat it as replication.

aov(percent ~ calc) -> srlaov
d <- c(1.95996, 0.412, 0.423)
rdf <- 4
rms <- deviance(srlaov) / rdf
z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms/3)
meanfn <- apply(array(sp$percent, dim=c(3,2)), 2, mean)
lcl95 <- meanfn - z
ucl95 <- meanfn + z
srlres <- data.frame(calc=c("chi","srl"), meanfnpc=meanfn,         
  lcl95=lcl95,ucl95=ucl95)

print(srlres,digits=4)
  calc meanfnpc lcl95  ucl95
1  chi    5.577 2.433  8.722
2  srl   12.062 8.917 15.207

Greg Louis, 2002; last modified 2002-12-01]