More recently, Gary suggested using Fisher's method for combining probabilities to perform the S calculation. The idea is theoretically sound: every distinct token in the message contributes, through its f(w) value, an independent estimate of how likely it is that the message is a spam. Fisher noted in the early fifties that if one has a number k of independent estimates of probability, and the null hypothesis (that these k values arose by pure chance) is true, then if you sum the natural logarithms of the k estimates and multiply by -2, the result will be distributed as chi-squared with 2k degrees of freedom. Accordingly, we can use our message's f(w) values to calculate -2 * sum(ln(f(w))), with twice the number of tokens as the degrees of freedom, and apply an inverse chi-squared function to obtain the probability that the message is spam. We can also use the opposite probabilities (1 - f(w)) to calculate a similar probability that the message is nonspam. It's been proposed that both be calculated and that S, the "spamicity," should be the difference between the two. An advantage of using Fisher's method is that when different messages have different numbers of words, or of characteristic words, the difference in degrees of freedom should so affect the calculations as to make "spamicity" values more comparable from message to message. On the other hand, more calculation is needed (computing overhead is greater) in Fisher's method because of the two inverse chi-squared approximations that have to be performed.
I wanted to test whether this approach would produce a further improvment in discrimination, and I also had another question to ask. In the original calculation method, only the most characteristic words (those with f(w) values nearest to 0 or 1) are used to calculate S. In the geometric-mean S calculation, I had tried a high minimum deviation (looking only at f(w) over 0.9 or under 0.1) and none at all (looking at every token), and the latter seemed to work better. I wanted to investigate this systematically.
The nonspam runs with minimum deviation zero produced, on average, 86 false positives from the geometric mean calculation with a spam cutoff of 0.55 (there were many difficult messages among the nonspams, with lots of "spammy" words). This was taken as the standard for the remaining runs. For each of these, the nonspam "spamicity" values were sorted in decreasing order and the mean of the 87th values on the three sorted lists was taken as the cutoff for the corresponding spam run. This produced the following cutoff values, where "g" and "f" stand for geometric-mean and Fisher-based calculation respectively:
mindev g f 1 0.00 0.5507887 0.8639143 2 0.05 0.5608033 0.9263109 3 0.10 0.5826133 0.9529288 4 0.15 0.5852377 0.9199869 5 0.20 0.5897027 0.8601642 6 0.25 0.5950363 0.8005189 7 0.30 0.5938437 0.7276493 8 0.35 0.5895233 0.6632147 9 0.40 0.5841883 0.6157993 10 0.45 0.5806370 0.6097056
For each spam run, the wanted datum was the count of false negatives (spam messages with "spamicity" values below the cutoff). This was obtained with the geometric-mean calculation, and then the runs were repeated with calculation based on Fisher's method as described above (see Appendix A for details). In the following table, the "percent" column gives the percentage of false negatives.
calc mindev run fneg percent 1 g 0 0 151 12.03 2 g 0 1 160 12.75 3 g 0 2 161 12.83 4 g 0.05 0 154 12.27 5 g 0.05 1 154 12.27 6 g 0.05 2 159 12.67 7 g 0.1 0 152 12.11 8 g 0.1 1 155 12.35 9 g 0.1 2 163 12.99 10 g 0.15 0 149 11.87 11 g 0.15 1 155 12.35 12 g 0.15 2 163 12.99 13 g 0.2 0 152 12.11 14 g 0.2 1 157 12.51 15 g 0.2 2 163 12.99 16 g 0.25 0 162 12.91 17 g 0.25 1 164 13.07 18 g 0.25 2 174 13.86 19 g 0.3 0 167 13.31 20 g 0.3 1 169 13.47 21 g 0.3 2 184 14.66 22 g 0.35 0 172 13.71 23 g 0.35 1 181 14.42 24 g 0.35 2 189 15.06 25 g 0.4 0 172 13.71 26 g 0.4 1 184 14.66 27 g 0.4 2 207 16.49 28 g 0.45 0 194 15.46 29 g 0.45 1 212 16.89 30 g 0.45 2 220 17.53 31 f 0 0 168 13.39 32 f 0 1 167 13.31 33 f 0 2 160 12.75 34 f 0.05 0 152 12.11 35 f 0.05 1 160 12.75 36 f 0.05 2 152 12.11 37 f 0.1 0 148 11.79 38 f 0.1 1 159 12.67 39 f 0.1 2 157 12.51 40 f 0.15 0 147 11.71 41 f 0.15 1 157 12.51 42 f 0.15 2 162 12.91 43 f 0.2 0 152 12.11 44 f 0.2 1 155 12.35 45 f 0.2 2 163 12.99 46 f 0.25 0 155 12.35 47 f 0.25 1 158 12.59 48 f 0.25 2 174 13.86 49 f 0.3 0 163 12.99 50 f 0.3 1 168 13.39 51 f 0.3 2 183 14.58 52 f 0.35 0 165 13.15 53 f 0.35 1 177 14.10 54 f 0.35 2 189 15.06 55 f 0.4 0 172 13.71 56 f 0.4 1 187 14.90 57 f 0.4 2 204 16.25 58 f 0.45 0 189 15.06 59 f 0.45 1 204 16.25 60 f 0.45 2 216 17.21
There is no demonstrable effect of calculation type, though minimum deviation does have a statistically significant impact, on the numbers of false negatives:
> gfaov <- aov(pfn ~ calc + mindev + calc * mindev)
> summary(gfaov)
Df Sum Sq Mean Sq F value Pr(>F)
calc 1 0.138 0.138 0.2419 0.6255
mindev 9 98.590 10.954 19.1669 6.957e-12 ***
calc:mindev 9 1.135 0.126 0.2207 0.9897
Residuals 40 22.861 0.572
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Summarizing, with means and 95% confidence limits:
calc mindev meanfn lcl95 ucl95 1 g 0.00 12.54 11.66 13.42 2 g 0.05 12.40 11.52 13.28 3 g 0.10 12.48 11.60 13.36 4 g 0.15 12.40 11.52 13.28 5 g 0.20 12.54 11.66 13.42 6 g 0.25 13.28 12.40 14.16 7 g 0.30 13.81 12.93 14.69 8 g 0.35 14.40 13.52 15.28 9 g 0.40 14.95 14.07 15.83 10 g 0.45 16.63 15.75 17.51 11 f 0.00 13.15 12.27 14.03 12 f 0.05 12.32 11.44 13.20 13 f 0.10 12.32 11.44 13.20 14 f 0.15 12.38 11.50 13.26 15 f 0.20 12.48 11.60 13.36 16 f 0.25 12.93 12.05 13.81 17 f 0.30 13.65 12.77 14.53 18 f 0.35 14.10 13.22 14.98 19 f 0.40 14.95 14.07 15.83 20 f 0.45 16.17 15.29 17.05
Here's a graph of these results. Black points show counts of false negatives from the geometric-mean method of calculation, and red points show counts of false negatives from the Fisher-based method. They're slightly offset on the x axis so they can be seen more clearly. The vertical lines show the 95% confidence limits.

scalefactor = badlist_messagecount / goodlist_messagecount f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
The f(w) value is calculated for each unique token in the message; the number of such tokens is represented below by n.
In the geometric-mean calculation,
P = 1 - exp(sum(ln(1-f(w))) / n) Q = 1 - exp(sum(ln(f(w))) / n) S = (1 + (P-Q) / (P+Q)) / 2
Fisher's method uses an inverse chi-squared function, prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2n degrees of freedom:
P = prbx(-2 * sum(ln(1-f(w))), 2*n) Q = prbx(-2 * sum(ln(f(w))), 2*n) S = (1 + Q - P) / 2
cd mail
# 3765 spam in corpus.bad
# 8487 nonspam in corpus.good
cat ~/bin/thirds
#! /bin/sh
let n=${FILENO}%3
fname=cgx-$n
cat >>$fname
FILENO=0 formail -s ~/bin/thirds < corpus.good
rename gx ns cgx*
FILENO=0 formail -s ~/bin/thirds < corpus.bad
rename gx sp cgx*
cd ..
# /usr/bin/bogofilter is standard Robinson with -m option to set min_dev
# bin/bogofilter is Robinson-Fisher with -m option to set min_dev
for run in 0 1 2; do
for md in 0.0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45; do
formail -s /usr/bin/bogofilter -m $md -v < mail/csp-$run &> sp-$md-$run.lm
formail -s /usr/bin/bogofilter -m $md -v < mail/cns-$run &> ns-$md-$run.lm
formail -s bin/bogofilter -m $md -v < mail/csp-$run &> sp-$md-$run.F
formail -s bin/bogofilter -m $md -v < mail/cns-$run &> ns-$md-$run.F
done
done
pr -l 9999 -F -w 990 -m -n -J -S" " -t ns*0.lm >ns0lm.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t ns*1.lm >ns1lm.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t ns*2.lm >ns2lm.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t sp*0.lm >sp0lm.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t sp*1.lm >sp1lm.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t sp*2.lm >sp2lm.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t ns*0.F >ns0f.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t ns*1.F >ns1f.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t ns*2.F >ns2f.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t sp*0.F >sp0f.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t sp*1.F >sp1f.dat
pr -l 9999 -F -w 990 -m -n -J -S" " -t sp*2.F >sp2f.dat
The rest of the data reduction is done in R:
read.table("F/ns0f.dat") -> ns0f
read.table("F/ns0g.dat") -> ns0g
read.table("F/ns1f.dat") -> ns1f
read.table("F/ns1g.dat") -> ns1g
read.table("F/ns2f.dat") -> ns2f
read.table("F/ns2g.dat") -> ns2g
read.table("F/sp0f.dat") -> sp0f
read.table("F/sp0g.dat") -> sp0g
read.table("F/sp1f.dat") -> sp1f
read.table("F/sp1g.dat") -> sp1g
read.table("F/sp2f.dat") -> sp2f
read.table("F/sp2g.dat") -> sp2g
scutoff <- 0.55
fp0 <- sum(ns0g[,1] > scutoff)
fp1 <- sum(ns1g[,1] > scutoff)
fp2 <- sum(ns2g[,1] > scutoff)
fp <- mean(fp0,fp1,fp2)
ts <- apply(ns0g,2,sort,decreasing=TRUE)
fp0 <- ts[fp+1,]
ts <- apply(ns1g,2,sort,decreasing=TRUE)
fp1 <- ts[fp+1,]
ts <- apply(ns2g,2,sort,decreasing=TRUE)
fp2 <- ts[fp+1,]
fp3 <- t(array(c(fp0,fp1,fp2), dim=c(10,3)))
gcutoff <- apply(fp3,2,mean)
n <- 1; fn0g <- c(0,0,0,0,0,0,0,0,0,0)
while(n<11) { fn0g[n] <- sum(sp0g[n] <= gcutoff[n]) ; n <- n+1 }
n <- 1; fn1g <- c(0,0,0,0,0,0,0,0,0,0)
while(n<11) { fn1g[n] <- sum(sp1g[n] <= gcutoff[n]) ; n <- n+1 }
n <- 1; fn2g <- c(0,0,0,0,0,0,0,0,0,0)
while(n<11) { fn2g[n] <- sum(sp2g[n] <= gcutoff[n]) ; n <- n+1 }
ts <- apply(ns0f,2,sort,decreasing=TRUE)
fp0 <- ts[fp+1,]
ts <- apply(ns1f,2,sort,decreasing=TRUE)
fp1 <- ts[fp+1,]
ts <- apply(ns2f,2,sort,decreasing=TRUE)
fp2 <- ts[fp+1,]
fp3 <- t(array(c(fp0,fp1,fp2), dim=c(10,3)))
fcutoff <- apply(fp3,2,mean)
n <- 1; fn0f <- c(0,0,0,0,0,0,0,0,0,0)
while(n<11) { fn0f[n] <- sum(sp0f[n] <= fcutoff[n]) ; n <- n+1 }
n <- 1; fn1f <- c(0,0,0,0,0,0,0,0,0,0)
while(n<11) { fn1f[n] <- sum(sp1f[n] <= fcutoff[n]) ; n <- n+1 }
n <- 1; fn2f <- c(0,0,0,0,0,0,0,0,0,0)
while(n<11) { fn2f[n] <- sum(sp2f[n] <= fcutoff[n]) ; n <- n+1 }
md <- c(0,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45)
cutoff <- data.frame(mindev=md, g=gcutoff, f=fcutoff)
xmd <- c(t(array(c(md,md,md),dim=c(10,3))))
n <- rep(c(0,1,2),10)
calc <- c(rep("g",30),rep("f",30))
fng <- c(t(array(c(fn0g,fn1g,fn2g),dim=c(10,3))))
fnf <- c(t(array(c(fn0f,fn1f,fn2f),dim=c(10,3))))
gf <- data.frame(calc=calc, mindev=c(xmd,xmd), run=c(n,n),
fneg=c(fng,fnf))
gf$percent <- round(gf$fneg * 100 / length(sp0f[,1]),digits=2)
write.table(gf,"robfish.tbl")
attach(gf)
gf$calc <- factor(gf$calc)
gf$run <- factor(run)
gf$mindev <- factor(mindev)
gfaov <- aov(percent ~ calc + mindev + calc * mindev, data=gf)
summary(gfaov)
d <- c(1.95996, 0.412, 0.423)
rdf <- 40
rms <- deviance(gfaov) / rdf
z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms/3)
meanfn <- round(apply(array(gf$percent,dim=c(3,20)), 2, mean),
digits=2)
xcalc <- c(rep("g",10), rep("f",10))
gfres <- data.frame(calc=xcalc, mindev=c(md,md), meanfn=meanfn,
lcl95=lcl, ucl95=ucl)
write.table(gfres,"robfish.result")
X11(width=5,height=5)
plot(gfres$mindev[1:10]-0.002, gfres$meanfn[1:10], ylim=c(0,18),
main="Geometric Mean vs Fisher Chi-Squared",
xlab="Minimum deviation", ylab="Percent error")
points(gfres$mindev[11:20]+0.002, gfres$meanfn[11:20], col="red")
lines(gfres$mindev[1:10]-0.002, gfres$ucl[1:10], type="h")
lines(gfres$mindev[1:10]-0.002, gfres$lcl[1:10],
type="h",col="white")
lines(gfres$mindev[11:20]+0.002, gfres$ucl[11:20], type="h",
col="red")
lines(gfres$mindev[11:20]+0.002, gfres$lcl[11:20], type="h",
col="white")
[© Greg Louis, 2002, 2003; last modified 2003-04-12]