More recently, Gary suggested using Fisher's method for combining probabilities to perform the S calculation. A brief overview of the theory, and results of an experiment to compare Fisher's method with the geometric-mean calculation, are also on this site.
More recently still, the question was raised whether yet another means of selecting and combining key (that is, highly characteristic) probabilities would offer further improvement. The idea is that we should, for some relatively small number r, determine the rth-closest f(w) value to 0 and the rth-closest f(w) value to 1; these will present beta distributions with parameters r and n-r+1, where n is the number of f(w) values from among which we chose. It seemed possible that the selected values, plus the fact that there must be r-1 other f(w) values more extreme than those chosen, would constitute adequate information on which to base a decision. This idea too was presented by Gary Robinson, and the object of the experiment to be reported here was to try it out.
The nonspam runs with Fisher's method each produced 4 false positives out of 833 nonspams. This was taken as the standard for the remaining runs. For each of these, the nonspam "spamicity" values were sorted in decreasing order and the mean of the 5th values on the three sorted lists was taken as the cutoff for the corresponding spam run. This produced the following cutoff values:
calc fcutoff 1 chi 0.9433333 2 r01 0.9940433 3 r02 0.9600000 4 r03 0.9026667 5 r04 0.9221667 6 r05 0.9350000 7 r06 0.8591000 8 r07 0.8936667 9 r08 0.8903333 10 r09 0.9172667 11 r10 0.9095000
For each spam run, the wanted datum was the count of false negatives (spam messages with "spamicity" values below the cutoff). In the following table, the "percent" column gives the percentage of false negatives; see Appendix A for details.
calc run fneg 1 chi 0 89 2 chi 1 92 3 chi 2 100 4 r01 0 430 5 r01 1 429 6 r01 2 437 7 r02 0 262 8 r02 1 281 9 r02 2 283 10 r03 0 210 11 r03 1 236 12 r03 2 239 13 r04 0 194 14 r04 1 221 15 r04 2 225 16 r05 0 198 17 r05 1 217 18 r05 2 223 19 r06 0 173 20 r06 1 191 21 r06 2 194 22 r07 0 180 23 r07 1 198 24 r07 2 209 25 r08 0 174 26 r08 1 192 27 r08 2 208 28 r09 0 182 29 r09 1 200 30 r09 2 222 31 r10 0 185 32 r10 1 206 33 r10 2 222
Summarizing, with means and 95% confidence limits:
calc percent lcl95 ucl95 1 chi 11.1 10.3 12.0 2 r01 51.4 50.5 52.2 3 r02 32.7 31.9 33.6 4 r03 27.2 26.3 28.0 5 r04 25.4 24.5 26.2 6 r05 25.3 24.4 26.2 7 r06 22.1 21.2 23.0 8 r07 23.3 22.4 24.1 9 r08 22.8 21.9 23.6 10 r09 23.9 23.1 24.8 11 r10 24.3 23.4 25.2
Here's a graph of these results. The vertical lines show the 95% confidence limits.

scalefactor = badlist_messagecount / goodlist_messagecount f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
The f(w) value is calculated for each unique token in the message; the number of such tokens is represented below by n.
Fisher's method uses an inverse chi-squared function, prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2n degrees of freedom:
P = prbx(-2 * sum(ln(1-f(w))), 2*n) Q = prbx(-2 * sum(ln(f(w))), 2*n) S = (1 + Q - P) / 2
With the beta distribution, we choose a value of r and rank the r f(w) values closest to 0 and the r f(w) values closest to 1. We then obtain, with a beta-distribution function, the values of p that correspond to the rth values; 1-p for the f(w) close to zero is our P, and 1-p for the f(w) close to 1 is our Q in the calculation:
S = (1 + Q - P) / 2
for file in corpus.*; do echo $file; grep -c "^From " $file; done
corpus.bad
2526
corpus.good
2499
cat ~/bin/thirds
#! /bin/sh
let n=${FILENO}%3
fname=cgx-$n
cat >>$fname
FILENO=0 formail -s ~/bin/thirds sp-$run
formail -s ~/bin/bogobeta -b ns-$run
done
wc -l *
833 ns-0
833 ns-1
833 ns-2
841 sp-0
841 sp-1
841 sp-2
5022 total
grep -c -v ^1 ns-?
ns-0:829
ns-1:829
ns-2:829
# with cutoff 0.952, chi gave 4 false positives in each case.
# cutoffs for beta will be set to match.
# now in R:
read.table("F/ns-0") -> ns0
read.table("F/ns-1") -> ns1
read.table("F/ns-2") -> ns2
read.table("F/sp-0") -> sp0
read.table("F/sp-1") -> sp1
read.table("F/sp-2") -> sp2
comb <- function (x) {
y <- if(x[1] == 1) 1 - as.real(x[2]) else as.real(x[2])
for(i in seq(3,21,by=2))
y <- c(y, if(x[i] == 1) 1 - as.real(x[i+1]) else as.real(x[i+1]))
y
}
cns0 <- t(apply(ns0,1,comb))
cns1 <- t(apply(ns1,1,comb))
cns2 <- t(apply(ns2,1,comb))
csp0 <- t(apply(sp0,1,comb))
csp1 <- t(apply(sp1,1,comb))
csp2 <- t(apply(sp2,1,comb))
fp <- 4
ts <- apply(cns0,2,sort,decreasing=TRUE)
fp0 <- pmin(ts[fp+1,],0.997)
ts <- apply(cns1,2,sort,decreasing=TRUE)
fp1 <- pmin(ts[fp+1,],0.997)
ts <- apply(cns2,2,sort,decreasing=TRUE)
fp2 <- pmin(ts[fp+1,],0.997)
fp3 <- t(array(c(fp0,fp1,fp2), dim=c(11,3)))
fcutoff <- apply(fp3,2,mean)
fn0 <- c(0,0,0,0,0,0,0,0,0,0,0)
fn1 <- fn0
fn2 <- fn0
for(n in 1:11) fn0[n] <- sum(csp0[,n] <= fcutoff[n])
for(n in 1:11) fn1[n] <- sum(csp1[,n] <= fcutoff[n])
for(n in 1:11) fn2[n] <- sum(csp2[,n] <= fcutoff[n])
fnf <- (t(array(c(fn0,fn1,fn2),dim=c(11,3))))
fn <- c(fnf)
n <- rep(c(0,1,2),11)
calc <- c(rep("chi",3),paste("r0",c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,
6,6,6,7,7,7,8,8,8,9,9,9),sep=""),rep("r10",3))
beta <- data.frame(calc=calc,run=n,fneg=fn)
write.table(beta,"beta.tbl")
attach(beta)
beta$calc <- factor(beta$calc)
beta$run <- factor(run)
betaov <- aov(fneg ~ calc + run, data=beta)
d <- c(1.95996, 0.412, 0.423)
rdf <- 20
rms <- deviance(betaov) / rdf
z <- (d[1] + 1 / (rdf * d[2] - d[3])) * sqrt(rms/3)
meanfn <- apply(array(beta$fneg,dim=c(3,11)), 2, mean)
xcalc <- c("chi", paste("r0", 1:9, sep=""),"r10")
lcl95 <- meanfn - z
ucl95 <- meanfn + z
betares <- data.frame(calc=xcalc,
percent=round(meanfn*100/841,digits=1),
lcl95=round(lcl95*100/841,digits=1),
ucl95=round(ucl95*100/841,digits=1))
write.table(betares,"beta.result")
betares$calc = factor(betares$calc)
plot(betares$calc,betares$percent,ylim=c(0,60),
main="Beta Distribution vs Fisher Chi-Squared",
xlab="Calculation", ylab="Percent false negatives", type="p")
> lines(betares$calc,betares$ucl95,type="h")
> lines(betares$calc,betares$lcl95,type="h",col="white")
> lines(betares$calc[2:11],betares$percent[2:11])
[© Greg Louis, 2002, 2003; last modified 2003-04-12]