The purpose of this experiment is to determine whether subject tagging actually helps discrimination between spam and nonspam messages. It was anticipated that there may be some interaction with the minimum deviation (threshold below which a token may be ignored as having a probability too close to 0.5).
The experimental design is factorial: tagging (two levels, without and with) and min_dev (five levels, 0.025 to 0.125 in steps of 0.025). For this experiment I used a training corpus of 19,823 nonspams and 21,969 spam emails. Two test runs were performed, each with 3,204 nonspams and 5,948 spams. For each of the ten experimental configurations, the nonspams were first classified to determine a spam-cutoff level that would give four false positives (nonspams classified as spam). That cutoff was then used to determine the number of false negatives (spams classified as nonspam or uncertain) for the given configuration in each of the two runs. The results appear in the following table and in the graph below.
tagging mindev cutoff run fp fn 1 notag 0.025 0.99169 0 4 241 2 notag 0.025 0.99169 1 4 266 3 notag 0.05 0.99109 0 4 246 4 notag 0.05 0.99109 1 4 262 5 notag 0.075 0.98480 0 4 225 6 notag 0.075 0.98480 1 4 241 7 notag 0.1 0.99444 0 4 255 8 notag 0.1 0.99444 1 4 284 9 notag 0.125 0.99643 0 4 284 10 notag 0.125 0.99643 1 4 304 11 tag 0.025 0.98510 0 4 216 12 tag 0.025 0.98510 1 4 222 13 tag 0.05 0.98150 0 4 214 14 tag 0.05 0.98150 1 4 214 15 tag 0.075 0.98750 0 4 224 16 tag 0.075 0.98750 1 4 234 17 tag 0.1 0.99717 0 4 271 18 tag 0.1 0.99717 1 4 296 19 tag 0.125 0.99765 0 4 292 20 tag 0.125 0.99765 1 4 309
Df Sum Sq Mean Sq F value Pr(>F)
tagging 1 672.8 672.8 16.401 0.0028858 **
mindev 4 14463.5 3615.9 88.144 3.263e-07 ***
run 1 1344.8 1344.8 32.782 0.0002849 ***
tagging:mindev 4 2371.7 592.9 14.454 0.0005913 ***
Residuals 9 369.2 41.0

(badcount/badlist_messagecount)
p(w) = -----------------------------------------------------------------
(badcount/badlist_messagecount + goodcount/goodlist_messagecount)
n = badcount + goodcount
f(w) = (s * x + n * p(w)) / (s + n)
scalefactor = badlist_messagecount / goodlist_messagecount f(w) = (s * x + badcount) / (s + badcount + goodcount * scalefactor)
Fisher's method uses an inverse chi-squared function, prbx, to get the probability associated with -2 times the sum of the logs of f(w) with 2N degrees of freedom:
P = prbx(-2 * sum(ln(1-f(w))), 2*N) Q = prbx(-2 * sum(ln(f(w))), 2*N) S = (1 + Q - P) / 2
The experimental design is factorial: tagging (two levels, without and
with) and min_dev (five levels, 0.025 to 0.125 in steps of 0.025).
The following scripts were used to distribute messages into test and
training files:
# mkdb
echo -n "distributing spam... "
bzip2 -dc ../mail/corpus*.bad.bz2 | FILENO=0 formail -s ./distrib sp
echo done
echo -n "training on uncertain spam... "
bzip2 -dc ../mail/corpus*.sptrain.bz2 >>t.sp
echo done
echo -n "distributing nonspam... "
bzip2 -dc ../mail/corpus0*.good.bz2 | FILENO=0 formail -s ./distrib ns
echo done
echo -n "training on additional and uncertain nonspam... "
bzip2 -dc ../mail/corpus1223.good.bz2 ../mail/corpus*.nstrain.bz2 >>t.ns
echo done
echo building nontagged wordlists
./bogonotag -v -s -d dbnotag < t.sp
./bogonotag -v -n -d dbnotag < t.ns
bogoutil -w dbnotag .MSG_COUNT
echo building tagged wordlists
./bogotag -v -s -d dbtag < t.sp
./bogotag -v -n -d dbtag < t.ns
bogoutil -w dbtag .MSG_COUNT
echo all done
# distrib
FILE=(t.$1 r0.$1 t.$1 r1.$1 t.$1)
let n=${FILENO}%5
fname=${FILE[$n]}
cat >>$fname
bogotag and bogonotag are bogofilter, compiled from cvs checked out on
2003-02-13 at 1504, with modifications as per patch file patch-bogl0213
(available on request). The default for the tag option in the two
binaries is on and off respectively.
The corpora and training database had the following properties:
$ bogoutil -w dbnotag .MSG_COUNT
spam good
.MSG_COUNT 21969 19823
$ bogoutil -w dbtag .MSG_COUNT
spam good
.MSG_COUNT 21969 19823
$ grep -c '^From ' *.??
r0.ns:3204
r0.sp:5948
r1.ns:3204
r1.sp:5948
t.ns:19823
t.sp:21969
The test runs were conducted according to the following script:
#! /bin/bash
# runex for studying effects of subject tagging and mindev
function getco () {
mopt=$1; bogo=$2; shift 2
res=`cat $* | formail -s ./bogo$bogo -d db$bogo -m $mopt -v 2>&1 | \
perl -e ' $target = 10; while (<>) { ' \
-e ' ($i, $d) = split; push @diffs, $d unless $i != 1; }' \
-e ' die "dainbramage" unless scalar @diffs > 15;' \
-e ' @s = sort { $a <=> $b } (@diffs); $co = $s[$target];' \
-e ' while($co < 0.000001) { ++$target; $co = $s[$target]; }' \
-e ' printf("%8.6f %d",1.0-$s[$target],$target-1);'`
}
function wrapper () {
mopt=$1; bogo=$2; shift 2; v=-v
res=`cat $* | formail -s ./bogo$bogo -d db$bogo -m $mopt -v 2>&1 | \
grep -c $v '^1'`
}
for mode in notag tag; do
for md in 0.025 0.05 0.075 0.1 0.125; do
echo -n "$mode $md fpos... "
getco $md,0.1 $mode r0.ns r1.ns
fpos=${res##* }
co=${res%% *};
let fpos=$fpos/2
echo -n "$fpos at cutoff $co, run0... "
run=0; wrapper $md,$co $mode r0.sp
fneg=$res
echo "$mode $md $co $run $fpos $fneg" >>parms.tbl
echo -n "$fneg, run1... "
run=1; wrapper $md,$co $mode r1.sp
fneg=$res
echo "$mode $md $co $run $fpos $fneg" >>parms.tbl
echo $fneg
done
done
Running this script produced the following data in parms.tbl
(with row and column labels added manually):
tagging mindev cutoff run fp fn
1 notag 0.025 0.991690 0 4 241
2 notag 0.025 0.991690 1 4 266
3 notag 0.05 0.991090 0 4 246
4 notag 0.05 0.991090 1 4 262
5 notag 0.075 0.984800 0 4 225
6 notag 0.075 0.984800 1 4 241
7 notag 0.1 0.994440 0 4 255
8 notag 0.1 0.994440 1 4 284
9 notag 0.125 0.996430 0 4 284
10 notag 0.125 0.996430 1 4 304
11 tag 0.025 0.985100 0 4 216
12 tag 0.025 0.985100 1 4 222
13 tag 0.05 0.981500 0 4 214
14 tag 0.05 0.981500 1 4 214
15 tag 0.075 0.987500 0 4 224
16 tag 0.075 0.987500 1 4 234
17 tag 0.1 0.997170 0 4 271
18 tag 0.1 0.997170 1 4 296
19 tag 0.125 0.997650 0 4 292
20 tag 0.125 0.997650 1 4 309
This was read into R and manipulated as follows:
read.table("subjtag.tbl") -> bogo
bogo$tagging <- factor(bogo$tagging)
bogo$mindev <- factor(bogo$mindev)
bogo$run <- factor(bogo$run)
attach(bogo)
bogaov <- aov(fn ~ tagging + mindev + tagging * mindev + run)
summary(bogaov)
Df Sum Sq Mean Sq F value Pr(>F)
tagging 1 672.8 672.8 16.401 0.0028858 **
mindev 4 14463.5 3615.9 88.144 3.263e-07 ***
run 1 1344.8 1344.8 32.782 0.0002849 ***
tagging:mindev 4 2371.7 592.9 14.454 0.0005913 ***
Residuals 9 369.2 41.0
Since the second run had significantly more false negatives than the
first, the runs are plotted separately instead of being treated as
replicates.
fn0n <- fn[run==0][1:5]
fn1n <- fn[run==1][1:5]
fn0t <- fn[run==0][6:10]
fn1t <- fn[run==1][6:10]
as.real(levels(mindev)) -> md
X11(width=3.5, height=3.5)
plot(md, fn0t/59.48, ylim=c(3.5,5.3), xlim=c(0.02,0.13),
main="Subject tagging effect with varying min_dev", axes=FALSE,
xlab="min_dev", ylab="Percent false negatives", type="p", col="blue")
axis(1, at=c(0.025, 0.05, 0.075, 0.1, 0.125))
axis(2)
box()
lines(md, fn0t/59.48, col="blue")
points(md,fn0n/59.48, col="red")
lines(md,fn0n/59.48, col="red")
text(0.02, 5.2, labels="Untagged", col="red", pos=4)
text(0.02, 5.1, labels="Tagged", col="blue", pos=4)
points(md,fn1n/59.48, col="red")
lines(md,fn1n/59.48, col="red", lty="dashed")
points(md,fn1t/59.48, col="blue")
lines(md,fn1t/59.48, col="blue", lty="dashed")
[© Greg Louis, 2003; last modified 2003-04-12]