Gary Robinson took an interest in Graham's paper, and wrote an insightful commentary in which he presented several suggestions for possible improvements to the calculation method Graham had developed. I had been following bogofilter since shortly after esr released it, and it seemed worthwhile to me to evaluate Robinson's suggestions. I therefore modified bogofilter to try them out. I won't go into detail here, but to summarize, I found that two of Gary's three suggestions made significant positive improvements to bogofilter's discrimination capability. Accordingly, I finalized a Robinson variant of esr's bogofilter 0.7 with the S and f(w) calculations, and began using it to divert spam from my mailbox. When bogofilter-0.7.4 appeared, I ported my mods to that and submitted a patch to the bogofilter mailing list. David Relson was interested, and his tests seemed to confirm my conclusion that Robinson's calculations gave better results.
Since then, I've been involved with bogofilter in three ways. I've been trying out various suggestions Gary has made -- he's working on minimizing violations by our classification calculations of the underlying assumptions on which Bayesian classification depends. Where these have been successful, I've been passing my bogofilter modifications along to the project, so that the improvements can be incorporated into mainstream bogofilter. I'm also maintaining a production version of bogofilter with a lot of options hard-coded, which I use to filter email both for myself and for my employer's computer users. It's my hope that this branch will someday wither when mainstream bogofilter matures.
20020918
Graham, without or with a minimum deviation of 0.4, vs Robinson
msgcount_good = 5106
msgcount_bad = 1773
test=training
Graham_original Graham_mindev Robinson
false negative: 6.65% 3.0% 2.1%
false positive: 0.01% 0.01% 0
20020919
Graham vs Robinson as above
msgcount_good = 5106
msgcount_bad = 1773
test_msgcount_good = 144
test_msgcount_bad = 111
Graham_original Graham_mindev Robinson
false negative: 8.1% 6.3% 0.9%
false positive: 0.7% 0.7% 0.7%
NOTE: the one false positive detected by all 3 methods was
found to be due to a training artefact; in this test,
all three values should have been zero.
20020820
Graham vs Robinson (f(w)), as above, plus Robinson with g(w)
msgcount_good = 5248
msgcount_bad = 2136
test = training
Graham_original Graham_mindev Robinson Robinson_g(w)
false negative: 7.7% 6.55% 3.4% 3.6%
false positive: 0.6% 0.6% 0.19% 0.19%
20020822
Robinson with f(w) vs Robinson with g(w)
msgcount_good = 5248
msgcount_bad = 2136
test_msgcount_good = 120
test_msgcount_bad = 86
Robinson Robinson_g(w)
false negative: 2.3% 2.3%
false positive: 0 0
20020822
Robinson with f(w) vs Robinson with g(w)
msgcount_good = 5248
msgcount_bad = 2136
test_msgcount_good = 428
test_msgcount_bad = 382
Robinson Robinson_g(w)
false negative: 2.1% 1.8%
false positive: 0 0
Users of either of Robinson's methods have access to the option -R, which tells bogofilter to print a data frame suitable for use with the mathematics package R. The data frame contains information about each word processed by bogofilter, as well as the calculation results, so it's quite useful in investigating misclassifications. Here's an example:
Token pgood pbad fw invfwlog fwlog U 1 glouis 0.986498 0.328654 0.249898 -0.28755 -1.38670 + 2 dynamicro.on.ca 0.233706 0.272581 0.538393 -0.77304 -0.61917 - 3 tue 0.217626 0.214905 0.496854 -0.68687 -0.69946 - 4 dec 0.064441 0.192363 0.749066 -1.38257 -0.28893 + 5 from 1.000000 1.000000 0.500000 -0.69315 -0.69315 - 6 greg 0.647478 0.005455 0.008354 -0.00839 -4.78497 + 7 louis 0.398061 0.006899 0.017036 -0.01718 -4.07244 + 8 gary 0.034246 0.001845 0.051122 -0.05247 -2.97355 + 9 robinson 0.016693 0.000802 0.045851 -0.04694 -3.08236 + 10 grobinson 0.006137 0.000080 0.012902 -0.01299 -4.35036 + 11 transpose.com 0.006137 0.000080 0.012902 -0.01299 -4.35036 + 12 subject 0.999386 0.999037 0.499913 -0.69297 -0.69332 - 13 testing 0.056585 0.013076 0.187703 -0.20789 -1.67289 + 14 farther 0.000982 0.000722 0.423707 -0.55114 -0.85871 - 15 reply-to 0.195164 0.522782 0.728164 -1.30255 -0.31723 + 16 mime-version 0.922671 0.838441 0.476086 -0.64643 -0.74216 - 17 content-type 0.927090 0.946414 0.505157 -0.70352 -0.68289 - 18 text 0.931018 0.947457 0.504376 -0.70194 -0.68443 - 19 plain 0.913097 0.579737 0.388347 -0.49159 -0.94586 + 20 iso-8859-1 0.261200 0.459811 0.637730 -1.01537 -0.44984 + 21 content-disposition 0.157850 0.148965 0.485522 -0.66460 -0.72253 - 22 inline 0.103719 0.121450 0.539373 -0.77517 -0.61735 - 23 organization 0.094882 0.026552 0.218656 -0.24674 -1.52026 + 24 dynamicro 0.011661 0.000562 0.045943 -0.04703 -3.08035 + 25 consulting 0.015711 0.008102 0.340231 -0.41587 -1.07813 + 26 limited 0.175402 0.078774 0.309920 -0.37095 -1.17144 + 27 x-confirm-reading-to 0.057567 0.000000 0.000000 -0.00000 -23.50485 + 28 status 0.967350 0.051500 0.050547 -0.05187 -2.98485 + 29 content-length 0.693384 0.464945 0.401393 -0.51315 -0.91282 - 30 lines 0.702590 0.468956 0.400288 -0.51131 -0.91557 - 31 again 0.152449 0.076047 0.332815 -0.40469 -1.10017 + 32 i'm 0.128391 0.037542 0.226249 -0.25650 -1.48612 + 33 trying 0.054130 0.020055 0.270332 -0.31517 -1.30810 + 34 the 0.868295 0.763035 0.467738 -0.63062 -0.75985 - 35 experiment 0.010556 0.001444 0.120328 -0.12821 -2.11754 + 36 with 0.999755 1.000000 0.500061 -0.69327 -0.69302 - 37 smaller 0.011170 0.003530 0.240119 -0.27459 -1.42662 + 38 corpus 0.002087 0.000160 0.071397 -0.07407 -2.63950 + 39 that 0.591874 0.407829 0.407950 -0.52416 -0.89661 - 40 consists 0.003191 0.002166 0.404292 -0.51800 -0.90562 - 41 spams 0.003069 0.001364 0.307674 -0.36770 -1.17872 + 42 and 0.710446 0.709450 0.499649 -0.69245 -0.69385 - 43 nonspams 0.002332 0.000000 0.000000 -0.00000 -20.29869 + 44 carefully 0.017798 0.017086 0.489802 -0.67296 -0.71375 - 45 minimize 0.003560 0.001364 0.276991 -0.32433 -1.28377 + 46 duplication 0.005033 0.000722 0.125461 -0.13406 -2.07576 + 47 oddities 0.001227 0.000000 0.000000 -0.00000 -19.65683 + 48 like 0.203633 0.272902 0.572680 -0.85022 -0.55743 - 49 uuencode 0.000614 0.000160 0.207240 -0.23223 -1.57388 + 50 will 0.425310 0.377988 0.470545 -0.63591 -0.75386 - 51 report 0.135387 0.080780 0.373691 -0.46791 -0.98433 + 52 later 0.043943 0.040270 0.478192 -0.65046 -0.73774 - 53 today 0.102123 0.195652 0.657046 -1.07016 -0.42000 + 54 gpg 0.186326 0.000401 0.002148 -0.00215 -6.14322 + 55 public 0.238247 0.142066 0.373551 -0.46769 -0.98470 + 56 key 0.124586 0.047249 0.274966 -0.32154 -1.29111 + 57 http 0.516509 0.823761 0.614623 -0.95353 -0.48675 + 58 www.bgl.nu 0.184117 0.000080 0.000436 -0.00044 -7.73901 + 59 finger 0.187799 0.002326 0.012236 -0.01231 -4.40339 + 60 bgl.nu 0.123972 0.002567 0.020286 -0.02049 -3.89782 + 61 P_Q_S_invs_logs_md 1.00e-00 0.00e+00 2.04e-09 -18.098 -150.810 0.10
Loading the above table into R and getting a histogram is as easy as
read.table("bogo2.tbl") -> bogo
attach(bogo)
hist(fw[1:length(fw)-1],plot=TRUE,breaks=20)
From the histogram it's clear that this message contains (unsurprisingly) a lot of words that don't contribute much to the decision either way, but there are very few words with f(w) values on the right (spam) side of the graph. There's a cluster of nonspam values around 0.22 and another at 0.0; these suffice to identify the message as nonspam (S=2e-09, very close to zero and far below cutoff value 0.99). | ![]() |
Last modified 2004-03-24 -- feedback to Greg