Initially people thought that every available message should be used to train bogofilter. Some even do this automatically; bogofilter has an option for that purpose. Of course the messages have to be verified and errors "backed out" of the training database afterward, or else bogofilter's inevitable mistakes will degrade the accuracy of training.
The suggestion was soon made that, after the training database had reached a reasonable size (some 10,000 each of spam and nonspam had been used for training), one could save time and storage by training only on errors (messages that were classed wrongly, or classed as "unsure." Tests soon showed that this worked quite well, both for bogofilter users and for participants in the Spambayes project, which did a lot of useful research into both tokenization and scoring.
Training on error is theoretically questionable, because in principle the distribution of tokens in the training database should match the distribution in the message population. Nevertheless, it works for those who have used it extensively, and success is deemed to excuse some departure from strict theoretical rigour. (In fact, doing Bayesian analysis even with full training already violates an assumption of independence on which Bayesian theory depends.)
Once people started training on error -- some folks even skipped the initial stage of building the database to reasonable size by full training, though that has been shown to be less efficient -- the question naturally arose, "Once the filter has made a mistake, why not train with that message repeatedly till the program gets it right?" And indeed many people had the empirical experience that doing this with the occasional "difficult" message did improve accuracy to some extent. However, concern was raised that this could lead to much worse distortion of the token distribution within the training database, and eventually to degraded accuracy: the ability to recognize the erroneous messages used in repeated training might improve, but the ability to recognize similar but not identical messages might suffer. Putting it glibly, training repeatedly on a dachshund is probably not going to help a Bayesian classifier learn to distinguish a St. Bernard from a cat.
Notwithstanding this concern, however, the preceding experiment seemed to indicate that registering erroneously classified messages a limited number of times might improve overall accuracy.
Some people, however, passionately maintain that training only on
error, from scratch, with up to ten iterations of registration if a
message is still wrongly classified after the ninth -- or even beyond
that number -- gives excellent results. I therefore decided to
compare five different training methods with a corpus of 51,907 nonspam
and 52,127 spam, "dealt" out into four files each, two of which were to
be used for training and the other two for testing:
bogofilter -Q | fgrep ' 0.' # bogofilter version 0.17.2 # robx = 0.610612 (6.11e-01) # robs = 0.017800 (1.78e-02) # min_dev = 0.020000 (2.00e-02) # ham_cutoff = 0.281000 (2.81e-01) # spam_cutoff = 0.532200 (5.32e-01)
error run full half max1 max4 max10
fp 0 1 1 21 34 20
fp 1 1 2 33 28 28
fn 0 244 197 66 58 70
fn 1 259 202 71 70 61
value full half max1 max4 max10
x 0.5962 0.5231 0.6106 0.6106 0.6106
md 0.0950 0.0700 0.0200 0.0200 0.0200
s 0.0100 0.1000 0.0178 0.0178 0.0178
co 0.50004 0.5001 0.9997 0.99998 0.9999
fp0 1 1 1 3 1
fp1 1 1 4 3 3
fn0 149 165 5960 7175 6278
fn1 147 172 5954 7119 6266
With these small databases, I thought, there will be lots of unknown and low-count tokens, which might be inherently unreliable. Using a high s and a slightly nonspammy x should bring the spam cutoff down and give better results. I therefore arbitrarily chose 0.45 for x and 1 for s, determined optimal cutoffs (for max 4 fp), and built new training databases with those parameters for the train-on-error methods:
value max1 max4 max10
co 0.538 0.524 0.512
fp0 1 1 1
fp1 3 3 3
fn0 3261 3020 2458
fn1 3196 3004 2443
Message counts in training databases:
messages full half max1 max4 max10
spam 26064 13267 220 251 279
nonspam 25954 13004 796 805 870
Message counts in training databases: cutoffs messages max1 max4 max10 0.05/0.95 spam 831 836 1023 0.05/0.95 nonspam 740 962 1098 0.01/0.99 spam 1118 1117 1435 0.01/0.99 nonspam 858 1141 1488
value max1 max4 max10
co 0.80 0.65 0.79
fp0 1 1 2
fp1 3 3 2
fn0 471 286 467
fn1 471 298 476
value max1 max4 max10
co 0.804 0.804 0.78
fp0 1 1 3
fp1 3 3 1
fn0 469 304 381
fn1 484 293 396
After being built, each training database was used in a bogotune run to determine optimal parameters for the test. These parameters were then used to reclassify the test messages, as shown in the table below; the full-training results are repeated as well, for comparison, and the first three rows give message counts and sizes in kilobytes for the training databases:
value full hmax1 hmax4 hmax10
trainsp 26,064 13,267 13,572 13,938
trainns 25,954 13,004 13,008 13,017
dbkB 40,688 25,612 25,600 25,576
x 0.5962 0.5231 0.3959 0.4967
md 0.0950 0.0700 0.02 0.02
s 0.0100 0.1000 0.1778 0.3162
co 0.50004 0.5001 0.50002 0.61263
fp0 1 1 1 1
fp1 1 1 1 1
fn0 149 165 188 198
fn1 147 172 198 205
As would be expected, more spam than nonspam errors were encountered during training. The graph shows the numbers of spam and nonspam messages actually used in training at each pass:

value half1 half4 half10
trainsp 13,267 13,586 13,914
trainns 13,004 13,010 13,019
dbkB 25,612 25,612 25,612
x 0.5231 0.4362 0.4817
md 0.0700 0.035 0.025
s 0.1000 0.3162 0.3162
co 0.5001 0.50004 0.61193
fp0 1 1 1
fp1 1 1 1
fn0 165 185 192
fn1 172 190 193
# Distribute messages
cat ../corpus0127.good ../corpus0127.nstrain ../corpus.good \
../corpus.nstrain | FILENO=0 formail -s ./distrib ns
cat /store/spam_corpus ../corpus0127.bad ../corpus0127.sptrain \
../corpus.bad ../corpus.nstrain | FILENO=0 formail -s ./distrib sp
grep -c '^From ' ??.?? >counts
cat counts
# ns.r0:12977
# ns.r1:12976
# ns.t0:12977
# ns.t1:12977
# sp.r0:13032
# sp.r1:13031
# sp.t0:13032
# sp.t1:13032
# Create wordlist subdirs
mkdir db.f db.h db.1 db.4 db.x
bogofilter -Q | fgrep ' 0.'
# bogofilter version 0.17.2
# robx = 0.610612 (6.11e-01)
# robs = 0.017800 (1.78e-02)
# min_dev = 0.020000 (2.00e-02)
# ham_cutoff = 0.281000 (2.81e-01)
# spam_cutoff = 0.532200 (5.32e-01)
# Full training
cat sp.t0 sp.t1 | bogofilter -d db.f -vs
# 7129501 words, 26064 messages
cat ns.t0 ns.t1 | bogofilter -d db.f -vn
# 10371105 words, 25954 messages
# First half full, second half on-error, register 1x only
bogofilter -d db.h -vs < sp.t0
# 3552598 words, 13032 messages
bogofilter -d db.h -vn < ns.t0
# 5085829 words, 12977 messages
randomtrain db.h s sp.t1 n ns.t1
# db.h, max reg is 1
# spam reg good reg
# 13032 235 12977 27
# Train exclusively on error, register 1x only
randomtrain db.1 s sp.t0 s sp.t1 n ns.t0 n ns.t1
# db.1, max reg is 1
# bogofilter: (db) open( db.1/wordlist.db ), err: 2, No such file or
# directory
# Can't open file 'wordlist.db' in directory 'db.1'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
# spam reg good reg
# 26064 220 25954 796
# Train exclusively on error, register up to 4x
randomtrain db.4 i 4 s sp.t0 s sp.t1 n ns.t0 n ns.t1
# db.4, max reg is 4
# bogofilter: (db) open( db.4/wordlist.db ), err: 2, No such file or
# directory
# Can't open file 'wordlist.db' in directory 'db.4'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
# spam reg good reg
# 26064 233 25954 728
# Train exclusively on error, register up to 10x
randomtrain db.x i 10 s sp.t0 s sp.t1 n ns.t0 n ns.t1
# db.x, max reg is 10
# bogofilter: (db) open( db.x/wordlist.db ), err: 2, No such file or
# directory
# Can't open file 'wordlist.db' in directory 'db.x'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
# spam reg good reg
# 26064 256 25954 715
# Get message counts for each training db
# Classify test messages with each training db, using the same
# bogofilter parameters used during training, but adjusting cutoff
for db in db.f db.h db.1 db.4 db.x; do
echo -n "# $db: "
bogoutil -w $db .MSG_COUNT | tail -n 1
echo "# $db: fp"
bogofilter -d $db -vM < ns.r0 | grep -c '^1'
bogofilter -d $db -vM < ns.r1 | grep -c '^1'
echo "# $db: fn"
bogofilter -d $db -vM < sp.r0 | grep -c -v '^1'
bogofilter -d $db -vM < sp.r1 | grep -c -v '^1'
echo
done
# db.f: .MSG_COUNT 26064 25954
# db.f: fp
# 1
# 1
# db.f: fn
# 244
# 259
#
# db.h: .MSG_COUNT 13267 13004
# db.h: fp
# 1
# 2
# db.h: fn
# 197
# 202
#
# db.1: .MSG_COUNT 220 796
# db.1: fp
# 21
# 33
# db.1: fn
# 66
# 71
#
# db.4: .MSG_COUNT 251 805
# db.4: fp
# 34
# 28
# db.4: fn
# 58
# 70
#
# db.x: .MSG_COUNT 279 870
# db.x: fp
# 20
# 28
# db.x: fn
# 65
# 61
# Tune db.f and db.h (not enough messages in pure t-o-e dbs)
for db in db.f db.h; do
echo "# $db" | tee $db.tune
bogotune -d $db -T 3 -v -s sp.r0 sp.r1 -n ns.r0 ns.r1 | tee -a $db.tune
eval `fgrep '=' $db.tune | awk '{print $1}'`
echo "$db tuned: fp" | tee -a $db.tune
bogofilter -d $db -vM -m $min_dev,$robs,$robx \
-o $spam_cutoff,$ham_cutoff < ns.r0 | grep -c '^1' | tee -a $db.tune
bogofilter -d $db -vM -m $min_dev,$robs,$robx \
-o $spam_cutoff,$ham_cutoff < ns.r1 | grep -c '^1' | tee -a $db.tune
echo "$db tuned: fn" | tee -a $db.tune
bogofilter -d $db -vM -m $min_dev,$robs,$robx \
-o $spam_cutoff,$ham_cutoff < sp.r0 | grep -c -v '^1' | tee -a $db.tune
bogofilter -d $db -vM -m $min_dev,$robs,$robx \
-o $spam_cutoff,$ham_cutoff < sp.r1 | grep -c -v '^1' | tee -a $db.tune
done
# db.f:
# db_cachesize=14
# robx=0.596176
# min_dev=0.095
# robs=0.0100
# spam_cutoff=0.5000 # for 0.01% fpos (2); expect 1.14% fneg (297).
# ham_cutoff=0.100
# db.f tuned: fp
# 3
# 1
# db.f tuned: fn
# 147
# 147
# db.h
# db_cachesize=9
# robx=0.523060
# min_dev=0.070
# robs=0.1000
# spam_cutoff=0.5001 # for 0.01% fpos (2); expect 1.29% fneg (337).
# ham_cutoff=0.100
# db.h tuned: fp
# 1
# 1
# db.h tuned: fn
# 165
# 172
# For the pure train-on-error experiments, retain the original
# parameter values except adjust spam cutoff to give 4 or fewer fp
# (cutoff must be between 0.5 and 0.99999 inclusive)
# db.1
# cutoffs of 0.99999 and 0.9997 gave 5 fp; 0.9993 gave 6.
# db.1 co 0.9997: fp
bogofilter -d db.1 -vM -o 0.9997 < ns.r0 | grep -c '^1'
# 1
bogofilter -d db.1 -vM -o 0.9997 < ns.r1 | grep -c '^1'
# 4
# db.1 co 0.9997: fn
bogofilter -d db.1 -vM -o 0.9997 < sp.r0 | grep -c -v '^1'
# 5960
bogofilter -d db.1 -vM -o 0.9997 < sp.r1 | grep -c -v '^1'
# 5954
# db.4
# cutoffs of 0.99999 and 0.99998 gave 6 fp; 0.99997 gave 7.
# db.4 co 0.99998: fp
bogofilter -d db.4 -vM -o 0.99998 < ns.r0 | grep -c '^1'
# 3
bogofilter -d db.4 -vM -o 0.99998 < ns.r1 | grep -c '^1'
# 3
# db.4 co 0.99998: fn
bogofilter -d db.4 -vM -o 0.99998 < sp.r0 | grep -c -v '^1'
# 7175
bogofilter -d db.4 -vM -o 0.99998 < sp.r1 | grep -c -v '^1'
# 7119
# db.x
# a cutoff of 0.9999 gave 4 fp; 0.9995 gave 5.
# db.x co 0.9999: fp
bogofilter -d db.x -vM -o 0.9999 < ns.r0 | grep -c '^1'
# 1
bogofilter -d db.x -vM -o 0.9999 < ns.r1 | grep -c '^1'
# 3
# db.x co 0.9999: fn
bogofilter -d db.x -vM -o 0.9999 < sp.r0 | grep -c -v '^1'
# 6278
bogofilter -d db.x -vM -o 0.9999 < sp.r1 | grep -c -v '^1'
# 6266
# With these small databases there will be lots of unknown and low-count
# tokens, which are inherently unreliable. In these circumstances,
# better results might possibly be obtained with a high s and a slightly
# nonspammy x, which would bring the spam cutoff down:
# db.1 md=0.02 s=1 x=0.45 co=0.538: fp
bogofilter -d db.1 -vM -m 0.02,1,0.45 -o 0.538 < ns.r0 | grep -c '^1'
# 1
bogofilter -d db.1 -vM -m 0.02,1,0.45 -o 0.538 < ns.r1 | grep -c '^1'
# 3
# db.1 md=0.02 s=1 x=0.45 co=0.538: fn
bogofilter -d db.1 -vM -m 0.02,1,0.45 -o 0.538 < sp.r0 | grep -c -v '^1'
# 3261
bogofilter -d db.1 -vM -m 0.02,1,0.45 -o 0.538 < sp.r1 | grep -c -v '^1'
# 3196
# db.4 md=0.02 s=1 x=0.45 co=0.524: fp
bogofilter -d db.4 -vM -m 0.02,1,0.45 -o 0.524 < ns.r0 | grep -c '^1'
# 1
bogofilter -d db.4 -vM -m 0.02,1,0.45 -o 0.524 < ns.r1 | grep -c '^1'
# 3
# db.4 md=0.02 s=1 x=0.45 co=0.524: fn
bogofilter -d db.4 -vM -m 0.02,1,0.45 -o 0.524 < sp.r0 | grep -c -v '^1'
# 3020
bogofilter -d db.4 -vM -m 0.02,1,0.45 -o 0.524 < sp.r1 | grep -c -v '^1'
# 3004
# db.x md=0.02 s=1 x=0.45 co=0.512: fp
bogofilter -d db.x -vM -m 0.02,1,0.45 -o 0.512 < ns.r0 | grep -c '^1'
# 1
bogofilter -d db.x -vM -m 0.02,1,0.45 -o 0.512 < ns.r1 | grep -c '^1'
# 3
# db.x md=0.02 s=1 x=0.45 co=0.512: fn
bogofilter -d db.x -vM -m 0.02,1,0.45 -o 0.512 < sp.r0 | grep -c -v '^1'
# 2458
bogofilter -d db.x -vM -m 0.02,1,0.45 -o 0.512 < sp.r1 | grep -c -v '^1'
# 2443
# We conclude that training on error, unlike full training, does not
# produce a training database with predictive value. However, it may
# still be a viable procedure if training is done with higher and lower
# than normal spam and nonspam cutoffs.
diff -u randomtrain.bck randomtrain
# --- randomtrain.bck 2004-03-13 13:29:23.000000000 -0500
# +++ randomtrain 2004-03-13 13:29:23.000000000 -0500
# @@ -70,7 +70,7 @@
# dd if=$fnam bs=1 skip=$offset count=$length 2>/dev/null >msg.$pid
# iter=0;
# while [ $iter -lt $maxiter ]; do
# - bogofilter -d $bogodir < msg.$pid
# + bogofilter -d $bogodir -c $bogodir/bogodir.rc < msg.$pid
# got=$? # 0=spam, 1=good, 2=unknown, 3=err
# if [ $iter -eq 0 ]; then
# if [ "$expect" = "s" ]; then let nspam=$nspam+1
cat db.1/bogofilter.rc
# spam_cutoff=0.95
# ham_cutoff=0.05
cp db.1/bogofilter.rc db.4
cp db.1/bogofilter.rc db.x
for iter in 1 4 10; do
db=db.$iter
if [ $iter -eq 10 ]; then db=db.x; fi
echo "$db:"
mv $db/wordlist.db $db/wordlist.db.orig
randomtrain $db i $iter s sp.t0 s sp.t1 n ns.t0 n ns.t1
done
# db.1:
# db.1, max reg is 1
# bogofilter: (db) open( db.1/wordlist.db ), err: 2, No such file or directory
# Can't open file 'wordlist.db' in directory 'db.1'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
# spam reg good reg
# 26064 831 25954 740
# db.4:
# db.4, max reg is 4
# bogofilter: (db) open( db.4/wordlist.db ), err: 2, No such file or directory
# Can't open file 'wordlist.db' in directory 'db.4'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
# spam reg good reg
# 26064 710 25954 868
# db.x:
# db.x, max reg is 10
# Can't open file 'wordlist.db' in directory 'db.x'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
# spam reg good reg
# 26064 737 25954 913
for iter in 1 4 10; do
db=db.$iter
if [ $iter -eq 10 ]; then db=db.x; fi
echo "$db: fp (0, 1) fn (0, 1)"
bogoutil -w $db .MSG_COUNT | tail -n 1
bogofilter -d $db -vM < ns.r0 | grep -c '^1'
bogofilter -d $db -vM < ns.r1 | grep -c '^1'
bogofilter -d $db -vM < sp.r0 | grep -c -v '^1'
bogofilter -d $db -vM < sp.r1 | grep -c -v '^1'
done
# db.1: fp (0, 1) fn (0, 1)
# .MSG_COUNT 831 740
# 13
# 24
# 67
# 67
# db.4: fp (0, 1) fn (0, 1)
# .MSG_COUNT 836 962
# 15
# 31
# 46
# 58
# db.x: fp (0, 1) fn (0, 1)
# .MSG_COUNT 1023 1098
# 17
# 37
# 41
# 50
cat >db.1/bogofilter.rc <<EOT
min_dev=0.1
robs=1.0
robx=0.415
spam_cutoff=0.8
EOT
cat >db.4/bogofilter.rc <<EOT
min_dev=0.1
robs=1.0
robx=0.415
spam_cutoff=0.65
EOT
cat >db.x/bogofilter.rc <<EOT
min_dev=0.1
robs=1.0
robx=0.415
spam_cutoff=0.79
EOT
for iter in 1 4 10; do
db=db.$iter
if [ $iter -eq 10 ]; then db=db.x; fi
echo "$db: fp (0, 1) fn (0, 1)"
bogoutil -w $db .MSG_COUNT | tail -n 1
bogofilter -d $db -c $db/bogofilter.rc -vM < ns.r0 | grep -c '^1'
bogofilter -d $db -c $db/bogofilter.rc -vM < ns.r1 | grep -c '^1'
bogofilter -d $db -c $db/bogofilter.rc -vM < sp.r0 | grep -c -v '^1'
bogofilter -d $db -c $db/bogofilter.rc -vM < sp.r1 | grep -c -v '^1'
done
# db.1: fp (0, 1) fn (0, 1)
# .MSG_COUNT 831 740
# 1
# 3
# 471
# 471
# db.4: fp (0, 1) fn (0, 1)
# .MSG_COUNT 836 962
# 1
# 3
# 286
# 298
# db.x: fp (0, 1) fn (0, 1)
# .MSG_COUNT 1023 1098
# 2
# 2
# 467
# 476
cat >db.1/bogofilter.rc <<EOT
spam_cutoff=0.99
ham_cutoff=0.01
EOT
cp db.1/bogofilter.rc db.4
cp db.1/bogofilter.rc db.x
for iter in 1 4 10; do
db=db.$iter
if [ $iter -eq 10 ]; then db=db.x; fi
echo "$db:"
mv $db/wordlist.db $db/wordlist.db.orig
randomtrain $db i $iter s sp.t0 s sp.t1 n ns.t0 n ns.t1
done
# db.1:
# db.1, max reg is 1
# bogofilter: (db) open( db.1/wordlist.db ), err: 2, No such file or directory
# Can't open file 'wordlist.db' in directory 'db.1'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
# spam reg good reg
# 26064 1118 25954 858
# db.4:
# db.4, max reg is 4
# bogofilter: (db) open( db.4/wordlist.db ), err: 2, No such file or directory
# Can't open file 'wordlist.db' in directory 'db.4'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
# spam reg good reg
# 26064 922 25954 1029
# db.x:
# db.x, max reg is 10
# bogofilter: (db) open( db.x/wordlist.db ), err: 2, No such file or directory
# Can't open file 'wordlist.db' in directory 'db.x'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
# spam reg good reg
# 26064 1009 25954 1160
cat >db.1/bogofilter.rc <<EOT
min_dev=0.1
robs=1.0
robx=0.415
spam_cutoff=0.804
EOT
cat >db.4/bogofilter.rc <<EOT
min_dev=0.1
robs=1.0
robx=0.415
spam_cutoff=0.804
EOT
cat >db.x/bogofilter.rc <<EOT
min_dev=0.1
robs=1.0
robx=0.415
spam_cutoff=0.78
EOT
for iter in 1 4 10; do
db=db.$iter
if [ $iter -eq 10 ]; then db=db.x; fi
echo "$db: fp (0, 1) fn (0, 1)"
bogoutil -w $db .MSG_COUNT | tail -n 1
bogofilter -d $db -c $db/bogofilter.rc -vM < ns.r0 | grep -c '^1'
bogofilter -d $db -c $db/bogofilter.rc -vM < ns.r1 | grep -c '^1'
bogofilter -d $db -c $db/bogofilter.rc -vM < sp.r0 | grep -c -v '^1'
bogofilter -d $db -c $db/bogofilter.rc -vM < sp.r1 | grep -c -v '^1'
done
.MSG_COUNT 1118 858
1
3
469
484
db.4: fp (0, 1) fn (0, 1)
.MSG_COUNT 1117 1141
1
3
304
293
db.x: fp (0, 1) fn (0, 1)
.MSG_COUNT 1435 1488
3
1
381
396
# Create two more training dbs, db.h4 and db.hx
mkdir db.h4 db.hx
# First half full, second half on-error, register up to 4x
bogofilter -d db.h4 -vs < sp.t0
# 3552598 words, 13032 messages
bogofilter -d db.h4 -vn < ns.t0
# 5085829 words, 12977 messages
cp db.h4/wordlist.db db.hx
randomtrain db.h4 i 4 s sp.t1 n ns.t1
# db.h4, max reg is 4
# spam reg good reg
# 13032 181 12977 26
randomtrain db.hx i 10 s sp.t1 n ns.t1
# db.hx, max reg is 10
# spam reg good reg
# 13032 141 12977 29
# Get database sizes
ls -lk db*/wordlist.db
# -rw-r--r-- 1 root root 4920 Mar 14 15:25 db.1/wordlist.db
# -rw-r--r-- 1 root root 4756 Mar 14 16:17 db.4/wordlist.db
# -rw-r--r-- 1 root root 40688 Mar 12 10:14 db.f/wordlist.db
# -rw-r--r-- 1 root root 25612 Mar 12 10:46 db.h/wordlist.db
# -rw-r--r-- 1 root root 25600 Mar 15 07:59 db.h4/wordlist.db
# -rw-r--r-- 1 root root 25576 Mar 15 08:36 db.hx/wordlist.db
# -rw-r--r-- 1 root root 4964 Mar 14 17:10 db.x/wordlist.db
for db in db.h4 db.hx; do
echo "# $db" | tee $db.tune
bogoutil -w $db .MSG_COUNT | tail -n 1 | tee -a $db.tune
bogotune -d $db -T 3 -v -s sp.r0 sp.r1 -n ns.r0 ns.r1 | tee -a $db.tune
eval `fgrep '=' $db.tune | awk '{print $1}'`
echo "$db tuned: fp" | tee -a $db.tune
bogofilter -d $db -vM -m $min_dev,$robs,$robx \
-o $spam_cutoff,$ham_cutoff < ns.r0 | grep -c '^1' | tee -a $db.tune
bogofilter -d $db -vM -m $min_dev,$robs,$robx \
-o $spam_cutoff,$ham_cutoff < ns.r1 | grep -c '^1' | tee -a $db.tune
echo "$db tuned: fn" | tee -a $db.tune
bogofilter -d $db -vM -m $min_dev,$robs,$robx \
-o $spam_cutoff,$ham_cutoff < sp.r0 | grep -c -v '^1' | tee -a $db.tune
bogofilter -d $db -vM -m $min_dev,$robs,$robx \
-o $spam_cutoff,$ham_cutoff < sp.r1 | grep -c -v '^1' | tee -a $db.tune
done
# db.h4
# .MSG_COUNT 13572 13008
# robx=0.395873
# min_dev=0.020
# robs=0.1778
# spam_cutoff=0.5000
# ham_cutoff=0.100
# db.h4 tuned: fp
# 1
# 1
# db.h4 tuned: fn
# 188
# 198
# db.hx
# .MSG_COUNT 13938 13017
# robx=0.496740
# min_dev=0.020
# robs=0.3162
# spam_cutoff=0.6126 # for 0.01% fpos (2); expect 1.54% fneg
(401).
# ham_cutoff=0.100
# db.hx tuned: fp
# 1
# 1
# db.hx tuned: fn
# 198
# 205
# See if maxiter 1 but repeated randomtrain does any better
# first training-on-error registered 235 spam, 27 nonspam
mkdir db.r4 db.rx
cp db.h/wordlist.db db.r4
for i in 2 3 4; do randomtrain db.r4 s sp.t1 n ns.t1; done
# db.r4, max reg is 1
# spam reg good reg
# 13032 135 12977 4
# db.r4, max reg is 1
# spam reg good reg
# 13032 100 12977 1
# db.r4, max reg is 1
# spam reg good reg
# 13032 84 12977 1
cp db.r4/wordlist.db db.rx
for i in 5 6 7 8 9 10; do randomtrain db.rx s sp.t1 n ns.t1; done
# db.rx, max reg is 1
# spam reg good reg
# 13032 67 12977 1
# db.rx, max reg is 1
# spam reg good reg
# 13032 61 12977 3
# db.rx, max reg is 1
# spam reg good reg
# 13032 57 12977 1
# db.rx, max reg is 1
# spam reg good reg
# 13032 50 12977 2
# db.rx, max reg is 1
# spam reg good reg
# 13032 47 12977 1
# db.rx, max reg is 1
# spam reg good reg
# 13032 46 12977 1
# Get database sizes
ls -lk db.r*/wordlist.db
# -rw-r--r-- 1 root root 25612 Mar 15 13:08 db.r4/wordlist.db
# -rw-r--r-- 1 root root 25612 Mar 15 15:31 db.rx/wordlist.db
for db in db.r4 db.rx; do
echo "# $db" | tee $db.tune
bogoutil -w $db .MSG_COUNT | tail -n 1 | tee -a $db.tune
bogotune -d $db -T 3 -v -s sp.r0 sp.r1 -n ns.r0 ns.r1 | tee -a
$db.tune
eval grep '=' $db.tune | awk '{print $1}'
echo "$db tuned: fp"
| tee -a $db.tune
bogofilter -d $db -vM -m $min_dev,$robs,$robx \
-o $spam_cutoff,$ham_cutoff
[© < A HREF="mailto:glouis@dynamicro.on.ca">Greg
Louis, 2004; last modified 2004-03-15]