Bogofilter and Repetitive Training (Continued)

Introduction and general description:

The following is repeated from an earlier experiment in which I investigated the potential of repetitive training on error after an initial period of full training (these terms are explained below): Bogofilter is a Bayesian spam filter; background is presented on my general bogofilter page.  Bogofilter runs a modification of Paul Graham's original calculation method that was suggested by Gary Robinson.  The basic principle is that known spam and nonspam messages are presented to bogofilter for training; the messages are broken up into tokens (words, IP addresses and so forth) and the frequency of occurrence of each token in spam and nonspam is counted on a cumulative basis.  Bogofilter can then compare these frequencies to token frequencies in a message to be classified, and calculate a score that indicates the likelihood that the message is spam or nonspam.

Initially people thought that every available message should be used to train bogofilter.  Some even do this automatically; bogofilter has an option for that purpose.  Of course the messages have to be verified and errors "backed out" of the training database afterward, or else bogofilter's inevitable mistakes will degrade the accuracy of training.

The suggestion was soon made that, after the training database had reached a reasonable size (some 10,000 each of spam and nonspam had been used for training), one could save time and storage by training only on errors (messages that were classed wrongly, or classed as "unsure."  Tests soon showed that this worked quite well, both for bogofilter users and for participants in the Spambayes project, which did a lot of useful research into both tokenization and scoring.

Training on error is theoretically questionable, because in principle the distribution of tokens in the training database should match the distribution in the message population.  Nevertheless, it works for those who have used it extensively, and success is deemed to excuse some departure from strict theoretical rigour. (In fact, doing Bayesian analysis even with full training already violates an assumption of independence on which Bayesian theory depends.)

Once people started training on error -- some folks even skipped the initial stage of building the database to reasonable size by full training, though that has been shown to be less efficient -- the question naturally arose, "Once the filter has made a mistake, why not train with that message repeatedly till the program gets it right?"  And indeed many people had the empirical experience that doing this with the occasional "difficult" message did improve accuracy to some extent.  However, concern was raised that this could lead to much worse distortion of the token distribution within the training database, and eventually to degraded accuracy: the ability to recognize the erroneous messages used in repeated training might improve, but the ability to recognize similar but not identical messages might suffer.  Putting it glibly, training repeatedly on a dachshund is probably not going to help a Bayesian classifier learn to distinguish a St. Bernard from a cat.

Notwithstanding this concern, however, the preceding experiment seemed to indicate that registering erroneously classified messages a limited number of times might improve overall accuracy.

Some people, however, passionately maintain that training only on error, from scratch, with up to ten iterations of registration if a message is still wrongly classified after the ninth -- or even beyond that number -- gives excellent results.  I therefore decided to compare five different training methods with a corpus of 51,907 nonspam and 52,127 spam, "dealt" out into four files each, two of which were to be used for training and the other two for testing:

  1. Full training: the two training files of spam and then the two training files of nonspam were registered in their entirety.
  2. My normal production method: one file of spam was registered, then one of nonspam; the other training files were used with a script called randomtrain that feeds its input messages to bogofilter in random order for classification; if the classification is wrong or uncertain, the message is used for training, up to some user-defined maximum iterations.  For this training database I set the maximum to 1, as that's my usual practice.
  3. Training on error from scratch, using randomtrain, with all four training files as input, and the iteration limit set to 1.
  4. As above except with the iteration limit set to 4, which the previous experiment seemed to indicate was a "safe" number.
  5. As above, but with the iteration limit at 10, which has been reported by Tom Anderson to give good results.
As a followup to the previous experiment's finding that allowing multiple registrations of wrongly or uncertainly classified messages may improve accuracy, I also prepared two further databases similar to the second one listed above, but allowing up to 4 or 10 maximum iterations respectively.

Procedure and Results:

First the messages were distributed into four files and the five training databases were prepared. Initially, bogofilter's parameters were left at the values I currently use in production (more about this below) for training:
bogofilter -Q | fgrep ' 0.'   
# bogofilter version 0.17.2
# robx        = 0.610612 (6.11e-01)
# robs        = 0.017800 (1.78e-02)
# min_dev     = 0.020000 (2.00e-02)
# ham_cutoff  = 0.281000 (2.81e-01)
# spam_cutoff = 0.532200 (5.32e-01)
For full and half-full training, the training databases were large enough to allow parameter optimization with bogotune. This could not be done for the other three training databases (I tried lowering the message-count limit but bogotune produced nonsensical output), so the test runs were first done with the parameters unchanged. The first test consisted merely of classifying the two test nonspam files and the two test spam files, and counting false positives and false negatives:
 error run full half max1 max4 max10
    fp   0    1    1   21   34    20
    fp   1    1    2   33   28    28
    fn   0  244  197   66   58    70
    fn   1  259  202   71   70    61
Comparisons in accuracy are only meaningful when the number of false positives is held constant, or nearly so. As the training-on-error databases all yielded false-positive counts that I deem unacceptably high (the best was 0.15%, much beyond what's tolerable), I decided to try to increase the spam cutoffs to get a maximum of four false positives. For this round, I also ran bogotune for the first two training databases, to get the best performance from those:
 value     full     half      max1      max4     max10
     x   0.5962   0.5231    0.6106    0.6106    0.6106
    md   0.0950   0.0700    0.0200    0.0200    0.0200
     s   0.0100   0.1000    0.0178    0.0178    0.0178
    co  0.50004   0.5001    0.9997   0.99998    0.9999
   fp0        1        1         1         3         1
   fp1        1        1         4         3         3
   fn0      149      165      5960      7175      6278
   fn1      147      172      5954      7119      6266
This obviously didn't work too well. Cutoffs high enough to give reasonable false positive counts with the training-on-error databases allowed far too many false negatives.

With these small databases, I thought, there will be lots of unknown and low-count tokens, which might be inherently unreliable. Using a high s and a slightly nonspammy x should bring the spam cutoff down and give better results. I therefore arbitrarily chose 0.45 for x and 1 for s, determined optimal cutoffs (for max 4 fp), and built new training databases with those parameters for the train-on-error methods:

 value  max1  max4 max10
    co 0.538 0.524 0.512
   fp0     1     1     1
   fp1     3     3     3
   fn0  3261  3020  2458
   fn1  3196  3004  2443
Better, but not at all adequate. It is to be noted that the numbers of messages used in training are far fewer than these:
Message counts in training databases:
 messages  full  half max1 max4 max10
     spam 26064 13267  220  251   279
  nonspam 25954 13004  796  805   870
Two further modifications were then tried: training with cutoffs of 0.95 and 0.05 or 0.99 and 0.01, and increasing min_dev so that unknown tokens were always excluded from the scoring. The wider uncertain windows caused more messages to be used in training:
Message counts in training databases:
   cutoffs messages max1 max4 max10
 0.05/0.95     spam  831  836  1023
 0.05/0.95  nonspam  740  962  1098
 0.01/0.99     spam 1118 1117  1435
 0.01/0.99  nonspam  858 1141  1488
This led to a considerable improvement in accuracy. The test runs had min_dev set to 0.1, s to 1 and x to 0.415. For the 0.05/0.95 training:
 value  max1  max4 max10
    co  0.80  0.65  0.79
   fp0     1     1     2
   fp1     3     3     2
   fn0   471   286   467
   fn1   471   298   476
The 0.01/0.99 training produced these test results:
 value  max1  max4 max10
    co 0.804 0.804  0.78
   fp0     1     1     3
   fp1     3     3     1
   fn0   469   304   381
   fn1   484   293   396
The final phase of this testing compares training databases that were created by full training with half the training data, followed by training on error with the second half. Three such databases are compared. They differ only in the number of iterations permitted during the training on error: 1, 4 and 10. (A maximum of ten iterations means that a message, if wrongly classified or deemed unsure, is used for training and then reclassified, training and classification being repeated up to nine times; if the classification is still wrong or unsure, we train one more time and move on to the next message.)

After being built, each training database was used in a bogotune run to determine optimal parameters for the test. These parameters were then used to reclassify the test messages, as shown in the table below; the full-training results are repeated as well, for comparison, and the first three rows give message counts and sizes in kilobytes for the training databases:

   value    full   hmax1   hmax4  hmax10
 trainsp  26,064  13,267  13,572  13,938
 trainns  25,954  13,004  13,008  13,017
    dbkB  40,688  25,612  25,600  25,576
       x  0.5962  0.5231  0.3959  0.4967
      md  0.0950  0.0700    0.02    0.02
       s  0.0100  0.1000  0.1778  0.3162
      co 0.50004  0.5001 0.50002 0.61263
     fp0       1       1       1       1
     fp1       1       1       1       1
     fn0     149     165     188     198
     fn1     147     172     198     205
Note that this differs from the procedure used in the previously reported experiment, where the whole corpus to be retrained was submitted, each message that was wrongly classified was used once in training, and then the whole corpus was resubmitted, up to the maximum number of iterations.  It seemed appropriate to try the experiment that way as well.  Accordingly, the training database made half by full training and half by training-on-error with a maximum iteration of 1 (hmax1 in the above table) was copied, and training-on-error was repeated nine more times; at the third repetition (four training passes total) a copy was made, so that we had 1, 4 and 10 passes of training on error.

As would be expected, more spam than nonspam errors were encountered during training.  The graph shows the numbers of spam and nonspam messages actually used in training at each pass:

graph of messages vs
iteration

The test results for these databases are shown in the following table:
   value   half1   half4  half10
 trainsp  13,267  13,586  13,914
 trainns  13,004  13,010  13,019
    dbkB  25,612  25,612  25,612
       x  0.5231  0.4362  0.4817
      md  0.0700   0.035   0.025
       s  0.1000  0.3162  0.3162
      co  0.5001 0.50004 0.61193
     fp0       1       1       1
     fp1       1       1       1
     fn0     165     185     192
     fn1     172     190     193

Conclusions:

  1. This experiment yielded evidence that using training on error exclusively, be it with or without repetition, has poor predictive value if one requires a small number of false positives, though low overall error rates can be achieved if that's not a criterion.  This is different from some people's experience, but I can only report what I found.
  2. Although full training gave the best results of all training methods that were tested, training with a substantial number of spam and nonspam messages, followed thereafter by training on error, can do almost as well at a considerable saving in database size.  (The sizes reported above are as built, and could be reduced if the databases were to be dumped and reloaded with bogoutil.)
  3. When adopting the method of training on error after an initial period of full training, it isn't worthwhile to allow repetitions of registration/classification message by message.  Nor was batchwise registration, though somewhat successful in the experiment reported earlier, able to improve accuracy with this set of messages.

Appendix: Detailed log

# Distribute messages
cat ../corpus0127.good ../corpus0127.nstrain ../corpus.good \
    ../corpus.nstrain | FILENO=0 formail -s ./distrib ns
cat /store/spam_corpus ../corpus0127.bad ../corpus0127.sptrain \
    ../corpus.bad ../corpus.nstrain | FILENO=0 formail -s ./distrib sp
grep -c '^From ' ??.?? >counts
cat counts
# ns.r0:12977
# ns.r1:12976
# ns.t0:12977
# ns.t1:12977
# sp.r0:13032
# sp.r1:13031
# sp.t0:13032
# sp.t1:13032

# Create wordlist subdirs
mkdir db.f db.h db.1 db.4 db.x
bogofilter -Q | fgrep ' 0.'
# bogofilter version 0.17.2
# robx        = 0.610612 (6.11e-01)
# robs        = 0.017800 (1.78e-02)
# min_dev     = 0.020000 (2.00e-02)
# ham_cutoff  = 0.281000 (2.81e-01)
# spam_cutoff = 0.532200 (5.32e-01)

# Full training
cat sp.t0 sp.t1 | bogofilter -d db.f -vs
# 7129501 words, 26064 messages
cat ns.t0 ns.t1 | bogofilter -d db.f -vn
# 10371105 words, 25954 messages

# First half full, second half on-error, register 1x only
bogofilter -d db.h -vs < sp.t0
# 3552598 words, 13032 messages
bogofilter -d db.h -vn < ns.t0
# 5085829 words, 12977 messages
randomtrain db.h s sp.t1 n ns.t1
# db.h, max reg is 1
#  spam  reg   good  reg
# 13032  235  12977   27 

# Train exclusively on error, register 1x only
randomtrain db.1 s sp.t0 s sp.t1 n ns.t0 n ns.t1
# db.1, max reg is 1
# bogofilter: (db) open( db.1/wordlist.db ), err: 2, No such file or
# directory
# Can't open file 'wordlist.db' in directory 'db.1'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
#  spam  reg   good  reg
# 26064  220  25954  796

# Train exclusively on error, register up to 4x
randomtrain db.4 i 4 s sp.t0 s sp.t1 n ns.t0 n ns.t1
# db.4, max reg is 4
# bogofilter: (db) open( db.4/wordlist.db ), err: 2, No such file or
# directory
# Can't open file 'wordlist.db' in directory 'db.4'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
#  spam  reg   good  reg
# 26064  233  25954  728 

# Train exclusively on error, register up to 10x
randomtrain db.x i 10 s sp.t0 s sp.t1 n ns.t0 n ns.t1
# db.x, max reg is 10
# bogofilter: (db) open( db.x/wordlist.db ), err: 2, No such file or
# directory
# Can't open file 'wordlist.db' in directory 'db.x'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
#  spam  reg   good  reg
# 26064  256  25954  715 

# Get message counts for each training db
# Classify test messages with each training db, using the same
# bogofilter parameters used during training, but adjusting cutoff
for db in db.f db.h db.1 db.4 db.x; do
    echo -n "# $db: "
    bogoutil -w $db .MSG_COUNT | tail -n 1
    echo "# $db: fp"
    bogofilter -d $db -vM < ns.r0 | grep -c '^1'
    bogofilter -d $db -vM < ns.r1 | grep -c '^1'
    echo "# $db: fn"
    bogofilter -d $db -vM < sp.r0 | grep -c -v '^1'
    bogofilter -d $db -vM < sp.r1 | grep -c -v '^1'
    echo
done
# db.f: .MSG_COUNT                      26064  25954
# db.f: fp
# 1
# 1
# db.f: fn
# 244
# 259
#
# db.h: .MSG_COUNT                      13267  13004
# db.h: fp
# 1
# 2
# db.h: fn
# 197
# 202
#
# db.1: .MSG_COUNT                        220    796
# db.1: fp
# 21
# 33
# db.1: fn
# 66
# 71
#
# db.4: .MSG_COUNT                        251    805
# db.4: fp
# 34
# 28
# db.4: fn
# 58
# 70
#
# db.x: .MSG_COUNT                        279    870
# db.x: fp
# 20
# 28
# db.x: fn
# 65
# 61

# Tune db.f and db.h (not enough messages in pure t-o-e dbs)
for db in db.f db.h; do
    echo "# $db" | tee $db.tune
    bogotune -d $db -T 3 -v -s sp.r0 sp.r1 -n ns.r0 ns.r1 | tee -a $db.tune
    eval `fgrep '=' $db.tune | awk '{print $1}'`
    echo "$db tuned: fp" | tee -a $db.tune
    bogofilter -d $db -vM -m $min_dev,$robs,$robx \
	-o $spam_cutoff,$ham_cutoff < ns.r0 | grep -c '^1' | tee -a $db.tune
    bogofilter -d $db -vM -m $min_dev,$robs,$robx \
	-o $spam_cutoff,$ham_cutoff < ns.r1 | grep -c '^1' | tee -a $db.tune
    echo "$db tuned: fn" | tee -a $db.tune
    bogofilter -d $db -vM -m $min_dev,$robs,$robx \
	-o $spam_cutoff,$ham_cutoff < sp.r0 | grep -c -v '^1' | tee -a $db.tune
    bogofilter -d $db -vM -m $min_dev,$robs,$robx \
	-o $spam_cutoff,$ham_cutoff < sp.r1 | grep -c -v '^1' | tee -a $db.tune
done

# db.f:
# db_cachesize=14
# robx=0.596176
# min_dev=0.095
# robs=0.0100
# spam_cutoff=0.5000      # for 0.01% fpos (2); expect 1.14% fneg (297).
# ham_cutoff=0.100        

# db.f tuned: fp
# 3
# 1
# db.f tuned: fn
# 147
# 147

# db.h
# db_cachesize=9
# robx=0.523060
# min_dev=0.070
# robs=0.1000
# spam_cutoff=0.5001      # for 0.01% fpos (2); expect 1.29% fneg (337).
# ham_cutoff=0.100        

# db.h tuned: fp
# 1
# 1
# db.h tuned: fn
# 165
# 172

# For the pure train-on-error experiments, retain the original
# parameter values except adjust spam cutoff to give 4 or fewer fp
# (cutoff must be between 0.5 and 0.99999 inclusive)

# db.1
# cutoffs of 0.99999 and 0.9997 gave 5 fp; 0.9993 gave 6.
# db.1 co 0.9997: fp
bogofilter -d db.1 -vM -o 0.9997 < ns.r0 | grep -c '^1'
# 1
bogofilter -d db.1 -vM -o 0.9997 < ns.r1 | grep -c '^1'
# 4
# db.1 co 0.9997: fn
bogofilter -d db.1 -vM -o 0.9997 < sp.r0 | grep -c -v '^1'
# 5960
bogofilter -d db.1 -vM -o 0.9997 < sp.r1 | grep -c -v '^1'
# 5954

# db.4
# cutoffs of 0.99999 and 0.99998 gave 6 fp; 0.99997 gave 7.
# db.4 co 0.99998: fp
bogofilter -d db.4 -vM -o 0.99998 < ns.r0 | grep -c '^1'
# 3
bogofilter -d db.4 -vM -o 0.99998 < ns.r1 | grep -c '^1'
# 3
# db.4 co 0.99998: fn
bogofilter -d db.4 -vM -o 0.99998 < sp.r0 | grep -c -v '^1'
# 7175
bogofilter -d db.4 -vM -o 0.99998 < sp.r1 | grep -c -v '^1'
# 7119

# db.x
# a cutoff of 0.9999 gave 4 fp; 0.9995 gave 5.
# db.x co 0.9999: fp
bogofilter -d db.x -vM -o 0.9999 < ns.r0 | grep -c '^1'
# 1
bogofilter -d db.x -vM -o 0.9999 < ns.r1 | grep -c '^1'
# 3
# db.x co 0.9999: fn
bogofilter -d db.x -vM -o 0.9999 < sp.r0 | grep -c -v '^1'
# 6278
bogofilter -d db.x -vM -o 0.9999 < sp.r1 | grep -c -v '^1'
# 6266

# With these small databases there will be lots of unknown and low-count
# tokens, which are inherently unreliable.  In these circumstances,
# better results might possibly be obtained with a high s and a slightly
# nonspammy x, which would bring the spam cutoff down:

# db.1 md=0.02 s=1 x=0.45 co=0.538: fp
bogofilter -d db.1 -vM -m 0.02,1,0.45 -o 0.538 < ns.r0 | grep -c '^1'
# 1
bogofilter -d db.1 -vM -m 0.02,1,0.45 -o 0.538 < ns.r1 | grep -c '^1'
# 3
# db.1 md=0.02 s=1 x=0.45 co=0.538: fn
bogofilter -d db.1 -vM -m 0.02,1,0.45 -o 0.538 < sp.r0 | grep -c -v '^1'
# 3261
bogofilter -d db.1 -vM -m 0.02,1,0.45 -o 0.538 < sp.r1 | grep -c -v '^1'
# 3196

# db.4 md=0.02 s=1 x=0.45 co=0.524: fp
bogofilter -d db.4 -vM -m 0.02,1,0.45 -o 0.524 < ns.r0 | grep -c '^1'
# 1
bogofilter -d db.4 -vM -m 0.02,1,0.45 -o 0.524 < ns.r1 | grep -c '^1'
# 3
# db.4 md=0.02 s=1 x=0.45 co=0.524: fn
bogofilter -d db.4 -vM -m 0.02,1,0.45 -o 0.524 < sp.r0 | grep -c -v '^1'
# 3020
bogofilter -d db.4 -vM -m 0.02,1,0.45 -o 0.524 < sp.r1 | grep -c -v '^1'
# 3004

# db.x md=0.02 s=1 x=0.45 co=0.512: fp
bogofilter -d db.x -vM -m 0.02,1,0.45 -o 0.512 < ns.r0 | grep -c '^1'
# 1
bogofilter -d db.x -vM -m 0.02,1,0.45 -o 0.512 < ns.r1 | grep -c '^1'
# 3
# db.x md=0.02 s=1 x=0.45 co=0.512: fn
bogofilter -d db.x -vM -m 0.02,1,0.45 -o 0.512 < sp.r0 | grep -c -v '^1'
# 2458
bogofilter -d db.x -vM -m 0.02,1,0.45 -o 0.512 < sp.r1 | grep -c -v '^1'
# 2443

# We conclude that training on error, unlike full training, does not
# produce a training database with predictive value.  However, it may
# still be a viable procedure if training is done with higher and lower
# than normal spam and nonspam cutoffs.

diff -u randomtrain.bck randomtrain
# --- randomtrain.bck     2004-03-13 13:29:23.000000000 -0500
# +++ randomtrain 2004-03-13 13:29:23.000000000 -0500
# @@ -70,7 +70,7 @@
#         dd if=$fnam bs=1 skip=$offset count=$length 2>/dev/null >msg.$pid
#         iter=0;
#         while [ $iter -lt $maxiter ]; do
# -           bogofilter -d $bogodir < msg.$pid
# +           bogofilter -d $bogodir -c $bogodir/bogodir.rc < msg.$pid
#             got=$?      # 0=spam, 1=good, 2=unknown, 3=err
#             if [ $iter -eq 0 ]; then
#                 if [ "$expect" = "s" ]; then let nspam=$nspam+1

cat db.1/bogofilter.rc
# spam_cutoff=0.95
# ham_cutoff=0.05
cp db.1/bogofilter.rc db.4
cp db.1/bogofilter.rc db.x

for iter in 1 4 10; do
    db=db.$iter
    if [ $iter -eq 10 ]; then db=db.x; fi
    echo "$db:"
    mv $db/wordlist.db $db/wordlist.db.orig
    randomtrain $db i $iter s sp.t0 s sp.t1 n ns.t0 n ns.t1
done

# db.1:
# db.1, max reg is 1
# bogofilter: (db) open( db.1/wordlist.db ), err: 2, No such file or directory
# Can't open file 'wordlist.db' in directory 'db.1'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
#  spam  reg   good  reg
# 26064  831  25954  740 

# db.4:
# db.4, max reg is 4
# bogofilter: (db) open( db.4/wordlist.db ), err: 2, No such file or directory
# Can't open file 'wordlist.db' in directory 'db.4'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
#  spam  reg   good  reg
# 26064  710  25954  868

# db.x:
# db.x, max reg is 10
# Can't open file 'wordlist.db' in directory 'db.x'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
#  spam  reg   good  reg
# 26064  737  25954  913

for iter in 1 4 10; do
    db=db.$iter
    if [ $iter -eq 10 ]; then db=db.x; fi
    echo "$db: fp (0, 1) fn (0, 1)"
    bogoutil -w $db .MSG_COUNT | tail -n 1
    bogofilter -d $db -vM < ns.r0 | grep -c '^1'
    bogofilter -d $db -vM < ns.r1 | grep -c '^1'
    bogofilter -d $db -vM < sp.r0 | grep -c -v '^1'
    bogofilter -d $db -vM < sp.r1 | grep -c -v '^1'
done

# db.1: fp (0, 1) fn (0, 1)
# .MSG_COUNT                        831    740
# 13
# 24
# 67
# 67
# db.4: fp (0, 1) fn (0, 1)
# .MSG_COUNT                        836    962
# 15
# 31
# 46
# 58
# db.x: fp (0, 1) fn (0, 1)
# .MSG_COUNT                       1023   1098
# 17
# 37
# 41
# 50

cat >db.1/bogofilter.rc <<EOT
min_dev=0.1
robs=1.0
robx=0.415
spam_cutoff=0.8
EOT
cat >db.4/bogofilter.rc <<EOT
min_dev=0.1
robs=1.0   
robx=0.415 
spam_cutoff=0.65
EOT
cat >db.x/bogofilter.rc <<EOT
min_dev=0.1
robs=1.0   
robx=0.415 
spam_cutoff=0.79
EOT

for iter in 1 4 10; do
    db=db.$iter
    if [ $iter -eq 10 ]; then db=db.x; fi
    echo "$db: fp (0, 1) fn (0, 1)"
    bogoutil -w $db .MSG_COUNT | tail -n 1
    bogofilter -d $db -c $db/bogofilter.rc -vM < ns.r0 | grep -c '^1'
    bogofilter -d $db -c $db/bogofilter.rc -vM < ns.r1 | grep -c '^1'
    bogofilter -d $db -c $db/bogofilter.rc -vM < sp.r0 | grep -c -v '^1'
    bogofilter -d $db -c $db/bogofilter.rc -vM < sp.r1 | grep -c -v '^1'
done   

# db.1: fp (0, 1) fn (0, 1)
# .MSG_COUNT                        831    740
# 1
# 3
# 471
# 471
# db.4: fp (0, 1) fn (0, 1)
# .MSG_COUNT                        836    962
# 1
# 3
# 286
# 298
# db.x: fp (0, 1) fn (0, 1)
# .MSG_COUNT                       1023   1098
# 2
# 2
# 467
# 476

cat >db.1/bogofilter.rc <<EOT
spam_cutoff=0.99
ham_cutoff=0.01
EOT
cp db.1/bogofilter.rc db.4
cp db.1/bogofilter.rc db.x
 
for iter in 1 4 10; do
    db=db.$iter
    if [ $iter -eq 10 ]; then db=db.x; fi
    echo "$db:"
    mv $db/wordlist.db $db/wordlist.db.orig
    randomtrain $db i $iter s sp.t0 s sp.t1 n ns.t0 n ns.t1
done

# db.1:
# db.1, max reg is 1
# bogofilter: (db) open( db.1/wordlist.db ), err: 2, No such file or directory
# Can't open file 'wordlist.db' in directory 'db.1'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
#  spam  reg   good  reg
# 26064 1118  25954  858 

# db.4:
# db.4, max reg is 4
# bogofilter: (db) open( db.4/wordlist.db ), err: 2, No such file or directory
# Can't open file 'wordlist.db' in directory 'db.4'.
# error #2 - No such file or directory.
# 
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
#  spam  reg   good  reg
# 26064  922  25954 1029 

# db.x:
# db.x, max reg is 10
# bogofilter: (db) open( db.x/wordlist.db ), err: 2, No such file or directory
# Can't open file 'wordlist.db' in directory 'db.x'.
# error #2 - No such file or directory.
#
# Remember to register some spam and ham messages before you
# use bogofilter to evaluate mail for its probable spam status!
#  spam  reg   good  reg
# 26064 1009  25954 1160

cat >db.1/bogofilter.rc <<EOT
min_dev=0.1
robs=1.0   
robx=0.415
spam_cutoff=0.804
EOT

cat >db.4/bogofilter.rc <<EOT
min_dev=0.1
robs=1.0   
robx=0.415
spam_cutoff=0.804
EOT

cat >db.x/bogofilter.rc <<EOT
min_dev=0.1
robs=1.0   
robx=0.415
spam_cutoff=0.78
EOT

for iter in 1 4 10; do
    db=db.$iter
    if [ $iter -eq 10 ]; then db=db.x; fi
    echo "$db: fp (0, 1) fn (0, 1)"
    bogoutil -w $db .MSG_COUNT | tail -n 1
    bogofilter -d $db -c $db/bogofilter.rc -vM < ns.r0 | grep -c '^1'   
    bogofilter -d $db -c $db/bogofilter.rc -vM < ns.r1 | grep -c '^1'   
    bogofilter -d $db -c $db/bogofilter.rc -vM < sp.r0 | grep -c -v '^1'
    bogofilter -d $db -c $db/bogofilter.rc -vM < sp.r1 | grep -c -v '^1'
done

.MSG_COUNT                       1118    858
1
3
469
484
db.4: fp (0, 1) fn (0, 1)
.MSG_COUNT                       1117   1141
1
3
304
293
db.x: fp (0, 1) fn (0, 1)
.MSG_COUNT                       1435   1488
3
1
381
396
   
# Create two more training dbs, db.h4 and db.hx
mkdir db.h4 db.hx
# First half full, second half on-error, register up to 4x
bogofilter -d db.h4 -vs < sp.t0
# 3552598 words, 13032 messages
bogofilter -d db.h4 -vn < ns.t0
# 5085829 words, 12977 messages
cp db.h4/wordlist.db db.hx
randomtrain db.h4 i 4 s sp.t1 n ns.t1
# db.h4, max reg is 4
#  spam  reg   good  reg  
# 13032  181  12977   26
randomtrain db.hx i 10 s sp.t1 n ns.t1
# db.hx, max reg is 10  
#  spam  reg   good  reg
# 13032  141  12977   29

# Get database sizes
ls -lk db*/wordlist.db                
# -rw-r--r--  1 root root  4920 Mar 14 15:25 db.1/wordlist.db
# -rw-r--r--  1 root root  4756 Mar 14 16:17 db.4/wordlist.db
# -rw-r--r--  1 root root 40688 Mar 12 10:14 db.f/wordlist.db
# -rw-r--r--  1 root root 25612 Mar 12 10:46 db.h/wordlist.db
# -rw-r--r--  1 root root 25600 Mar 15 07:59 db.h4/wordlist.db
# -rw-r--r--  1 root root 25576 Mar 15 08:36 db.hx/wordlist.db
# -rw-r--r--  1 root root  4964 Mar 14 17:10 db.x/wordlist.db 

for db in db.h4 db.hx; do
    echo "# $db" | tee $db.tune
    bogoutil -w $db .MSG_COUNT | tail -n 1 | tee -a $db.tune
    bogotune -d $db -T 3 -v -s sp.r0 sp.r1 -n ns.r0 ns.r1 | tee -a $db.tune
    eval `fgrep '=' $db.tune | awk '{print $1}'`
    echo "$db tuned: fp" | tee -a $db.tune
    bogofilter -d $db -vM -m $min_dev,$robs,$robx \
        -o $spam_cutoff,$ham_cutoff < ns.r0 | grep -c '^1' | tee -a $db.tune
    bogofilter -d $db -vM -m $min_dev,$robs,$robx \                        
        -o $spam_cutoff,$ham_cutoff < ns.r1 | grep -c '^1' | tee -a $db.tune
    echo "$db tuned: fn" | tee -a $db.tune
    bogofilter -d $db -vM -m $min_dev,$robs,$robx \
        -o $spam_cutoff,$ham_cutoff < sp.r0 | grep -c -v '^1' | tee -a $db.tune
    bogofilter -d $db -vM -m $min_dev,$robs,$robx \                           
        -o $spam_cutoff,$ham_cutoff < sp.r1 | grep -c -v '^1' | tee -a $db.tune
done

# db.h4
# .MSG_COUNT                      13572  13008
# robx=0.395873
# min_dev=0.020
# robs=0.1778
# spam_cutoff=0.5000
# ham_cutoff=0.100
# db.h4 tuned: fp
# 1
# 1
# db.h4 tuned: fn   
# 188
# 198

# db.hx
# .MSG_COUNT                      13938  13017
# robx=0.496740
# min_dev=0.020
# robs=0.3162  
# spam_cutoff=0.6126      # for 0.01% fpos (2); expect 1.54% fneg
(401).
# ham_cutoff=0.100        
# db.hx tuned: fp         
# 1
# 1
# db.hx tuned: fn
# 198
# 205

# See if maxiter 1 but repeated randomtrain does any better
# first training-on-error registered 235 spam, 27 nonspam  
mkdir db.r4 db.rx
cp db.h/wordlist.db db.r4
for i in 2 3 4; do randomtrain db.r4 s sp.t1 n ns.t1; done
# db.r4, max reg is 1
#  spam  reg   good  reg
# 13032  135  12977    4
# db.r4, max reg is 1   
#  spam  reg   good  reg
# 13032  100  12977    1
# db.r4, max reg is 1   
#  spam  reg   good  reg
# 13032   84  12977    1
cp db.r4/wordlist.db db.rx
for i in 5 6 7 8 9 10; do randomtrain db.rx s sp.t1 n ns.t1; done
# db.rx, max reg is 1
#  spam  reg   good  reg
# 13032   67  12977    1
# db.rx, max reg is 1   
#  spam  reg   good  reg
# 13032   61  12977    3
# db.rx, max reg is 1   
#  spam  reg   good  reg
# 13032   57  12977    1
# db.rx, max reg is 1   
#  spam  reg   good  reg
# 13032   50  12977    2
# db.rx, max reg is 1   
#  spam  reg   good  reg
# 13032   47  12977    1
# db.rx, max reg is 1   
#  spam  reg   good  reg
# 13032   46  12977    1

# Get database sizes
ls -lk db.r*/wordlist.db
# -rw-r--r--  1 root root 25612 Mar 15 13:08 db.r4/wordlist.db
# -rw-r--r--  1 root root 25612 Mar 15 15:31 db.rx/wordlist.db

for db in db.r4 db.rx; do
    echo "# $db" | tee $db.tune
    bogoutil -w $db .MSG_COUNT | tail -n 1 | tee -a $db.tune
    bogotune -d $db -T 3 -v -s sp.r0 sp.r1 -n ns.r0 ns.r1 | tee -a
$db.tune
    eval grep '=' $db.tune | awk '{print $1}'
    echo "$db tuned: fp"
| tee -a $db.tune
    bogofilter -d $db -vM -m $min_dev,$robs,$robx \
        -o $spam_cutoff,$ham_cutoff 
[© < A HREF="mailto:glouis@dynamicro.on.ca">Greg Louis, 2004; last modified 2004-03-15]