Is Bogofilter Scaleable?

The bogofilter program is extremely well suited for use by an individual, moderately savvy un*x user.  Such a person will have no difficulty maintaining the training database and correcting any errors that may creep in over time.  Bogofilter, however, was written with a design goal of being fast enough to use on a busy mail server with lots of users.  I'm getting ready to roll it out on a site-wide basis, and I'm wondering how to keep the training-set quality up without having to devote a lot of my own time to it.  I think my site is small enough that this problem can be solved fairly easily; but the solution I have in mind clearly won't work for sites ten times the size of mine.  Hence the question that's the title of this paper.

The bogofilter manual says

Integration with Mail Transport Agent (MTA)

1. bogofilter can also be integrated into an MTA to filter all incoming
   mail.  While  the specific implementation is MTA dependent, the gen-
   eral steps are as follows

2. Install bogofilter on the mail server

3. Prime the bogofilter databases with  a  spam  and  non-spam  corpus.
   Since bogofilter will be serving a larger community, it is important
   to prime it with a representative set of messages.

4. Set up the MTA to invoke bogofilter on each message. While  this  is
   an MTA specific step, you'll probably need to use the -p, -u, and -e
   options.

5. Set up a mechanism for users to register spam/nonspam  messages,  as
   well as to correct mis-classifications. The most generic solution is
   to set up alias email addresses to which users bounce messages.
It's my contention that the -u option is a bad thing and should never be used.  That's because bogofilter (and any other spam filter) has a nonzero error rate, and letting it update its training database based on its own decisions is demonstrably unhelpful (see this test, for example).  For a single user, with one or two errors a day, this is not too serious a problem; but if bogofilter is processing thousands of messages between corrections, there will be many errors in the training set most of the time, and suboptimal performance as a result.

Any sysadmin will tell you that expecting ordinary users to register spam/nonspam messages is... naïve.  The most we can hope for is that, if we set up a mail alias for spam, some portion of the false negatives will be fed to it.  Also fed to it, however, will be legitimate mail from people the user is mad at, and all sorts of such guff.  Manual inspection will be essential.  Worse: there's a strong possibility that bogofilter will be misled by the extra headers that bounced email will carry.

Where does this leave us?

Well, I can do that.  I have around 100 users, who get some 1200 emails a day of which 350 are spam.  It doesn't take long to scan 350 emails and pick out any false positives.  Say five to ten minutes a day including the training-set updates.  Tolerable.  And if for some reason I can't get to it for a few days, at least the training set grows no worse.

Now imagine a site just ten times the size of mine.  Hand-checking 3500 emails every day is a bit of a pain, right?  What about a site fifty times the size of mine?

We haven't yet addressed the point (it's discussed in Graham's original paper, though) that Bayesian spam filtering is likely to be most efficient when it's done on an individual basis.  Nonspam emails for a large user population are, taken as a whole, less dissimilar from spam than any individual's nonspam email is likely to be.  I don't know whether this effect would be significant at the 100-user, the 1000-user or the 10000-user level.  It would have to be tested.

My conclusion is that bogofilter, and Bayesian filtering in general, aren't necessarily less labour-intensive, on a large scale, than rule-based spam filtering.

Note added two years later (2004-09-10)
With less rigorous training than described above, sites very much larger than mine -- 60,000 users in one case and 150,000 users in another -- are using bogofilter with one wordlist for all users successfully in their fight against spam.  See Blosser, J. and Josephsen, D., Scalable Centralized Bayesian Spam Mitigation with Bogofilter, 2004 LISA XVIII - Novermber 14-19, 2004 - Atlanta, GA, USA, in press, for an account.

Greg Louis, 2002; last modified 2004-09-10]