Since then, Akismet has successfully captured tens of thousands of spam comments to this site. However, since I'm not comfortable completely accepting the results from a Baysian filter, I've dutifully been stuffing them into my database. However, it is getting a little silly:
$ sqlite3 main.sqlite sqlite> SELECT is_spam, count(1) FROM blog_comments GROUP BY is_spam; 0|30 1|13656
Ouch. Lets clean that out and see what happens.
$ cp main.sqlite bak.sqlite $ sqlite3 main.sqlite sqlite> DELETE FROM blog_comments WHERE is_spam AND NOT visible; sqlite> vacuum; sqlite> .quit $ ls -lh -rw-rw---- 1 mikeboers mikeboers 19905536 Sep 7 16:35 bak.sqlite -rw-rw---- 1 mikeboers mikeboers 2811904 Sep 7 16:37 main.sqlite
17MB of my 20MB database was spam comments!
In my first post I outlined the various methods of spam detection: manual auditing, captchas, honeypots, and contextual filtering (i.e. Akismet). Lets quickly add another one of these to exponentially increase our confidence.