Catching Comment Spam in a Honeypot

How a tempting target can reveal automated spammers.

A couple years ago I wrote about using Akismet to catch spam.

Since then, Akismet has successfully captured tens of thousands of spam comments to this site. However, since I'm not comfortable completely accepting the results from a Baysian filter, I've dutifully been stuffing them into my database. However, it is getting a little silly:

$ sqlite3 main.sqlite
sqlite> SELECT is_spam, count(1) FROM blog_comments GROUP BY is_spam;

Ouch. Lets clean that out and see what happens.

$ cp main.sqlite bak.sqlite
$ sqlite3 main.sqlite
sqlite> DELETE FROM blog_comments WHERE is_spam AND NOT visible;
sqlite> vacuum;
sqlite> .quit
$ ls -lh
-rw-rw----  1 mikeboers mikeboers 19905536 Sep  7 16:35 bak.sqlite
-rw-rw----  1 mikeboers mikeboers  2811904 Sep  7 16:37 main.sqlite

17MB of my 20MB database was spam comments!

In my first post I outlined the various methods of spam detection: manual auditing, captchas, honeypots, and contextual filtering (i.e. Akismet). Lets quickly add another one of these to exponentially increase our confidence.

A honeypot is an alluring trap designed for the target in mind. In our case, we are targeting automated spam, so we want to set a trap that automated systems will fall for, but humans will not.

A typical spam honeypot is an extra field in the comment form that from a script's perspective (looking at the HTML source) is appears to be a reasonable field to fill. However, we hide it from the user, so that if that field is filled we know (relatively certainly) that it was not entered by a human.

An example form could be:

  <label id="firstname">
    Name: <input name="firstname" />
  <label id="lastname">
    Leave empty (to detect SPAM): <input name="lastname" />

The label and input are then hidden via CSS: #lastname { display: none; }. Now all we need to do is check for "lastname" in the POST, and 99% of the time we find that spam includes it.

However, the honeypot alone can still suffer from false positives, so I only completely throw away a comment if both the honeypot and Akismet agree that a new comment is spam. Otherwise it gets stuffed into the database, and an email sent off asking a human to review it.

See the final implementation for my own site.

Posted . Categories: .