Post Archive
Catching Comment Spam in a Honeypot
How a tempting target can reveal automated spammers.
A couple years ago I wrote about using Akismet to catch spam.
Since then, Akismet has successfully captured tens of thousands of spam comments to this site. However, since I'm not comfortable completely accepting the results from a Baysian filter, I've dutifully been stuffing them into my database. However, it is getting a little silly:
$ sqlite3 main.sqlite sqlite> SELECT is_spam, count(1) FROM blog_comments GROUP BY is_spam; 0|30 1|13656
Ouch. Lets clean that out and see what happens.
$ cp main.sqlite bak.sqlite $ sqlite3 main.sqlite sqlite> DELETE FROM blog_comments WHERE is_spam AND NOT visible; sqlite> vacuum; sqlite> .quit $ ls -lh -rw-rw---- 1 mikeboers mikeboers 19905536 Sep 7 16:35 bak.sqlite -rw-rw---- 1 mikeboers mikeboers 2811904 Sep 7 16:37 main.sqlite
17MB of my 20MB database was spam comments!
In my first post I outlined the various methods of spam detection: manual auditing, captchas, honeypots, and contextual filtering (i.e. Akismet). Lets quickly add another one of these to exponentially increase our confidence.
Friendlier (and Safe) Blog Post URLs
Until very recently, the URLs for individual blog posts on this site looked something like:
http://mikeboers.com/blog/601/friendlier-and-safe-blog-post-urls
The 601
is the ID
of this post in the site's database. I have always had two issues with this:
- The ID is meaningless to the user, but it is what drives the site.
- The title is meaningless to the site (you could change it to whatever you want), but it is what appears important to the user.
What they would ideally look like is:
http://mikeboers.com/blog/friendlier-and-safe-blog-post-urls
But since I tend to quickly get a new post up and then edit it a dozen times before I am satisfied (including the title) the URL would not be stable, and implementations I have seen in other blog platforms would force the URL to retain the original title of the post, not the current title.
So I have come up with something more flexible that gives me URLs very similar to what I want, but allow for (relatively) safe changes in the title of the post (and therefore the URL).
Cleaning Comments with Akismet
My site recently (finally) started to get hit by automated comment spam. There are few ways that one can traditionally deal with this sort of thing:
- Manual auditing: Manually approve each and every comment that is made to the website. Given the low volume of comments I currently have this wouldn't be too much of a hassle, but what fun would that be?
- Captchas: Force the user to prove they are human. ReCaptcha is the nicest in the field, but even it has been broken. But this doesn't stop human who are being paid (very little).
- Honey pots: Add an extra field1 to the form (e.g. last name, which I currently do not have) that is hidden by CSS. If it is filled out one can assume a robot did it and mark the comment as spam. This still doesn't beat humans.
- Contextual filtering: Use Baysian spam filtering to profile every comment as it comes in. By correcting incorrect profiles we will slowly improve the quality of the filter. This is the only automated method which is able to catch humans.
I decided to go with the last option, as offered by Akismet, the fine folks who also provide Gravatar (which I have talked about before). They have a free API (for personal use) that is really easy to integrate into whatever project you are working on.
Now it is time to try it out. I've been averaging about a dozen automated spam comments a day. With luck, none of them will show up here.
*crosses his fingers *
Update:
I was just in touch with Akismet support to offer them a suggestion regarding their documentation. Out of nowhere they took a look at the API calls I was making to their service and pointed out how I could modify it to make my requests more effective in catching spam!
That is spectacular support!
-
The previously linked article is dead as of Sept. 2014. ↩
RoboHash and Gravatar
I recently discovered a charming web service called RoboHash which returns an image of a robot deterministically as a function of some input text. Take a gander at a smattering of random robots:
These would make an awesome fallback as an avatar for those without a Gravatar set up, since it will always give you the same robot if you enter the same email address. So of course I implemented it for this site!
New Cards
Finally about to send my new cards off to the printer. First take a look at the front:
The background and the words on it are both black. The words, however, are rich black, so they should appear glossy on the flat background. The templates I'm giving the printer has ~20 cards on it so I get a slight variation on what words are actually on each one as well.
The back is more interesting:
Every card has a unique code on it, and a QR code that can be scanned with a phone. Should be useful whenever I give someone a card to contact me with a purpose, as I can customize what is displayed to them when they go to that link. I spent way too much time designing a lovely crypto system to drive the codes, until I decided to simply randomly generate them and keep track. There are ~35 trillion codes of this size, and I'll at most use a couple thousand so as long as I do a little bit of rate limiting on the website (say about an hour of delay if you enter 10 wrong codes) then nobody should ever be able to guess them.
There are no more posts tagged "website".