Recent Blog Posts


Streamlining MySQL Authentication

Quick tip for easy access.

I've gotten far too used to Postgres' ability to authenticate you by your system uid, and tire of the continual copy-paste of massive passwords for my MySQL servers.

However, there is a way to streamline this: create a .my.cnf file in your home that looks like:

[client]
user=myname
password=mypassword

Just make sure that you are the only one who can read it (chmod go= .my.cnf), and you are good to go!

Posted . Categories: .

Catching Comment Spam in a Honeypot

How a tempting target can reveal automated spammers.

A couple years ago I wrote about using Akismet to catch spam.

Since then, Akismet has successfully captured tens of thousands of spam comments to this site. However, since I'm not comfortable completely accepting the results from a Baysian filter, I've dutifully been stuffing them into my database. However, it is getting a little silly:

$ sqlite3 main.sqlite
sqlite> SELECT is_spam, count(1) FROM blog_comments GROUP BY is_spam;
0|30
1|13656

Ouch. Lets clean that out and see what happens.

$ cp main.sqlite bak.sqlite
$ sqlite3 main.sqlite
sqlite> DELETE FROM blog_comments WHERE is_spam AND NOT visible;
sqlite> vacuum;
sqlite> .quit
$ ls -lh
-rw-rw----  1 mikeboers mikeboers 19905536 Sep  7 16:35 bak.sqlite
-rw-rw----  1 mikeboers mikeboers  2811904 Sep  7 16:37 main.sqlite

17MB of my 20MB database was spam comments!

In my first post I outlined the various methods of spam detection: manual auditing, captchas, honeypots, and contextual filtering (i.e. Akismet). Lets quickly add another one of these to exponentially increase our confidence.

Read the full post...

Posted . Categories: .

Parsing Python with Python

(Ab)using the tokenize module.

A few years ago I started writing PyHAML, a Pythonic version of HAML for Ruby.

Since most of the HAML syntax is pretty straight forward, PyHAML's parser uses a series of regular expressions to get the job done. This proved generally inadequate anytime that there was Python source to be isolated, since Python isn't quite so straight forward to parse.

The earliest thing to bite me was nested parenthesis in tag definitions. In PyHAML you can specify a link via %a(href="http://example.com"), essentially treating the %a tag as a function which accepts keyword arguments. The very next thing you will want to do is include a function call, e.g. %a(href=url_for('my_endpoint')).

At this point, you are going to have A Bad Timeā„¢ with regular expressions as you can't deal with arbitrarily deep nesting. I "solved" this particular problem by scanning character by character until we have balanced the parenthesis, with something similar to:

1
2
3
4
5
6
7
def split_balanced_parens(line):
    depth = 0
    for pos, char in enumerate(line):
        depth += {'(': 1, ')': -1}.get(char, 0)
        if not depth:
            return line[:pos+1], line[pos+1:]
    return '', line

And things were great with PyHAML for a long time, until a number of odd restrictions starting getting in the way. For example, you can't have a closing parenthesis in a string in a tag (like %img(title="A sad face looks like ):")), you can't have a colon in a control statement, and statements can't span lines via unbalanced brackets.

If only you could use Python to tokenize Python without fully parsing it...

Read the full post...

Posted . Categories: .

"Are we getting the same SSL cert?"

A tiny webapp to help answer that question.

A friend had an SSL scare on public WiFi while in a coffee shop today. Her browser was warning her that every SSL certificate was invalid (except for *.google.com). Eventually it stopped, and she overheard others in the coffee shop commenting that their tablet was finally able to connect (it was previously refusing).

I'm not sure, but this could be a man in the middle attack on the WiFi, in which the attacker (somehow) had a valid Google certificate and provided DNS records to point at their own machine.

In this scenario the browser is perfectly content to allow you to connect to this spoofed service. If you are not extremely familiar with SSL certificate authorities, a good way to assert a cert is not a forgery is to compare it to a known-good copy of the certificate. If the signatures match, then you are good to go.

But where can you get a known-good copy?

To answer this question, I quickly make the SSL Cert Fetcher (the source of which is available on GitHub).

Take a look at the certificates for Twitter, Google, and Facebook and see if they match what you are getting. (I'll sit here with my fingers crossed for a while.)

Posted . Categories: .

Which Python?

Finding packages, modules, and entry points.

which is a handy utility for discovering the location of executables in your shell; it prints the full path to the first executable on your $PATH with the given name:

$ which python
/usr/local/bin/python

In a similar stream, I often want to know the location of a Python package or module that I am importing, or what entry points are available in the current environment.

Lets write a few tools which do just that.

Read the full post...

Posted . Categories: .

I find fellow #developers to be too afraid of floating point numbers. Yes, you were burned once, but they are still VERY useful.

@mikeboers on . Visit on Twitter.

It is silly that my bank displays comma-separated thousands, but does not accept them in forms; way to break copy-paste! #webdev

@mikeboers on . Visit on Twitter.

Where Does the `sys.path` Start?

Constructing Python's `import` path

Importing modules in Python seems simple enough on the surface: import mymodule looks across the sys.path until it finds your module. But where does the sys.path itself come from?

Sure, there is a $PYTHONPATH variable which "augments the default search path for module files", but what is the default search path, how is it "augmented", how does easy_install or pip fit into this, and where does my package manager install modules?

Read the full post...

Posted . Categories: .

When Two Packages Fight for a Name

Pillow, the [un]friendly fork of PIL.

Many of us have had the "pleasure" of working with a pair of forked projects. Normally, this is an exercise in patience and reading code with extreme precision, but sometimes it is a whole other level of frustration.

In particular, I have spent a lot of time banging my head on FFmpeg and Libav, both of which generally provide identically named shared libraries which provide identically named exports, but are slowly diverging and offer slightly different functionality. Perhaps I am a complete n00b, but I have found it anything but easy to anticipate which one I will get when I install or later call upon anything prefixed with "ffmpeg" or "av", let alone develop code against them.

Because of that, I was very concerned when I started taking a serious look at Pillow, the self-dubbed "friendly" fork of PIL. On the surface, I am in love with the project and its call to action. However, I am afraid of the project's assumption of the "PIL" namespace, and how that will inevitably break other code in interesting ways.

Read the full post...

Posted . Categories: .

Dictionary Building for Word-Search

Mining Wikipedia for a "geek" lexicon.

One of the local pubs styles itself after geek/nerd culture (e.g. sci-fi, fantasy, and board games). The back of the coasters feature a word-search. It reports to contain 45 "geek words and phrases":

Writing code to solve a word-search isn't particularly tricky (if you remember to use a prefix trie) as long as you have a list of words to find, but in this case we are given no such clues.

But, since we are tremendously lazy, how can we solve this with code anyways?

Read the full post...

Posted . Categories: .
View posts before January 06, 2014