Parsing Python with Python

(Ab)using the tokenize module.

A few years ago I started writing PyHAML, a Pythonic version of HAML for Ruby.

Since most of the HAML syntax is pretty straight forward, PyHAML's parser uses a series of regular expressions to get the job done. This proved generally inadequate anytime that there was Python source to be isolated, since Python isn't quite so straight forward to parse.

The earliest thing to bite me was nested parenthesis in tag definitions. In PyHAML you can specify a link via %a(href=""), essentially treating the %a tag as a function which accepts keyword arguments. The very next thing you will want to do is include a function call, e.g. %a(href=url_for('my_endpoint')).

At this point, you are going to have A Bad Timeā„¢ with regular expressions as you can't deal with arbitrarily deep nesting. I "solved" this particular problem by scanning character by character until we have balanced the parenthesis, with something similar to:

def split_balanced_parens(line):
    depth = 0
    for pos, char in enumerate(line):
        depth += {'(': 1, ')': -1}.get(char, 0)
        if not depth:
            return line[:pos+1], line[pos+1:]
    return '', line

And things were great with PyHAML for a long time, until a number of odd restrictions starting getting in the way. For example, you can't have a closing parenthesis in a string in a tag (like %img(title="A sad face looks like ):")), you can't have a colon in a control statement, and statements can't span lines via unbalanced brackets.

If only you could use Python to tokenize Python without fully parsing it...

"Are we getting the same SSL cert?"

A tiny webapp to help answer that question.

A friend had an SSL scare on public WiFi while in a coffee shop today. Her browser was warning her that every SSL certificate was invalid (except for * Eventually it stopped, and she overheard others in the coffee shop commenting that their tablet was finally able to connect (it was previously refusing).

I'm not sure, but this could be a man in the middle attack on the WiFi, in which the attacker (somehow) had a valid Google certificate and provided DNS records to point at their own machine.

In this scenario the browser is perfectly content to allow you to connect to this spoofed service. If you are not extremely familiar with SSL certificate authorities, a good way to assert a cert is not a forgery is to compare it to a known-good copy of the certificate. If the signatures match, then you are good to go.

But where can you get a known-good copy?

To answer this question, I quickly make the SSL Cert Fetcher (the source of which is available on GitHub).

Take a look at the certificates for Twitter, Google, and Facebook and see if they match what you are getting. (I'll sit here with my fingers crossed for a while.)

Which Python?

Finding packages, modules, and entry points.

which is a handy utility for discovering the location of executables in your shell; it prints the full path to the first executable on your $PATH with the given name:

$ which python

In a similar stream, I often want to know the location of a Python package or module that I am importing, or what entry points are available in the current environment.

Lets write a few tools which do just that.

Where Does the `sys.path` Start?

Constructing Python's `import` path

Importing modules in Python seems simple enough on the surface: import mymodule looks across the sys.path until it finds your module. But where does the sys.path itself come from?

Sure, there is a $PYTHONPATH variable which "augments the default search path for module files", but what is the default search path, how is it "augmented", how does easy_install or pip fit into this, and where does my package manager install modules?

When Two Packages Fight for a Name

Pillow, the [un]friendly fork of PIL.

Many of us have had the "pleasure" of working with a pair of forked projects. Normally, this is an exercise in patience and reading code with extreme precision, but sometimes it is a whole other level of frustration.

In particular, I have spent a lot of time banging my head on FFmpeg and Libav, both of which generally provide identically named shared libraries which provide identically named exports, but are slowly diverging and offer slightly different functionality. Perhaps I am a complete n00b, but I have found it anything but easy to anticipate which one I will get when I install or later call upon anything prefixed with "ffmpeg" or "av", let alone develop code against them.

Because of that, I was very concerned when I started taking a serious look at Pillow, the self-dubbed "friendly" fork of PIL. On the surface, I am in love with the project and its call to action. However, I am afraid of the project's assumption of the "PIL" namespace, and how that will inevitably break other code in interesting ways.

Dictionary Building for Word-Search

Mining Wikipedia for a "geek" lexicon.

One of the local pubs styles itself after geek/nerd culture (e.g. sci-fi, fantasy, and board games). The back of the coasters feature a word-search. It reports to contain 45 "geek words and phrases":

Writing code to solve a word-search isn't particularly tricky (if you remember to use a prefix trie) as long as you have a list of words to find, but in this case we are given no such clues.

But, since we are tremendously lazy, how can we solve this with code anyways?

Autocompleting Python Modules

Simplifying the search for modules to execute from a shell.

The last few times I overhauled an execution environment I required people to execute the bulk of their tools via python -m package.module instead of python package/ (to enable the development environment).

The downside is that you lose shell autocompletion, which can be a big deal if you have dozens of tools that you only occasionally use.

This addition to your ~/.bashrc fixes that.

Is this an amazing, or terrible idea?

Lets define a classproperty in Python such that it works as a property on both a class, and an instance:

class classproperty(object):

    def __init__(self, func):
        self.func = func

    def __get__(self, obj, cls):
        return self.func(cls, obj)

It can be used thusly:

class Example(object):

    def prop(cls, obj):
        return obj or cls

x = Example()
assert x.prop is x
assert Example.prop is Example

Is this a good idea, or a bad idea?

(Hint: I don't know.)

