Parsing Python with Python

(Ab)using the tokenize module.

A few years ago I started writing PyHAML, a Pythonic version of HAML for Ruby.

Since most of the HAML syntax is pretty straight forward, PyHAML's parser uses a series of regular expressions to get the job done. This proved generally inadequate anytime that there was Python source to be isolated, since Python isn't quite so straight forward to parse.

The earliest thing to bite me was nested parenthesis in tag definitions. In PyHAML you can specify a link via %a(href="http://example.com"), essentially treating the %a tag as a function which accepts keyword arguments. The very next thing you will want to do is include a function call, e.g. %a(href=url_for('my_endpoint')).

At this point, you are going to have A Bad Timeā„¢ with regular expressions as you can't deal with arbitrarily deep nesting. I "solved" this particular problem by scanning character by character until we have balanced the parenthesis, with something similar to:

1
2
3
4
5
6
7
def split_balanced_parens(line):
    depth = 0
    for pos, char in enumerate(line):
        depth += {'(': 1, ')': -1}.get(char, 0)
        if not depth:
            return line[:pos+1], line[pos+1:]
    return '', line

And things were great with PyHAML for a long time, until a number of odd restrictions starting getting in the way. For example, you can't have a closing parenthesis in a string in a tag (like %img(title="A sad face looks like ):")), you can't have a colon in a control statement, and statements can't span lines via unbalanced brackets.

If only you could use Python to tokenize Python without fully parsing it...


Behold the tokenize module, which exposes Python's own internal tokenizer. This will allow us to take one tiny step further with our function above, and have it fully understand strings without us even having to try. It doesn't fully parse the given code, so there are many less cases for it to trip up on HAML being very much not Python.

Once you have a token iterator via [tokenize.generate_tokens()][generate_tokens], you can write a version of the above function which isolates which token is the end of a bracketed statement:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def match_python_brackets(token_iter):
    brackets = {'(': ')', '[': ']', '{': '}'}
    stack = []
    for token in token_iter:
        type_, value, _, _, _  = token
        if type_ == tokenize.OP:
            if value in brackets:
                stack.append(brackets[value])
            elif stack and value == stack[-1]:
                stack.pop()
            if not stack:
                return token

Since tokens contain their line and column number, you can quickly isolate how much of the original source was consumed. I'll leave it as an exercise to the reader (or someone who wants to read PyHAML's source) how to figure out how this can be done.

Immediately we can put whatever we want into strings, and expressions (and control statements) can span multiple lines.

Posted . Categories: .