Friendlier (and Safe) Blog Post URLs

Until very recently, the URLs for individual blog posts on this site looked something like:

http://mikeboers.com/blog/601/friendlier-and-safe-blog-post-urls

The 601 is the ID of this post in the site's database. I have always had two issues with this:

The ID is meaningless to the user, but it is what drives the site.
The title is meaningless to the site (you could change it to whatever you want), but it is what appears important to the user.

What they would ideally look like is:

http://mikeboers.com/blog/friendlier-and-safe-blog-post-urls

But since I tend to quickly get a new post up and then edit it a dozen times before I am satisfied (including the title) the URL would not be stable, and implementations I have seen in other blog platforms would force the URL to retain the original title of the post, not the current title.

So I have come up with something more flexible that gives me URLs very similar to what I want, but allow for (relatively) safe changes in the title of the post (and therefore the URL).

The trick is to do a string similarity test with the title in the requested URL against all of the blog posts. I achieve this with the Python stdlib difflib module:

import difflib

def title_similarity(a, b):
    return difflib.SequenceMatcher(None, a, b).ratio()

See the blog controllers in this site's repo (specifically the do_with_title function) for how this is used to select the best blog post. Essentially, it does the following:

from nitrogen import status
from ..main import Response, render, Session
from ..blog import BlogPost

@route(r'/{title:.+}')
def example_controller(request):

    # Start a SQLalchemy query; filter this by date ranges, tags, or any other info
    # that you can use to restrict the request.
    session = Session()
    query = session.query(BlogPost)

    # The requested title.
    title = request.route['title'].lower()

    # Iterate across all posts looking for the best title match.
    best_match = 0
    best_post = None
    for post in query:
        # The BlogPost class has a url_title property that calculates the
        # canonical title for URLs.
        match = title_similarity(title, post.url_title)
        if match > best_match:
            best_match = match
            best_post = post

    # An arbitrary level of precision is required.
    if best_match < 0.5:
        best_post = None

    # Assert we found a if not post:
    if not best_post:
        raise status.NotFound('could not find post')

    # Assert the canonical path.
    if request.script_root != best_post.url_path:
        raise status.SeeOther(location = best_post.url_path)

    # Render the page.
    return Response(render('/blog/single.html.mako',
        post=best_post,
    ))

In order to keep the site from having to iterate across every post, I have decided that the canonical URLs will contain the date of the post, allowing for the database query to return only a handful of posts at most. Ergo, the final URLs look like:

http://mikeboers.com/blog/2012/03/15/friendlier-and-safe-blog-post-urls

Now that this is done, I can refer to this post in any of the following ways:

And you can even omit the date and get the title a little wrong and get back here; for example, here are some links that should redirect you back here:

Posted on March 15, 2012. Categories:

website
/ URLs