Mike Levin SEO

Future-proof your technology-skills with Linux, Python, vim & git... and me!

Tackling The Unusual Implementation of Regex in Python

by Mike Levin SEO & Datamaster, 07/10/2012

I started some great work yesterday to add an easier way of screen-scraping to Tiger. I have to support Regular Expressions, and it brings up the whole nuanced mess that is Regular Expressions under Python. It’s not quite as bad as the urllib2 exception use cases ruining it for all the mainstream cases, but it’s close. This is a daily journal entry full of thinking-out-loud about Regex under Python. It’s not a regex tutorial, but it may shed some light on why they’re just not as easy as they should be.

Regular Expressions are an incredibly powerful and maddening technology to match text strings. So many people have run afoul of them that you can find tons of advice to avoid them entirely. And there are in fact more powerful and less baffling pattern matching technology out there such as Snobol, but for whatever reason (probably Perl’s support and popularity) regex has become the defacto standard for advanced text parsing when functions like left, right and substring fall short. Regex is also built into nearly every mainstream programming language, meaning that mastering it is one of those skills that cuts across languages, platforms and time. That makes it worth taking up.

Python has regex built in too, but it’s one of those skills that no matter how much I use it, it just never comes natural to me. Part of that may be the arbitrary and counterintuitive nature of regex itself, but at least some of it is due to the Python implementation. No matter how much I love Python as the terse and pragmatic choice, there are places where for whatever reasons, regex is just much less elegant (albeit just as powerful) as in other languages I’ve been exposed to lately—namely, Ruby.

Start out with the fact that it’s a library that you need to import. Ruby doesn’t require that. So now all the appeal of slamming out the equivalent of Ruby one-liners for the ubiquitous task of pattern matching off the table. One-liner elegance is tainted by having to have that import up top somewhere. I know you can’t have everything from the external libraries loaded into memory all the time, but RegEx is one of those things that straddles the line of something that may actually be worth it. Ruby chose to make it always there, probably because of because its heritage is in Perl, which is nearly synonymous with regex, so no one’s going to raise the memory efficiency argument with RegEx always being there.

The next ugly thing about regex under Python is the nuances of the API. There’s functions such as search, match, sub, findall and finditer, all of which do slightly different things, and which have inconsistent interfaces for specifying such critical details like case insensitivity. In two, you can use bafflingly named and strangely combinable re.I, re.M, re.DOTALL, re.S. You combine them as the third parameter in the search method with the “or” operator of the pipe symbol: re.S re.I re.M. But not on findall, in which case you need to open your pattern with (i) for case insensitivity. So, you have abbreviated properties of the re object itself being fed back into itself as arguments, combined together with the pipe symbol. I mean, where else in all of Python-dom or any other language for that matter does something like that occur?

Okay, the next thing is that the response objects that come back from queries are “shaped” a bit strangely in a way that wouldn’t make sense to you without a fairly deep pre-existing knowledge of regex. The concept is matching groups. So the object that comes back from a pattern-match is a match object, which you think you could just step through: match[1], match[2], etc. But because of the fairly advanced concept of matching groups as delimited by parentheses in your pattern, you actually need to access match.group(0), match.group(1), etc. Subtly complicating everything from checking for matches to seemingly unnecessary verbosity of your code. I know this is done to support all the use cases and more advanced use of regex, but it’s at the sacrifice of easy main use cases.

The next thing to get used to if you’re trying to master regex under Python like me is how nearly all defining pattern strings are preceded with “r”, as in r”somepattern”. That’s because both Python and Regex use backslashes as their special character modifiers, like \n for newline, so mixing the two can be messy. If you wanted to match a newline in a pattern, you might write:

mypat = “end\nbegin”

…which would match the word “end” at the end of a line, followed by “begin” at the beginning of the next line, but the \n by default “belongs” to Python. Python’s special character escaping will step in and transform the \n into a real linebreak, and it will never reach the regex engine as intended, so the solution to use the token for “raw” strings in Python by which no Python character escaping is attempted. You are essentially conceding all backslash handling to regex, which is a good idea. But it is a place where you realize Python’s attempt at terse elegance starts to break down.

mypat = r”end\nbegin”

Now, all languages have their own strange idioms. In this case, Python is taking advantage of the fact that a character immediately before a string would be illegal, so there is no compatibility issues of using that as a directive for raw, unicode, or other special treatment to the string. It’s clever, but it creates a very Python-specific signature. Now admittedly, so does forced indenting, which I love. But this is not one of those cases where Python’s decisions results in more beautiful code. It actually makes it a bit more ugly, and is something I prefer to avoid. Ruby’s solutions was to make the double-quote and single-quote NOT interchangeable, but in one usage, special characters belong to Ruby, and in the other case, special characters remain raw and belong to the character string. So, get used to seeing and using an “r” before pattern strings.

And speaking of uncharacteristic lack-of-terseness in Python, it appears that the mode efficient way yo use regex patterns is to “compile” them first. Now this is a concept so alien to the interpreter way of working that it sticks out as an oddity every time I see it. What’s worse, if you do compile your regex patterns, the API is different, making use of the methods of compiled patterns. This takes a bit of explaining, and is another example of Python’s lack-of-consistency contrast with Ruby. What do I mean by compile? Well, one way to use regex in Python looks like this:

patobj = re.compile(r”end\nbegin”) match = patobj.search(string)

If you wanted to make this case-sensitive, you would do this:

patobj = re.compile(r”end\nbegin”, re.I) match = patobj.search(string)

In this case, we are invoking the search method of a pattern object. This is in contrast to using the search function directly without compiling the pattern:

re.search(r”end\nbegin”, string, re.I)

Now while we’re still really using methods and properties, in the later case it feels a lot like a function. Nowhere is this function vs. methods & properties duality of Python as clear as with regular expressions.

What do I mean by a function vs. a method? Well, in Ruby, basically everything is a method or property of an object. If you wanted to know something’s length, the one and only way to do it is object.length. It’s a property with no parameters, so there’s not even parentheses after the word length. But in Python, you have the len() function which can be used like len(object). The justification is that len() applies across so many object types, the pragmatic decision is to let you “get to know” broadly applicable functions, rather than the internal details of countless object types. But Python does also have a rich set of methods and properties in objects that can be addressed much like Ruby, but must include the often unnecessary parentheses. You just get to know what to use when through convention and habit.

…and unusually that’s pretty easy. Always use len() as a function. It always works. If it makes sense for the object, it’s going to work. But it’s not as clear in regex until you’ve had LOTS and lots of practice. If you’re compiling, you’re using methods. If you’re doing quick one-off’s, you’re using functions. You get the idea. The API subtly changes, as does a large part of the concept of what you’re doing.

So, does compiling your patterns have any real advantage? It sounds like it’s a good optimization step, right? Nope. Python now caches any regex pattern so subsequent calls to the same pattern are optimized. So, there’s functionally no difference, and it’s really just a matter of preference. Some people seem to like to do it for organization and easy pattern reuse.