I finally grok the group and groups in Python RegEx API

by Mike Levin SEO & Datamaster, 07/18/2012

Okay, I’m up to a fairly tricky RegEx URL ReWrite puzzle in Python, and I thought I’d dedicate this journal entry specifically towards the solution. I had the sudden insight that the challenge I’m encountering could shed some light on match.group() for those struggling with it, plus nuances of Python that are both a huge benefit, but strange to the un-Python-initiated.

So basically, I’m trying to reproduce the Apache ReWrite process. Now, there are both directives for the RewriteCond (condition) and the actual rule itself, ReWrite. I’m only interested in ReWrite, because in my case, it will always be applied when encountered. So, the syntax for ReWrite according to the Apache guide http://httpd.apache.org/docs/2.0/misc/rewriteguide.html and rules http://httpd.apache.org/docs/2.0/mod/mod_rewrite.html pages at is:

RewriteRule ^oldstuff\.html$ newstuff.html

Now, if that were the extent of what I had to support, there would be no problem. However, we are almost always TRANSFORMING an old URL into a new one, using parts of the old URL in the new. Anything used in parenthesis groups in the pattern can be reused in the new URL with Perl-style Regex backreferences. So, if you wanted to change:

http://domain.com/foobar.html

…into…

http://domian.com/barfoo.html

RewriteRule ^/(foo)(bar)\.html$ %2%1.html

Notice, when you’re keeping all the traffic on the same domain name, you don’t need to include the beginning part of the URL. There are a number of other subtleties here that do not apply, do to our particular adaptation of this rule. In fact, because I don’t need to use directives, my rules will look more like this:

^http://www\.youtube\.com/user/(.*)$ http://www.youtube.com/%1/videos

Now, in order to satisfy my criteria, I could easily just support the back-reference group 1 as needed in this example, but that would be a cop out. Instead, I should support multiple backreferences, and I’ll make gratuitous use of matching groups to make the point:

^http://www\.(youtube)\.com/(user)/(.*)$ http://www.%1.com/%2/%3/videos

Now, I have three matching groups, and if we fed the parts into a Python regex search, it would look something like this:

match = re.search(‘^http://www\.(youtube)\.com/(user)/(.*)$’, ‘http://www.youtube.com/user/miklevin’)

…which would return:

>>> match.group(0) ‘http://www.youtube.com/user/miklevin’ »> match.group(1) ‘youtube’ »> match.group(2) ‘user’ »> match.group(3) ‘miklevin’ »> match.groups() (‘youtube’, ‘user’, ‘miklevin’)

And so, to build this back into the URL we need, to replace %1 with group(1) and %2 with group(2) and so on, until we have no more groups in the match. THIS is finally making the power of Python’s strange RegEx API clear to me. When you REALLY need to build things with RegEx—rather than Ruby-like one-liners—then the Python API is not the nonsensical burden that it seems.

Sooooo, what’s the most Pythonic, elegant and efficient way of getting from the above data structure to replacing…

http://www.%1.com/%2/%3/videos

Well, the first part of that answer is we use enumerate, so we have access to an index without having to create an unnecessary housekeeping counter variable:

for index, value in enumerate(match.groups()): print index, value

…which outputs…

0 youtube 1 user 2 miklevin

But what do I want to do on each iteration? I want to replace the occurrence of the string that is the same as the percent symbol and index+1 with the index’s value in the tuple. Sooooo…g;

>>> apardsdst = ‘http://www.%1.com/%2/%3/videos’

for index, val in enumerate(match.groups()): … apart.replace(‘%’+str(index+1), match.group(index+1)) … ‘http://www.youtube.com/%2/%3/videos’ ‘http://www.%1.com/user/%3/videos’ ‘http://www.%1.com/%2/miklevin/videos’

And to make it actually update an object:

>>> apart = ‘http://www.%1.com/%2/%3/videos’

for index, val in enumerate(match.groups()): … apart = apart.replace(‘%’+str(index+1), match.group(index+1)) … apart ‘http://www.youtube.com/user/miklevin/videos’

And THAT should be sufficiently robust.