URL ReWriting with RegEx - the hardest part of easy screen scraping

by Mike Levin SEO & Datamaster, 07/19/2012

Okay. I slip in a bit of coding here and there at work. Progress is much slower than I would like interweaving it with meetings and such. But little breakthrough like yesterday keep me going. I’m working on a big, sexy feature nobody will “get” until it exists. And I gave to supplement these thing with videos, or else they are just more obscure unused features.

This particular feature called ezscrape is taking the single most common use for Tiger after crawling—arbitrary screen-scraping or API-calls—and making it easy to spin new ones with copy-and-paste examples that don’t require you to delve into Python Programming.

To that end, I had to come up with an approach that doesn’t require Python coding to describe a scrape function in a way that a new user can understand at quick glance. The answer is, of course, just make a new optional worksheet tab that contains all the scraping function names and parameters. Let the user ask for it, then change existing examples or copy-and-paste for new ones. And those cells are all easily described, and none contain Python code.

This is possible because almost all scraping jobs are the same—and even complex ones that use iframes or Ajax calls after the initial page-load are simpler than they might seem—if only you can re-write that initial page-load URL to actually be the secondary one used for the iframe or XMLHttp call where the desired data resides. That was one of the trickier bits to solve, and is what the last few days were about.

Suddenly, with (url) input re-writing, all sorts of wonderful stupid http request tricks would be possible (without programming) like using a simple username as input, but building it into a full Twitter profile URL—or better still, built into a JSON API URL and done the official way. I imagine being given a Web URL and actually wanting to hit an official API to be a common need.

THAT was the last real challenge to solve before exposing ezscrape. There are others, like having better path queries for yanking data out of JSON, but I can always start out by regexing JSON, and adding JSONpath later. The general ability to massage screen-scraping en mass across “functions” without having to hunt down endless disparate actual Python functions is one of the huge wins here. So, ready, fire, aim!

Apache rewrite rules showed the way. My challenge was orders of magnitude simpler than Apache’s. But basically, I needed to shove something requiring three inputs and a rich data transformation language that could be crammed into one appropriately named cell, and which didn’t have to go through a hackable eval or exec statement. Regex is it, thanks to its back-reference feature, which I take advantage of for URL-transformation with very brief rewrite rules precisely the same way as Apache.

So now, I have url rewriting working in a way I can expose to users as a single field in an ezscrape tab. That was my last obstacle in exposing the tab itself and getting it into announce-able format. This post is about thinking through remaining issues.

It is interesting to note that in the time I was working on this, Twitter went from the difficult -to-scrape hashbang (#!) JavaScript approach that puts the burden on the browser to pull numbers with separate calls to the older-style URLs in which the numbers are included directly into the HTML of the initial page-load—the stuff you see when you “view source”. Point being, Twitter is much more easily scrape-able even though I’ve done the work now to go through the anonymous JSON API, and I may want to try dropping it into ezscrape even before JSONPath is supported.

I still have that Future of SEO script/Powerpoint to deliver this week. I have a large chunk of non-interrupted meeting time coming up today after my last must-attend meeting coming up in fifteen minutes. I need to use the second half of today to BOTH finish the ezscrape tab AND the Future of SEO Powerpoint deck / script second draft. It’s mostly refinement and a picture-hunt on the Powerpoint front. I have A LITTLE bit of new thinking on it, because of Google Now and Nexus 7 (which I’ll be getting any day now). But I basically nailed it first-pass.

Okay, time to cut the journal entry. The next one will be about the implementation of the ezscrape tab itself, addressing such things as worksheets in Google Spreadsheets being thought of as a list of dicts, for easily sucking all the data in from a worksheet, or spewing it all out. So far, I’ve been pulling data in from worksheets cell-by-cell, losing the feel for that “shape” which you can plug back in row-by-row with the gdata InsertRow method. I’m going to try to do it nice and neat.