Finally Using Python Regular Expression Named Groups For Easy Scraping
by Mike Levin SEO & Datamaster, 07/11/2012
NOTE: The headline is a spoiler. Yesterday I added support for a newurl (url rewriting) in the ezscrape feature, based on the fact that you may want to scrape from a slightly different URL than the one provided, such as getting YouTube channel upload counts, which is not visible on the main channel page, but the URL with the data just adds /videos to the end.
This URL tweaking capability could also let you switch to an xml or json API where no login is required, which should let me turn the Twitter functions into ezscrape entries. That would additionally require supporting argument fields other than just URL—twusername in this case. But that makes it even more powerful, because ezscrape will be able to build up its own URL input string based on minimal data from the user (not having to be URLs every time). In other words, I can keep the twusername input field, but still move the twitter functions to ezscrape.
But that brings up the fact that when I hit a real JSON API, xpath won’t cut it for extracting the desires data, and regex is overkill. I’ll make the ezscrape interface work with a JSON data path address as well. There is something called JSONPath which is to JSON what xpath is to XML, but that’s one more external dependency, and since JSON converts so readily to native python objects, I’m tempted to just use key names and indexes. The problem there is the necessity to know the object structure, which is mind-numbing to look at.
This leads to two possible approaches. The first is actually using JSONPath because it lets you make fuzzy selections just like xpath, then further refine it. The other is more appealing and challenging in which the system gives you the object path, given some input, such as an API request URL and what you’re looking for (tatget). This could apply to xpath, json, and jquery. It’s not so easy for regex, but I imagine it doable.
Speaking of regex, I really need to have my plan for today. Make regex a supported type for ezscrape. Stick to pure regex. Forget regpre and regpost. It has it’s place, but not here. Here, we will be in the land of powerful single-string patterns, probably leveraging named matching groups so the data you’re trying to grab always has a particular name (by convention) and the rest of the match is either non-consuming or consumed but not returned. Both are viable approaches.
What would such regex patterns look like? They would definitely use parenthesis to get groups. You can express groups that don’t capture the span of text that they match, which would do away with naming groups, potentially making the patterns simpler. Named groups appears to be a Python-specific extension to regex, so I’ll stay away from them. I may port the system to other languages someday. The one thing named groups has going for it is that it’s a bit more terse. You only need special instructions in the matching group and not everywhere else.
Either way, it’s far superior to my regpre and regpost approach of the past. It will shift some more burden onto the user of understanding regex proper, but that will be compensated for with plenty of copy-and-paste examples (I hope to get to the exscrape tab today).
…okay, a whole bunch of meetings and distractions… back to it! I will be going out for lunch today. But regather your thoughts and be ready to hit the ground running after lunch.
Okay, I really lost my groove there. Get it back! Get it back! 1, 2, 3… 1?
Okay, don’t mess around with how the URL rewriting is working right now. I am tempted to make it more pure regex and less Python-dependent, but it’s working. The thing to focus on right now is to make the ezscraping functions take arbitrary columns as their input. Why don’t they now?
Bingo! I have support for arbitrary args now. Things are proceeding well. Hmmmm. I could almost do the Twitter API functions now. Okay, don’t worry about the JSONPath stuff yet. That will be big. But get RegEx done. That’s the priority now.
Remember, 1, 2, 3… 1?
Took a related detour‒pulling the gindexcount throws a captcha, but using the virtual machine from your desktop was a temporary workaround, so you could fill out the captcha on your own machine. I did a long-overdue updating of the virtual machine, and put the zip back on the corporate network and my USB keychain. But all this stuff is related. Keep that in mind for testing. Keep plowing through. Today could be an awesome or a sucky day. It has yet to be determined by your next steps!
Focus on the Google Analytics ID-getting function. That’s an easy regex exercise. Step 1?
Okay, I got it rudimentarily working. Now change my fully matching pattern:
…for a partially matching one:
Failed experiment. group(0) still contains the leading and trailing quotes. How about the named groups?
Okay, that’s the ticket. Named groups. Explicit wins! But that’s a Python-specific extension to RegEx. So be it. This immediately gives me what I want, and is relatively easy to teach. It’s similar to regpre and regpost, but uses the regex engine directly with no pre or post processing.
Okay, so start moving functions over into ezscrape. Start with Pinterest, because they’re relatively easy and recent. Okay, I got the first couple over. Easy success! This is going to result in SUCH a code clean-up and SUCH an improved self-help screen scraping / API-calling system reminiscent of how functions work, but MUCH easier for users who just want to scrape and not dabble in Python.
Next step… create an optional exscrape tab, just as you have for documentation and other features. Make it work much like the functions tab, but actually pull down all or a exemplary set of scraping entries for copy-and-paste work. Have the local worksheet versions always override the global scraping functions.
Also, start thinking about reverse xpath, reverse jsonpath, and reverse jquery.
Regex users won’t have such a reverse-path-function luxury… or will they? Hmmmmm.