Mike Levin SEO

Future-proof your technology-skills with Linux, Python, vim & git... and me!

Easy Pattern Matching and Screen Scraping for 360iTiger

by Mike Levin SEO & Datamaster, 07/09/2012

Okay, in my quest to reinvent myself online, or perhaps to invent myself for the first time, I have to decide whether what I’m trying to do is build personal brand, or just to be genuine doing my thing, and let all the personal brand stuff be the side-effect. Of course, the later is my plan, and that’s the motivation for the daily journal approach. I just move at too fast a pace at work, and have too much going on in my own life to NOT get the efficiency of making my daily journal my blog content. The result is a ton of journal entries that can’t be published, because they’re too disjointed and raw—or just refined and sanitized enough to push out, but still not highly edited. So be it, goddamnit.

My most important thing at the office is turning 360iTiger into a better and better internal tool for our SEO Managers and team, as well as a better and better thought leadership device, which has sort of stalled out. No big deal. I’m not in a hurry here, and this is part of a long-term play that dovetails with my mission in life and a Free and Open Source Software (and hardware) initiative, that I’m sure will take years to play out, so as long as I stay on-plan, I’m making progress. Certain aspects of that plan, no matter how appealing they are to me, have to be put on the back-burner, because they’re not what I need to be doing right now to satisfy clients.

Happily, there are other aspects to the project that are perfectly aligned with satisfying my internal company customers (SEO Managers, Community Managers, etc.) that does align perfectly with my long-term plans. There’s basically a hardware, operating system, and end-user application layer to what I’m doing. The stuff I’m putting on hold is what’s coolest to me—the hardware/OS part with Raspberry Pi microservers and the best tiny version of qemu Linux for USB virtual machines. But the end-user application layer IS 360iTiger, and there’s plenty of exciting work to do there.

It’s time to start solving simultaneous equations: my personal online brand, new Tiger features, and rigorous Tiger testing. My plan is to leverage everything about Tiger to boost my personal brand as a part of the battery of tests. I.e. test all Tiger features on myself as part of the built-in Tiger health-check, which I will be running ALL THE TIME. This will also re-immerse me in the Tiger code base, and start me on the path to version 2.0. This should be a real clean-up, organization and preparation of the code for the eventual removal of Apache2 and mod_python from the picture, making my software stack much smaller and lightweight.

Also, in this latest pass of work, I will be filing a bunch of the spurs off of Tiger via more elegant and consistent pattern matching. One of the shortcomings of Python on the Tiger project in my opinion is the awkward interface to regular expressions—especially compared to Ruby one-liners. A Tiger system based on Ruby would really encourage pattern matching more than under Python. Maybe a long-term goal could be to implement Tiger under each language I wish to master: LISP, JavaScript and Ruby. Noble goal, and great potential software app for my FOSS-platform mission in life. Sort of a language benchmark test.

My answer to pattern matching is to use all xpath, xquery and regex as appropriate and what people are comfortable with, but making the patterns easier to acquire and derive. A goal here is to make the patterns SO EASY and accessible in general to Tiger newbies, that all those functions littering up the Tiger code with a couple of lines to do a simple pattern-match, which inevitably is going to have to be updated, will simply go away. They get replaced by something else—so long as that something else is actually better.

This pattern-matching improvement is just ONE of the pieces of the simultaneous equation. What are the other pieces? What are the BIGGEST wins? That pattern matching thing falls under the category of the biggest self-help wins. It should almost take the place of documentation. It’s kinda like a simple hash-table. The patterns could be just as easy as the documentation. This would apply to 4square, youtube, and a whole lotta screen scraping stuff. Can you give it a very simple workshop-ish identity? 360iTiger Scrapes. Just go to the “Scrapes” tab. Look it over. Add your own. Blah blah blah.

Is that too large of an architectural expansion? Its purpose in life would be to pull in a table and turn it into a dictionary. And I would inevitably want to support indexes, in case patterns pull back multiple matches. Or maybe I should automatically handle indexes, like pulling back the most obvious choice (a number). Any way you look at it, it has to be an easy, easy, easy interface! Immediately obvious and enlightening in nature.

Everything can have an xpath, jquery and regex pattern—and column, by extension. It provides a nice fall-back method. If one doesn’t work, use the other, and so on. Maybe they have to agree. Or maybe I have a rule. If one, okay. If two great. If three… you’ve got a pretty surefire match. I could also provide an index column. Or perhaps the type of datatype you’re looking for—int, string, number, decimal, etc. I don’t want to overcomplicate this. All the columns would be optional, but a minimal amount would be required to make it work.

This entire endeavor would have one object in the Tiger code itself, effectively taking the place of tons of disparate functions. And it would have another representation decomposed into spreadsheet form that could be expanded by the user—perhaps optionally overriding the existing patterns the way functions currently works. The trick is to do this without making it a big refactoring or introducing bugs. But it will be infinitely useful.

Other types of work that fall under the category of game-changers, and things that I would like to work into this round of work include some form of scheduling and some form of trending. Both of these have very tricky aspects to them. I won’t describe everything, but basically, I am very hesitant to make anything about Tiger automated and unattended. This is not a heavy-duty server app with a database and persistence. It’s lightweight and transient. I’m playing the ultimate shell-game with servers, and scheduling doesn’t lend itself to that model. And trending is tricky for a different reason—I need to figure out how to add rows or rename and manage columns. Trend horizontal vs vertical. Do I use worksheets? How will the trending visualization occur? I should figure out the easiest trending visualization and work backwards from that with the easiest data layout to produce trending graphs from.

Okay, so this is some pretty good thought-work about the great advances Tiger will be imminently making. But I need rhyme and reason to this approach. I need to touch and personally use nearly every Tiger function. I need to deeply understand and internalize all the functions—even the ones I didn’t write myself. My confidence is pretty high right now with how successful I was with file-writing and emailing, and how cleanly I really could integrate that into the Google Analytics functions, about which I am still greatly clueless.

To make the battery of tests useful, fun, and frame all my soon-to-occur work, I have to work in almost a sort of macro automation workflow. Python provides that, and my “test behavior” is the obvious place to go to start working. I need to remind myself how the battery of tests gets kicked off. There’s an “igniting my interest” and getting the ball rolling event that I’m dancing around here. It IS chasing the rabbit down the rabbit hole, which is what gives me such pause. But if done correctly, I can keep myself precisely on-topic and on-plan, and the rewards are great.

So much about Tiger is about scraping, and scraping shouldn’t require access to functions or the need to know how to program, or being exposed to Python’s idiosyncrasies. EZScrape is the idea. We need a strong identity for the tab, and EZScrape is it. Just do an easy scrape. Just ezscrape it. Go to the ezscrape tab and copy one of the existing functions. To get your pattern, just fill in the target column, and put a question mark in the three pattern columns. One of them should come back with the correct guess.

Actually, I’m thinking that in order to prevent ambiguity, really only one of the pattern matching methods should ever be filled in. You have the option to do any, but only one is necessary. If you do more than one, I will have to figure out how to handle it. For reverse compatibility, these should essentially be turned into global functions, so there will have to be a function name column. A url and target will be necessary as input, so that question marks under the pattern functions can be replaced with the guessed-at pattern. And each pattern-in needs a match-out, and that match-out may have multiple values, so there needs to be an index column, and one further column for the final output, which uses the index column if a value was there. And finally, each match needs a description.

name, description, url, target, xpath, xmatch, jquery, jmatch, regexp, rmatch, index, match

Okay, I have a plan. But how to put this into motion without totally nuking productivity. This is truly a killer capability for Tiger. It makes it much more broadly appealing, and provides a form of function-making that can be released to the public without too much power exposed. Let’s follow that 1, 2, 3 step approach that worked so well for you in the past.

1… THIS is the hard part. This is why not just everyone is doing this sort of stuff. Things break down when you try to put the petal to the metal. Okay, okay. The first steps in chasing the rabbit down the hole are the most critical. All following work for months or even years is colored by this step… this magical moment… this step #1…

1. Just simply understand the principle of what you’re trying to do here. Don’t mess around with a user-exposed worksheet tab yet. Get the data structure correct natively in the Tiger code. Plan a data structure so you can start eliminating the mess of throw-away functions. You’re talking about a list of dictionaries. Why? Because each item on the list represents a potential global function. Hmmm. The internal data object can/should be more clear and efficient, in a form such as:

def scrapelist(): sl = [] sl.append({‘name’: ‘firstp’, ‘type’: ‘xpath’, ‘pattern’: ‘//p’, ‘index’: 0}) return sl

Okay, simply putting such a function into the Tiger code without crashing IS step #1. Woot!

Okay, step #2?

2. Simply make that object have several functions. Don’t worry about making them a diverse sampling yet.

def scrapelist(): sl = [] sl.append({‘name’: ‘firstp’, ‘type’: ‘xpath’, ‘pattern’: ‘//p’, ‘index’: 0}) sl.append({‘name’: ‘secondp’, ‘type’: ‘xpath’, ‘pattern’: ‘//p’, ‘index’: 1}) sl.append({‘name’: ‘thirdp’, ‘type’: ‘xpath’, ‘pattern’: ‘//p’, ‘index’: 2}) return sl

3. Identify the PLACE you need to start stepping through the lists turning it into functions. It’s the exact same place as you load optional functions out of the worksheets.

Okay, now we’re getting somewhere. Now, write out the name of each item in the list, treating it as a dict. Done.

Next… wow, I haven’t done mental gymnastics like this in a long time. This is very refreshing and empowering. I feel a major code cleanup of the Tiger system coming with every line of code I write now.

Okay, I have it working with xpath. Now, put a regular expression example in there. Hmmmm. I just realized this model could break how testing works. But once all the scraping rules are in one easy-to-run location (we can include a sample URL), it is actually easier to run mass scrape health checks than it is currently.

This is going really well. With all the time in the world, I would combine testing, scraping and trending. It’s a golden trifecta of moving the Tiger system forward. But it’s already 3:10 PM, and the window of opportunity is slamming shut on me. Tiger could use a few old-school all-nighters of the type I used to put in when I was younger and didn’t have a kid. I have a new baby.

Okay, so it comes down to strategy and working smart.

I’m looking now at the twitter, pinterest, foursquare, gowalla and other scraping functions. There are A LOT, and I’m thinking I need a philosophy to determine what gets moved into ezscrape and what doesn’t. I’m not trying to make busy-work here. It’s all about making simple what CAN be simple, and a lot of these things introduce complications that necessitates custom programming. Not in every case, but Twitter definitely does.

Okay, it’s coming up on 5:00 and I have to reach a good stopping point. I did some remarkable work today. AND it can be actively incorporated now, because even they even get auto-documented with the auto-documenting feature, thanks to them being no different than any other global function. The only drawback is that in its current state, it doesn’t show the column dependencies, but that’s a small loss for such an otherwise code cleanup win.

But this isn’t just about code cleanup. This is about giving a powerful screen scraping interface to Tiger users, so they can rapidly do all sorts of screen scraping tasks by copy-and-paste example, rather than having to interact with Python functions. Today’s work was really just a portion of what needs to be done. What remains is: