Fixing Twitter Screen-Scraping Function - Forced onto API

by Mike Levin SEO & Datamaster, 02/29/2012

I’m running out of time today to fix a very important Tiger function. Both external clients and internal audience are relying on it, and it’s something I’m currently accomplishing through screen-scraping, because it’s so much easier than the API. I just really have to focus. I lose focus so easily. Let’s use the 1, 2, 3 method…

1. Establish a before-and-after. Pull up a page that has the symptom. Done. 2. Find a page where the data REALLY IS available in view-source. Done. 3. Make sure the URL you’re pulling is exactly the same as that. Done. 4. Make sure that the HTML you’re getting back through Python is the same as view-source from the Web browser. IT’S NOT!

Ah ha! If you’re not logged in (to the normal Web user interface), you don’t get the numbers emitted in the HTML. It must be populated by JavaScript post-page-load. Hmmmmm. Okay.

It’s 4:30, and I think I can have this thing nailed by the time I leave at 6:00. What I’m doing here is taking something that has been working with screen scraping for quite some time, minus the occasional fix, and switching it over to something more formal. Twitter is clearly taking a stand against screen scraping and making you investigate API. The problem with using APIs for simple lookups, for say number of Twitter followers or number of tweets is providing login credentials is silly for things that are available without login on the main website.

But Twitter has taken a progressive number of steps that break this screen-scraping approach. First, they made the old profile URL format stop working: http://twitter.com/username. Then, they started inserting the new convention for bookmark-able Ajax pages: hash/bang ( #! ), which was advocated by Google as a method of being able to bookmark “state” within Ajax or Flash applications. Therefore, the new profile addresses started looking like http://twitter.com/#!/username. Most recently, Twitter switched from normal http to always-on secure https, making the new profiles https://twitter.com/#!/username.

And finally, the problem that I ran up against today is that they have started emitting different HTML based on whether you are logged in (to the normal webpage) or not. Only if you are logged in, will you get the actual numbers like Tweets, Following and Followers in the view-source HTML (i.e. the HTML of the initial page-load). For non-logged-in page-views, you get the HTML framework, but the values must be filled in post-page-load by JavaScript. Or in other words, they just broke screen scraping, unless you’re executing the JavaScript of the page with a headless web browser, like PhantomJS or HtmlUnit—overkill if an API is available. And so, I start googling on friends_count, statuses_count, and the other field names that were my screen-scraping tokens in the past, and lo-and-behold, I find this page:

https://dev.twitter.com/docs/api/1/get/search

…which teaches me how to build a URL like:

https://api.twitter.com/1/users/lookup.json?screen_name=miklevin&include_entities=true

…which means now that I can get the function working again, using a credential-free API. Woot!

Okay, I think I would rather just use:

https://api.twitter.com/1/users/lookup.json?screen_name=miklevin

Okay, I got my work done and announced to my stakeholders. I’m ashamed to say that I just adjusted my regular expressions to pull the correct data out of the JSON object as text—and not real data. I will put that on my list to switch it over to real data. Why does grabbing data out of a JSON object still feel like more work than a regex match? I think I need to find or write a support function to pull a value out of a JSON object (actually a Python object after conversion) by key/value name, regardless of where the key appears in the object tree.

Happy Leap-Year day!