Fixing Twitter Screen-Scraping Function - Forced onto API
by Mike Levin SEO & Datamaster, 02/29/2012
I’m running out of time today to fix a very important Tiger function. Both external clients and internal audience are relying on it, and it’s something I’m currently accomplishing through screen-scraping, because it’s so much easier than the API. I just really have to focus. I lose focus so easily. Let’s use the 1, 2, 3 method…
1. Establish a before-and-after. Pull up a page that has the symptom. Done. 2. Find a page where the data REALLY IS available in view-source. Done. 3. Make sure the URL you’re pulling is exactly the same as that. Done. 4. Make sure that the HTML you’re getting back through Python is the same as view-source from the Web browser. IT’S NOT!
It’s 4:30, and I think I can have this thing nailed by the time I leave at 6:00. What I’m doing here is taking something that has been working with screen scraping for quite some time, minus the occasional fix, and switching it over to something more formal. Twitter is clearly taking a stand against screen scraping and making you investigate API. The problem with using APIs for simple lookups, for say number of Twitter followers or number of tweets is providing login credentials is silly for things that are available without login on the main website.
But Twitter has taken a progressive number of steps that break this screen-scraping approach. First, they made the old profile URL format stop working: http://twitter.com/username. Then, they started inserting the new convention for bookmark-able Ajax pages: hash/bang ( #! ), which was advocated by Google as a method of being able to bookmark “state” within Ajax or Flash applications. Therefore, the new profile addresses started looking like http://twitter.com/#!/username. Most recently, Twitter switched from normal http to always-on secure https, making the new profiles https://twitter.com/#!/username.
…which teaches me how to build a URL like:
…which means now that I can get the function working again, using a credential-free API. Woot!
Okay, I think I would rather just use:
Okay, I got my work done and announced to my stakeholders. I’m ashamed to say that I just adjusted my regular expressions to pull the correct data out of the JSON object as text—and not real data. I will put that on my list to switch it over to real data. Why does grabbing data out of a JSON object still feel like more work than a regex match? I think I need to find or write a support function to pull a value out of a JSON object (actually a Python object after conversion) by key/value name, regardless of where the key appears in the object tree.
Happy Leap-Year day!