Okay, I cranked out ezscrape features both for Amie and Jennifer on Monday and Tuesday. ezscrape is turning out as beautifully as I had hoped. It looks like today I may have the focus to do one of those videos. But because tomorrow is my last day in before my Colorado trip, I’m thinking better of that. Today is for polish. Today is for practicing my video-shoot tomorrow. Practice and polish. Deeper familiarity with the internal/external user behavior split… but don’t introduce bugs! And looking at incorporating JSONPath and JQuery for ezscrape. Also maybe filtering which scrapes show for internal vs external users. Perfect subject-matter for today. All inter-related. All of important and immediate use to the organization. Ready, set, go! 1, 2, 3… 1?
Okay, so first and foremost is the 1000-row limit that Samir is running into. I thought that was in my configuration file, but it’s not. Then I thought it was in the main Tiger function. But it’s not. Oh, wait! I remember! It’s so memory can’t be nuked on the webserver. Ha! Okay, I changed it to be tied to the maxrow value in the configuration file. Now, the Tiger system is hardwired to not let more than 5000 rows at a time be processed for question-mark replacement. I can up it pretty easily now through the config file. I just need to be careful when doing so, because Tiger is not using the most memory efficient method possible to scan for question-marks in the worksheets. That is something I would do different if I rewrote it.
Okay, fixing the max-row issue for Samir was actually pretty quick. Now, I need to both follow-through on my work yesterday with Amie and Jennifer, wrapping anything back into the greater system that needs to be, and deal with any learnings. Already with Amie, I had a learning. For her job to fit into ezscrape, some values need to be hardwired into the regex pattern, which is fine for self-help, but terrible for rolling things out system-wide. In general, Amie’s function is a matter of pulling things off of a search results page, and that might be deserving of its own function. Not everything needs to be ezscrape. But it does prove that in a pinch, the self-help operation of ezscrape lets you do amazing new functions with no programming—beyond RegEx, which IS an issue. Regular Expressions really are that strange.
Okay, fix a bug you encountered when working on Jennifer’s Flickr function in which when datatype is set to number, the wrong number comes back. If you remove the number datatype and let it get handled like a string, then it comes back correct… oops, found a bug in twweb function and fixed. I have to be more explicit in my regex patterns. Keep focused! Next? That number bug. Copy Jennifer’s spreadsheet to recreate it. Now, recreate the bug. Ugh! It’s working perfectly now. Okay, just keep an eye on it. I think it may have been a result of formatting, the way parenthesis make negative numbers and such… oops, wait! I am getting different numbers occasionally! I’m going to guess that yesterday, it was just a coincidence or a caching issue where I couldn’t get the different numbers to come up in the browser. Problem identified. There is not problem. Things working perfectly, but number actually in data source occasionally jump around.
Okay, NEXT! Roll these Flickr functions system-wide. They’re ready for prime time. Pshwew! Okay, Flickr functions in the main code-base and will roll out system-wide an hour after the next commit. NEXT! I’m dying to incorporate JSONPath. That’s a big win, because it makes the ezscrape patterns for JSON API’s not require RegEx, and that makes it just much more pleasant to talk about and promote ezscrape internally. That bit of work is in competition now with actually producing a video… oh, which I decided I wasn’t going to do today. Okay, bear down and get JSONPath done. 1, 2, 3…
Okay, we’re talking about a new Python Module not part of the standard library. I avoid that stuff, generally. But the benefit here is just too great. It’s going to have to become part of my process of creating a new server, so I will add it to my install.txt file. Also, because now I’m using multiple load-balanced servers, I will need to do to all the servers whatever I do to one. I have cssh for that, which has been working pretty well, and should do the trick in this case as well. How do we test. What is the 1? I can test this thing on my local machine or a test server. Okay, I have it downloaded onto my local machine from off of github, and untarballed. I want to do a before and after, so I can see it not work before the install.
1. Make a bare-bones file to do an http call to a JSON API. I’ll use my Twitter profile…
import urllib2 response = urllib2.urlopen('https://twitter.com/users/show_for_profile.json?screen_name=miklevin') somejson = response.read() print somejson
2. Figure out what piece of data you’re trying to grab out of it. Okay, the JSON object is littered with url fields inside of large complex objects. I will try to pull back everything in a field named url. That is roughly equivalent to the trick I do all the time with the double forward slash in XPath, called recursive descent. For example, this will pull back every h1 in the document:
And so, the page at http://goessner.net/articles/JsonPath/ tells me that dot dot is the syntax for recursive descent. It also looks like indexes are much easier to use in JSONPath than they are in XPath. It’s a very nuanced point and one of the most annoying things about XPath from my perspective for my use, but if you have multiple h1′s, and you want to pull back the first, it’s not simply:
…as you would think (XPath uses a 1-based index). Because of order of precedence, what this is actually saying is: pull back ALL THE h1′s that are the first child of their element. In other words, this would return a list, and not just the first h1 as you might expect. The index actually belongs to the wrong group. So as with many languages, parenthesis deals with that grouping, so if you want to get the absolute first h1 on the page, it’s:
Okay, so how do we say this same thing in JSONPath? I’m hoping the behavior is better. Now, because unlike XML that XPath is designed to work against, JSON doesn’t always guarantee a root element, so you can’t just start with /, assuming root. There is therefore a special token that I believe is borrowed from JQuery to represent the highest you can go in the object, root element or not. And that is $. Then, recursive descent in JSONPath is .., and the indexing system is the same. Soooo, we should be able to grab the first h1 in a JSON object with:
And in Python parlance, that would be expressed as:
import urllib2, jsonpath response = urllib2.urlopen('https://twitter.com/users/show_for_profile.json?screen_name=miklevin') somejson = response.read() match = jsonpath.jsonpath(somejson, '$..url')
Oops… you WISH it was that easy. What we’ve gotten back here is plain text in memory. We need to convert it to real JSON in memory. So, I have to do json.loads.
import urllib2, jsonpath, json response = urllib2.urlopen('https://twitter.com/users/show_for_profile.json?screen_name=miklevin') jsontext = response.read() somejson = json.loads(jsontext) match = jsonpath.jsonpath(somejson, '$..url') print match
Go to staging server.
cd /tmp wget http://www.ultimate.com/phil/python/download/jsonpath-0.54.tar.gz tar -xvf jsonpath-0.54.tar.gz cd jsonpath-0.54.tar.gz python setup.py install
Create the jsonme.py test file on the staging server and make sure it runs like it does on your local machine… check!
Okay! JSONPath installed on all the Tiger servers. Add that documentation to install.txt… okay, done.
Now, do the deed. Find where in Tiger you look for the pattern match type for ezscrape functions. Do a commit first.
Okay, one of my realizations during this adventure is that the datatype column is a complete misnomer. It has one and only one function only: grabmaxnum. If grabmaxnum is set to true, it should do exactly that. If not, it should work as you would normally expect. There is too much encouragement to put “number” under datatype if it is a number, and that is bad. It should be left blank until you want that very special automagic functionality. Do a commit first, then change that.
Oops. It’s going to be used to format numbers more than it will to grab maximum number. It’s identity should be formatnumber… okay, done!
Wow. I mean, wow! I didn’t have to use recursive descent for Twitter JSON after all. In fact, it was a problem because so many things got returned with the same field name. It turns out that the correct JSONPaths for what I was looking for was:
user.followers_count user.friends_count user.statuses_count
…and so on! I couldn’t be easier (now that it’s done). Here’s some sample code:
import urllib2, jsonpath, json response = urllib2.urlopen('https://twitter.com/users/show_for_profile.json?screen_name=miklevin') jsontext = response.read() somejson = json.loads(jsontext) match = jsonpath.jsonpath(somejson, 'user.followers_count') print match
Pshwew! I’ve had so much success, I may try to cram in JQuery before I’m done today.