Mike Levin SEO

Future-proof your technology-skills with Linux, Python, vim & git... and me!

Inching Towards Easy Screen Scraping Tab / Python List of List of Dicts

by Mike Levin SEO & Datamaster, 07/14/2012

Okay, I’m going to try to make this a true Focus Friday. There are several particular challenges to overcome. This is my career we are talking about here. I must produce fairly regular inspired work.

Okay, so how to get started? It’s a 1, 2, 3 step procedure again.

Ugh! What point are you even at? A day like yesterday and morning like today really throws off your grove. THIS is where productivity gets lost. I was so pumped over this project, and just the normal day-to-day that everyone has to deal with is enough to derail inspired work.

I must improve my thought-to-dazzle ratio. I’m thinking too much, and not announcing new, amazing things enough. That needs to change. And ezscrape is VERY announcable. It’s the best of all possible Tiger improvements—simplifying overall codebase, making new functions easier to deliver, improving self-help aspect of Tiger, etc.

I just have to hit it home, and make it sexy, sexy, sexy!

1, 2, 3… 1?

Where are you at right now?

Change the url key in scrapelist into testurl. Done. Add testcondition key/value pair to all scrape functions.

Ugh! It’s a bad idea to intermix the sample URL and test conditions into where you define the scrapes. It just makes it too confusing, and wayyyyyy too confusing when exposing it to the user. Looking at the work I did to test functions is VERY enlightening. I will indeed need a testing routine for all the scrapes that does use test URLs / http calls, responses, a test against the response, and whether the response passed the test—basically, exactly what I did for generic tests, but for all the scrapes, and SEPARATE from the work I’m doing today.

Anyway, that was a false start.

The most important thing now is checking for the existence of an ezscrape tab, but only if it WASN’T JUST CREATED, and if found, loading its contents into a list of dicts, and handling it just like we do ascrapelist… except there is extreme order dependency. I want to avoid oder dependency issues if I can. So, the scenarios are…

1. the ezscrape tab is not exposed 2. the ezscrape tab HAS JUST BEEN EXPOSED but user hasn’t had a chance to do anything yet 3. the ezscrape tab has already been exposed

In the first case, ONLY the internal scrapelist can be used In the second case, ONLY the internal scrapelist can be used In the third case, both the internal list and the external list can be used, but the external one always overrides the internal one.

It’s a tiny bit inefficient, but it seems to me that the best way to handle this is to always process the internal scrapelist, then to process the complete external scrapelist.

I really hate to go there, but… we’re talking about a list of list of dicts. Ugh! Okay, to clarify, my current looping structure looks something like this:

ascrapelist = [{‘foo’ : ‘bar’, ‘spam’ : ‘eggs’}] for adict in ascrapelist: #do something

But now, we have two lists of dicts, which are very similar, but some overriding has occured…

bscrapelist = [{‘foo’ : ‘jar’, ‘spam’ : ‘legs’}]

So, now we want to process ascrapelist and bscrapelist identically, but since the keys become function names, the last one executed wins, and there is indeed an unavoidable order dependency if we go this route. So, we must process ascrapelist first, and bscrapelist second. There is also the issue that I may just cherry-pick entries from ascrapelist for exposing to end users, holding back some of the more complex or confusing entries, Therefore, it’s actually quite critical that the whole of ascrapelist gets processed all the time, because bscrapelist may not be everything necessary for the complete system.

Okay, so the looping structure changes like this:

ascrapelist = [{‘foo’ : ‘bar’, ‘spam’ : ‘eggs’}] bscrapelist = [{‘foo’ : ‘jar’, ‘spam’ : ‘legs’}] for alistodicts in (ascrapelist, bscrapelist): for adict in alistodicts: #do something

Thank you, Python! All this means is one more level of indent, and a bit of object name collision avoidance. Do a commit, and get this in place. Your first test won’t actually be loading bscrapelist out of the tab yet, but rather just reusing ascrapelist. Still a good test.

Okay, one piece of cleanup I need to do is test whether the object type is a list before trying to use a numbered index on the result, even if an index is provided. I want this to be a “work every time” scenario, and for indexes to not cause constant issues.

Okay, an index of 0 can ALWAYS be included now with no harm. It will only apply if the datatype returned is a list, which typically is a bad practice in Python, because of duck typing, but in this case is fine because all the xpath, regex, jsonpath and jquery stuff that I plan to be using for pattern matching always return a list on multiple matches. For the sake of simplicity, a check against the list datatype is perfectly appropriate.

Okay, I have “slice” supported as a parameter type, but why would you slice? You slice because you’re looking for the NUMBER in a string, and for probably that reason only. I could maybe imagine other scenarios, and certainly can imagine multiple numbers in a string where only one is the one you want. But I feel slice is too confusing, and for the sake of simplicity, I think I should just make a “datatype” field, and if int, number, or any other number of format-suggesting values are put in there, I just feed it through a function that does the work.

The most common use case will be getting a number out of a string, and getting rid of non-number decoration like parentheses and commas that are often used in and around numbers. So basically, I need to check if I already have such a function, and if not find or make the most common case use function.

Okay, I have nothing named “numberify” in my core functions, so think through what it needs to do, particularly for parentheses, commas, and appearing in a string.

Can it be very regular regex? And if the regex match comes back with a list, then you can use index to determine which number you want? Of course!

Okay, so what we do to number-ify is…

This is the 1st of many (3,000)

…and we want to get back just 3000.

Hmmmm. Or actually, I want to get back [1, 3000]

But if it were NOT multiple numbers, it should only return 3000 without a list, so the use of index is not necessary. And how we do that is…

match = re.findall(‘[0-9,]+’, string)

Now, we have:

[‘1’, ‘3,000’]

And I’ve been dying to get started with list comprehensions, and this seems like the perfect baby-step. The basic list comprehension example is this:

squares = [x**2 for x in range(10)]

…and so, my simple chore should be:

match = [x.replace(‘,’, ‘’) for x in match]

Sheesh, that’s easy! Why haven’t I been using list comprehensions everywhere? I really hardly even need a standalone function for this, what I’m doing is so easy.

Wow, I have it implemented most of the way, but I realize a gotcha. I’m relying on the index number to do work in 2 different contexts: first, which pattern match to use, and second: if your matched pattern has multiple numbers and you specify a number datatype, which number to use. I believe 99 times out of 100, you will want the largest number, because the largest number probably has meaning. It’s a counter or such. So, I simply forget using the index at the time, and choose instead the max content of a list function. Done.

Okay, In my mind, I’ve all but eliminated the need to use the Python slice ability in the ezscrape function. Keeping it in there means everyone needs to look at and deal with such a confusing thing, and that’s too much overhead. I have datatype, but I get to pre-populate all of those with “number” and most people will know just to leave it. It’s also a nice place for future growth, as I can put any number of functions in there that align with datatypes you’re trying to scrape.

I am sooooo close on ezscrape. What I have to do next is load the worksheet-resident scrape object into blist, whenever a worksheet named ezscrape exists. The rest should take care of itself at this point. Wow!