Mike Levin SEO

Future-proof your technology-skills with Linux, Python, vim & git... and me!

Re-architecting proxy-usage process during screen scraping

by Mike Levin SEO & Datamaster, 07/16/2013

Note: This is another of those rambling daily work journal posts. In it, I re-think the architecture of how I use proxy servers from within a data mash-up system I created called 360iTiger. I both re-think it and successfully implement against my thoughts. If such geekery is of interest to you, read on.

Wow, this is the first morning where an actually recently recorded thought has driven right into the next sitting in the morning’s thought. Wow. This is how my journal is intended to work. How could I ever have thought of throwing out the original Tiger code. I wish I hadn’t thrown out the very first when I moved it off of that little Israeli FitPC at the back of my Time Warner cable box at home and onto the Rackspace cloud. There was so much thought there, and there is so much new thought here. The chances of a continuous thought-journal going into the very same place over the years is extremely unlikely in the course of one’s life, and probably through history as a whole.

One of the lessons I wish to impart on a new generation of up-and-coming techies is that you should keep a single master journal. Keep it very secure. Keep other journals - such as ones for work (like this) and maybe if you’re writing a book on the side or are an artist recording what you see. You can keep a purely artistic journal, a private deep thoughts journal, and a work journal. They are just thought-processing devices - though they probably could be effectively mined for memoirs - even if memoirs are never officially written. There will be memoir search algorithms.

I did a screen capture of my desktop just now. I feel like I’m hitting the best Zen of being at one with my work-at-hand right now. I’ve sort of been falling behind, and I’m committing to deadlines with Jason, so I really have to focus like a laser-beam and deliver. Deliver proxy today.

It started with the attempt to simplify the recording of what Proxy was used. And I’m missing the best solution. Who cares? Screw the geography of the query. Just go for “works” and “will work eventually if you keep trying. Get rid of state. Record the proxy used after it has been used. That way, an new proxy will be selected EVERY TIME. And there will be no record of what proxy was used EXCEPT FOR WHILE it is being used. And that is the breakthrough today. A masterful application of the 80/20 rule - which seems to apply to almost everything.

Actually push more of your daily journal entries out. You are much like the photo-taker who never pushes your photos out the day of the event so everyone can see what’s going on WHILE its going on. That’s more generous if you’re trying to let people experience your joys and thrills and experiences along with you while you have them. That’s how you build audience. You don’t build it with a filtered, sanitized, time-delayed snapshot of that experience you had yesterday. And so in that spirit, I will start pushing these journal entries out more often.

So. So. 1, 2, 3… 1!

1. Rip out the existence of the proxydate config tab entry. It has no place in the new scheme. The existence of the proxy entry in the config tab is all you need.

I’ve got a status update meeting with Jason at 2:30 today.

My daily, recurring self-discipline memes are right now quite weak, having come out of client-facing work and meeting-interrupt head-space only a few months ago. I’m in the process of coding in focus-enabled self-discipline memes so as to get my work done in a super-smart and enjoyable fashion in my new job role.

This is a challenge for me, as I am so distractible, and I can see my previous bread and butter getting chopped up in the saw-mill, as Google re-tools the factories of the content-reward system. See? High stress. High stakes. And the ambition to instill the 2-day work week as a realistic economic option for those smart and motivated enough sometime well within the next 20 years.

And so, on to #2. Determine the desired in most behavior-matrix of input possibilities. And try to reduce the complexity of the matrix by controlling input. Behavior? …

proxyme() is invoked, optionally including a URL to test proxying with.

We immediately check to see if there is a recorded proxy used… do we even need to? You don’t want a new proxy chosen on every row, and that’s what’s currently happening. So, we do check if proxyused is om cfg.config.

Stop being apologetic in your mind about how long these things are taking. When you think it out like this, you really are pulling off some quite complex things - making machinery no one else much uses, but which gives you and your tribe a real advantage.

So, I’m going to give an awesome example of why things are currently taking a long time, and the sort of deep dive you’re about to do into the Tiger code to make this Proxy thing really work. It’s 2 deep-ish dives. One still into the ezscrape architecture. And another into deleting the proxyused entry from the config tab after every use. I may also add a precautionary config tab entry deleting at the known first part of the proxying procedure.

Had a FANTASTIC meeting on the vendor form. I am so the right person for this prototyping role, it’s not even funny. Take things out of all the other “tasked” resources critical path. Work miracles. Be quite expert in a much broader array of problem domains than even your biggest detractors ever dared to imagine. Yep, I’m a SQL-guy. I’m a pretty hardcore old-school SQL Server guy who goes back to 6.0 when you couldn’t rename fields.

I’m going to resist doing a “journal cut” today until the work-at-hand is actually complete! That’s my self-discipline trick this morning. I was already distracted responding to things on Facebook and in YouTube. We must give our day-job its complete fair share - even if your effectiveness is a function of your engagement in the community… is it? No, I carry my tools with me. I am a master craftsman with highly portable tools. I do not have such social app dependencies to be productive and get stuff done. And so, on with it!

Okay, architecturally, we have to do clean-up work. It can be run WHENEVER Tiger exits.

Tue Jul 16 13:32:09 EDT 2013 Not doing a journal cut, but I AM doing a datestamp to capture the progression of time. Just spent some time talking with Tanya about Amazon data capture. It is so clear that Tiger is the cure for so many of the things that ail you.

Okay, so… find a place where we can delete out of the config tab universally.

Found the spot to do this.

Success! I made a deleteconfigvalue function and nailed it on my first try! Commit, and get this journal entry out there. And then go back and do the ezscrape support, which probably won’t justify a journal entry, because the introspective craziness that makes it all work won’t make any sense at all in a public journal.