Future-proof your tech-skills with Linux, Python, vim & git as I share with you the most timeless and love-worthy tools in tech — and on staying valuable while machines learn... and beyond.

Data Jockeying 101

by Mike Levin

Thursday, January 05, 2023

SEO? Technical SEO, for sure. Data Scientist? Nope. Data Engineer? Maybe. But Mikey don’t scale and get on the hook for swarms of daemons. I don’t build houses of cards then live life running interference so they don’t topple. No, we seek small but meaningful stupid data tricks that minimize statefulness and tech liability. Get the reward of your last run. Preserve the good bits for an ever-improving library of good bits. Then rapidly reinvent when the next job comes around. Your level of API-granularity you choose to work at should be such that it allows it.

Beware The Exaggerated Over-Reaction Language

Many movements on tech are exaggerated over-reactions to the state-of-tech shortcomings that preceded it. It’s easy to get fixated on the benefits of meta design, object oriented design, portability, concurrency, functional statelessness, user interfaces and the like. These would map to LISP, C++, Java, Go Lang, Haskell and JavaScript, respectively. If what plagues you in day-to-day activity is what one of these languages is responding to, then by all means use them. What would 3D game engines like Unity be without C++ or Google backend systems without Go?

But overall, choosing a language forged in the cauldron of edge-case problems will force you to engage in convoluted, unintuitive thought patterns as the norm. It will slow you down and give the nature of your work, and indeed very thought processes, a distinct and not always best “signature”. Languages that force you to think in concurrent or object oriented terms at all times for all things make you scratch your head when facing common everyday problems, asking yourself whether there might be a simpler way? Fewer lines of code? Easier to read means easier to modify later, share with others, and thus a long potential lifespan for use and reuse.

If corse anyone who’s read my stuff before knows what comes next is Python, Python, Python. And of course, Python. Google might make Go trying to be Python. Apple might make Swift trying to be Python. But Go and Swift are not really taking over the world. Python is. Rust you say? Legit, but it’s displacing the fatally flawed C-languages like C++ and Java — not Python. It’s more likely anything written in Rust will provide Python API wrappers so Python people can use it. Python is not at odds with the C-esque languages, but rather their best buddy by nature of such easy integrations resulting in such improved (Pytnonic) interfaces.

Python does have a signature. Python does have its weirdness. There are non-Pythonic, non-intuitive crap we must live with seeing a lot, like if name == “main”. But such idiomatic nonsense can on the whole can be dealt with or ignored. The keyword “self” is all over places it doesn’t seem like it should belong in a certain style of programming, but you don’t need to know about that if you’re not doing object oriented design. You can use things built that way and get the benefit, but you don’t need to get how it was written.

On a similar note, counters are a part of a programmer’s life. We must count iterations through a loop. But computers offer use an “internal” or “back-end” counter automatically for its own use, and you should be able to expose that and use it. In Go Lang, those counters are just always there. In Python, you’ve got to use the “enumerate” function around the object you’re looping trough so you get both a counter and the item of the iterative loop as 2 separate items if you wish through implied rupee unpacking. You’ve got to be in pretty deep to understand why you’re “I, item”‘ing when you’re using the enumerate function on your iteratable object, and you’ll have unpacked those tuples.

if you want to expose the built-in sequential numbered index counter present on all loops. On the whole, Python panders to user expectations and makes you do weird special stuff only on edge cases.

What I leave should be internalizable tools. For problem solving in coding in Python particular, but also in life in general. Understand the algorithms and apply them. Thus the weekly namedtuples project.

If you’re going to demonstrate to the world during g this time of rnr rise of AI that you’re an original, the one from whom algorithms to produce economic value are fit to. Don’t fit today’s models. Make tomorrow’s models have to fit to you. Incompressible, but a pattern or vibe or essential self still recognizable. But not too much of an act. Be yourself. So find how your happiest self expresses themselves in a way that produces economic value. Make sure can do it long-term while staying happy.

Okay. Data gathering and transformation skills are believe it or not along these lines. I’m practicing it now in a bit of browsers can now be automated on headless servers as a regular part of the data collecting routine. Wow.

Each week will have a keyword histogram. One for each data source in fact. That way the keyword histogram I form out of this published journal will be different from the one created out of my private journal.

My journals are data. Google Photos is data. Both will fit nicely into weekly units. If I went with daily there may be days without new photos or journal entries. But I almost certainly did at least one of each every week, especially if you take both personal and private journals into account. You’d need ways to keep them apart in the output, especially if you plan on publishing anything LOL.

Defer hierarchy and row & column discussions for later. It’s tempting to do data transforms at the moment of data-collection. Resist the urge. There’ll be time ‘nuff for counting when the dealing’s done.

You just made an API request, right? Hey I know, why don’t we make the api args the keys to a persistent dict? Wouldn’t that go a long way towards the goal of capturing the data fast. You can transform it later. Save it the api response, Requests or httpx response object and all. You can think of it as a local cache. You can always reuse that file name later after the extract and transform.

Okay, get your ducks in a row for today’s work. Is there a sqldict of namedtuples awaiting processing? Maybe not when each day just replaces a whole grid of data. If there’s not a long list of input arguments, as in a series of dates to process, then forget a database of namedtuple keys, as pleasant as they are to use.

Okay, so this latest project. I need to create a couple of date-ranges. They are from the beginning of the year to current (minus a few days) and beginning of the year to a couple of days ago.