Mike Levin SEO

Future-proof your technology-skills with Linux, Python, vim & git... and me!

← Cancelling Netflix and Chill (T-Mobile offer)
Trying to get Python sbin commands from nbdev →

Write Python Webcrawler! Toil & Talk w/Audience

by Mike Levin

Thursday, May 26, 2022

The time has come the walrus said to write a web crawler… AGAIN! This is one of my all-time best livecasts. It’s a 3-hour gift of awesome that you should watch from beginning to end if you’re in the field of SEO or anything else where data is important. Of course you can ignore me and remain dependent on vendor products.

Okay, so it’s time to code a webcrawler. This is not my first time. I know precisely what pitfalls I wish to avoid, namely recursion.

Avoid Recursion… The Infinite Pitfall

Recursion is tempting. It’s awesome. It’s not nice to the memory of your computer nor to knowing what’s going on.

So what is recursion for the uninitiated? The alluring way to write a crawler is this where you call a function from itself like this:

def foo(bar):
    foo(bar)

foo('value')

This solves some of the challenges of writing a crawler, but creates others. So it seems like a good idea at first, but simple iteration is better:

for x in range(1000)
    foo(bar)

Onto Coding The Crawler

This session follows onto yesterday’s video of Python dictionary high performance persistence, thanks to Sqlite3 built into Python, but generally inaccessible for FAST key/value-pair (NoSQL) interface. It’s usually row & column, which slows you down mentally. Key/value data-capture lets you work fast and furious. It is in my opinion that the use of a key-value data-store is the first step in any project involving fetching data because no matter what you fetch, there’s details of the request that become the key, and the fetched data that becomes the value. Fetch, store, fetch, store, fetch, store. You can always transform the data later.

It’s 3:30 AM

To toil or not to toil? That is the question. Sleep is good, but once you’ve got enough, uninterrupted focus-time that is “out-of-time” for getting into the zone or the flow is good too. One must balance the two.

1, 2, 3… 1? Notebook!

To have a “real” database in the picture is usually a tech liability. It’s not with Sqlite.

A Reminder, It’s About SEO & Keywords

We have to get a handle on the “issues” going on in a website.

Crawl Strategy

Pull the whole site down quickly into Sqlite3.

While we COULD do asyncronous concurrent web crawling for some real fast site slurping, we’re going to do sequential page-fetching for simplicity’s sake and to make a point about starting small on the crawl. Don’t go for a million pages initially, especially when getting started learning Python & crawlers. Time enough for x10 complexity later.

Using URLs as Database keys is key.

Don’t be beholden to the “echo chamber” of SEO or any other field… go to the original sources of information.

Python Context Manager

“With” indicates the Python “context manager” (which avoids explicit opening and closing of files). It’s NOT a loop.

Don’t always follow the common wisdom. Those folks are talking in forums and twitter because they don’t have work to do. I’m doing THIS at 4:20 AM because I have no time, and I DO have MOTIVATION… AND SO, AT 4:20, I show you stuff.

Thank’s Jer… there’s stuff I’m not covering like http SESSIONS. Huh? Same a browser… login is still logged in on your 2nd page-load, otherwise you’d be logging in an awful lot on websites. Session is persistence of certain data (often cookie-based) between pageloads.

Also skipping setting user_agent.

DRY Don’t Repeat Yourself… not even once!!! If you do, they will be all over you like someone who learned Java in CompSci 101. They’re wrong.

Finished Web Crawler Code

from sqlitedict import SqliteDict as sqldict
import requests
from requests.models import Response
from mlseo import *

url = 'https://mikelev.in/'
resp = requests.get(url)

# We Initialize First Data Write of a Crawl
with sqldict('crawl.db') as db:
    db[url] = resp
    links = extract_links(resp.text)
    for link in links:
        if link not in db:
            db[link] = None
    db.commit()

for x in range(10):
    with sqldict('crawl.db', timeout=20) as db:
        for i, key in enumerate(db):
            resp = db[key]
            if not resp:
                print(i, key)
                resp = requests.get(key)
                if type(resp) == Response:
                    if resp.status_code == 200:
                        db[key] = resp
                        links = extract_links(resp.text)
                        for link in links:
                            if link not in db:
                                db[link] = None
                    db.commit()
    h2(f'LOOP {x}')
h1('Done')

I took a little time after the livecast to make all the URLs get recorded, even if they’re 404 (page not founds). They can always be queried later to find the bad pages. To find the pages they linked from will be a different project, because this version of a crawler basically throws out the “link graph”.

from sqlitedict import SqliteDict as sqldict
from requests.models import Response
from mlseo import *
import requests

url = "https://mikelev.in/"
resp = requests.get(url)

# We Initialize First Data Write of a Crawl
with sqldict("crawl.db") as db:
    db[url] = resp
    links = extract_links(resp.text)
    for link in links:
        if link not in db:
            db[link] = None
    db.commit()

for x in range(10):
    h2(f"LOOP {x + 1}")
    with sqldict("crawl.db", timeout=20) as db:
        for i, key in enumerate(db):
            resp = db[key]
            if resp == None:
                print(i, key)
                resp = requests.get(key)
                db[key] = resp
                db.commit()
                if type(resp) == Response and resp.status_code == 200:
                    links = extract_links(resp.text)
                    for link in links:
                        if link not in db:
                            db[link] = None
                    db.commit()
h1("Done!")

Follow-on Discussion With Audience

Wow, what an epic 3-hour session!

Once again I throw it open to discussion with the folks on the livecast chat. And once again, the discussion comes back to my love for a particular set of tools:

“Full Web Stack”… rat race… hamster wheel… fire & motion

Joel Spolsky Fire & Motion Article

What’s fire and motion, you ask? Well, read Joel Spolsky’s article on the topic (a topic on which he wrongly backtracked — Microsoft put a horse-head in your bed, Joel?). But to put in in a nutshell:

What I say is simply don’t engage in trench warfare (fire & motion / constant re-training). Choose better tools. Choose free and open source software (FOSS) tools that are serious about not changing the APIs on you every 2 to 5 years. Look at the grief caused in the Python community going from Python 2.x to Python 3.x. It tore the Python community apart, and those were small API changes compared to what happens every year in the JavaScript world with the ECMA Script standard.

Python is Less of a Moving Target

Yes, but Python changes too. Like recently, it’s Python 3.8 to Python 3.10. Did it change a lot? No, it only changed very little. And they were mostly additive changing with no older APIs breaking and little to no retraining necessary. We got the walrus operator and a new switch-like filter. Yawn! Not at all like:

ReactJS vs VueJS

Boy, those folks are really screwed. And where are you going to have to all run to after VueJS goes out of style, my friends? Sure JavaScript is the pee in the pool of tech just like Python and PERL now, but it’s tied to a time-and-place. It’s tied to the state of browser platforms and mobile devices of the early 2020’s… just like PostScript was tied to the laserprinters of the 1990s. A platform-specific language does not a timeless tool make.

Sorry, JavaScript bros.

Django in 1998… you could Django today… Maybe switch from mako templates to jinja2… maybe.

Are timeless, obsolescence-proof (future-proof), resist disruption.

LISP & eMacs Users are Real Wizards

I covet… I am jealous of the true “meta” wizards of tech who uses as their equivalent of LPvg:

This is both a:

Python is For Cheaters & Shortcut-Takers

Specifically, Python is for shortcut-takers and finish-line racers. Python lets you get stuff done fast, building on many, many shoulders of many, many domain-specialists. You name the field, I can pip install the package, from gene splicing to goat herding.

This is not a bad thing. We all stand on the shoulders of giants. Python people just have a big box of giants to choose from. Python is a bit less plagued by NIH-syndrome than some other languages. NIH stands for “Not Invented Here”. Many programmers want to do all the work themselves for the experience. While that can be good, wouldn’t it also be nice to satisfy your boss (or yourself) today? I mean right now? I mean you could be done the project already.

If I were talking about a child’s game of red-light, green-light, the question is who is better off when RED LIGHT is shouted out? It’s the Python user, of course. It’s because Python makes all the right compromises (for productivity) that most people want to make most of the time. If you want to go off the beaten track, you want LISP or something like it.

Am I Saying Don’t Use RUST?

But what about other languages that are gangbusters in popularity today like RUST? Actually, specifically RUST. I mean RUST is everywhere no today, isn’t it? Yes, and it’s great. There’s always going to be an “almost C” language. In the past it was Java. Then it was Google’s Go Lang. Today’s C-like language du jour is RUST.

Not only is there room for RUST & GoLang, they SHOULD be used where appropriate. I love Python because I’m not a professional developer. I don’t write drivers, OSes, or native apps. Those things are usually best with some language that’s “like” C, but fixes it’s biggest headaches.

Not Computer Literacy… It’s Just Plain Literacy!

Programming, coding… whatever… is just plain literacy. It’s not computer literacy. It’s not really even tech. It’s just communicating without many of the shortcomings of human spoken languages. Autistic? Maybe. Want to know my vibe? Want to know Python’s vibe? Go to a Jupyter Notebook and run:

import this

I Am Not A Geine

So why am I not (or at least not consider myself) a “professional programmer”? It’s because professional programmers make products and are held to very high standards and are on the hook for things that make me cringe. It makes you much like an indentured servant. In the old days, you’d be “on a beeper”. Today, it’s just 24x7 availability through the Net. If your product fails anyone anywhere, they’re looking for you.

Web Development: Web Server… like a genie. POOF What do you need?

WebDev be like:

*POOF* What do you need?
*POOF* What do you need?
*POOF* What do you need?
*POOF* What do you need?
*POOF* What do you need?

Not my vibe.

Linux services (automation after Notebooks) be like:

This is my vibe.

Even when it’s not automated

Aside from having no good name, this is my vibe too.

Samurai Warriors of The Information Age

Samurai Warriors of the Information Age are not professional developers. Those are the farmers or herders. Some may be plain soldiers. None of that is for me. I am a generalist in the world of marketing who carries secret weapons.

It’s easy to think of my vibe as a Samurai warrior of tech. It isn’t the same as working you like a WebDev or App Developer workhorse. Someone cracks the whip at you, you can just walk away. Your skills are valuable, rare and applicable under any conditions for any employer. You’re not tied to a particular code-base or project. You just carry your universally applicable samurai sword around inside of you. It is an internal asset that’s never going bad or going away.

They’re standing by ready to slice of a head.

Huh? By slicing off a head, I mean getting data. Prepping data. Transforming data. Making it pretty. Delivering it to people who need it… all without a webserver even in the picture. If it has to be automated, you put it on a Linux daemon under systemd (a Linux service). The value is just like that of ye olde Samurai slicing off a head.

Gotta Fix MikeLev.in… Eventually

Next steps: using the crawl discoveries TO FIX MIKELEV.IN!!!

Especially the titles tags!

Github Pages Jekyll system… and my blog slice & dice system.