MIKE LEVIN LPVG SEO

Future-proof your tech-skills with Linux, Python, vim & git as I share with you the most timeless and love-worthy tools in tech — and on staying valuable while machines learn... and beyond.

Planning Big Crawl Jobs On a Little Laptop (Chunking)

by Mike Levin

Wednesday, January 11, 2023

Wow, okay. Yesterday’s late-night project was to do the http fetching against a list of URLs. I capture all that data into a SQLite3 database properly using the URL field as the primary-key and putting the “raw” crawled HTML data into a BLOB field by passing the insert command as an insert “pattern” plus Python data tuples. That’s the mechanism to get unencoded binary blobs into a SQLite table. It was a nice little breakthrough. It now resides in Pipulate under the practice folder named crawl_and_extract.ipynb (no longer sequential_crawl.ipynb). I now want to do 2 quickies to expand the system:

  1. Create a keyword histogram per page
  2. Group the pages by keyword themes (using the page’s full text)

Fast-forward to 10:00 PM. Wow, intense few days at work. Covering a lot of ground. Showed my latest crawler tech. Well-received. Got a million-page crawl to perform. Gonna look at some good data management techniques. I like to crawl on my local machine that I’m working on, but would also like to farm out to other machines. Always local. May be Windows laptops with WSL. May be NAS-hosted Linux instances. Think it through!

Deciding to do it all on one machine, but with momentarily unthrottled concurrency… in chunks. Get your list of URLs before-hand and crawl the data in N-sized chunks and spew them right onto the drive. Process chunk-by-chunk and optionally even move them off your local drive and onto your NAS while you go. Nice, sequential finite jobs. Even concurrency gets mapped onto sequential chunks, haha!