Planning Big Crawl Jobs On a Little Laptop (Chunking)
I recently had a breakthrough in creating a SQLite3 database to capture data from a list of URLs, and now I'm creating a keyword histogram and grouping the pages by keyword themes. I've been assigned a million-page crawl, and I'm tackling it all on one machine with momentarily unthrottled concurrency in chunks. Come read about my journey and how I'm processing the data in finite jobs and moving the chunks off my local.
Tackling a Million-Page Crawl on One Machine with Momentary Unthrottled Concurrency
By Michael Levin
Wednesday, January 11, 2023
Wow, okay. Yesterday’s late-night project was to do the http fetching against a list of URLs. I capture all that data into a SQLite3 database properly using the URL field as the primary-key and putting the “raw” crawled HTML data into a BLOB field by passing the insert command as an insert “pattern” plus Python data tuples. That’s the mechanism to get unencoded binary blobs into a SQLite table. It was a nice little breakthrough. It now resides in Pipulate under the practice folder named crawl_and_extract.ipynb (no longer sequential_crawl.ipynb). I now want to do 2 quickies to expand the system:
- Create a keyword histogram per page
- Group the pages by keyword themes (using the page’s full text)
Fast-forward to 10:00 PM. Wow, intense few days at work. Covering a lot of ground. Showed my latest crawler tech. Well-received. Got a million-page crawl to perform. Gonna look at some good data management techniques. I like to crawl on my local machine that I’m working on, but would also like to farm out to other machines. Always local. May be Windows laptops with WSL. May be NAS-hosted Linux instances. Think it through!
Deciding to do it all on one machine, but with momentarily unthrottled concurrency… in chunks. Get your list of URLs before-hand and crawl the data in N-sized chunks and spew them right onto the drive. Process chunk-by-chunk and optionally even move them off your local drive and onto your NAS while you go. Nice, sequential finite jobs. Even concurrency gets mapped onto sequential chunks, haha!