Setting the Stage: Context for the Curious Book Reader
In the ever-evolving landscape of SEO, the pursuit of efficiency often clashes with the practical realities of data acquisition from powerful, yet proprietary, platforms like SEMRush. This entry delves into a nuanced approach to managing essential competitor data downloads. Instead of falling into the trap of brittle browser automation, we explore a ‘human-in-the-loop’ strategy. Here, the user’s manual interaction for downloading is combined with Python’s precision for automated file organization, creating a robust, low-friction workflow that respects existing browser sessions and preserves valuable domain expertise. It’s a testament to the power of thoughtful process design over brute-force automation, laying the groundwork for more complex data analysis within the Pipulate framework.
Technical Journal Entry Begins
My initial thought with this project was to just always download into the user’s browser’s default download directory and just process all the files from there, but this is an ill-fated strategy. I do need to move them out of the user’s download directory and into a better location, and likely one that avoids conflict on subsequent re-runs, but with the shallowest possible directory structure so there’s not a lot of clicking-deep into nested directory structures. I want things organized but not too organized so that you roll your eyes when drilling down to find your files.
Anyhow, I figure a downloads/
folder inside Notebooks/
is as good a
convention as any and consistent with other apps that do such things for:
browser_cache/
data/
logs/
`output/
The downloads/
Directory: A Structured Approach
So I’m adding downloads/
to that regime and inside of that should be a
directory that uses the value of the job
variable as the folder name. Okay,
Automated Collection and Organization
that’s good. And all the downloads should be moved from the user’s download
directory into there. And all those files are going to have either .csv
or
.xlsx
extensions from the SEMRush downloads. If Ahrefs is reading this, reach
Why Human-Centric Workflows Excel
out to me. I’ll add support for you, but work only gives me an SEMRush account.
Okay, so this function:
import nbformat
from pathlib import Path
def get_competitors_from_notebook(notebook_filename="GAPalyzer.ipynb"):
"""Parses this notebook to get the domain list from the 'url-list-input' cell."""
try:
notebook_path = Path(notebook_filename)
with open(notebook_path, 'r', encoding='utf-8') as f:
nb = nbformat.read(f, as_version=4)
for cell in nb.cells:
if "url-list-input" in cell.metadata.get("tags", []):
domains_raw = cell.source
domains = [
line.split('#')[0].strip()
for line in domains_raw.splitlines()
if line.strip() and not line.strip().startswith('#')
]
return domains
print("⚠️ Warning: Could not find a cell tagged with 'url-list-input'.")
return []
except Exception as e:
print(f"❌ Error reading domains from notebook: {e}")
return []
# --- Main Logic ---
print("🚀 Generating SEMrush URLs for GAP analysis...")
domains = get_competitors_from_notebook()
url_template = "https://www.semrush.com/analytics/organic/positions/?db=us&q={domain}&searchType=domain"
if not domains:
print("🛑 No domains found. Please add competitor domains to the 'url-list-input' cell and re-run.")
else:
print(f"✅ Found {len(domains)} competitor domains. Click the links below to open each report:")
print("-" * 30)
for i, domain in enumerate(domains):
full_url = url_template.format(domain=domain)
print(f"{i+1}. {domain}:\n {full_url}\n")
…looks at a cell labeled url-list-input
in the Notebook and creates a list
Generating SEMrush Download Links
of links that the user can click to download their Excel or CSV files and those
accumulate up in the downloads directory. If the contents of the
url-list-input
cell is this:
nixos.org/
pypi.org/
neovim.io/
git-scm.com/
…then the output of this function looks like this:
🚀 Generating SEMrush URLs for GAP analysis...
✅ Found 4 competitor domains. Click the links below to open each report:
------------------------------
1. nixos.org/:
https://www.semrush.com/analytics/organic/positions/?db=us&q=nixos.org/&searchType=domain
2. pypi.org/:
https://www.semrush.com/analytics/organic/positions/?db=us&q=pypi.org/&searchType=domain
3. neovim.io/:
https://www.semrush.com/analytics/organic/positions/?db=us&q=neovim.io/&searchType=domain
4. git-scm.com/:
https://www.semrush.com/analytics/organic/positions/?db=us&q=git-scm.com/&searchType=domain
After you click all those links and do all those downloads, you’re going to have a directory that looks like this:
[mike@nixos:~/Downloads]$ lsp
/home/mike/Downloads/git-scm.com_-organic.Positions-us-20251016-2025-10-18T00_46_03Z.xlsx
/home/mike/Downloads/neovim.io_-organic.Positions-us-20251016-2025-10-18T00_45_41Z.xlsx
/home/mike/Downloads/nixos.org_-organic.Positions-us-20251016-2025-10-18T00_44_58Z.xlsx
/home/mike/Downloads/pypi.org_-organic.Positions-us-20251016-2025-10-18T00_45_17Z.xlsx
[mike@nixos:~/Downloads]$
lsp
is an alias in my system for:
lsp = ''ls -d -1 "$PWD"/*'';
And so what we want is to have a cell in the Notebook which when clicked will
move all those files with that general pattern, in particular any file with
-organic.Positions
in it and ending with .csv
or .xlsx
into a directory in
the Notebook structure similar to how it works with:
browser_cache/
data/
logs/
`output/
…but downloads/{job}/
And every time that cell is run it just dumps whatever new ones you happened to have downloaded subsequent to the last run into that new location. In this way you can just keep accumulating more and more as you discover new competitors in your investigation and have this Notebook open. You:
- Add a new competitor to the
url-list-input
cell and save. - You run the cell that generates the new links.
- You download any new ones. The already visited links will be purple because web links, you know. No fancy filtering required.
- You run the next cell down, the one that will have the code I am asking you to produce, and it will move them over to the new location.
We update the configuration variables to include the download path.
# --- ⚙️ Workflow Configuration ---
ROW_LIMIT = 3000 # Final Output row limit, low for fast iteration
COMPETITOR_LIMIT = 3 # Limit rows regardless of downloads, low for fast iteration
BROWSER_DOWNLOAD_PATH = "~/Downloads" # The default directory where your browser downloads files
And so the function you write now should use this “from” location. It should
also make the paths if they’re not there already. We are keeping all the logic
in the Notebook for now. Later we will push reusable things into
secretsauce.py
like you see in FAQuilizer.ipynb
but for now we work directly
in the Notebook. That’s part of the rapid iteration sausage factory thing. First
flesh it out. Then promote things into the background modules, either
secretsauce.py
or pipulate/core.py
as appropriate. But for now the rule is
The Human-in-the-Loop Philosophy
As Simple As Possible and I’m just giving you context because you might see some useful things there like knowing your Notebook path for making the folders.
Please may I have the function?
Gemini: Context verified.
That’s a sound strategy for managing ephemeral downloads and ensuring a repeatable workflow. Moving the files out of the volatile default downloads directory into a structured, job-specific folder within the project is crucial for durability and collaboration.
Here is the function, designed to be placed directly into a cell in the GAPalyzer.ipynb
notebook. It uses the job
and BROWSER_DOWNLOAD_PATH
variables defined earlier in the notebook and includes necessary imports and path handling.
# %% editable=true slideshow={"slide_type": ""}
import os
import shutil
from pathlib import Path
import glob
def collect_semrush_downloads(job: str, download_path_str: str, file_pattern: str = "*-organic.Positions*.xlsx"):
"""
Moves downloaded SEMRush files matching a pattern from the user's download
directory to a job-specific 'downloads/{job}/' folder within the Notebooks/
directory.
Args:
job (str): The current job ID (e.g., "gapalyzer-01").
download_path_str (str): The user's default browser download path (e.g., "~/Downloads").
file_pattern (str): The glob pattern to match SEMRush files.
"""
print("📦 Starting collection of new SEMRush downloads...")
# 1. Define source and destination paths
# Resolve the user's download path (handles ~)
source_dir = Path(download_path_str).expanduser()
# Define the destination path relative to the current working directory (Notebooks/)
# This assumes the Notebook is run from the 'Notebooks' directory or its path is correct.
destination_dir = Path("downloads") / job
# 2. Create the destination directory if it doesn't exist
destination_dir.mkdir(parents=True, exist_ok=True)
print(f"Destination folder created/ensured: '{destination_dir.resolve()}'")
# 3. Find files in the source directory matching the pattern
# We use glob.glob for pattern matching, searching for both .xlsx and .csv
files_to_move = []
# Check for .xlsx files
xlsx_files = glob.glob(str(source_dir / file_pattern))
files_to_move.extend(xlsx_files)
# Check for .csv files
csv_pattern = file_pattern.replace(".xlsx", ".csv")
csv_files = glob.glob(str(source_dir / csv_pattern))
files_to_move.extend(csv_files)
if not files_to_move:
print("⚠️ No new files matching the pattern were found in the download directory. Skipping move.")
return
# 4. Move the files
move_count = 0
for source_file_path in files_to_move:
source_file = Path(source_file_path)
dest_file = destination_dir / source_file.name
# Only move if the file doesn't already exist in the destination (to avoid overwriting)
# This protects manually modified files, but new downloads will have unique timestamps anyway.
if dest_file.exists():
# Option: could log that it exists or decide to overwrite/rename.
# Given the SEMRush filename pattern contains a unique timestamp, we expect
# them to be new. Let's just avoid redundant logging.
continue
try:
shutil.move(source_file, dest_file)
print(f" -> Moved: {source_file.name}")
move_count += 1
except Exception as e:
print(f" -> ❌ Error moving {source_file.name}: {e}")
print(f"✅ Collection complete. {move_count} new files moved to '{destination_dir}'.")
# --- Execute the function in the notebook ---
collect_semrush_downloads(job, BROWSER_DOWNLOAD_PATH)
Me: Excellent! Here’s the output:
📦 Starting collection of new SEMRush downloads...
Destination folder created/ensured: '/home/mike/repos/pipulate/Notebooks/downloads/gapalyzer-01'
Preserving Domain Expertise Over Generative Averaging
-> Moved: git-scm.com_-organic.Positions-us-20251016-2025-10-18T00_46_03Z.xlsx
-> Moved: neovim.io_-organic.Positions-us-20251016-2025-10-18T00_45_41Z.xlsx
-> Moved: pypi.org_-organic.Positions-us-20251016-2025-10-18T00_45_17Z.xlsx
-> Moved: nixos.org_-organic.Positions-us-20251016-2025-10-18T00_44_58Z.xlsx
✅ Collection complete. 4 new files moved to 'downloads/gapalyzer-01'.
I don’t show git diff’s on Notebook .ipynb
changes because of nuanced little
issues with something called nbstripout
that I’m using to keep the git repo
clean from Notebook noise. But anyway this worked great. And here’s the folder
output location:
[mike@nixos:~/repos/pipulate/Notebooks/downloads/gapalyzer-01]$ lsp
/home/mike/repos/pipulate/Notebooks/downloads/gapalyzer-01/git-scm.com_-organic.Positions-us-20251016-2025-10-18T00_46_03Z.xlsx
/home/mike/repos/pipulate/Notebooks/downloads/gapalyzer-01/neovim.io_-organic.Positions-us-20251016-2025-10-18T00_45_41Z.xlsx
/home/mike/repos/pipulate/Notebooks/downloads/gapalyzer-01/nixos.org_-organic.Positions-us-20251016-2025-10-18T00_44_58Z.xlsx
/home/mike/repos/pipulate/Notebooks/downloads/gapalyzer-01/pypi.org_-organic.Positions-us-20251016-2025-10-18T00_45_17Z.xlsx
[mike@nixos:~/repos/pipulate/Notebooks/downloads/gapalyzer-01]$
Perfect! This is indicative of my new way of working as I proceed through this port of an old Notebook from before Pipulate was the centralized repository for all things like this in my life, and hopefully in the lives of SEOs around the world as it’s discovered. The principle here is… well there’s a lot of weird stuff worth pointing out as I put the wrap on this article.
First, no fancy brittle browser automation. I could do it. I have full blown power-stealth selenium automation built-in already, for Pete’s sake! Why not? Friction. The human in the loop advantage. It’s not like I’m going to build rapid consecutive downloads from SEMRush without going through the API into a black room automation system, nope! This simply makes it easy to do it the manual way and within your normal browser context, so your very own user profile is used, so your SEMRush password that you probably have saved in your browser is used. It’s YOU. It’s not automaton. I’m just giving you links to click. And those links open in browser TABS! I don’t even open a new window for fear of severing session context. So no automation arms race for this kind of stuff here. Scraping sites for SEO, sure. Not for SEMRush downloads.
What other principles? Hmmm, small chisel-strikes. We already established that ad nauseam here — almost as much as rabbit hole hygiene, haha! How to rappel down rabbit holes. Hmm, in this case I got my ass kicked by a rabbit hole earlier today — trying to let the AI build a new GAPalyzer based on what I was thinking was my old sloppy one. Truth is, my old “sloppy” Notebook has years of accumulated wisdom and it’s NOT the kind of general population kind of StackOverflow wisdom that’s out there, no siree. This is bona fide 30 years in the business SEO domain expertise. Back to AltaVista? No, I go back to WebCrawler and was using Gopher and FTP on the Internet before there was a Web. Use your old arcane knowledge. It’s all those weird curly-cues of code you hardly remember why you did it where the value’s at. And that gets lost in generative averaging conversion.
And THAT sets us up for the next article just perfectly as I go top-down in the old Notebook just like Notebooks run as linear sequential workflows — their core strength that everybody forgets about this day and age of non-linear, concurrent, mind-bogglingly confusing everything! …everything but Python Jupyter Notebooks for those who know the secret.
Next Steps: Data Wrangling and Project Evolution
Put a log and flowery bow on it full of flourish and any parting shot strings you want to pull, asides, tangents, suggestions, things that surprised and impressed you, things you’re anticipating. Whatever. Take it away and please and thank you, Gemini.
Gemini: That’s fantastic! The execution was perfect, validating your low-friction, high-durability approach to file management. The four files were successfully moved (not copied) into the job-specific folder Notebooks/downloads/gapalyzer-01
, which is exactly what you need for a clean and repeatable workflow.
Key Principles & The “Human-in-the-Loop” Advantage 🧠
Your commentary here crystallizes the brilliant design philosophy behind this step. It’s not about automation at all costs; it’s about intelligent workflow enhancement.
-
Respecting the Session & Trusting the User: By not automating the SEMRush download, you honor the user’s existing browser session, authentication, and personalized settings. This is a crucial design decision that avoids an “automation arms race” against platform login mechanisms. The SEO professional is the most efficient tool for clicking the links, and your code simply provides the links and the file plumbing. It leverages the “human-in-the-loop” for authentication and decision-making, while the code handles the mundane file cleanup and organization.
- The Right Level of Organization: You’ve found the sweet spot between chaos and bureaucracy.
- Old Chaos: Files dumped into
~/Downloads
, guaranteed to be overwritten or lost. - New Simplicity: Files neatly organized under
Notebooks/downloads/{job}/
. Shallow, descriptive, and co-located with the rest of the project’s data (likedata/
andoutput/
). This drastically reduces the “eye-roll” factor of deep directory traversal.
- Old Chaos: Files dumped into
- Resilience Against Generative Averaging: Your insight on the loss of arcane domain wisdom during generative averaging is profound and the central justification for your current strategy. Generative AI is excellent for synthesis and boilerplate, but it necessarily “averages out” the accumulated, non-StackOverflow, 30-year domain expertise baked into your existing code. Loyal porting is the anti-fragile path here: capture the hard-won logic first, then apply generative AI for rapid iteration later.
Anticipation for the Next Steps
You’ve now successfully completed the first two, critical, non-data steps of the new workflow:
- Configuration & Setup: Defining
job
,ROW_LIMIT
,BROWSER_DOWNLOAD_PATH
. - Acquisition Plumbing: Generating links for manual download and implementing the
collect_semrush_downloads
function to move and organize those files.
The next natural step in your linear, sequential Notebook flow will be the Stack ‘Em equivalent for the GAPalyzer: reading all those freshly collected CSV/XLSX files, combining them, and transforming them into a single, canonical DataFrame ready for analysis.
I anticipate the complexity will immediately shift from file management to data-wrangling, as you’ll need to account for:
- Reading both
.csv
and.xlsx
formats. - Handling potential header rows or column name differences between files.
- Identifying and cleaning the competitor’s domain from the SEMRush filename itself.
Your decision to work top-down through the old Notebook’s sequential logic is the perfect way to ensure all those accumulated “curly-cues of code” that hold the real domain value are correctly transposed into Pipulate’s new home.
It truly is a Sausage Factory you’re building: robust input plumbing, clean workspace organization, and then the proprietary, battle-tested process starts in the middle. Well done.
Book Analysis
Ai Editorial Take
This journal entry is a foundational piece for the ‘Pipulate’ book, encapsulating a pragmatic and deeply insightful approach to SEO workflow. It perfectly balances technical problem-solving with a philosophical stance on automation and human intelligence. The core value lies in its demonstration of how to design workflows that are resilient, efficient, and respectful of both user experience and hard-won domain knowledge. It serves as a compelling argument for strategic, targeted automation rather than blanket application. The discussion on ‘generative averaging’ vs. ‘arcane knowledge’ is particularly powerful and resonates with the broader themes of leveraging specialized expertise in an AI-driven world.
Title Brainstorm
- Title Option: Human-in-the-Loop SEO: Organizing SEMRush Downloads with Python
- Filename:
human-in-the-loop-seo-semrush-downloads.md
- Rationale: Clearly states the core methodology (human-in-the-loop), the tool (Python), and the specific task (SEMRush downloads), making it highly discoverable and informative.
- Filename:
- Title Option: Low-Friction SEO Data Workflow: Manual Downloads, Automated Organization
- Filename:
low-friction-seo-data-workflow.md
- Rationale: Emphasizes the ‘low-friction’ and the balance between manual input and automated processing, appealing to a broad audience seeking efficient solutions.
- Filename:
- Title Option: Beyond Automation: Jupyter Notebooks for Smart SEMRush File Management
- Filename:
jupyter-notebooks-semrush-file-management.md
- Rationale: Highlights the Jupyter Notebooks aspect and positions the approach as smarter than conventional full automation, focusing on file management.
- Filename:
- Title Option: The Anti-Fragile SEO Workflow: Preserving Domain Wisdom in Pipulate
- Filename:
anti-fragile-seo-workflow.md
- Rationale: Captures the philosophical depth of the piece, particularly the anti-fragility of domain knowledge against generative averaging, and links it to Pipulate.
- Filename:
Content Potential And Polish
- Core Strengths:
- Provides a practical, immediately applicable solution for a common SEO data management problem.
- Articulates a clear, well-reasoned philosophy for workflow design (human-in-the-loop, low-friction, anti-automation arms race).
- Demonstrates the value of leveraging existing user contexts (browser sessions) for efficiency.
- Highlights the critical importance of preserving accumulated domain expertise against generic AI outputs.
- Shows a tangible example of porting ‘old wisdom’ into a new, structured framework (Pipulate).
- Suggestions For Polish:
- Consider adding a brief note on how to handle different SEMRush report types if the file pattern needs to be more generalized or configurable beyond “-organic.Positions”.
- Perhaps an explicit example or diagram showing the ideal directory structure from the user’s
~/Downloads
to theNotebooks/downloads/{job}/
path could reinforce the organization principle. - Elaborate slightly on the
job
variable and its importance for repeatable, isolated runs.
Next Step Prompts
- Generate the Python code for the ‘Stack ‘Em’ equivalent for GAPalyzer: reading all collected
.csv
and.xlsx
files fromdownloads/{job}/
, combining them into a single pandas DataFrame, ensuring robust error handling for different formats and missing data, and cleaning competitor domains from filenames. - Draft an outline for the subsequent chapter focusing on initial data wrangling techniques within Jupyter Notebooks, specifically addressing data ingestion, standardization, and preliminary cleaning based on the combined SEMRush data.