Original Vision for Article: A Death Star Hologram for Your Website
The theme of this article was going to be: Like Admiral Akbar’s Death Star hologram, learn how to visualize website architecture through link-graphs enhanced with Google Search Console data, revealing structural insights and optimization opportunities. However, I never got to the visualization part, and anyone following the link would be less than satisfied with it not reaching that point.
The Real Challenge: AI-Assisted API Integration
What it did reach was some of the most complex and challenging API integration of my life, downright not possible without AI-help. And not just any AI-help. AI that has repeatedly received a Matrix-style Kung Fu download. It’s not Prompt Engineering. It’s not training a model or fine-tuning. It’s something whose moniker I’m choosing makes my coworker blush. I can’t very well go with a movie quote to name the technique. It’s got to be more literal and descriptive if it’s going to compete with prompt engineering, fine-tuning and the like. No, it’s got to have some pop and pizazz; maybe even be a bit controversial.
Mastering BQL: Not Your Average Query Language
No, BQL is not SQL; I have to keep telling the LLMs. Neither is it the API-language of a certain business news channel (at least not in my context). Rather, it is the very complex API-language of an Enterprise SEO company and its web crawler. And to write BQL effectively is to have some pretty amazing kung fu. I frankly probably couldn’t do it without AI assistance. But I found even then, it was plagued by the exact same problem as them constantly thinking FastHTML was FastAPI. And so I fixed that problem in the exact same way. I made a giant tutorial on exactly what BQL was, and I injected it into the conversation history when the iron was hot, much like a prompt but separate from the prompt itself. It was an immediately-prior hot prompt injection.
The Future Vision: GPU-Powered Link Visualization
And it works. It works so well that I’ve coded some of the most intricate logic of my life, all setting me up for what I really want: a link-graph visualization that overlays Google Search Console data, creating an experience not unlike Admiral Ackbar showing the Death Star hologram, liberated by Princess Leia, to the Rebel Alliance as they strategize. That’s where we are now. The rotating hologram is your website’s map, brought to life by data from an enterprise-level crawl spanning millions of edges. GPUs aren’t just for AI—they’re our tools in mapping and understanding these complex digital landscapes.
And now back to our regularly streamed thoughts. Showing you the Kung Fu is quite a bit down the page.
On Focus and Discretionary Projects
Don’t lose time on discretionary projects. But isn’t everything in life discretionary? It’s chasing rabbits down rabbit holes all the way down. And it’s that kind of thinking that allows you to pursue the wrong things for the wrong reasons. It’s about focus and bearing down on the right projects for the right reasons. Everything is a spectrum and everything is full of ambiguity, but we have to pretend its not. We have to pretend that there are absolutes and that there is clarity, otherwise we get stuck in spiraling vortexes of deterioration and decay. If we do it right, it’s a constant banking of no-take-back little wins that compound on each other and snowball into the life we want for ourselves.
Virtual Desktops and Muscle Memory
The Power of Fixed Positions
Develop your new muscle memory more, the kind that’s going to be with you for life. These “virtual desktops” that scroll left and right are in all the OSes, even though Mac messes it up by nuking their fixed-position nature. I’m going to have to do something about that because the feature is too important to not tie your muscle memory to—something critical on the first screen and something critical on the last screen. Here at home, all my work screens are sandwiched between two journals; the personal one on the left and the professional one on the right. I start out, even with this sort of writing here in the personal one because who knows what you want to publish? And as the day goes on and the bits to publish take form and can be extracted, I copy/paste them over to the other journal. And my left-right journeys between the two ends force me to review the current status of things and get an overview. Almost nothing is hidden in tabs taking my hand off the keyboard and disrupting flow, drilling down into some sort of nested hierarchical index. I step through a flat index from 1 to 7.
The Mac Problem
Except on Mac. Ugh! As much as I’m happy to be off Windows for awhile, Windows did handle the virtual desktops with absolute positions, more like every Linux distro. It seems indefensible to me for Mac to take one of the most fixed-position muscle memory helping boons in all of tech and to run it, but hey, I shouldn’t be surprised. Look at what they did to a standard keyboard with the Control and Alt keys; renamed them, altered their function, and swapped their positions. It’s really about hard-wiring your muscle-memory to their ecosystem. Never forget that, you Mac “cool kids”. Life is better on generic Linux. You don’t have to get into the friggn Linux desktop-of-the-week game. Just choose the most generic Linux with the most generic desktop. Debian with Gnome is probably it, though I’m a NixOS with Gnome fan for the deterministic reproducibility of systems.
Today’s Work Plan
Let’s get down to the work for today. It’s coming up on 6:00 AM on a Saturday morning. This is going to be a good, long focused day. I have a few errands to run to stay fed, but beyond that this should be a connecting-the-dots transformative day. I spent enough time diving down these friggin rabbit holes to:
- Understand the complex coding of Botify’s BQL API syntax… Done!
- To have a place to document this and train LLMs using it… Done!
- To have a place to plug it in, assembling into apps… Done!
- To have an AI assistant there to help with use of apps… Done!
- To build those apps, and to build momentum in doing so…
On Momentum and Disruptions
Yeah, so we’re up to building the apps and having momentum in doing so. This is
where it keeps breaking down because of the double whammy disruption of not
really having the prior pieces done yet, and the lack-of-focus of the daily
grind because of continuous slack notifications. I’m going to lower-case slack
from now on because it doesn’t deserve an upper-case s
. It’s so aptly
named—slacking off. It should be used in that form. To slack.
Evolving Workflow
Okay, some things are gelling about the muscle memory I’m developing here. Sure there’s life-long muscle memory, which is my favorite to talk about. But there’s also short-term muscle memory, which is the precise use of each of those 7 screens, and when you do use tabs, what those tabs are and what their arrangement means. And for me, there’s a brand new important fact that developed: where’s JupyterLab? It used to be the center of my development world, and it’d get a whole screen to itself, which got a whole web browser, which got usually 1-and-only-1 tab. Find Jupyter? Go to screen-2! Screen 2 is now occupied by Cursor AI. This almost purged JupterLab from my workflow, but it’s back now that I have this incredibly important documentation task. I will keep JupyterLab in a tab on screen-3 next to the Web App I am always now developing in Cursor AI. These two things are closely related, as I make discoveries and extract the new findings and patterns into distilled and purified code examples—that run stand-alone without all the surrounding in-app context!
Link-Graph Visualization Challenges
The Deep Crawl Problem
This is about link-graph visualization on another level. First of all, the Botify product does a deep crawl of a client’s site like few others in the business. If there is a spider trap, we could configure our product to crawl and crawl and crawl forever and produce infinite meaningless data after a point. It’s the same problem Google faces with the brute-force crawl-and-index approach, which was always a little bit sus, but now that content is effectively infinite due to the generative AI spam cannon, controlling that crawl is all the more important, and the story of the crawl that evolves as you look at the site at crawl click-depth 1, 2, 3— it’s all so important! When does it exponentially expand? Why? What’s the site-navigation mechanism that allows that? Is that also what the other crawlers are doing, or just ours? If we bring it under control with wrangling-rules, does that necessarily mean the other crawlers did too or are they encountering spider-traps we filtered out?
This and more, we expose outstanding!
From page-to-page our crawl is landing
Queuing up what’s yet-to-visit
For new content… hmmm, or is it?
Yeah, so deep crawls are a bit as discretionary as the rabbit hole projects I’m avoiding to get to the work that matters. There’s conceptual overlap here. Past crawl-depth of 5 or 6, any decent ecommerce site has effectively millions of links due to some faceted search navigation feature. Sure, it can be marked to not be crawled or indexed, but from a pragmatic standpoint, requests of a crawler to be well-behaved is delusional. If a link can be followed, it will be followed. And it’s even worse now that crawlers can execute JavaScript and take actions requiring form submits. This discussion itself is a rabbit hole. I must back up, get my bearings (survey the landscape) and adjust my course.
We want to produce the link-graph experience in real-time in front of a user.
I need to be able to do this against a lot of sample data so that I’m not biasing it towards a successful experience on a single-case.
Every organization is going to have a bunch of projects, only 1 of which is the one we want to visualize on.
And even when we find the right project, there’s a slew of analysis slugs of which we really only want the latest… oh, and a month ago to do a comparison. Oh, and let’s say one sample from 6 months prior.
Okay, so 3 link-graphs over-time can tell a pretty good story:
- From the current analysis
- From the immediately prior analysis
- From six months ago
Okay… okay, don’t over-complicate. Because from here we get into further nested-discussions concerning sections of their site and such. Don’t take a deep dive until we’ve mastered the basic swimming strokes. So while this thought process and investigation is nice, we’re talking about the basic swimming strokes over a broader set of input-data to make sure I’m not making mistakes on the details of the basics. And for that, we get a larger sample of input data.
Okay, so that’s it. I need a list of org’s from which I can find the best (single) project, from which I get the newest analysis slug. And that’s my next piece of kata…
Given a list of orgs, how do I find their most likely active projects? It’s the count of analysis slugs per project. Only active projects are going to have a lot of analysis slugs.
Okay, so go collect up some organization names in which I’m interested in seeing the link-graph for!
Me: What’s the easiest way to do a simple 2-level hierarchy config file? I want a parent and a child per. I don’t want to mess with JSON.
Perplexity: For a simple 2-level hierarchical configuration file, YAML is an excellent choice. It’s easy to read and write, supports hierarchical structures, and is less verbose than JSON. Here’s how you can create a simple 2-level hierarchy using YAML:
YAML Configuration
Create a file named config.yaml
with the following structure:
parent1:
child1: value1
child2: value2
parent2:
child1: value3
child2: value4
This structure creates two parent levels, each with two child key-value pairs. You can add more parents or children as needed.
Reading the Configuration
To read this configuration in Python, you can use the PyYAML
library. Here’s a simple script to read and access the values:
import yaml
with open('config.yaml', 'r') as file:
config = yaml.safe_load(file)
# Accessing values
print(config['parent1']['child1']) # Outputs: value1
print(config['parent2']['child2']) # Outputs: value4
Benefits of This Approach
- Readability: YAML is highly readable and easy to edit manually.
- Simplicity: It doesn’t require complex parsing or special libraries beyond a basic YAML parser.
- Flexibility: You can easily add more parents or children without changing the structure.
- No JSON Complexity: As requested, this avoids the use of JSON.
- Easy Modification: Adding or changing values is straightforward, even for non-technical users.
Best Practices
- Consistent Indentation: Use consistent indentation (typically 2 or 4 spaces) for child elements.
- Comments: YAML supports comments using the
#
symbol, which can be useful for documentation. - Validation: Consider implementing a validation step to ensure the config file has the expected structure.
- Version Control: Keep your config files in version control to track changes over time.
This YAML-based approach provides a simple, readable, and easily maintainable solution for a 2-level hierarchical configuration file[1][2].
Citations:
[1] https://www.puppet.com/docs/puppet/6/hiera_config_yaml_5.html
[2] https://github.com/adobe/himl
[3] https://docs.amd.com/r/2023.1-English/ug947-vivado-partial-reconfiguration-tutorial/Step-2-Create-Levels-of-Hierarchy-in-the-Block-Design
[4] https://github.com/mkdocs/mkdocs/issues/2218
[5] https://configu.com/blog/configuration-files-types-examples-and-5-critical-best-practices/
[6] https://learn.microsoft.com/en-us/mem/configmgr/core/plan-design/hierarchy/design-a-hierarchy-of-sites
[7] https://www.pulumi.com/blog/2022-03-10-hierarchical-config/
[8] https://docs.splunk.com/Documentation/Splunk/9.3.1/Admin/Wheretofindtheconfigurationfiles
Me: I like what you’re laying down. But I don’t even need labels. In fact, they’re a distraction. I don’t want questions like what do the colons mean? It should just be the pure value data on each line.
Perplexity: I see! You’re looking for an even simpler, more stripped-down approach. Let’s create a bare-bones hierarchical structure without any labels or special characters. We’ll use indentation alone to define the hierarchy. Here’s how you can do it:
Simple Indented Configuration
Create a file named config.txt
with the following structure:
Parent1
ChildValue1
ChildValue2
Parent2
ChildValue3
ChildValue4
This structure uses only indentation (typically 4 spaces) to define the hierarchy. Each line contains just the value, with no labels or special characters.
Reading the Configuration
To read this configuration in Python, you can use a simple script that interprets indentation. Here’s an example:
def read_config(file_path):
config = {}
current_parent = None
with open(file_path, 'r') as file:
for line in file:
stripped_line = line.strip()
if stripped_line: # Skip empty lines
if not line.startswith(' '):
# This is a parent
current_parent = stripped_line
config[current_parent] = []
else:
# This is a child
config[current_parent].append(stripped_line)
return config
# Usage
config = read_config('config.txt')
print(config)
This script will produce a dictionary where each parent is a key, and its value is a list of its children.
Benefits of This Approach
- Ultimate Simplicity: No special characters, just pure data.
- No Confusion: Avoids questions about syntax elements like colons or brackets.
- Easy to Edit: Anyone can modify this file without knowledge of config file formats.
- Visually Clear: The hierarchy is immediately apparent from the indentation.
Best Practices
- Consistent Indentation: Use consistent indentation (typically 4 spaces) for child elements.
- Avoid Empty Lines: For simplicity in parsing, avoid empty lines between entries.
- Unique Parent Names: Ensure parent names are unique to avoid overwriting.
- Documentation: Consider adding a commented header explaining the structure.
Limitations
- Depth: This simple approach is best for 2-level hierarchies. Deeper nesting might require a more complex parsing strategy.
- Whitespace Sensitivity: Be careful with leading spaces, as they define the structure.
This approach provides the bare minimum for a hierarchical structure, focusing purely on the data values and using only indentation to define relationships.
I’ll take it! I can throw a config.txt
in the repo next to the config.json
and they’ll have related but different purposes, and both be filtered by
.gitignore
, and my copy protected between machines by some other means like
Notes or a Google Doc.
But even just to make that file, I need to get the most likely project candidates per org. And so first, list the orgs.
Wow, okay. So in order to get the script to get the best list of projects per
organization for link-graph visualization sample data, I took the entire
api.md
file that’s was the output from my giant Jupyter training-notebook from
the prior post and fed it into ChatGPT with the instructions to give me the new
copy/paste Notebook-ready code sample and… and… Perplexity, tell me about
John Henry.
Perplexity: John Henry is a legendary figure in American folklore, celebrated as a symbol of strength, determination, and the resilience of laborers during the industrial era. His story is particularly associated with the construction of railroads in the United States, where he worked as a “steel-driving man,” tasked with drilling holes into rock to facilitate the laying of tracks.
Historical Background
John Henry was born into slavery in the Southern United States and became a free man following the Civil War. His life narrative reflects the struggles faced by African Americans during this transformative period in U.S. history. After gaining his freedom, he found work on the Chesapeake and Ohio Railroad, where he quickly gained a reputation for his exceptional strength and work ethic[1][3].
The Legend
The most famous aspect of John Henry’s story involves a legendary contest against a steam-powered drill. According to folklore, he challenged this machine to a race while working on the Big Bend Tunnel in West Virginia. John Henry reportedly won the contest by drilling more rock than the machine but tragically died shortly after from exhaustion, clutching his hammer in his hand[6][7]. This tale has been immortalized in numerous folk songs and stories, emphasizing themes of human spirit versus mechanization and the dignity of labor[1][4].
Cultural Significance
John Henry’s legacy extends beyond mere folklore; he embodies the struggles of workers against oppressive conditions and the encroachment of technology that threatens their livelihoods. His story has inspired various social and labor movements advocating for fair wages and better working conditions[1][3]. The character has been depicted in literature, music, and art, often representing broader themes of resistance against oppression, particularly for African Americans[6][7].
Historical Basis
While many aspects of John Henry’s legend are embellished or mythical, historians have identified a real individual who may have inspired these tales. Research suggests that there was indeed a John Henry who worked as a convict laborer on railroad projects after being sentenced for theft in Virginia around 1866. This historical figure’s experiences likely contributed to the development of the folk hero narrative[4][5].
In summary, John Henry remains an enduring symbol of strength and resilience, encapsulating the struggles faced by laborers during America’s industrialization while also serving as a powerful reminder of the human spirit’s capacity to confront adversity.
Citations:
[1] https://www.silkroad.org/blog/2023/6/27/the-swannanoa-tunnel-79yhe
[2] https://www.youtube.com/watch?v=J6l7oa-BbIk
[3] https://postalmuseum.si.edu/exhibition/the-black-experience/folklore-john-henry
[4] https://virginiahistory.org/learn/steel-drivin-man-john-henry-untold-story-american-legend
[5] https://archive.wvculture.org/history/africanamericans/henryjohn05.html
[6] https://en.wikipedia.org/wiki/John_Henry_(folklore)
[7] https://www.britannica.com/topic/John-Henry-folk-hero
[8] https://www.ibiblio.org/john_henry/analysis.html
Me: ChatGPT, what do you think is coming next?
4o: Your article so far has a rhythm and flow that blends technical insights with your musings on productivity, muscle memory, and personal optimization strategies. It’s weaving through introspective commentary on work focus, daily routines, and system preferences while setting up the primary focus: tackling link-graph visualization using web crawl data.
What’s likely coming next seems to be an exploration into practical steps for implementing this visualization. Based on the narrative progression, here’s what I think might follow:
-
Listing and Filtering Relevant Data: The current end-goal is to extract valuable data points, so the next section could outline the criteria for “active” or “relevant” projects, as well as specific filtering steps to sift through organization-level data.
-
Code Implementation with BQL: You might showcase how to harness Botify’s API effectively to parse through analysis slugs for each organization, particularly filtering down to just those few necessary to produce a comparative link-graph visualization.
-
Visualization Setup: Transitioning from data selection to data visualization seems inevitable. You could discuss the initial configuration, perhaps sharing a basic framework or a visualization setup within a Jupyter Notebook to superimpose Google Search Console (GSC) data over the link-graph.
-
Reflection on Workflow Adjustments: Since this article balances high-level technical tasks with personal productivity practices, it wouldn’t be surprising to see a section on workflow refinement, perhaps a few tweaks to make your link-graph and configuration structure more sustainable.
-
Metaphorical or Philosophical Tie-In: Wrapping up with a more philosophical or symbolic note—possibly comparing the setup process to the resilience of a character like John Henry—would give the article a satisfying, reflective close. The John Henry reference seems to suggest that, even amidst automation, there’s value in preserving the ‘human spirit’ aspect of customization and control in data visualization.
The article feels like it’s about to culminate in practical examples and code snippets, building on the philosophical ideas you’ve established about discretion, muscle memory, and focus. There’s a synergy between the technical and introspective elements, which could make the final sections powerful and actionable.
Did somebody say actionable? Hey, AI; download this. Now you know BQL. And as Morpheus says—Show me.
Kung Fu Downloaded / Showing You
I could have probably written this on my own, but the value prop would have been out of whack. In fact, it probably was for this too, but now that I have the instant-BQL-expert machine, I can do it over and over, and the value proposition improves over time as it keeps paying back. A lot of what I discuss so far comes off as abstract. But it results in exact working code. That’s the Kung Fu from the AI of which I speak.
Exact code-example, it is! The goal? Find the best examples from amongst my clients for link-graph visualization. Our crawls are big, run frequently and have lots of profiles. The challenge is to find the right ones and the best ones to use as input for this process.
A nuance here was API rate-limiting quotas and the occasional org that had tons of projects and each project requires its own nested sub-query to get the count of analyses. This is a nested subquery situation, something familiar to SQL people. Nested subqueries are best avoided, but sometimes you have to do them. And when the subquery has an API-call, you got to limit them.
There’s probably more efficient ways to do this, but I have no back-end access to these tables. I have to work through the API and I see no path to an efficient aggregate query. And so, brute-force it is!
import requests
import pandas as pd
import time
import pickle
import os
# Load API key and headers
api_key = open('botify_token.txt').read().strip()
headers = {"Authorization": f"Token {api_key}"}
# Rate limit configuration
DELAY_BETWEEN_PAGES = 2 # Delay between pagination requests in seconds
DELAY_BETWEEN_PROJECTS = 3 # Delay between project analysis requests in seconds
DELAY_BETWEEN_ORGS = 10 # Delay between each organization in seconds
# File paths for saving progress
output_csv = 'projects_with_multiple_analyses.csv'
progress_pickle = 'progress.pkl'
# Helper functions
def read_orgs_from_config(file_path):
"""Read organizations from a simple indented configuration file."""
orgs = []
with open(file_path, 'r') as file:
for line in file:
stripped_line = line.strip()
if stripped_line and not line.startswith(' '):
orgs.append(stripped_line)
return orgs
def fetch_projects(org):
"""Fetch all projects for a given organization with pagination."""
url = f"https://api.botify.com/v1/projects/{org}"
all_projects = []
while url:
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
data = response.json()
all_projects.extend([
{'org': org, 'project_slug': project['slug'], 'project_name': project['name']}
for project in data.get('results', [])
])
url = data.get('next') # Move to the next page if available
print(f" Fetched {len(all_projects)} projects so far for organization '{org}'")
time.sleep(DELAY_BETWEEN_PAGES)
except requests.RequestException as e:
print(f" Error fetching projects for organization '{org}': {e}")
break # Stop further requests if there's an error
return all_projects
def fetch_analysis_count(org, project_slug):
"""Fetch the count and most recent analysis slug for a given project."""
try:
url = f"https://api.botify.com/v1/analyses/{org}/{project_slug}/light"
response = requests.get(url, headers=headers)
response.raise_for_status()
data = response.json()
analyses = data.get('results', [])
if len(analyses) >= 2:
return len(analyses), analyses[0]['slug'] # Return count and most recent analysis
except requests.RequestException as e:
print(f" Error fetching analyses for project '{project_slug}' in organization '{org}': {e}")
return None, None
# Load progress if available
if os.path.exists(progress_pickle):
with open(progress_pickle, 'rb') as f:
output_data, org_index = pickle.load(f)
print(f"Resuming from organization index {org_index}")
else:
output_data = []
org_index = 0
org_list = read_orgs_from_config('config.txt')[org_index:] # Start from the saved org index
for org in org_list:
print(f"Processing organization: {org}")
projects = fetch_projects(org)
qualifying_projects = []
for project in projects:
# Fetch analysis count and most recent analysis
analysis_count, recent_analysis = fetch_analysis_count(org, project['project_slug'])
if analysis_count and analysis_count >= 2:
print(f" Project: {project['project_name']} (Slug: {project['project_slug']})")
print(f" Analysis count: {analysis_count}, Most recent analysis: {recent_analysis}")
qualifying_projects.append({
'Organization': org,
'Project Name': project['project_name'],
'Project Slug': project['project_slug'],
'Analysis Count': analysis_count,
'Most Recent Analysis': recent_analysis
})
# Delay after each project to manage rate limits
time.sleep(DELAY_BETWEEN_PROJECTS)
# Sort projects by analysis count in descending order and add to output
qualifying_projects.sort(key=lambda x: x['Analysis Count'], reverse=True)
output_data.extend(qualifying_projects)
# Save progress after each organization
with open(progress_pickle, 'wb') as f:
pickle.dump((output_data, org_index + org_list.index(org) + 1), f)
# Delay between organizations to further manage rate limits
time.sleep(DELAY_BETWEEN_ORGS)
# Save final CSV and clean up progress file
df = pd.DataFrame(output_data)
df.to_csv(output_csv, index=False)
print(f"Data saved to {output_csv}")
# Remove progress file if run completes successfully
if os.path.exists(progress_pickle):
os.remove(progress_pickle)
# Update config.txt with project candidates under each organization
with open('config.txt', 'w') as file:
for org in read_orgs_from_config('config.txt'):
file.write(f"{org}\n")
org_projects = [item for item in output_data if item['Organization'] == org]
for project in org_projects:
file.write(f" {project['Project Name']} ({project['Project Slug']})\n")
print("config.txt updated with project candidates.")
The Steel-Driving Coder vs The AI
We coders were all the bit-driving men (and yes, women too—word-flow, you know). We can make code like this on our own, but there’s this tireless bit-driving machine that could save us from exhaustion if we just yield and conceded defeat. We are no more likely to outcode an AI than John Henry was likely to out-drive a steam engine. If that’s not true today, it certainly will be tomorrow. You’re going to have to be Reid Richards smart to out-think these things at pure coding.
The Modern Steam Engine
The steam engine is the GPU. NVidia is selling pickaxes to miners. Their drilling machine is already in data-centers and they’ve got this Apple ecosystem-like lock-in between CUDA on the software-side and their distributed GPU networking protocol equivalent on the hardware infrastructure side. This is not investing advice, but OMG, the parallels between the steel-driving man versus the steam engine and code-pounding brogrammers vs. AI is exact.
Back to Work
Okay, enough rambling. Back to the task at hand. Valuable, valuable weekend focused discretionary time. Hit this puppy home. Come out the other end of this weekend with some reality-changers for where I work. Bank some little wins from where there’s no going back from. And that’s finishing out this link-graph thing and making that the launching-off point for countless other follow-on’s that are built from the same exact components!
The Link Graph Challenge
Okay, so Botify contains exportable link-graph data. But a typical client’s link graph is way too large for visualization. The first thing we do is we cut back at what click-depth we’re exporting the data for. We start out with this query to know the click-depth export. We’re going to keep it to the maximum click-depth for which the export would be under a million rows (or edges).
We’ve already established that query in our Kung Fu download:
# Get Map of Click-Depths Aggregates Given Analysis Slug
import json
import requests
import pprint
config = json.load(open("config.json"))
api_key = open('botify_token.txt').read().strip()
def get_urls_by_depth(org, project, analysis, api_key):
"""Get URL count aggregated by depth for a Botify analysis."""
url = f"https://api.botify.com/v1/projects/{org}/{project}/query"
data_payload = {
"collections": [f"crawl.{analysis}"],
"query": {
"dimensions": [f"crawl.{analysis}.depth"],
"metrics": [{"field": f"crawl.{analysis}.count_urls_crawl"}],
"sort": [{"field": f"crawl.{analysis}.depth", "order": "asc"}]
}
}
headers = {"Authorization": f"Token {api_key}", "Content-Type": "application/json"}
response = requests.post(url, headers=headers, json=data_payload)
response.raise_for_status()
return {
row["dimensions"][0]: row["metrics"][0]
for row in response.json()["results"]
}
# Run the depth aggregation and print results
depth_distribution = get_urls_by_depth(
config['org'], config['project'], config['analysis'], api_key
)
pprint.pprint(depth_distribution, width=1)
Sample Output:
{
0: 1,
1: 100,
2: 10000,
3: 1000000,
4: 100000000
}
We would have to select a click-depth of 3 if we used <= a million. If we did < a million, it’d be a click-depth of 2 that we’d visualize and the end product would feel very different because of the vastly different amount of nodes on the network graph. Not critical, but something to keep in mind about this. The difference between being well-under the visualization limit and way over is often just 1 click-depth, and it might be good to go over his with the client because how this click-depth nosedive looks reveals tons of issues about their site.
Oops! That’s pages— not edges! Edges are going to reach a million long before pages do. I can’t even use this efficient aggregate approach to calculating the click-depth for the visualization. Yikes! I’m going to have to do a loop to get the edge-count at each click-depth. This is going to be some hairy code.
Anyway, these aren’t my issues to figure out anymore. They are the LLM’s. I choose ChatGPT 4o because it’s such a good Python coder. We give it the kung fu download and a very particular prompt, and lo-and-behold a link-graph query with the depth-selection, download, decompression and all! Now in all fairness, it was a really good prompt as I know how to do this already, because this query was the genesis of it all.
# Import required libraries
import os
import requests
import pandas as pd
import time
import json
from pathlib import Path
# Load configuration and API key
config = json.load(open("config.json"))
api_key = open('botify_token.txt').read().strip()
org = config['org']
project = config['project']
analysis = config['analysis']
# Define API headers
headers = {
"Authorization": f"Token {api_key}",
"Content-Type": "application/json"
}
# Determine optimal click depth for link graph export
def find_optimal_depth(org, project, analysis, max_edges=1000000):
"""
Determine the highest depth for which the number of edges does not exceed max_edges.
"""
url = f"https://api.botify.com/v1/projects/{org}/{project}/query"
session = requests.Session()
previous_edges = 0
for depth in range(1, 10):
data_payload = {
"collections": [f"crawl.{analysis}"],
"query": {
"dimensions": [],
"metrics": [{"function": "sum", "args": [f"crawl.{analysis}.outlinks_internal.nb.total"]}],
"filters": {"field": f"crawl.{analysis}.depth", "predicate": "lte", "value": depth},
},
}
response = session.post(url, headers=headers, json=data_payload)
data = response.json()
edges = data["results"][0]["metrics"][0]
print(f"Depth {depth}: {edges:,} edges")
if edges > max_edges or edges == previous_edges:
return depth - 1 if depth > 1 else depth, previous_edges
previous_edges = edges
return depth, previous_edges
# Export link graph to CSV
def export_link_graph(org, project, analysis, chosen_depth, save_path="downloads"):
"""
Export link graph up to the chosen depth level and save as a CSV.
"""
url = "https://api.botify.com/v1/jobs"
data_payload = {
"job_type": "export",
"payload": {
"username": org,
"project": project,
"connector": "direct_download",
"formatter": "csv",
"export_size": 1000000,
"query": {
"collections": [f"crawl.{analysis}"],
"query": {
"dimensions": [
"url",
f"crawl.{analysis}.outlinks_internal.graph.url",
],
"metrics": [],
"filters": {"field": f"crawl.{analysis}.depth", "predicate": "lte", "value": chosen_depth},
},
},
}
}
response = requests.post(url, headers=headers, json=data_payload)
export_job_details = response.json()
job_url = f"https://api.botify.com{export_job_details.get('job_url')}"
# Polling for job completion
attempts = 300
delay = 3
while attempts > 0:
time.sleep(delay)
response_poll = requests.get(job_url, headers=headers)
job_status_details = response_poll.json()
if job_status_details["job_status"] == "DONE":
download_url = job_status_details["results"]["download_url"]
save_as_filename = Path(save_path) / f"{org}_{project}_{analysis}_linkgraph_depth-{chosen_depth}.csv"
download_file(download_url, save_as_filename)
return save_as_filename
elif job_status_details["job_status"] == "FAILED":
print("Export job failed.")
return None
print(".", end="", flush=True)
attempts -= 1
print("Unable to complete download attempts successfully.")
return None
# Download file function
def download_file(url, save_path):
"""
Download a file from a URL to a specified local path.
"""
response = requests.get(url, stream=True)
with open(save_path, "wb") as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
print(f"\nFile downloaded as '{save_path}'")
# Main execution
print("Determining optimal depth for link graph export...")
chosen_depth, final_edges = find_optimal_depth(org, project, analysis)
print(f"Using depth {chosen_depth} with {final_edges:,} edges for export.")
# Make sure the downloads folder exists
downloads_folder = Path("downloads")
downloads_folder.mkdir(parents=True, exist_ok=True)
print("Starting link graph export...")
link_graph_path = export_link_graph(org, project, analysis, chosen_depth, save_path="downloads")
if link_graph_path:
print(f"Link graph saved to: {link_graph_path}")
else:
print("Link graph export failed.")
Sample Output:
Determining optimal depth for link graph export...
Depth 1: 50,000 edges
Depth 2: 120,000 edges
Depth 3: 500,000 edges
Depth 4: 1,200,000 edges
Using depth 3 with 500,000 edges for export.
Starting link graph export...
Polling for job completion: ...
Download URL: https://botify-export-url.com/file.csv
File downloaded as 'downloads/org_project_analysis_linkgraph_depth-3.csv'
Link graph saved to: downloads/org_project_analysis_linkgraph_depth-3.csv
Rationale: This code exports a link graph by calculating the maximum depth that limits the total number of edges to under 1,000,000, ensuring that the Botify API’s CSV export limit is not exceeded. By polling for job completion and handling potential errors, the script delivers a user-friendly and reliable process for exporting link graphs in large-scale SEO projects. The flexibility to adjust depth dynamically based on edge count allows for efficient resource use and helps teams capture optimal levels of detail in the exported link graph.
But now that I have that most critical query out of the way (I still have to do the meta-data follow-up that gets page segmentation and Google Search Console data), the iron is hot to extract some rules out of the LLM so humans have some hope of doing this sort of thing on their own, and that LLMs have even better guidance in the future. I switched to Gemini Advanced due to the sheer volume of collections and field names in the Botify system. It’s way too large for most models, including the latest ChatGPT’s from OpenAI. So I’m using Gemini to seed the process…
Gemini-Advanced:
BQL Query Building Rules and Guidelines
Understanding the Structure
- Collections: The foundation of your query. Specify the data source, like
crawl.20241101
for the November 1st crawl. - Query: The heart of BQL. Contains
dimensions
,metrics
,filters
, andsort
. - Dimensions: Fields to group by (e.g.,
segments.pagetype.value
for page type). - Metrics: Data to aggregate (e.g.,
crawl.20241101.count_urls_crawl
for URL count). - Filters: Conditions to narrow down data (e.g.,
crawl.20241101.depth
for click depth). - Sort: Order of results (e.g.,
crawl.20241101.metadata.title.len
to sort by title length).
Building Queries
- Start with the Collection: Choose the relevant collection based on the data you need (e.g.,
crawl.20241101
,logs
,generic.QueryMaskML.20241101
). - Define Dimensions: Select the fields you want to analyze (e.g.,
segments.pagetype.value
,crawl.20241101.depth
). - Choose Metrics: Specify the data you want to aggregate for each dimension (e.g.,
crawl.20241101.count_urls_crawl
,crawl.20241101.metadata.title.len
). - Apply Filters: Refine your data by adding conditions (e.g.,
crawl.20241101.scoring.issues.title_len
). - Set Sorting: Order your results for easier analysis (e.g., sort by
crawl.20241101.metadata.title.len
in ascending order).
Example: Find Pages with Short Titles
{
"collections": ["crawl.20241101"],
"query": {
"dimensions": [
"url",
"crawl.20241101.metadata.title.len"
],
"metrics": [],
"filters": {
"field": "crawl.20241101.scoring.issues.title_len",
"predicate": "eq",
"value": true
},
"sort": [
{
"field": "crawl.20241101.metadata.title.len",
"order": "asc"
}
]
}
}
Tips and Best Practices
- Use the Documentation: Refer to the BQL documentation and the
List Fields
output to discover available fields and metrics. - Test with Small Queries: Start with simple queries to understand the data structure before building complex ones.
- Validate Your Syntax: Ensure your JSON structure is correct to avoid errors.
- Manage Rate Limits: Use delays between requests to prevent exceeding API limits.
- Save Your Progress: For large tasks, save intermediate results to avoid data loss.
By following these rules and guidelines, you can effectively leverage BQL to extract valuable insights from your Botify projects.
Okay, that was a first pass dumping everything in and asking for a general guideline. Now I’m prodding it to be beefed out a bit. I honestly don’t know how accurate the below stuff is, but I wanted to capture a bunch of it while the iron is hot. This was dumping Gemini’s getting-started document into ChatGPT 4o which had just done a lot of BQL stuff so has it fresh in the context window. I’m hoping applying this on top of the Gemini-seeded document will cause flowers to blossom. But it may all be hallucinations. I’ll parse through it later to see what should be included in the kung fu download.
ChatGPT 4o:
BQL Query Building Rules and Guidelines
BQL (Botify Query Language) is a powerful API interface for retrieving and analyzing structured data within the Botify platform. This guide covers essential query components, key structures, and sample use cases to help users make the most of BQL’s capabilities.
Understanding the Structure
BQL queries are constructed using key elements that define how data is selected, filtered, grouped, and ordered. Understanding each of these components is crucial to crafting effective queries.
- Collections: The datasets that serve as your query sources. Collections range from crawl data (
crawl.20241101
) to integration data likelogs
,generic.QueryMaskML.YYYYMMDD
, andweb_vitals.field_data
. - Query: The main structure, containing
dimensions
,metrics
,filters
, andsort
components. - Dimensions: These fields allow grouping of data, such as by
segments.pagetype.value
(page type),url
(URL), ordepth
(click depth). - Metrics: Aggregated data points like counts or sums. Example metrics include
count_urls_crawl
(total URLs),metadata.title.len
(title length), andscoring.issues
for quality metrics. - Filters: Conditions to refine results. Common filters include
depth
(to filter by click depth),scoring.issues
(to flag quality issues), and custom flags based on dimensions. - Sort: Defines the order of results, essential for prioritizing or visualizing high-impact data (e.g., sorting by
metadata.title.len
in ascending order).
Building Queries Step-by-Step
Here’s a step-by-step process to build effective BQL queries tailored to your needs:
-
Select the Collection: Choose the primary collection that aligns with your analysis needs. For example, use
crawl.20241101
for the November 1st crawl data,logs
for server log data, orweb_vitals.field_data
for Core Web Vitals. - Define Dimensions: Identify the fields you need for grouping or categorizing. For example:
segments.pagetype.value
to analyze data by page type.depth
to group by click depth.metadata.title.quality
for evaluating title quality across pages.
- Specify Metrics: Choose numerical data points that provide insights:
count_urls_crawl
to get URL counts.metadata.title.len
to analyze title length.outlinks_internal.nb.total
to count internal links.inlinks.nb.total
to assess how well pages are interlinked.
- Apply Filters: Refine the dataset by adding filters:
- Filter by
depth
to analyze click depth distribution. - Use
metadata.title.len
to isolate URLs with short titles. - Apply
scoring.issues
flags to find pages with quality issues, such as missing H1s.
- Filter by
- Set Sorting: Define the order of your results to highlight the most important insights first:
- Sort by
metadata.title.len
in ascending order to find the shortest titles. - Sort by
count_urls_crawl
in descending order to highlight the most crawled pages.
- Sort by
Advanced Query Examples
Below are some examples illustrating different ways to use BQL for common SEO and analysis tasks. Each example demonstrates the flexibility of BQL in extracting actionable insights.
Example 1: Find Pages with Short Titles
{
"collections": ["crawl.20241101"],
"query": {
"dimensions": [
"url",
"crawl.20241101.metadata.title.len"
],
"metrics": [],
"filters": {
"field": "crawl.20241101.metadata.title.len",
"predicate": "lte",
"value": 10
},
"sort": [
{
"field": "crawl.20241101.metadata.title.len",
"order": "asc"
}
]
}
}
Explanation: This query finds URLs with titles of 10 characters or less in the November 1st crawl. Results are sorted by title length in ascending order to quickly identify the shortest titles. This helps prioritize pages needing title improvements.
Example 2: Analyze Pages by Click Depth with URL Counts
{
"collections": ["crawl.20241101"],
"query": {
"dimensions": [
"crawl.20241101.depth"
],
"metrics": [
{
"function": "sum",
"args": ["crawl.20241101.count_urls_crawl"]
}
],
"sort": [
{
"field": "crawl.20241101.depth",
"order": "asc"
}
]
}
}
Explanation: This query aggregates URL counts by click depth, providing insight into the distribution of URLs across different levels of a site’s structure. Sorting by depth helps in understanding how many URLs are deeply nested, which can highlight structural SEO issues.
Example 3: Retrieve High-Impact Pages with High Internal Links Count
{
"collections": ["crawl.20241101"],
"query": {
"dimensions": [
"url",
"crawl.20241101.inlinks.nb.total"
],
"metrics": [],
"filters": {
"field": "crawl.20241101.inlinks.nb.total",
"predicate": "gte",
"value": 50
},
"sort": [
{
"field": "crawl.20241101.inlinks.nb.total",
"order": "desc"
}
]
}
}
Explanation: This query retrieves URLs with 50 or more internal links, sorting them by link count in descending order. Pages with high inlink counts are often of high importance, and identifying them helps prioritize these for further optimization.
Example 4: Find Pages with Missing or Duplicate H1 Tags
{
"collections": ["crawl.20241101"],
"query": {
"dimensions": [
"url",
"crawl.20241101.metadata.h1.quality"
],
"metrics": [],
"filters": {
"or": [
{
"field": "crawl.20241101.metadata.h1.quality",
"predicate": "eq",
"value": "missing"
},
{
"field": "crawl.20241101.metadata.h1.quality",
"predicate": "eq",
"value": "duplicate"
}
]
},
"sort": [
{
"field": "crawl.20241101.metadata.h1.quality",
"order": "asc"
}
]
}
}
Explanation: This query identifies pages with missing or duplicate H1 tags, helping to quickly address technical SEO issues that can affect search visibility.
Example 5: Extract Page Types with High URL Counts for Prioritization
{
"collections": ["crawl.20241101"],
"query": {
"dimensions": [
"segments.pagetype.value"
],
"metrics": [
{
"function": "sum",
"args": ["crawl.20241101.count_urls_crawl"]
}
],
"sort": [
{
"field": "crawl.20241101.count_urls_crawl",
"order": "desc"
}
]
}
}
Explanation: This query aggregates URL counts by page type (e.g., product
, category
, blog
), helping to identify high-impact sections of a site that may benefit most from optimization efforts.
Tips and Best Practices
- Reference Documentation: Use the BQL documentation and
List Fields
output to explore available fields for each collection. - Start Small: Begin with simple queries to test filters and metrics before expanding complexity.
- Validate JSON Syntax: Ensure correct JSON structure to avoid execution errors.
- Account for API Rate Limits: Use delays between requests to avoid rate limiting.
- Use Progressive Saves: For large queries, save intermediate results to avoid data loss during interruptions.
Sample Output
{
"results": [
{
"dimensions": ["search"],
"metrics": [2340000]
},
{
"dimensions": ["pdp"],
"metrics": [82150]
},
{
"dimensions": ["plp"],
"metrics": [53400]
},
{
"dimensions": ["category"],
"metrics": [44420]
}
],
"page": 1,
"size": 10
}
Explanation: This output shows URL counts by page type, offering a quick snapshot of site structure and content distribution. PDP
and PLP
(Product Detail and Listing Pages) are often high-priority areas in e-commerce and are shown with their respective counts.
Rationale
These guidelines equip Botify users with a structured approach to creating BQL queries, allowing them to retrieve, filter, and analyze data from their site crawls effectively. With this toolkit, users can make informed SEO decisions, prioritize optimization efforts, and gain a detailed understanding of site structure and performance through the Botify API.
Okay, now the meta data. This is an interesting one because the problem that
plagued me here in the past is that I can never figure out one master set of
additional columns to add that all clients have. I mean, some have log-file
integration and some have Google Search Console integration and they are all
suppose to have Pagetype segmentation, but you really never know. So I am always
hand-crafting the column metrics to include in the meta-data overlay. Now when
you use the term meta in SEO almost all SEOs are going to think you’re talking
about the <meta>
tag, but no. The spirit is similar, but in this case it’s
just data-points that are not in the link-from/link-to mostly 2-column
link-graph data. It’s almost joined in a SQL-join-like way to the link-graph
visualization so you have additional data-points for colorizing and such. It
turns the link graph into almost medical imaging for diagnosis.
Fast-forward through some astoundingly intense work just on the meta file for the link graph. Capturing some ideas…
Claude:
Metadata Collection Kata
After collecting the link graph structure, we need to enrich it with metadata that will inform our visualization. This kata demonstrates building a resilient metadata collector that gracefully handles missing or incomplete data.
The Three Principles of Resilient Data Collection
- Preview Before Commit: Always check data availability before full collection
- Graceful Degradation: Handle missing fields by providing sensible defaults
- Structured Validation: Group and verify related fields together
Key Techniques
# 1. Preview data availability with minimal query
def preview_data(org, project, analysis, depth=1):
"""Sample a small subset to verify data availability"""
data_payload = {
"collections": [f"crawl.{analysis}", "search_console"],
"query": {
"dimensions": [f"crawl.{analysis}.url"],
"metrics": ["search_console.period_0.count_impressions"],
"filters": {"field": f"crawl.{analysis}.depth",
"predicate": "lte", "value": depth}
}
}
# ... request handling ...
return bool(data.get('results'))
# 2. Collect data with fallbacks for optional fields
def get_bqlv2_data(org, project, analysis):
"""Fetch metadata with graceful fallbacks"""
base_dimensions = [f"crawl.{analysis}.url", f"crawl.{analysis}.depth"]
optional_dimensions = [
f"crawl.{analysis}.segments.pagetype.value",
f"crawl.{analysis}.compliant.is_compliant"
]
# Try full query first, fall back to base dimensions if needed
try:
df = query_data(base_dimensions + optional_dimensions)
except Exception:
df = query_data(base_dimensions)
return df
# 3. Validate field availability by category
def check_compliance_fields(org, project, analysis):
"""Verify availability of field groups"""
compliance_categories = {
'Basic': ['is_compliant', 'main_reason'],
'Performance': ['slow_load_time', 'slow_server_time'],
'SEO': ['duplicate_content', 'missing_title']
}
# ... field validation logic ...
return available_fields
The Flow
- Preview data availability with a minimal query
- Collect base dimensions that should always be present
- Attempt to collect optional dimensions, gracefully handle missing ones
- Validate and categorize available fields for later use
- Save results with clear column names
Key Insights
- Always start with a minimal query to verify data availability
- Build queries that gracefully degrade when optional fields are missing
- Group related fields to understand what analysis capabilities are available
- Provide clear feedback about what data was and wasn’t collected
This kata teaches us that robust data collection isn’t just about getting the data - it’s about handling uncertainty and providing clear feedback about what was collected.
The kata format helps teach the principles behind the code rather than just showing debug output. Would you like me to continue with more sections or adjust this format?
Website Data Holograms (Effectively)
Fast-forward again. The meta export is working. It turns out the link-graph portion of this was almost easy in comparison to layering in the additional data. All that weird reporting output below is because while every client has a link-graph, not every client has the different data-points (like Google Search Console and the Web-Logs) that I like to layer in. That’s right. The Admiral Akbar-style Death Star hologram of the website gets color-coded based on page segmentation, Google traffic, weblog visitation and the like. It’s some hardcore SciFi SEO stuff. I don’t really even think of it as SEO anymore.
import os
import json
import requests
import pandas as pd
from pathlib import Path
from datetime import datetime, timedelta
from typing import List
import gzip
import shutil
import time
# Load configuration and API key
headers = {
"Authorization": f"Token {open('botify_token.txt').read().strip()}",
"Content-Type": "application/json"
}
def preview_data(org, project, analysis, depth=1):
"""Preview data availability before committing to full download"""
# Get analysis date from the slug (assuming YYYYMMDD format)
analysis_date = datetime.strptime(analysis, '%Y%m%d')
# Calculate period start (7 days before analysis date)
period_start = (analysis_date - timedelta(days=7)).strftime('%Y-%m-%d')
period_end = analysis_date.strftime('%Y-%m-%d')
url = f"https://api.botify.com/v1/projects/{org}/{project}/query"
data_payload = {
"collections": [
f"crawl.{analysis}",
"search_console"
],
"query": {
"dimensions": [
f"crawl.{analysis}.url"
],
"metrics": [
"search_console.period_0.count_impressions",
"search_console.period_0.count_clicks"
],
"filters": {
"field": f"crawl.{analysis}.depth",
"predicate": "lte",
"value": depth
},
"sort": [
{
"field": "search_console.period_0.count_impressions",
"order": "desc"
}
]
},
"periods": [
[
period_start,
period_end
]
]
}
print(f"\n🔍 Sampling data for {org}/{project}/{analysis}")
print("=" * 50)
response = requests.post(url, headers=headers, json=data_payload)
if response.status_code != 200:
print("❌ Preview failed:", response.status_code)
return False
data = response.json()
if not data.get('results'):
print("⚠️ No preview data available")
return False
print("\n📊 Data Sample Analysis")
print("-" * 30)
metrics_found = 0
for result in data['results'][:3]: # Show just top 3 for cleaner output
url = result['dimensions'][0]
impressions = result['metrics'][0]
clicks = result['metrics'][1]
metrics_found += bool(impressions or clicks)
print(f"• URL: {url[:60]}...")
print(f" └─ Performance: {impressions:,} impressions, {clicks:,} clicks")
print("\n🎯 Data Quality Check")
print("-" * 30)
print(f"✓ URLs found: {len(data['results'])}")
print(f"✓ Search metrics: {'Available' if metrics_found else 'Not found'}")
print(f"✓ Depth limit: {depth}")
return True
def get_bqlv2_data(org, project, analysis):
"""Fetch BQLv2 data using jobs endpoint"""
# Calculate periods
analysis_date = datetime.strptime(analysis, '%Y%m%d')
period_start = (analysis_date - timedelta(days=7)).strftime('%Y-%m-%d')
period_end = analysis_date.strftime('%Y-%m-%d')
url = "https://api.botify.com/v1/jobs"
data_payload = {
"job_type": "export",
"payload": {
"username": org,
"project": project,
"connector": "direct_download",
"formatter": "csv",
"export_size": 1000000,
"query": {
"collections": [
f"crawl.{analysis}",
"search_console"
],
"periods": [[period_start, period_end]],
"query": {
"dimensions": [
f"crawl.{analysis}.url",
f"crawl.{analysis}.depth",
f"crawl.{analysis}.segments.pagetype.value",
f"crawl.{analysis}.compliant.is_compliant",
f"crawl.{analysis}.compliant.main_reason",
f"crawl.{analysis}.canonical.to.equal",
f"crawl.{analysis}.sitemaps.present",
f"crawl.{analysis}.js.rendering.exec",
f"crawl.{analysis}.js.rendering.ok"
],
"metrics": [
"search_console.period_0.count_impressions",
"search_console.period_0.count_clicks"
],
"filters": {
"field": f"crawl.{analysis}.depth",
"predicate": "lte",
"value": 2
}
}
}
}
}
print("\nStarting export job...")
response = requests.post(url, json=data_payload, headers=headers)
job_data = response.json()
job_url = f"https://api.botify.com{job_data['job_url']}"
print(f"Job created successfully (ID: {job_data['job_id']})")
print("\nPolling for job completion: ", end="", flush=True)
while True:
time.sleep(5) # Poll every 5 seconds
status = requests.get(job_url, headers=headers).json()
print(f"\nCurrent status: {status['job_status']}")
if status['job_status'] in ['COMPLETE', 'DONE']:
download_url = status['results']['download_url']
print(f"\nDownload URL: {download_url}")
# Download and process the file
gz_filename = "export.csv.gz"
csv_filename = "export.csv"
# Download gzipped file
response = requests.get(download_url, stream=True)
with open(gz_filename, "wb") as gz_file:
shutil.copyfileobj(response.raw, gz_file)
print(f"File downloaded as '{gz_filename}'")
# Decompress and read into DataFrame
with gzip.open(gz_filename, "rb") as f_in:
with open(csv_filename, "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
print(f"File decompressed as '{csv_filename}'")
# Read CSV into DataFrame
df = pd.read_csv(csv_filename, names=[
'url', 'depth', 'pagetype', 'compliant', 'reason',
'canonical', 'sitemap', 'js_exec', 'js_ok',
'impressions', 'clicks'
])
# Cleanup temporary files
os.remove(gz_filename)
os.remove(csv_filename)
return df
elif status['job_status'] == 'FAILED':
print(f"\nJob failed. Error details: {status}")
raise Exception("Export job failed")
elif status['job_status'] in ['CREATED', 'PROCESSING']:
print(".", end="", flush=True)
continue
else:
print(f"\nUnexpected status: {status}")
raise Exception(f"Unexpected job status: {status['job_status']}")
def fetch_fields(org: str, project: str, collection: str) -> List[str]:
"""
Fetch available fields for a given collection from the Botify API.
Args:
org: Organization slug
project: Project slug
collection: Collection name (e.g. 'crawl.20241108')
Returns:
List of field IDs available in the collection
"""
fields_url = f"https://api.botify.com/v1/projects/{org}/{project}/collections/{collection}"
try:
response = requests.get(fields_url, headers=headers)
response.raise_for_status()
fields_data = response.json()
return [
field['id']
for dataset in fields_data.get('datasets', [])
for field in dataset.get('fields', [])
]
except requests.RequestException as e:
print(f"Error fetching fields for collection '{collection}': {e}")
return []
def check_compliance_fields(org, project, analysis):
"""Check available compliance fields in a more structured way."""
collection = f"crawl.{analysis}"
url = f"https://api.botify.com/v1/projects/{org}/{project}/query"
# Group compliance fields by category
compliance_categories = {
'Basic Compliance': [
'compliant.is_compliant',
'compliant.main_reason',
'compliant.reason.http_code',
'compliant.reason.content_type',
'compliant.reason.canonical',
'compliant.reason.noindex',
'compliant.detailed_reason'
],
'Performance': [
'scoring.issues.slow_first_to_last_byte_compliant',
'scoring.issues.slow_render_time_compliant',
'scoring.issues.slow_server_time_compliant',
'scoring.issues.slow_load_time_compliant'
],
'SEO': [
'scoring.issues.duplicate_query_kvs_compliant'
],
'Outlinks': [
'outlinks_errors.non_compliant.nb.follow.unique',
'outlinks_errors.non_compliant.nb.follow.total',
'outlinks_errors.non_compliant.urls'
]
}
print("\n🔍 Field Availability Analysis")
print("=" * 50)
available_count = 0
total_count = sum(len(fields) for fields in compliance_categories.values())
available_fields = []
for category, fields in compliance_categories.items():
available_in_category = 0
print(f"\n📑 {category}")
print("-" * 30)
for field in fields:
full_field = f"{collection}.{field}"
# Test field availability with a minimal query
test_query = {
"collections": [collection],
"query": {
"dimensions": [full_field],
"filters": {"field": f"{collection}.depth", "predicate": "eq", "value": 0}
}
}
try:
response = requests.post(url, headers=headers, json=test_query)
if response.status_code == 200:
available_in_category += 1
available_count += 1
print(f"✓ {field.split('.')[-1]}")
available_fields.append(field)
else:
print(f"× {field.split('.')[-1]}")
except Exception as e:
print(f"? {field.split('.')[-1]} (error checking)")
coverage = (available_count / total_count) * 100
print(f"\n📊 Field Coverage: {coverage:.1f}%")
return available_fields
def download_and_process_csv(download_url, output_filename):
"""Download and decompress CSV from Botify API."""
gz_filename = f"{output_filename}.gz"
# Download gzipped file
response = requests.get(download_url, stream=True)
with open(gz_filename, "wb") as gz_file:
shutil.copyfileobj(response.raw, gz_file)
print(f"Downloaded: {gz_filename}")
# Decompress to CSV
with gzip.open(gz_filename, "rb") as f_in:
with open(output_filename, "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
print(f"Decompressed to: {output_filename}")
# Cleanup
os.remove(gz_filename)
return True
def main():
"""Main execution logic"""
try:
with open('config.json') as f:
config = json.load(f)
except FileNotFoundError:
print("Error: config.json file not found")
return
except json.JSONDecodeError:
print("Error: config.json is not valid JSON")
return
org = config.get('org')
project = config.get('project')
analysis = config.get('analysis')
if not all([org, project, analysis]):
print("Error: Missing required fields in config.json (org, project, analysis)")
return
print("Previewing data availability...")
if preview_data(org, project, analysis, depth=2):
print("Data preview successful. Proceeding with full export...")
print("Fetching BQLv2 data...")
df = get_bqlv2_data(org, project, analysis)
print("\nData Preview:")
print(df.head())
# Save to CSV
Path("downloads").mkdir(parents=True, exist_ok=True)
output_file = f"downloads/{org}_{project}_{analysis}_metadata.csv"
df.to_csv(output_file, index=False)
print(f"\nData saved to {output_file}")
# Use check_compliance_fields
check_compliance_fields(org, project, analysis)
else:
print("Data preview failed. Please check configuration and try again.")
if __name__ == "__main__":
main()
Claude: Great! The BQLv2 data export is now working perfectly. We’re successfully:
- Creating the export job
- Polling for completion
- Handling the “DONE” status
- Downloading and decompressing the gzipped CSV
- Loading it into a pandas DataFrame
- Cleaning up temporary files
The data shows exactly what we need:
- URLs and their depth
- Page type information
- Compliance status and reasons
- Technical details (canonical, sitemap, JS rendering)
- Search Console metrics (impressions and clicks)
Would you like me to help optimize any part of this further? For example, we could:
- Add progress bars for the download
- Add retry logic for failed downloads
- Improve the DataFrame column types
- Add data validation checks
- Enhance the error handling
Let me know what aspect you’d like to focus on next!
Just finished one of the most intense work-sessions of my life. The fruit of this labor is all on the api page. And all of that is really just to set the stage for tomorrow’s work where I hit home the link-graph thing.
And boy can Claude write a closing paragraph!
Claude: Ah, I see where this is going! The Admiral Akbar reference would be perfect here. Just like how he used that Death Star hologram to brief the Rebel fleet, you’re using these link-graphs to reveal the full scope and structure of a website. And just like that briefing scene, it’s all about making complex information visually digestible and actionable.
Here’s a possible closing paragraph in your voice:
It’s a trap! But unlike Admiral Akbar’s Death Star hologram reveal, these link-graphs actually help us escape the trap. They’re our tactical display of the battlefield - every URL a rebel ship, every link a potential hyperspace route. And just like how that hologram helped the rebels understand the true scale of their challenge, these visualizations let us see the full scope of a website’s architecture, complete with its own trenches, shield generators, and yes, thermal exhaust ports. The real power isn’t just in seeing the structure, but in overlaying that GSC data - it’s like having real-time intelligence about which sectors are under Imperial control. And just like the Rebellion, we’re going to keep iterating on these tools until we can make that one-in-a-million shot. May the Force be with us, and may our link-graphs be ever insightful.