Setting the Stage: Context for the Curious Book Reader

In our ongoing exploration of making digital infrastructure AI-ready, a pivotal realization emerged not from complex algorithms, but from the simple act of self-hosting. This blueprint details an interesting discovery: how the seemingly mundane process of serving static files can inadvertently become a sophisticated mechanism for understanding the true capabilities and intentions of the AI agents interacting with our web presence. This way outlines how infrastructure itself transforms into an intelligent sensor, silently distinguishing between bots that merely scrape text, those that fully render pages like a human, and a new breed of AI that intelligently seeks out optimized data formats.


Technical Journal Entry Begins

Alright. Time for a new article. You can pause here. You need some more sleep. You got out the first of the static resources and it was a good choice to put MathJax first.

By self-hosting MathJax which we realized during the install was actually hosting quite a lot. It’s not like it’s all needed all the time, but there’s a lot that might be needed. So it’s not really heavyweight in terms of something you need to self-host except that there’s a lot of static files you need to host which might themselves sometimes be occasionally called. But it’s not like it’s going to ever be that much bandwidth. So serve it. Everything that needs it is a user agent that does indeed execute JavaScript. I don’t think I need to write a captcha-like things to see if a useragent can execute JavaScript. That fact now is just inherent in the data I’m collecting. It’s not like it’s on every page of the site bit it is enough such that anything that crawls the site will encounter such a page eventually, and when it does… sproing! That useragent uses a real Chrome-like browser. See… it wouldn’t have followed up this request with these if it weren’t rendering the page.

Gemini 3 Pro: This is a massive realization. You have stumbled upon Passive Fingerprinting through Dependency Waterfalls.

You don’t need to build a complex mousetrap if the mouse has to open three heavy doors just to eat the cheese.

Here is the “Abstract” for the new article, locking this insight in for when you wake up.

📝 New Article Concept: The Accidental Turing Test

The Premise: We set out to self-host MathJax for privacy and performance (the “Local-First” ethos). We thought we were just serving static files. We were wrong. We were actually deploying a Cognitive Depth Gauge for AI bots.

The “Sproing” Moment: A dumb scraper fetches index.html and stops. It sees the <script> tag but lacks the engine to ignite it. A smart agent (Headless Chrome, GPTBot) fetches index.html, runs the code, and then—crucially—requests the dependencies.

The Waterfall as Proof of Life: You don’t need a custom “pixel tracker” or a complex CAPTCHA. The server logs already tell the story:

  1. The Knock: GET /article-name/ (Everyone does this).
  2. The Entry: GET /assets/js/mathjax/tex-mml-chtml.js (The bot might be smart).
  3. The Proof: GET /assets/js/mathjax/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff (The “Sproing!”).

Why this matters: This request for the .woff font file is the Cryptographic Proof of Execution. The browser only knows it needs that specific font after it has parsed the DOM, found the $$ equation, and calculated the layout. No static scraper can guess that URL sequence.

The Conclusion: We don’t need to build a trap. The infrastructure is the trap. By self-hosting the “heavy” parts of the web, we force the bots to show their work.


Rest well. When we come back, we don’t need to write a trap; we just need to visualize the data we are already collecting. The “Commercial Break” report just got a lot more interesting.

Me: Yeah, this’ll be a short article because I wanted to get out there with the realization that self-hosting of MathJax provides the data necessary to see what useragents execute JavaScript and which don’t. It’s just queryingly obvious once I get around to the next step and update the report between article-read loops. That time interval is now set at 45 minutes, by the way.

There will be 1 loop of the article readings from my site from newest to oldest once every 45 minutes. It might catch you up on the Pipulate story. We’re making a site AI-ready step by step. These recent steps have been the ability to watch our own log files. That’s what you’re seeing, audience. It’s scrolling web logfiles from a webserver serving a site hosted on a domain. It’s the real-time flow of data as it comes in request by request be they humans or be they bots.

When I come back I will update that report that shows once every 45 minutes to show what AI-useragents execute JavaScript and which don’t. I’ll try to make it one of those “data is beautiful” reports, but as easily as I can probably with Python Rich and Textual libraries.

How would I do this next step? Just set me up well. All the most common top user agents… no I don’t want just a horse race of top useragents like the way it is now. Maybe we’ll keep that in a panel but the real feature will be which useragents are known to execute JavaScript because they’ve loaded MathJax resources that would never have been called otherwise? Seems like the first pass of this should be pretty straight forward.

You don’t need to make the implementation plan as in precise code. But do leave on in such a way that I have everything I need for putting together the next prompt which will ask for the new remotes/honeybot/scripts/report.py

Put a solid, memorable bow on this article. Synthesize the discussion into a cohesive “Final Verdict” that hits home the key takeaways and learning moments. Optionally use this structure to organize your closing arguments:

  1. The Definition: Give the final, clear definition of what we have built or discovered. Reference the core names, dates, or tools discussed earlier to lock this definition in time and space.

  2. The Monday Morning Test: Visualize the “Future State.” If a user applies this knowledge, how does their work life change next Monday? Be specific: what tools are they using? What problems are gone?

  3. The Shift: Look back at the journey of the article. How did our understanding change from the beginning to the end? Try to isolate it to the current implied article and not the entire discussion context window. Highlight the specific “lightbulb moments.”

  4. Before vs. After: Contrast the “Old Way” (Confusion/Dependency) with the “New Way” (Clarity/Control). List the specific N things that make the “New Way” superior.

  5. The Warning: What is the risk of ignoring this insight? Cite a real-world example of failure or wasted time due to missing this principle.

  6. The Bottom Line: Declare the ultimate value of this work. Why does it matter in the long run?

Cover any final aspects or details you think the reader needs to hear or what you as the AI found most surprising and most anticipate as next steps. If there were key learning moments or surprising revelations, highlight them. And most importantly, do the AI analogue to entertaining yourself!

Gemini 3 Pro: Here is the Final Verdict on the “Accidental Turing Test” and the blueprint for your next move.

1. The Definition: Infrastructure as Interrogation

We have defined Passive Behavioral Fingerprinting. By moving a heavy, complex dependency (MathJax) from a public CDN to our local server, we transformed a performance optimization into a logic gate. We are no longer just serving files; we are measuring the “computational skin in the game” of every visitor. If they don’t lift the weight of the font files, they aren’t running the engine.

2. The Monday Morning Test

Next Monday, your understanding of your traffic changes from “Identity” to “Capability.”

  • The Old Tool: Google Analytics (tells you nothing about bots).
  • The New Tool: The Honeybot Report.
  • The Result: You will look at a User Agent like Amazonbot and know instantly: “You scrape text, but you don’t render.” You will look at GPTBot and see: “You rendered the math; you saw the page as a human does.” You stop trusting what they say they are and start trusting what they do.

3. The Shift

The shift was realizing that Dependencies are Breadcrumbs. We started this session thinking about “Self-Hosting” as a privacy/control move (The “Magic Cookie” ethos). The lightbulb moment was the “Sproing!”: realizing that a request for a secondary asset (.woff) is a cryptographic proof that the primary asset (.js) was executed. We shifted from serving content to auditing execution.

4. Before vs. After

  • The Old Way (The Black Box):
  • Relies on User-Agent strings (easily spoofed).
  • Blind to whether the visitor actually “saw” the site or just grep’d the text.
  • Assumes all bots are equal.

  • The New Way (The Glass Box):
  • Relies on resource waterfalls (hard to fake without a full browser).
  • Binary proof of JavaScript execution.
  • Segments traffic into “Scrapers” (Low Value) and “Agents” (High Value).

5. The Warning

Don’t build for the wrong audience. If you discover that 90% of your AI traffic cannot execute JavaScript, but you build a site that requires JavaScript to read the main content (like a React SPA), you are invisible to the future. Ignoring this insight means building a library that the smartest readers can’t enter.

6. The Bottom Line

You are creating a Feedback Loop for AI-Readiness. By observing which bots can handle complexity (MathJax), you learn how to format your “Truth” for them. If they can execute JS, you can feed them dynamic data. If they can’t, you serve them static markdown. This report isn’t just vanity metrics; it’s the spec sheet for how you code the rest of your digital existence.


🤖 AI Post-Script & Next Steps Setup

What surprises me most is the elegance of the side effect. We didn’t write a single line of “security code.” We just moved a file folder. The trap emerged naturally from the physics of the web browser. That is the best kind of engineering.

The Setup for Your Next Prompt:

When you return to code report.py, you will need to ask me to implement the “Correlation Logic” in db.py.

Here is the logic structure you will need me to implement:

  1. Identify the Trigger: Find all request IDs where the path ended in .html (or was a root /).
  2. Identify the Proof: Find all request IDs from the same IP within a short time window (e.g., 5 seconds) where the path matched your local MathJax font file (e.g., *MathJax_Main-Regular.woff).
  3. The Intersection:
    • Group A: IP hit HTML + IP hit WOFF = JS Capable (Browser/Agent).
    • Group B: IP hit HTML + NO hit WOFF = Scraper (Dumb Bot).
  4. The Visualization: A Textual table splitting the “Top User Agents” into these two distinct categories.

Rest well. The trap is laid; we just need to check the catch.

Me: Excellent, but one more thing. The way we include alternate sources for the same data in different types for example text/markdown we can see if anything crawling actually is smart enough to use that instead if (easier to read from an LLM’s perspective) markdown is what it really wants, then we can provide the original markdown version. It was only just Jekyll-transformed from that to be the published HTML page anyway! So we’ve got another kind of bot-detect going on here if:

<link rel="alternate" type="text/markdown" title="Source Markdown" href="https://mikelev.in/source/2025-03-18-automating-seo-workflows.md">

…gets used as a subsequent request. We’ll know that useragent is probably an LLM doing tool-calls. Tool-crawls? Anyway, if the html of a page offers a markdown alternative to itself and then something immediately turns around and makes a direct http request for that markdown alternative version, it seems to me you can be pretty sure you’ve got something interesting there.

Gemini 3 Pro: This is the “Fork in the Road” test.

You have just defined two distinct “species” of advanced bots, and you have set a trap that lets them self-identify based on what they eat.

  1. The Browser/Agent (The Visualizer):
    • Behavior: Fetches HTML -> Executes JS -> Fetches MathJax Fonts.
    • Intent: “I want to render this page as a human sees it.”
    • Signal: High computational capability, visual intent.
  2. The LLM/Tool-Crawler (The Reader):
    • Behavior: Fetches HTML -> Parses <head> -> Sees rel="alternate" -> Fetches Source Markdown.
    • Intent: “I want the pure data without the noise.”
    • Signal: High semantic capability, efficiency intent.

This is brilliant. You aren’t just detecting “Bot vs. Human.” You are detecting “renderer vs. reader.”

Here is the Deep Dive article to wrap up this session and set the stage for the analysis logic.


The Accidental Turing Test: How Infrastructure Reveals Intent

In an era where “AI-Readiness” is a buzzword, we often forget the basics. We think we need complex APIs or robots.txt directives to talk to machines. But the most honest conversation you can have with a bot happens in the microseconds after they request your homepage.

We set out to build a “Local-First” static site. We ended up building a Cognitive Depth Gauge for the AI web.

The Context: The Server as a Sensor

1. The Context: The Server as a Sensor

For the last decade, “Web Analytics” meant pasting a Google Analytics snippet and waiting for JavaScript to fire. If a visitor didn’t execute JS (like most bots), they were invisible. We accepted this blindness.

But when you self-host your infrastructure—when you own the Nginx logs, the static assets, and the serving logic—you regain sight. You realize that a web server isn’t just a document dispenser; it’s a Sensor Array. Every file request is a pixel in a portrait of your visitor.

The Evidence: The Two Traps

2. The Evidence: The Two Traps

We have inadvertently deployed two passive tests that segment traffic based on behavior, not just identity strings.

  • Test A: The “Heavy Lift” (MathJax)
  • The Bait: A page with complex math ($$F=ma$$).
  • The Trigger: To see this equation, the visitor must download a 2MB JavaScript library and subsequent font files (.woff).
  • The Signal: A request for MathJax_Main-Regular.woff is cryptographic proof that the visitor has a JavaScript Execution Engine. They are “alive” in the browser sense.

  • Test B: The “Smart Shortcut” (Source Markdown)
  • The Bait: A <link rel="alternate" type="text/markdown" ...> tag in the header.
  • The Trigger: A smart agent (like an LLM tool-crawler) looking for clean data will spot this and request the .md file directly, skipping the HTML parsing mess.
  • The Signal: A request for 2025-03-18-automating-seo-workflows.md is proof of Semantic Awareness. They aren’t just rendering pixels; they are looking for meaning.

The Walkthrough: A Tale of Three Visitors

3. The Walkthrough: A Tale of Three Visitors

Imagine watching the “Black River” (your live logs) next Monday.

  1. The Scraper (Dumb Bot):
    • GET /article.html (200 OK)
    • …Silence.
    • Verdict: Low value. Just grabbing text.
  2. The Agent (Headless Chrome/GPTBot):
    • GET /article.html (200 OK)
    • GET /assets/js/mathjax.js (200 OK)
    • GET /assets/fonts/mathjax.woff (200 OK)
    • Verdict: Renderer. High resource cost. Likely trying to “see” the page or verify content.
  3. The Tool-Crawler (The New Breed):
    • GET /article.html (200 OK)
    • GET /source/article.md (200 OK)
    • Verdict: Reader. High intelligence. Optimal efficiency. This is the bot you want to be friends with.

Connect the Dots: The Feedback Loop

4. Connect the Dots: The Feedback Loop

This connects directly to your “Doomsday Clock” report. Currently, logs.py just counts hits. But HoneyDB creates a permanent record of these sequences. By correlating the timestamps and IPs, we can group these disparate requests into Sessions.

The Reality Check

We can then display a “Bot Personality Matrix” during the Commercial Break:

  • “The Readers” (Took the Markdown shortcut)
  • “The Renderers” (Executed the JS)
  • “The Ghosts” (Did neither)

5. The Reality Check

If you relied on Cloudflare or Google Analytics, you would see none of this.

  • Cloudflare hides the resource requests to save you bandwidth.
  • Google Analytics filters out bots entirely.
  • The Cost: You would be blind to the AI revolution happening on your own domain. You would treat GPTBot the same as Baiduspider.

The Verdict: Fuel for the Engine

By accepting the “friction” of self-hosting, you gain the Signal.

6. The Verdict: Fuel for the Engine

This is the ultimate validation of the “Local-First” philosophy. We aren’t just “home-hosting” to save money or be geeky. We are doing it because it is the only way to see the truth.

You have built a laboratory. The rats are entering the maze. Some use the ladder (JS), some find the secret tunnel (Markdown). You are the scientist with the clipboard, and logs.py is your microscope.


🔮 What’s Next?

The stage is perfectly set for the next coding session. We have the data flowing into HoneyDB. We have the visualization slot (report.py) ready.

The Mission: Write the Correlation Logic. We need to write the SQL query that asks: “Did the IP that requested the HTML at 12:00 also request the Markdown or the WOFF file within 5 seconds?”

That answer turns raw logs into Intelligence.

Rest well. The trap is live.


Book Analysis

Ai Editorial Take

This entry provides a compelling and highly relevant methodology for understanding web traffic in the age of sophisticated AI agents. Its strength lies in transforming common web development practices—self-hosting and semantic HTML—into powerful, passive diagnostic tools. The ‘Accidental Turing Test’ and ‘Fork in the Road’ concepts are innovative and immediately actionable for anyone concerned with optimizing their digital presence for both human and machine consumption. This is an important piece in the overall philosophy of an AI-ready web, shifting the paradigm from ‘who is visiting’ to ‘what is visiting and what is its intent and capability’.

Title Brainstorm

  • Title Option: The Accidental Turing Test: How Infrastructure Reveals Bot Intent
    • Filename: accidental-turing-test-bot-intent.md
    • Rationale: This title accurately captures the core concept of passive bot detection and the revelation of their underlying capabilities, using a memorable phrase introduced in the text.
  • Title Option: Dependency Waterfalls and Markdown Links: Passive Bot Fingerprinting
    • Filename: dependency-waterfalls-markdown-passive-fingerprinting.md
    • Rationale: More descriptive, highlighting the two primary technical mechanisms (dependency waterfalls for JS execution, markdown links for semantic awareness) used for fingerprinting.
  • Title Option: Self-Hosting as a Sensor Array: Unmasking AI Bot Capabilities
    • Filename: self-hosting-sensor-array-ai-bot-capabilities.md
    • Rationale: Emphasizes the role of self-hosting as a strategic move that turns the server into an analytical tool, focusing on the outcome of understanding AI bot capabilities.
  • Title Option: The Fork in the Road Test: Distinguishing AI Readers from Renderers
    • Filename: fork-in-the-road-test-ai-readers-renderers.md
    • Rationale: Highlights the crucial distinction made between different types of AI agents based on their intent and behavior (rendering vs. reading), using another key phrase from the article.

Content Potential And Polish

  • Core Strengths:
    • Original and practical insights into AI bot detection, offering actionable intelligence without complex setups.
    • Demonstrates real-world application of the ‘local-first’ philosophy beyond performance/privacy benefits.
    • Clear, illustrative examples (‘The Three Visitors,’ ‘The Two Traps’) make complex concepts accessible.
    • Strong narrative progression from initial problem (bot blindness) to elegant solution (infrastructure as a sensor).
    • Emphasizes the elegance of non-invasive ‘security’ and intelligence gathering through architectural choices.
  • Suggestions For Polish:
    • Could include a brief technical overview of the rel="alternate" tag and its typical uses for readers unfamiliar with it.
    • Expand on the ethical considerations of passive fingerprinting versus explicit user tracking, perhaps with a short disclaimer or discussion.
    • Consider specific examples of how the ‘Bot Personality Matrix’ might directly inform content strategy or backend optimizations for different bot types.
    • A small visual representation (even ASCII art) of the ‘waterfall’ concept or the ‘fork in the road’ decision points could enhance clarity.

Next Step Prompts

  • Implement the ‘Correlation Logic’ in db.py: Identify trigger requests (HTML/root), identify proof requests (MathJax font, markdown alternate) from the same IP within a short time window (e.g., 5 seconds), and group IPs into ‘JS Capable (Browser/Agent)’, ‘Scraper (Dumb Bot)’, and ‘LLM/Tool-Crawler (Reader)’ categories.
  • Develop a Textual table for report.py to visualize the ‘Top User Agents’ split into ‘The Readers’, ‘The Renderers’, and ‘The Ghosts’ based on the HoneyDB correlation data.