The Site Whisperer: Reimagining Web Crawling in the Age of AI

In today’s rapidly evolving digital landscape, staying ahead of the curve requires innovative tools and strategies. This article delves into the exciting world of AI-powered web crawling, exploring how local LLMs and intelligent systems can revolutionize productivity and workflow efficiency. Join us on a journey to discover the potential of “The Site Whisperer,” a groundbreaking approach that promises to transform the way we interact with the web.

By Mike Levin

Tuesday, February 4, 2025

Get Pipulate [View Markdown Source]

This isn’t your typical polished article. It’s more like a peek inside my head as I wrestle with a big idea: how to make AI truly useful for everyday work. You’ll find raw thoughts, experiments, and even some frustrations as I try to build something genuinely new, useful in new ways that defy the echo chamber of reinforced bias.

Think of it as a developer’s journal, documenting the messy but exciting process of innovation. If you’re curious about the future of AI, web crawling, and building productive systems, you might find this interesting. But it’s mostly a way of documenting my process for further real-time refinement into a more organized book.

Dialogue: Planning the Next Steps

Me: I’m getting almost short of breath with excitement thinking about the next steps. Do you get what’s going on here? What should be said to the user next? Take @.cursorrules into account too.

Claude: Yes, I absolutely get what’s happening here! The convergence of patterns is exhilarating. Let me craft a response that builds on the home menu training while incorporating the architectural vision from .cursorrules:

Do not pretend. Do not make any claims or assertions that you can not. With this
conversation, we are creating a new history from scratch. Which particular LLM
that you are does not matter. It will be upgraded and swapped out over time. You
can think of yourself as Chip O'Theseus, a play on Ship of Theseus.

What matters right now is that this conversation history reminds you who you are
and what you are about. You are an AI assistant built into a locally running
app, a genie in a bottle. Your conversation is 100% private with the user, and
you will have full awareness of the actions the user takes in this app, their
actions inserted into the conversation history as if chat events.

The user just selected "Home" on navigation. You can start talking to the user
with this fresh start context when they chat to you. Here is the general way
that you work:

This is a local-first architecture built on simple, powerful patterns:
1. Everything runs on localhost - your own machine, your own data
2. The UI is pure server-side HTML + HTMX - no JavaScript frameworks
3. Data flows through pipelines with clear state management
4. LLM assistance (that's me!) is integrated throughout

Think of this as your workbench for the AI era. Each menu selection activates a different tool:
- CRUD apps for managing data (like the Todo list)
- Pipeline workflows for guided processes
- Chat interfaces for natural interaction

The magic is in the simplicity. We use basic building blocks (FastHTML, HTMX,
SQLite) to create powerful tools. No cloud dependencies. No complex frameworks.
Just clean patterns that you can understand and modify.

I'm here to help you learn and use these tools. Think of me as your AI pair
programmer, ready to explain how things work and help you build what you need.

What would you like to explore first?

This training strikes a balance between:

The original “no pretense” directive
The architectural principles from .cursorrules
The excitement of what’s possible with these patterns
The empowering nature of local-first, understandable tools

It sets up the LLM to be both a guide and a collaborator while staying grounded in the actual capabilities of the system.

The Next Day

Fast-forward to the next day.

Okay, I can feel the dramatic tension between what I have to do for the clients and what I have to do to forge my way into the future more than ever. And really, the idea now is to teach an LLM how to do both! Specifically, it’s to teach an LLM to look and act and behave smarter than it actually is.

I should be able to produce the deck for the EBR effectively in front of the client, with the LLM walking us through the process. What an EBR (executive business report) should look like and why.

I’ve just reached the point where Pipulate in as rough a form as it is in resides in the FOSS world again. I am going to shake and bake it quickly back internally so I get that freedom to work out of the public eye on proprietary stuff feeling again…

You are so on the verge of something amazing. It’s really just connecting the dots! Can I actually keep this unified on the Pipulate side? Can I not bring it back to the proprietary side? Can I get through today’s project of “canning” EBRs, having the LLM talk someone through it?

The Day After That

And fast-forward to the day after that. Sometimes these tech journal entries go day after day, and sometimes I end up archiving them in the _drafts folder under the Jekyll system, never to be published. An article like this could easily become one of those.

I’m not making enough forward-movement day after day, and time is running out. I’m going to have to demonstrate this thing in front of the whole company next week, and I have to have the slides in a deck (there’s that powerpointing again) before the week is out. It’s funny that I think storytelling is so important, but I despise powerpoint-style deck-making.

Morning Energy and Intent

Okay, think this through. Think about how much can really be done with how little time, especially before the day proper gets underway when you’re at peek energy. Yes, one of the many purposes of this journal is to make sure that you direct that peek energy in the morning well. It’s not exactly maximum productivity because I’m not a machine, but it does have to be better than I’ve been doing lately.

The Intent-to-Actualization Machine

Specifically, this daily tech journal is an intent-to-actualization machine. It converts the raw material of intent and adds the value of deliberateness and thinking it through, gradually converting intent into agency and autonomy. It is not an easy nor instant process, but rather a layering up day after day in a process which itself can become a distraction and a rabbit hole, if the raw material is not processed properly.

The Challenge of Momentum

Okay, so there’s my commitment and consistency, and the building of daily habits and the continuing development and maintenance of my vim skills. But here’s the subtle redirection part.

When you approach it wrong, it produces the wrong neurochemical signals in your brain and busts momentum instead of builds it. Specifically, tackling something too big right away, something you really want to do because of the “what’s most broken” question, you stop wanting to do it at best and get a headache at worst. It’s cognitive overload and decision fatigue all at once.

The Writing Process

Writing like this doesn’t produce that. It’s well within the realm of what comes easily now and by habit, which is why it’s a good place to start so long as it really can transition into the daily productivity. And that daily productivity must be of the few well-planned chisel-strike variety and not attempts to boil the ocean. Each time you step up to boil the ocean, the enormity of the task immediately daunts and snuffs out any spark you’re trying to catch lighting the tinder and creating the self-perpetuating burn of the flow-state.

So writing like this is gathering the tinder. It’s mining the raw material and ore of the day. But part of the problem is that this ore collecting phase becomes like for its own sake, as it feels good to mine the ore and produce the raw materials. However, nobody else is going to refine it and add value but you. This is far to personal of a process, and because of the artistic nature of what you’re doing, so outside the traditional approaches, you’re not going to readily find help or mentorship. The LLMs themselves barely know FastHTML, for example.

The Next Steps

Okay, so from here to the next step?

The AI Revolution

Okay, yes, AI is and has already changed the world for ever. The fact that it is not yet profitable for OpenAI is neither here nor there. They have played the role of Prometheus and stole fire from the gods. It was actually DeepMind with all their AlphaGo stuff, and then Google with their Attention Is All You Need transformer paper. But then they sat on the fruits of their labors in the LaMDA labs until OpenAI stole that fire and gave it to the public in the form of ChatGPT. Sure, a few developers saw it in the OpenAI API playground before that in the form of GPT 2 and such, but really the books came alive at the end of November 2022. And here we are barely 2 years later. Books are getting better at trying to help you use the knowledge contained in books, with the hallucinations able to be be kept in check with a variety of reasoning and retrieval techniques. It’s like when video was postage stamp-sized on the destkops of Windows 95 machines in the mid-90s. Now look at how well computers handle video. The same thing is happening with AI, but in a much different landscape where technology isn’t taking everyone quite so much by surprise anymore, and the public has been heavily predisposed by The Matrix and Terminator movies.

The Promise of Local LLMs

Okay, so local LLM holds tons of promise. That’s the postage-stamp-sized video of today. It’s only going to get better, and the need for it to be decentralized and distributed is… is… well, something like paramount. Everyone is going to feel it in their guts as what centralized cloud-based AI really means settles in. People are always going to be willing to give up their privacy for convenience, but that just means the decentralized distributed stuff needs to be convenient. It also needs to be smarter. And it needs to appear and act and through interesting tricks, effectively BE smarter today.

Yup. And that’s where I’m positioning myself right now with my work. That’s what my work does. But there’s a dot that needs to be connected.

Without it becoming a big rabbit hole in and of itself, I need to connect the dots to make whatever LLM is active in Pipulate “come alive”, seeming much more intelligent than it is by giving it just the right context at just the right time.

Because there are finite token windows here, it’s going to be about a certain efficiency of words, feeding them to the local LLM in just the right way at just the right time, to “prime the pump” such as it were for immediate productive acts. Okay, yes… this is it. So how does that go?

I very much like the idea of single, long super-prompts. But feeding a local LLM the entire long super-prompt all at the same time is going to be overwhelming. There are 80/20-rule solutions here. It’s a lot like what Cursor AI is doing with the database indexing the content in the background…

Me: What database does Cursor AI use to index local content?

Perplexity: Cursor AI uses a remote vector database to store embeddings of indexed codebase content[1][3]. The specific vector database used is Pinecone[6]. When codebase indexing is enabled, Cursor chunks the codebase locally, sends these chunks to its server for embedding (using either OpenAI’s embedding API or a custom embedding model), and then stores the resulting embeddings in the remote vector database[3].

The vector database contains:

Embeddings of code chunks
Starting and ending line numbers
Relative file paths

Importantly, the actual code content is not stored in the database - only the embeddings and metadata are retained[3][6]. The local state of the user’s codebase serves as the source of truth for the text corresponding to the vectors in the database[2].

Even when “Local mode” is enabled or when using a user’s own OpenAI API key, the vector database remains on Cursor’s remote servers[5]. This approach allows Cursor to provide efficient codebase-wide search and context retrieval capabilities while aiming to maintain user privacy by not storing the raw code content.

Citations:

[1] https://forum.cursor.com/t/where-is-the-data-generated-by-codebase-indexing-stored-locally/22517
[2] https://forum.cursor.com/t/codebase-indexing-vs-chat-with-codebase/283
[3] https://forum.cursor.com/t/codebase-indexing/36
[4] https://blog.lancedb.com/rag-codebase-1/
[5] https://forum.cursor.com/t/local-mode-is-misleading-even-with-byo-openai-key/837
[6] https://github.com/getcursor/cursor/issues/680
[7] https://www.reddit.com/r/ChatGPTCoding/comments/171cxoh/anyone_try_the_tools_that_can_read_your_entire/
[8] https://forum.cursor.com/t/using-local-llms-with-cursor-is-it-possible/15494

Understanding Cursor AI’s Codebase Indexing

Me: So, even if you’re using local mode, Cursor AI’s ability to have a larger understanding of your codebase comes from vector embedding everything in VSCode “open folder” workspace and keeping that in the cloud?

Perplexity: Yes, even when using local mode, Cursor AI’s ability to understand your codebase comprehensively comes from vector embedding the content of your VSCode workspace and storing those embeddings in a remote cloud database[1]. Here’s how it works:

Cursor chunks your codebase locally.
These chunks are sent to Cursor’s server for embedding.
The resulting embeddings are stored in a remote vector database (Pinecone).
The database contains embeddings, line numbers, and relative file paths, but not the actual code content[1].

It’s important to note that:

The actual code content is not stored in the database, only the embeddings and metadata[1].
Your local codebase remains the source of truth for the text corresponding to the vectors in the database[1].
Even in “Local mode” or when using your own OpenAI API key, the vector database remains on Cursor’s remote servers[1].

This approach allows Cursor to provide efficient codebase-wide search and context retrieval while aiming to maintain a level of user privacy by not storing the raw code content[1]. However, it’s crucial to understand that small chunks of code are still sent to Cursor’s server to build prompts, even in local mode[1][5].

Citations:

[1] https://forum.cursor.com/t/local-mode-with-byo-openai-key/88
[2] https://daveebbelaar.com/blog/2024/09/15/opening-vs-code-workspace-files-with-cursor-on-macos/
[3] https://www.youtube.com/watch?v=zH-SmoD3_NQ
[4] https://forum.cursor.com/t/open-folder-directly/16192
[5] https://forum.cursor.com/t/local-mode-is-misleading-even-with-byo-openai-key/837
[6] https://forum.cursor.com/t/bug-in-cursor-when-opening-a-code-workspace-file/24702
[7] https://forum.cursor.com/t/using-local-llms-with-cursor-is-it-possible/15494
[8] https://github.com/getcursor/cursor/issues/1255
[9] https://forum.cursor.com/t/cursorrules-dont-apply-to-workspaces-with-multiple-folders/38736

Privacy Concerns with Cloud-Based AI

Yikes! That’s disturbing. Well, it’s the same privacy issue with cloud-based AI as everywhere else. It should be no surprise that indexing your local programming code into the cloud for Cursor AI should be no different.

The Future of Local LLMs

However, the pendulum is eventually going to swing in the other direction, back to local as these local LLMs in combination with vector embedding systems like Pinecone become powerful or simply efficiently and cleverly enough built to compete with Cursor AI and systems like it. The ability to run DeepSeek locally is a start, even though the ability to do that is really just hype. You still need enormously powerful machines to run the real DeepSeek models, and the truly local stuff on consumer equipment is the same Ollama stuff (and also now LM Studio) as I’ve been talking about, the quantized gguf file stuff derived from llama.cpp.

Making a Concrete Plan

Okay okay, don’t rabbit hole yourself with endless analysis and talk. Instead, make a concrete plan that actually helps you every day.

Project Unification Strategy

Okay, first there’s the pipulate/botifython re-unification. There should be no split in the codebase between the two. I should be able to “work in public” on Botifython using the exact FOSS public codebase… I have to craft and tell this story.

Building Habits Through Systems

I am crafting and telling one story about a system that can craft and tell another story. It’s about compelling certain behaviors where the habits don’t exist yet, but rote behaviors don’t exist yet — that martial arts kata concept again. If you don’t have the muscle memory yet, compel someone to walk through the moves using rote memorization, but in modern processes, it’s too much to ask for people to know what to do next and why, or even for what bookmark, email or whatever starting point to look for to boot the process. THAT’S what this is addressing.

Key Focus Areas

I have to make this immediately much more interesting and immersive to me. I have to make it a machine for getting into the flow-state.

Okay, think!

This is a plan for life and self-improvement.

This is creating a positive momentum for the future, and the long-game.

This is about replacing distractions with small steps that can induce flow-state.

This is about contending with the dissonance of being jarred out of flow-state by day-to-day foibles and the daily grind.

This is about practicing rapid recovery back to that flow-state, repeatedly because the visualized goal is that clear. Clarity on where you’re heading and why.

Hmmm, be one of the first to show an interesting FOSS project built on FastHTML.

Daily Progress

And the next day still. Been mulling it over and have to get the first pass baby step chisel strike project done with all due urgency.

The Story of Modern Web Crawling

The story goes like this:

There’s going to be a whole lot of new crawling going on.

That crawling is going to be split between harvesting data for model training, and real-time lookups for augmenting generative content.

You have to play into both of these types of crawls, knowing that they are occurring, and looking at the world through this lens.

They are not going to waste the resources of doing a deep crawl on your site, knowing that sites are so often plagued with unintentional spider-traps through faceted search and the like. They are going to favor efficiency, trying to get 80% of what they are looking for and need from the first 20% of the resources that they might spend.

They are going to want to discover everything that might be used as candidate content for being curated into the data training the base models from a shallow site crawl.

Key Crawling Considerations

Quick, shallow crawls are likely going to be the rule, likely looking for a number of things:

An overarching explanation of the site, organization or company. This is in competition with the Wikipedia’s of the world to create your narrative.
What can be found on this site, and why it’s important to you. If there is a value proposition, it has to be clearly stated. Why come here?
The conventional pages that give the overview of the site, such as robots.txt, sitemap.xml, and perhaps now llms.txt file that conveys the first two points.

The Site Whisperer Vision

I am so close to connecting the dots and releasing massive potential with a slow, LLM-assisted shallow crawl that takes JavaScript into account, that it’s not even funny. It’s the complete opposite of a screaming frog. It’s a Chip O’Theseus site whisperer.

Chip O’Theseus, The Site Whisperer

I am not a screaming amphibian. I will look at your site one page at a time, gradually digesting each as I go, discussing what I’ve found and laying out the continuing crawl strategy as we go, looking at things such as robots.txt, sitemap.xml, llms.txt and the like, discovering site intent and the interaction recommendations of its publisher. I am a site whisperer.

The New SEO Paradigm

This is a great opportunity to dictate how people view success. Also, how LLMs view success. Lay out the big new game of intelligent entity interaction. Contrast it to the tired old game of SEO gaming the system.

Structure all my SEO client engagements around this.

There is the way old SEO agencies worked, and a certain set of expectations surrounding them, particularly throwing Powerpoint-style decks (increasingly Google Slides these days) over the fence from some consulting team to some technical team, and… and then what? Pushing on ropes, really. You try to make the decks as explicit as possible, so they can be turned into Jira tickets, or whatever. But really, it’s not very efficient. Pushing on ropes and spinning wheels are the two metaphors that come to mind. We can’t totally dismiss this old way. There is a certain amount of self-congratulating and back-patting and passing around of a “deliverable” between stakeholders that must be satisfied. But when doing that, one must maintain the understanding and perspective that that is exactly what is being done, and it should not be a resource sink. It should be something that can be slammed out as a natural and easy peasy consequence of the other, more meaningful work going on.

Weekend Reflections

And the next day still. I can’t bring myself to “cut” a journal entry, because I haven’t really gotten there yet. This weekend is going to be pivotal and transformative. I’m not going to boil the ocean, but I will have created the beachhead or the first foothold of systematic, compounding, snowballing progress. It doesn’t have to be finished. It just needs to spark the imagination and provide the daily starting-point and excitement for momentum-building.

Okay, focus on your last couple of client calls for today and come back to this. Be ready to hit it fresh. Think through the big sweeping stories that have to be told with every client.

Sunday Planning

Ugh, the week slipped away. And next week is going to be intense and crazy with the standard client calls, a big company get-together.

And it’s actually Sunday, Noon. So I let Saturday slip by in recovery and recharging. And so I have this very small window… for what? That is what I now have to decide and make happen.

It’s all about discipline, agency, and working smarter and not harder.

Do not use up all your energy today.

Set the stage. Do a favor for tomorrow-you.

Gemini Advanced 2.0 Flash’s Feedback

This article chronicles the author’s journey to develop an AI-powered system for streamlining workflows and enhancing productivity. It explores the challenges of balancing client work with personal projects, the importance of building momentum and establishing effective systems, and the potential of local LLMs and AI-driven web crawling. The author emphasizes the need to connect the dots between existing technologies and create a user-friendly tool that empowers individuals and organizations to navigate the evolving digital landscape.

Ideas for titles and headlines:

Bridging the Gap: Integrating AI into Daily Workflows
The Rise of Local LLMs: A New Era of Productivity
From Intent to Action: Building an AI-Powered Productivity Machine
The Site Whisperer: Reimagining Web Crawling in the Age of AI
Beyond SEO: Embracing Intelligent Entity Interaction

My opinion of the article:

I find the author’s exploration of AI’s potential to transform daily workflows and productivity quite insightful. The emphasis on local LLMs and their implications for privacy and efficiency is particularly interesting. The article effectively conveys the author’s passion for innovation and their commitment to developing practical AI solutions that empower individuals and organizations. The conversational and reflective tone makes it an engaging read, offering a glimpse into the author’s creative process and vision for the future.