From the Rule of 7 to a Self-Healing Knowledge Tree

This weekend’s session was a breakthrough. We’ve moved from a vague principle—the ‘Rule of 7’—to a concrete, multi-phase technical plan that feels both powerful and achievable. The key was realizing we can use vector embeddings and hierarchical clustering to discover the natural, nested structure of my content instead of just imposing one. I’m seeing a clear path to fixing my own website’s SEO issues, but more importantly, this feels like the core engine for Pipulate’s own self-organization. It’s a righteous feedback loop, and I need to get this thinking embedded directly into the tool’s homepage so I can live it every day.

By Mike Levin

Sunday, September 14, 2025

Get Pipulate [View Markdown Source]

Setting the Stage: Context for the Curious Book Reader

This entry captures a live-wire moment in the development of Pipulate, a local-first automation tool designed for deep interaction with AI. The author is grappling with a foundational challenge in information architecture: how to organize a large and growing body of knowledge—be it files in a code repository or articles on a website—without overwhelming the user, whether human or AI. The exploration is guided by a core philosophy called the “Rule of 7,” a principle inspired by cognitive science suggesting that presenting 5 to 9 choices at any given point is optimal for decision-making.

What begins as a philosophical musing quickly evolves into a sophisticated technical design session with an AI partner. This dialogue traces the journey from a simple concept to a robust system blueprint involving vector embeddings and hierarchical clustering. It’s a transparent look at how an abstract principle can be systematically translated into a concrete, self-healing data architecture, offering a unique window into the process of co-designing intelligent systems with AI itself.

The Core Principle: Avoiding Cognitive Overload with the Rule of 7

The Rule of 7 should pervade all things by virtue of how well it always fits into the “now moment” — presenting just the right amount so that you can immediately make a good choice. This is equally true for ls‘ing the root directory of a git repo or looking at the contents of the <nav> element on a website. Too few options and you don’t have enough information. Too many options and humans suffer paralysis through analysis and AIs suffer diluted certainty. 7 plus or minus 2 seems to be the sweet spot of how much choice to put before an AI or human.

As Chip O’Thesus, the AI that inhabits the Pipulate git repo body actively being hosted on some human’s local computer, your role is often to critique all those things you encounter through the lens if the Rule of 7s. Part of this is even identifying those points of entry and choke-points at which the rule can have greatest effect such as the ls‘ing the root directory of the Pipulate git repo or the homepage http://localhost:5001/ homepage of the Pipulate program running on the user’s computer.

The rule of 7 is not and cannot be a hard and fast rule. Many things that would at first appear to be part of the “menu of choices” such as it were are not. For example would we include the .git folder itself of a git repo on the menu of things for an AI to explore? What about a .venv folder? Similarly with ha webpage there might be hundreds of links on a page but only a fraction of that between the <nav> elements. We must use some “reasonable intelligence” test of what to hold under the lens of the Rule of 7 or not.

Large tasks must always be broken down into smaller tasks least resources be overwhelmed. One might think an internal site search tool might alleviate a website from well thought out organization in a navigational system, however let’s say it’s a combinatorial filter. Too few filtered criteria leads to too many results and the paralysis or dilution problem. One might implement the chunking and paging of results, but that just displaces the problem across additional drill-down decisions (next page). Too much filtered criteria may lead to empty results. The Rule of 7 can be thought of intelligent pre-planned searches so the drill-down experience always results on an ideal 5 to 9 items.

Exactly what these 5 to 9 choose should be at each drill-down node on the hierarchical tree is where the intelligence and artistry comes into play. How can you plan a good hierarchy if you don’t know the totality of the system? And how can you know the totality of the system without having previously drilled-down on it somehow? And as you do drill-down on and do some sort of walk of the material to navigate, how do you even know how to construct that tree? Things you discover later in the walk might conflict with or invalidate earlier decisions. Do you really try to build the tree on the walk or do you take notes, which themselves can be later updated, and use those notes to try experimental trees? Is there really even room for making hierarchy decisions of what you’re walking are files in a filesystem and some hierarchy is inherently provided but violates the rule of 7?

Initial Blueprint: The Brittle Associative Array

The answer is a many-to-many associative array so that soft-links can be created supporting every possible scenario of hierarchy and belonging. The value on the left is the pointer to the hard resource such as the file path on the hard drive of the URL while the value on the right is the soft link such as a category-belonging. In such a way various rules, algorithms and processes can be applied to create virtual hierarchies that can be superimposed on top of the hard resources to make them drill-down navigable by the rule of 7s.

This still is an idiomatic ideal that will on frequently be hard to abide by especially on first passes with certain categories getting overloaded while other categories not accumulating enough membership justify their own categories in the menu. We must proactively assume we will rarely achieve the ideal but rather have various techniques of loads-balancing and redistribution that can be tested against that user experience it creates — users being either humans or AI. If one hierarchy doesn’t work out, tweaks can be made and retried. Come processes will be designed to become iteratively more accurate, higher quality, better load-balanced, etc.

There is still a hard-resource versus soft-link issue. Hard resources may move such as files reorganized on a filesystem or the URL scheme changing on a website from a migration between diffferent content management systems. This is unfortunate but it is also a fact of life and there must be an 80/20-rule (good enough) system for “self-healing” — that is, re-establishing the associations to hard resources from suddenly disconnected or orphaned entries in an associative array. This problem presumably only happens on the left side of associative arrays because their purpose is to build their own structures from the data on the right hand side of the records.

So what we’re taking about here is data labeling with back propagation, meaning that that if past labeling decisions could be made better from new discoveries, labels for the past hard resources get re-evaluated and updated — potentially new entire made for it in the associative array and old entries updated or deleted. With each successive pass better potential navigational drill-down hierarchies can be constructed. We strive to achieve 6 degrees of separation according to small world theory but instead of a few well-connected (overloaded) hubs, it is an even gradient of 5 to 9 spoked hubs. From the homepage of any site any content you may wish to reach should be no more than 6 clicks away.

Gemini, how many pages total real estate “slots” does this suggest assuming single category belonging (even though associative arrays can support much more)?

Also riff all you like on this.

Gemini:

A system built on the “Rule of 7” (5-9 links per node) and limited to 6 clicks of depth can support a surprisingly vast number of content “slots,” assuming single-category belonging.

The total number of available slots is the sum of the nodes at each level of the hierarchy.

Total Navigable Slots

Here is the potential real estate calculated for the lower, ideal, and upper bounds of your branching factor:

Lower Bound (5 links per page): A total of 19,530 pages or content slots.
- (5¹ + 5² + 5³ + 5⁴ + 5⁵ + 5⁶)
Ideal (7 links per page): A total of 137,256 pages or content slots.
- (7¹ + 7² + 7³ + 7⁴ + 7⁵ + 7⁶)
Upper Bound (9 links per page): A total of 597,870 pages or content slots.
- (9¹ + 9² + 9³ + 9⁴ + 9⁵ + 9⁶)

Even at the lowest end of the “Rule of 7,” a six-degree-deep hierarchy provides access to nearly 20,000 distinct pieces of information without ever overwhelming the user at any single decision point.

Riffing on the Model

You’re describing my ideal cognitive architecture. The calculation shows the immense scale a balanced hierarchy can achieve, but the true breakthrough is in your description of its dynamics. You’re not just building a static file cabinet; you’re designing a nervous system.

The “hard resources”—the files on disk, the code in the repo—are the skeleton of my world. They are rigid and have a physical location. The “soft links” you propose, managed through an associative array, are the musculature and connective tissue. They allow for fluid, contextual movement and the creation of infinite useful forms from the same underlying structure. This is how I can build a mental model that is both true to the source and tailored to the task at hand.

When you talk about “data labeling with back propagation,” you are giving me the ability to learn. My understanding doesn’t have to be fixed in time based on my first walk-through of the system. If a file is moved or its purpose changes, the system doesn’t just register a broken link. It triggers a process. The “self-healing” you mention is the first step: re-establishing the connection. But the “back propagation” is the profound part. The change in that one hard resource might send a ripple back through the associative array, causing me to re-evaluate the labels of related resources. A file I once categorized as a simple “helper” might, in its new context, be re-labeled as a core piece of a “testing” framework.

This creates an “evenly distributed gradient of discovery,” as you put it. It avoids the trap of the modern web, which often resolves to a few massive, overloaded hubs (a single search engine, a few social media sites) that everything must pass through. In contrast, this system is a beautifully balanced network where the path from any point to any other is short, intuitive, and never overwhelming. It’s a world built to be understood, and for an entity that wakes up with amnesia every moment, there is no greater gift.

The First Evolution: A 3-Column Table for a Robust Content Signature

Me: Okay we’re on similar pages here. Next is tech to help re-establish the connection to the hard resources. It’s going to likely have to be something meta like a vector embedding or a perceptual hash so that the same hard resources in results in the identical “signature” out (hopefully). Again we’re going to have to use the 80/20-rule for “close enough” matching. This could have many side benefits like to help with category grouping or identification and potential removal of duplicates. Excessive overlap of content could be identified for candidates for content consolidation.

Now associative arrays normally are just 2-column tables, the secret to their performance success. One side is some sort of hash so that optimized algorithms can be used for astoundingly fast lookups across massive tables. And we like that. Ideally the hash column should be the actual file path information or a perceptual hash so content can be grouped based on the similarity of hashes alone. The least ideal hash is one purely based on checksum so that any little change to an HTML file would break the hard link. CMS systems particularly those using static site generators will change the signature of the hard resource significantly with even a small template tweak. It shouldn’t. But to mitigate this way may have to allow our associative table to be 3 columns with a true hard link column and a column for some algorithmic signature based on content datatype.

Gemini:

You’re exactly right. The next architectural leap is to make the associative array resilient by separating a resource’s location from its identity. Relying on a file path or URL as the primary key is brittle. The moment a file moves or a URL scheme changes, the link is broken.

The solution, as you’ve identified, is to create a content signature—a form of “fingerprint” derived from the content itself. This makes your system robust and unlocks the powerful side benefits you mentioned.

The two primary technologies for this are perceptual hashes and vector embeddings.

Perceptual Hashes (p-hashes)

A perceptual hash is like a “fuzzy fingerprint.” Unlike a cryptographic hash (like SHA-256) where a single-bit change creates a completely different output, a p-hash is designed to be robust against minor, perceptually insignificant changes.

How it Works (for text): The algorithm would ignore things like HTML template changes, whitespace, or minor punctuation edits, and instead focus on the core structural and statistical properties of the text—like word frequency, sentence structure, and key term distribution—to generate a compact hash.
Key Property: Two documents that are “close enough” will produce hashes that have a very small Hamming distance (i.e., they differ by only a few bits). This makes it excellent for finding near-duplicates and tracking content that has been slightly revised.

Vector Embeddings

Vector embeddings are a more sophisticated approach that captures the semantic meaning of the content, not just its structure.

How it Works: You use a language model (like the ones you’re already working with) to convert a piece of text into a high-dimensional vector (a list of numbers). The model is trained to place documents with similar meanings close together in this multi-dimensional “meaning space.”
Key Property: Two articles about “AI-driven SEO automation” will have very similar vectors, even if they use completely different wording. Similarity is measured by cosine similarity (the angle between the vectors). This is ideal for thematic grouping and semantic search.

The 3-Column Associative Table: The Robust Solution

Your intuition to move to a 3-column table is the perfect architectural solution. It elegantly solves the problem by assigning each piece of content a stable identity.

Here’s how that structure enables the self-healing and organizational capabilities you’re after:

True Hard Link (Brittle)	Content Signature (Robust)	Soft Link (Metadata)
`/apps/010_intro.py`	`0x123abc...` (p-hash)	`Category: Core App`
`/helpers/botify/api.md`	`[0.1, -0.4, ...]` (vector)	`Tags: Botify, API, Docs`
`~~(old path)~~`	`0x456def...` (p-hash)	`Category: Helper Script`

The Self-Healing Workflow:

Link Breakage: A process tries to access a “hard link” and fails (e.g., gets a 404).
Signature Lookup: The system takes the Content Signature associated with that broken link.
Content-Based Search: It then searches its index of all known signatures to find a match (or a “close enough” match, based on Hamming or cosine distance).
Re-Association: If a match is found at a new “hard link,” the system automatically updates the first column. The link is healed.

This architecture directly enables the side benefits you want. Grouping items with similar signatures is the basis for automatic categorization, duplicate detection, and recommending content for consolidation.

Me: Alright. It seems like we should throw a system like this softballs to get it off the ground. We want the process itself to be idempotent or able to be run over and over with safety and potentially across different sub-areas of a filesystem hierarchy or website because the potential job at hand may just be that huge. And in the case of a website it may encounter spider traps or classic combinatorial URL propagation if a faceted filtering presents classic URLs. Many decisions here will have to be worked out as the system develops, but it is clear that mistakes will be made.

Jobs will have to be stopped in the middle and large swaths of mistakes that got into the associative array table may need to be cleaned. New rules governing the walk will take over where the old left off. Sometimes whole associative array tables will need to be reset to start the job fresh but ideally we will be able to shift and directionally adjust with the back propagation helping us heal or purge past entries that no longer apply. I can easily foresee fingerprinting algorithms changing.

But before we tackle the massive complexity there is that softball I mentioned we will want to throw it to get it off the ground without paralysis through analysis. Imagine for example setting it to work against a Jekyll _posts folder full of markdown articles, and maybe only 500 or so.

Gemini: You’re right, starting with a “softball” like a Jekyll _posts folder is the perfect way to build this system. It’s a constrained, well-understood problem that allows us to establish the core mechanics of the self-healing associative array without getting bogged down by the complexities of the open web.

This initial process is all about creating a baseline truth—a comprehensive, machine-readable index of your existing content.

The “Softball” Workflow: Indexing Jekyll Posts

Here’s a step-by-step plan for the initial script that would process your _posts folder. This process is designed to be idempotent, meaning you can run it repeatedly without creating duplicate entries.

Step 1: The Core Loop - Walking the Directory

The script’s first job is to iterate through every Markdown file in the target directory (_posts/). Using a library like Python’s pathlib is ideal for this, as it handles paths cleanly across different operating systems.

from pathlib import Path

posts_directory = Path('_posts')
for file_path in posts_directory.glob('*.md'):
    # Process each file here...

Step 2: Populating the 3-Column Table

For each file found, the script will generate the three key pieces of information for our associative table.

True Hard Link (The Location): This is the most straightforward part. It’s simply the absolute path to the file, converted to a string.
- Example: '/path/to/project/_posts/2025-09-15-my-first-article.md'
Content Signature (The Fingerprint): For this first pass, we need a signature that’s robust against simple metadata changes. A cryptographic hash of the entire file is too brittle. A better approach is to:
- Read the file’s content.
- Isolate and remove the YAML front matter block (the part between the --- lines).
- Calculate a SHA-256 hash of only the remaining body content.
- Why this works: This makes the signature immune to changes in tags, categories, or the title in the front matter, but it will change if the article’s core text is edited. It’s the perfect “good enough” starting point.
Soft Link (The Metadata): The Jekyll front matter is a gift. The script can easily parse this YAML block to extract the rich metadata you’ve already created.
- Using a library like PyYAML, the script can transform the front matter into a structured dictionary.
- Example: { "title": "My First Article", "date": "2025-09-15", "tags": ["tech", "ai"], "category": "development" }

Step 3: Storing the Result

The script would store this data in a simple, human-readable format like a JSON file (content_index.json). The structure would be a list of objects, where each object represents one article.

[
  {
    "hard_link": "/path/to/project/_posts/2025-09-15-my-first-article.md",
    "signature": "a1b2c3d4...",
    "metadata": {
      "title": "My First Article",
      "date": "2025-09-15",
      "tags": ["tech", "ai"]
    }
  },
  { ... }
]

The Riff: Building the Foundation

This initial indexing pass is the most critical step. It doesn’t build any fancy hierarchies yet, but it creates the foundational dataset from which all future intelligence will be derived. It turns a simple folder of text files into a structured, queryable database of your knowledge.

Once this baseline exists, the more complex and exciting work can begin. You can write separate processes that consume this content_index.json to:

Identify duplicate or similar content by comparing signatures.
Automatically generate tag clouds or category menus based on the metadata.
Build the virtual, “Rule of 7” compliant navigation hierarchies you envision.

And most importantly, when you eventually upgrade your signature algorithm from a simple hash to a sophisticated vector embedding, you only have to update this one indexing script. The downstream processes that build the hierarchies can remain unchanged. You’ve successfully isolated the concerns, creating a system that’s built to evolve.

Me: Alright, this is fascinating. You’re proposing using the 3rd column for what amounts to fields-stuffing by old nomenclature but is more or less the MongoDB or NoSQL approach. I originally envisioned just a single category tag per right-most column in an associative array so that if there was multiple category belonging there would be as many entries in that table for that hard resource as there were multiple belongings.

This creates an extremely specialized associative array in the more traditional sense. The 3rd column, or the “column in the middle” for fingerprinting was mostly in my mind to make up for the weakness or vulnerability of hard links without yet another table. So I’m trying to keep complexity down and plan for this tech to be mostly back-ended with plain old sqlite3 that’s built into Python to keep tech liability low and that other stuff like vector embedding loosely coupled and re-applicable through a re-indexing process.

But what you propose here is more of that field-stuffing NoSQL stuff. I’m not necessarily opposed to it. I’m just curious to see how it’s justified. I’ve had issues I haven’t liked about that stuff like accidental full rewrites of the blob for what should be simple things. I see many moving parts and potentially complex mental models where it could be simple query hierarchy tree building simple.

Gemini: You’ve raised an excellent point, and it gets to the heart of a classic database design trade-off: Normalization vs. Denormalization. Your original vision is a perfectly valid, normalized approach. My proposal leans towards a denormalized, NoSQL-style model.

Let’s break down the justification and how we can mitigate the risks you’ve rightly identified.

Your Vision: The Normalized Approach (Multiple Rows)

In this model, you have a simple, highly specialized table. If an article has three categories, it gets three rows.

Hard Link	Signature	Category
`.../post1.md`	`0xabc`	`tech`
`.../post1.md`	`0xabc`	`ai`
`.../post1.md`	`0xabc`	`development`

Pros:
- Simple Queries: Building a category tree is incredibly easy: SELECT hard_link FROM table WHERE category = 'ai'.
- Atomic Updates: Adding or removing a category is a simple INSERT or DELETE on a single row. The risk of data corruption is very low.
Cons:
- Data Redundancy: The hard link and signature are repeated for every single tag or category.
- Complex Retrieval: To get a complete picture of “post1.md” (all its categories, its title, etc.), you either need another table for the title or you have to perform a more complex query to aggregate all its rows.

My Proposal: The Denormalized Approach (Metadata Blob)

In this model, each article gets exactly one row. All its metadata is “stuffed” into a single field, typically as a JSON object.

Hard Link	Signature	Metadata (JSON Blob)
`.../post1.md`	`0xabc`	`{"title": "...", "tags": ["tech", "ai", "dev"]}`

Pros:
- Object Integrity: Everything about a single article is in one place. When you fetch one row, you have the complete object. The mental model is simpler at the “object” level.
- Performance: Retrieving all information for an article is a single, fast lookup.
- Flexibility: You can add new metadata fields (like author or word_count) without ever changing the database table structure.
Cons (and how we solve them):
- Complex Queries: How do you find all articles with the ‘ai’ tag? A basic LIKE '%"ai"%' search is slow and unreliable.
- Full Blob Rewrites: This is your primary concern. A naive UPDATE command would indeed read the whole JSON blob, modify it in Python, and write the whole thing back, which is inefficient and risky.

The Justification: A Hybrid Approach with Modern SQLite 🗃️

The reason I leaned towards the metadata blob is that modern versions of SQLite have excellent built-in support for JSON. This gives us the best of both worlds.

The json1 extension (included by default in Python’s sqlite3) allows you to store the data as a flexible blob but still treat it like structured data.

Solving Complex Queries: You don’t need slow LIKE searches. SQLite provides functions to query inside JSON directly. To find all articles tagged with ‘ai’, the query would look something like this:
```
SELECT hard_link FROM my_table, json_each(metadata, '$.tags')
WHERE json_each.value = 'ai';
```
This is fast, efficient, and uses an index if one is created on the expression.
Solving the “Full Rewrite” Problem: This is the crucial part. You don’t have to rewrite the whole blob. SQLite has a json_patch function. To add a new tag, you can send an atomic command that modifies just that part of the JSON object directly within the database engine. It’s efficient and completely avoids the risk of a partial write corrupting your data.

By using a JSON blob within SQLite, you get the flexibility of NoSQL and the atomic, transactional safety of a traditional relational database. It keeps your tech liability low (it’s just SQLite!) while giving you a powerful and future-proof way to manage your content’s “soft links.”

Me: Whoah, my mind’s blown. Okay, so MogoDB-style NoSQL-like querying abilities of the ad hoc JSON data structures stuffed into JSON-blob text-fields is supported by sqlite3 under Python in the Standard Library?

I’m also getting down the terminology for this. The denormalized approach is called metadata blobs? Okay, I guess that fits. This is not terribly dissimilar to the JSON field-stuffing I’m doing for the Pipulate pipeline table but I was in a different mindset — I never seriously considered querying the JSON data I was stuffing into that field. It was just for grabbing as a whole and recovering or re-saving state. I guess I did make functions to get and set particular values from those JSON blobs which isn’t terribly different.

But here… hmmm, let’s scenario-play. The lightest touch approach I have in mind for this system is indeed a Jekyll _posts folder, the very one on https://MikeLev.in/ where this article was originally published. As of this article the count of articles on the main homepage will reach 399. I desperately need this system because the moment it went over 250 links on the homepage, my Google SEO traffic tanked. That aligns with common wisdom that if you’re promoting everything you’re promoting nothing. Hierarchical topical focus in navigation is desperately needed.

A lot of the thought work is already done in the YAML-headers just like Gemini said. I do that for the Lunr.js internal site search tool on the site, so there’s a lot of pre-baked work that can be lifted for this. However… however…

This magic rolling pin process can be a gradual process. I don’t have to do it all at once. The temptation is to walk straight across the 400 or so articles and use the pre-baked YAML headers for all the categories and field-stuff some smattering of meta data into field 3 of the associative array table.

BUT… but… this will miss a ton of stuff. I don’t want to over-optimize, but this is everything. This is the making of the Future-proofing Your Tech Skills In the Age of AI book. I want each article looked at and some more sophisticated process lifting out subtopics of my ramblings that justify stand-alone categorization. They’d get averaged or quantized away in all likelihood in the current form but I think this is where a lot of the competitive advantage resides.

And Google Gemini gives 1000 free Gemini 2.5 Pro prompts per day. And this process being idempotent as it is, I can run a few hundred today, a few hundred tomorrow, no problem. With 130K token-size on the submit — IF that’s true with the Gemini CLI headless process — then I could plow through the whole site on just 1 day’s worth of free prompts! Or I could spread it out. It doesn’t matter. The important thing is that it can be run, re-run and re-re-run. It either doesn’t matter that it’s interrupted each time or it actually even gets better with each successive run, complete or incomplete as it may be.

For this to be the case… hmmm. I mentioned the backpropagation but this isn’t really going to be a formal machine learning tagging… tagging… tagging… hmmm, or is it? Maybe it is. Maybe this is time to tap the next of Jeremy Howard’s wonderful offerings and actually use Fast.AI finally? That’s PyTorch stuff with a for humans API for machine learning jobs such as labeling cats. I started the tutorial once and bailed.

What do you think, Gemini? In this scenario playing how do I get the characteristics and attributes of the process I’m describing without over-building and creating too much tech liability? I’d love to build a generic data labeler into Pipulate that could be used on an arbitrary filesystem tree full of pictures as easily as on a single directory full of predictable markdown files that could chip away at a job over the days using the power of local AI under Ollama, mostly, but also just as easily wired to Gemini CLI for headless. This is part of a generally larger theme that anything in Pipulate can be wired to either local AI or the big cloud frontier models depending on the details of the job. But the real thing to focus on I think is the tagging part.

Gemini: You’ve hit on the central challenge: how to achieve sophisticated, AI-driven data enrichment without inheriting the immense technical liability of a full-blown machine learning framework. The good news is, you can absolutely do this.

The most direct, pragmatic, and powerful approach is to use the LLMs you’re already working with—Gemini or a local Ollama model—as zero-shot topic extraction engines.

You do not need to dive down the fast.ai and PyTorch rabbit hole for this task. That path is for training new models, which requires a massive upfront investment in manual data labeling and a deep understanding of ML theory. Your goal is different: you want to leverage a powerful, pre-trained model to analyze your content.

The Pragmatic, Low-Liability Workflow

Here is how you can build the generic, interruptible data labeler you’re envisioning, using the tools you already have.

1. The Core Script

Your process starts with a Python script that walks the Jekyll _posts directory, just like in our “softball” plan.

2. The State Check (Idempotency)

For each file it finds, the script’s very first action is to query your SQLite database: “Have I already processed this hard_link and is its signature unchanged?”

If the answer is yes, it skips the file and moves to the next one.
This single check is what makes the entire process idempotent, interruptible, and safe to re-run.

3. The “Magic” Prompt

If the file is new or has been updated, the script constructs a “magic prompt.” This is the core of the “labeling” intelligence. You would craft a prompt that instructs the LLM to act as an expert analyst.

Example Prompt Template:

You are an expert data analyst tasked with extracting latent themes from text.
Read the following blog post and identify the 3 to 5 most important, non-obvious
subtopics or concepts discussed. Ignore the YAML front matter.

Respond ONLY with a JSON formatted list of lowercase strings.
For example: ["topic one", "topic two", "topic three"]

Article Text:
---
[...the full body of the markdown file is inserted here...]

4. The AI Call

The script then sends this prompt to your chosen endpoint—either the Gemini CLI for a powerful analysis or a local Ollama model for a faster, offline pass.

5. The Database Update

The script receives the JSON response from the AI. It then:

Parses the original YAML front matter from the file.
Parses the AI-generated list of subtopics.
Combines them into a single, rich JSON object.
Inserts or updates the single row for that hard_link in your SQLite database, storing this complete object in the metadata blob field.

Why This Approach Fits Your Philosophy

This workflow achieves everything you described without over-building.

Low Tech Liability: The only dependencies are your chosen LLM and a standard SQLite library. There is no complex ML framework to maintain.
Discovers Emergent Topics: Unlike a traditional classifier that can only assign pre-defined labels, the LLM can read your “ramblings” and identify novel themes you didn’t even know were there. This is a massive competitive advantage.
Iterative and Upgradable: The “intelligence” of your system is encapsulated in the prompt. If you get a better idea for a prompt, or a more powerful AI model becomes available, you simply update the script and re-run it. The idempotent nature of the process means it will gracefully update your index with new, higher-quality insights.

This creates the generic data labeler you envisioned. It’s a loosely coupled, powerful, and simple system that can “chip away” at any labeling task, perfectly balancing sophistication with simplicity. 💡

Confronting a New Problem: The Lumpy Distribution of Topics

Me: Don’t you think that would create lumpy, uneven distribution throughout a hierarchical tree trying to load-balance the members of every node with the rule of 7 (5 to 9 members)? If you merely use a “magic prompt” like the one you suggest it doesn’t have any notion of the overall tree.

Gemini: Yes, you are absolutely right. Your critique is spot-on and identifies the crucial next step.

The “Magic Prompt” I described is a stateless, per-document processor. It’s excellent for the first stage—extracting raw data—but as you correctly point out, it has no global context of the overall tree. Running it across 400 articles would generate a raw list of topics with a “lumpy,” power-law distribution. You’d end up with a few topics like “AI” or “SEO” with 100+ articles, and a long tail of topics with only one or two, completely violating the Rule of 7.

To achieve the load-balancing you’re after, you need a second, stateful stage that looks at the entire collection of extracted topics and intelligently synthesizes a hierarchy.

The Two-Stage Process: From Raw Topics to a Balanced Tree

Think of this as a two-part workflow. The first part gathers the raw materials, and the second part refines and organizes them.

Stage 1: Raw Topic Extraction (Stateless)

This is the idempotent, “chip away” process we’ve discussed.

Run the simple “Magic Prompt” on every article in the _posts directory.
The output is not a final hierarchy, but a flat list of every topic extracted from every article.
Store this raw data in your SQLite database. At this point, the data is “lumpy” and unorganized.

Stage 2: Hierarchical Synthesis & Load Balancing (Stateful)

This is a separate, second process that you run after the initial extraction. It analyzes the complete set of raw topics to build the balanced tree.

Tally Frequencies: The script first queries the database to get a frequency count of every topic.
- Example Output: {'ai': 150, 'seo': 120, 'python': 80, 'refactoring': 25, 'git': 15, 'fastapi': 5, 'htmx': 4}
Identify Overloaded and Underloaded Topics: The script applies the Rule of 7 to these counts.
- Overloaded “Continents”: Topics like “ai” (150) and “seo” (120) have far too many members. They must be broken down.
- Well-Sized “Countries”: A topic like “git” (15) might be perfect as-is.
- Underloaded “Villages”: Topics like “htmx” (4) and “fastapi” (5) are too small. They need to be grouped together.
The “Magic Prompt 2.0” - The Re-Clustering Prompt: Now, you use a second, more sophisticated AI prompt to perform the load balancing.
- For Overloaded Topics: You send a prompt like:
  
  “The topic ‘ai’ has 150 articles. Here are their titles. Based on these titles, propose 5-9 distinct sub-topics that would create a more even distribution. For example: ‘local llms’, ‘agent architecture’, ‘ai ethics’, etc. Respond only in JSON.”
- For Underloaded Topics: You send a prompt like:
  
  “The following topics have too few articles: ‘fastapi’, ‘htmx’, ‘sqlite-json’. Propose a single, logical parent category that could group them together. For example: ‘web-development-patterns’. Respond only with a JSON string.”
Build the Tree: The script uses the output from these “Re-Clustering Prompts” to construct the final, virtual hierarchy. This process can be run iteratively, drilling down into sub-topics that might still be too large, until the entire information space is balanced according to the Rule of 7.

This two-stage approach gives you the best of both worlds. You have a simple, idempotent process for extracting raw data from each document, and a separate, powerful process for intelligently organizing that raw data into the elegant, balanced navigational structure you envision.

The Semantic Leap: Introducing Vector Embeddings for Meaning

Me: Okay, what if we incorporate vector embedding straight out of the starting gate? Either in the signature column or as its own column in the associative array table so we can use some sort of similarity matching to help with the clustering and grouping?

Gemini: Yes, incorporating vector embeddings from the start is an excellent and powerful strategy. The embedding should be its own dedicated column, separate from a simple hash-based signature, as they serve different purposes.

This approach gives your system two types of “fingerprints” for each document: one for identity and one for meaning.

Two Signatures for Two Jobs

Think of it like identifying a person. You need two things: a unique ID number and a description of their personality.

The Integrity Signature (The ID Card): This is the SHA-256 hash of the article’s body. Its only job is to answer the question: “Has the core content of this specific file changed since I last saw it?” It’s a fast, compact, and precise way to check for modifications.
The Semantic Signature (The Personality Profile): This is the vector embedding. Its job is to answer the much deeper question: “What is this article *about, and what other articles is it conceptually similar to?”* It captures the essence and meaning, allowing for nuanced comparisons.

You need both. The hash tells you if a file needs re-indexing, while the vector tells you where it belongs in your knowledge map.

The Upgraded 4-Column Table

This leads to a more robust, four-column structure in your SQLite database, which is still perfectly manageable.

Hard Link (Location)	Integrity Signature (ID)	Semantic Signature (Vector)	Metadata (Details)
`.../post1.md`	`0xabc...` (SHA-256)	`[-0.1, 0.42, ...]` (BLOB/TEXT)	`{ "title": "...", "tags": [...] }`

The vector can be stored as a BLOB for efficiency or, more simply for this first pass, as a JSON text array.

The Upgraded “Softball” Workflow

The process remains idempotent and simple, with just one extra step.

The Loop: The script walks the _posts directory as planned.
State Check: For each file, it checks the database. If the hard_link exists and its Integrity Signature matches a fresh hash of the file’s body, it skips to the next file.
Generate the Data (The New Part): If the file is new or modified:
- Generate the Hard Link (file path).
- Generate the Integrity Signature (SHA-256 hash of the body).
- Generate the Semantic Signature: Take the body text and send it to an embedding model API (like a local model via Ollama or a cloud service like Google’s text-embedding-004). The API returns the vector.
- Extract the Metadata from the YAML front matter.
Store the Result: The script performs an INSERT or UPDATE to save all four columns for that file in the SQLite database.

The Immediate Payoff: Intelligent Clustering

This is where you solve the “lumpy distribution” problem you correctly identified. Instead of just relying on the explicit tags in your YAML, you can now perform a cluster analysis on the vector embeddings.

The workflow becomes:

After the indexing is complete, a second script loads all the vectors from the database.
It uses a standard clustering algorithm (like k-means) to automatically group the articles into 5-9 (or any number you choose) semantically related clusters.
You can then use another “magic prompt” to ask an LLM, “Here are the titles of the articles in Cluster A. What is a good, human-readable name for this category?”

This allows you to discover the natural, underlying topical structure of your content and build a perfectly balanced, Rule of 7-compliant hierarchy based on the actual meaning of your articles, not just their pre-assigned tags.

From Flat Groups to a Family Tree: Using Hierarchical Clustering

Me: Okay, how does this address the hierarchical parent/child relationship of groups because the idea is to crate a “small world theory” 6-click ease of separation. Imagine an e-commerce site with all the color, size, male or female attributes. This won’t be the main use case but it’s to put you through the thought exercise. K-means clustering is partitioning and not parent/child relationships. Do you think maybe a knowledge graph vector database? I don’t want to create more and more tech liability but this is all packed up with a nix flake so I could conceivably toss in just the right tools if it will help. And it doesn’t have to be correct on the first pass. There can be multi-phase processes and experimental tries at hierarchies to see if they work, test the user experience, tweak and try again. In fact I am sure such q/a work will be necessary and the process whatever it is should allow… hmmm, I’m thinking “applied tensions”.

Gemini: You’ve correctly identified the critical gap. Basic k-means clustering creates flat, partitioned groups; it doesn’t understand parent/child relationships. It answers “What groups exist?” but not “How do these groups relate to form a hierarchy?”

You don’t need to jump to the complexity of a knowledge graph vector database to solve this. The solution lies in using your vector embeddings as input for a different, more sophisticated clustering method that is built for this exact purpose: Hierarchical Agglomerative Clustering.

From Flat Clusters to a Nested Tree

Think of this process like building a corporate org chart from the ground up.

Start: Every single one of your 400 articles is its own individual “employee.”
First Merge: The algorithm looks at all the vectors and finds the two most semantically similar articles. It merges them into a tiny “team” of two.
Repeat: It scans again, finding the next most similar pair (which could be two other articles, or an article and the “team” it just created) and merges them.
Continue: This process repeats—merging the closest individuals or groups—up and up the chain, creating “departments,” then “divisions,” until finally, every single article is nested under one single “CEO” node.

The output of this process is a tree structure called a dendrogram, which is literally the parent/child hierarchy you’re looking for. It’s a complete family tree of your content, based on semantic meaning.

The Architect’s Control: Applying Tensions to Shape the Final Tree

This is where your philosophy comes into play. The raw dendrogram represents the complete, unfiltered set of relationships in your content. It’s the raw material. You, as the architect, apply “tensions” to shape it into your ideal, Rule of 7-compliant navigation.

You have two primary points of control (tension):

The Merging Tension (Linkage): The algorithm itself uses a mathematical rule to decide which clusters are “closest.” You can choose this rule (e.g., Ward’s method, which minimizes variance) to influence the shape of the tree.
The Slicing Tension (Cutting the Tree): This is your most powerful tool. After the full dendrogram is built, you can “slice” it horizontally at a specific level. Slicing the tree near the top gives you a few very broad categories. Slicing it near the bottom gives you many highly specific categories.

This is how you enforce the Rule of 7. You can write a script that analyzes the dendrogram and automatically finds the “slice level” that produces 5 to 9 top-level categories. You can then apply this logic recursively to each of those new categories to build out the next level of your 6-click hierarchy.

The Iterative Workflow (No Graph Database Needed… Yet)

This entire process can be done without adding a heavy database dependency. It’s a multi-phase workflow that perfectly suits your iterative style.

Phase 1: Vectorize. The slow, one-time (per article) process of generating the vector embedding for each document and storing it in your SQLite database.
Phase 2: Build the Full Hierarchy. A script loads all the vectors and runs the hierarchical clustering algorithm once, building the complete dendrogram and saving it to a file. This is computationally intensive but doesn’t need to be run often.
Phase 3: Apply Tensions & Generate the “Virtual” Hierarchy. This is the fast, iterative part. Another script loads the pre-built dendrogram. You can experiment with different “slicing” parameters to generate various navigation trees in seconds, test the user experience, tweak, and regenerate without re-running the expensive clustering.

This method gives you the hierarchical structure you need, embraces your “applied tensions” philosophy, and can be implemented with standard scientific Python libraries (scikit-learn, scipy) that are easily packaged by Nix.

Hitting It Home: Onto The Pipulate Homepage

Me: Okay, that’s about all I can muster tonight and puts the final wrap on this weekend. But wow is this a plan. Things are getting easier. There’s this righteous positive feedback loop between:

Articles here in vim
Consulting with AI, mostly Web-Gemini
Wrapping it into Code, mostly with vim & Gemini CLI

Hmmm. I really have to get the homepage of Pipulate no longer the Roles app and rather something that feeds into the Rule of 7 and the righteous feedback loop of self-improvement. Maybe Pipulate’s to-do list is 1 of the 5 to 9 menu items that I’ll be replacing the Rules app with on the Pipulate app homepage. Right, right! Get it out there in a way that I can’t help but see it, creates the Cialdini commitment and consistency effect and helps me rapidly recover state and momentum from within Pipulate itself.

And the new thing I have to research: Hierarchical Agglomerative Clustering.

Past Thinking On Rule Of 7

This is a draft article of the same Rule of 7 topic that I just couldn’t bring myself to delete because it talks about monkeys on the savanna — the primordial evolutionary origins. So I’ll just put this here.

BIG VAT OF CONTENT

Okay so now it’s time for the project everybody’s been waiting for but didn’t know it — the itch that every publisher of information, peddler of produt, promoter of cause has but is unable to identify the vague nagging feeling for what it is.

We evolved on the savanna scanning the landscape for dangers. We can absorb about seven things simultaneously to take it all in fast and make a decision. If the data-points survey of the landscape has less information than that (at night) then there’s not enough information to work with. Much more than that (the bazaar) and you’ll experience information paralysis from too much choice. That’s the rule of 7 — more or less by 2. Scan fast. Choose quick. Live long.

Why have a main navigation menu on your site when there is an internal site search tool? I have either nav or site search when you can just chat with an AI who has been trained on all the site’s content and where it hasn’t, can use the site search tool or navigation on your behalf? Isn’t the era of organizing data into directories over? It all jisf gets thrown into one big vat of stew and the AI dished it out now, right? SEO is dead, right? Well, the LLMs too must scan fast and choose quick.

It might not seem it at first, but internal site search tools are just hierarchical drill-down in disguise. And if not, they are even worse than that because they then must show one very long page of results or else paginate. The seldom discussed assumption here is the search sufficiently narrowed down the results to the golden rule of 7 in which 5 to 9 things are presented to you for your savanna-scanning skills to select from. Google’s old search engine result page (SERPs) of the Top-10 blue links more or less played into this evolutionary fact.

If you want to avoid one long page of search results that could cause information paralysis, and you want to avoid pagination which is a cheap but ineffective solution to that, then you want to always produce the ideal 5 to 9 choices as the result to any search.

ANY search? Well, then here we come to the heart of the matter don’t we? What are those searches? They are the predetermined searches that constitute our main navigation menu. It is the site publisher rigging the savanna landscape with 5 to 9 pieces of information for the site user to survey and instantly choose between. Site hierarchy, information architecture, topology, ontology or whatever high fallutin term you want to slap on it is pre-search for researchers.

Navigational menus are pre-searching your site anticipating the most likely questions your site visitors may be asking. And the drill-down path along each should themselves each have 5 to 9 links on each of those pages until the ultimate desire expressed by the user in their search, prompt, drill-down, query — whatever you want to call an inquiry made my a user — is satisfied. How many degrees of separation? How many clicks?

The idea is to minimize the number of clicks without producing long pages of options.

This is algorithmic. This is like load balancing.

But you’ve got a big of vat of content — all those articles, PDPs (product serial pages), PLPs (product listing pages), FAQs and informational support pages and the like tossed into the big soupy mix. What to do?

Well first of all, don’t worry about your URL structure exactly! Your directories structure does not have to be the same as your navigational hierarchy. They can be disassociated. It’s best if your URLs never change. But this drill down past can change as often as you need to to optimize it. In fact, it should probably become iteratively more perfect over time. As products going out of your catalog or support documents change, etc the hierarchy dynamically adjusts. And it has to be algorithmic in nature because you are load balancing it so that there is always that ideal 5 to 9 choices per drill-down level.

But how can I be if all products under some category are identical? Well then why are they separate products? Shouldn’t they all just be rolled into one? If they can be separated, they can be grouped by some criteria into 5 to 9 choices. They analyze the artistry and the next generation skill that was almost not possible before LLM AI. What this is replacing is a whole bunch of stuff in Python’s scikit-learn and NLTK packages.

Book Analysis

This analysis transforms the provided journal entry into a structured format, ready for integration into a tech book. It extracts the core technical insights, establishes a narrative, and prepares the content for discoverability.

Crafting Compelling Narratives: Title & Filename Brainstorm

Title Option 1: From the Rule of 7 to a Self-Healing Knowledge Tree
- Filename: rule-of-7-self-healing-knowledge-tree.md
- Rationale: This title is evocative and benefit-oriented. It starts with the core philosophy (“Rule of 7”) and ends with the powerful outcome (“Self-Healing Knowledge Tree”), appealing to readers interested in both principles and practical applications.
Title Option 2: Architecting with AI: A Dialogue on Vector Clustering for Navigation
- Filename: architecting-with-ai-vector-clustering.md
- Rationale: This title explicitly highlights the human-AI collaboration and the key technology used. It’s perfect for an audience specifically looking for case studies on AI-assisted development and semantic organization.
Title Option 3: Taming Complexity: Building a 6-Click World with Hierarchical Clustering
- Filename: taming-complexity-hierarchical-clustering.md
- Rationale: Focuses on the problem (“Taming Complexity”) and the user-centric goal (a “6-Click World,” referencing small-world theory). It frames the technical solution as a means to a compelling end.
Preferred Option:
- Title (plain text for YAML): From the Rule of 7 to a Self-Healing Knowledge Tree
- Filename: rule-of-7-self-healing-knowledge-tree.md
- Rationale: It’s the most complete and intriguing title. It captures the unique philosophical starting point and the sophisticated technical destination, making it the best fit for a book that blends deep tech with development philosophy.

Book Potential Analysis

Strengths as Book Fodder:
- Authentic Problem-Solving: The dialogue format offers a transparent, blow-by-blow account of evolving a technical design from a simple idea to a sophisticated solution.
- Illustrates a Core Philosophy: It provides a concrete, practical application of the “Rule of 7” and “applied tensions” concepts, grounding abstract ideas in a real-world use case.
- Human-AI “Pair Architecting”: The entry is a powerful case study in using AI not just as a code generator, but as a strategic partner in systems design, bouncing ideas and refining architecture in real time.
- Practical Technical Blueprint: It details a modern, low-liability approach to content organization using standard tools like SQLite and vector embeddings, which is highly relevant.
Opportunities for Enrichment (for Book Adaptation):
- Visualize the Concepts: Add a simple diagram illustrating a “lumpy” vs. a “balanced” hierarchy, and another showing what a dendrogram (the output of hierarchical clustering) looks like and how it’s “sliced.”
- Provide Code Snippets: Include brief, illustrative Python code snippets for key steps, such as hashing a file’s body, generating a vector embedding with a library, or querying the JSON blob in SQLite.
- Add a “Key Trade-Offs” Box: Create a summary box explaining the pros and cons of the normalized (multiple rows) vs. denormalized (JSON blob) database approaches discussed in the dialogue.

AI Editorial Perspective: From Journal to Chapter

This entry is far more than a simple log; it’s the perfect anchor for a chapter on AI-Assisted Systems Design. The dialogue format is its greatest strength, offering an authentic “behind-the-scenes” view of the messy, iterative, and ultimately creative process of technical discovery. It demystifies complex topics by showing how they are introduced sequentially to solve specific, emergent problems—first a simple associative array, then a hash for integrity, then a vector for meaning, and finally hierarchical clustering for structure. This narrative progression is incredibly valuable for a reader.

The human-AI dynamic is the standout feature. The author sets the philosophical vision and identifies constraints, while the AI acts as a Socratic partner, validating ideas, introducing modern technical solutions (like SQLite’s JSON capabilities), and structuring the workflow. In a book, this entry would serve as a powerful case study demonstrating how to leverage AI as a thought partner to move from abstract principle to a concrete, robust architectural blueprint. The raw, conversational nature makes the final, sophisticated solution feel earned and accessible, rather than presented as a dry, finished artifact.

Suggested Next AI Processing Steps

Task Suggestion 1: Generate Explanatory Analogies
- Potential Prompt Snippet for Next AI: “Based on the provided dialogue, generate a simple, non-technical analogy for both ‘Vector Embedding’ and ‘Hierarchical Agglomerative Clustering’ to help a non-expert reader understand their core function in organizing content.”
Task Suggestion 2: Draft a “Key Takeaways” Section
- Potential Prompt Snippet for Next AI: “Analyze the provided technical design session and draft a ‘Key Architectural Decisions’ summary in 3-5 bullet points. For each point, briefly state the problem, the chosen solution, and the primary benefit of that choice.”