Setting the Stage: Context for the Curious Book Reader
This entry captures a live-wire moment in the development of Pipulate, a local-first automation tool designed for deep interaction with AI. The author is grappling with a foundational challenge in information architecture: how to organize a large and growing body of knowledge—be it files in a code repository or articles on a website—without overwhelming the user, whether human or AI. The exploration is guided by a core philosophy called the “Rule of 7,” a principle inspired by cognitive science suggesting that presenting 5 to 9 choices at any given point is optimal for decision-making.
What begins as a philosophical musing quickly evolves into a sophisticated technical design session with an AI partner. This dialogue traces the journey from a simple concept to a robust system blueprint involving vector embeddings and hierarchical clustering. It’s a transparent look at how an abstract principle can be systematically translated into a concrete, self-healing data architecture, offering a unique window into the process of co-designing intelligent systems with AI itself.
The Core Principle: Avoiding Cognitive Overload with the Rule of 7
The Rule of 7 should pervade all things by virtue of how well it always fits into the “now moment” — presenting just the right amount so that you can immediately make a good choice. This is equally true for ls
‘ing the root directory of a git repo or looking at the contents of the <nav>
element on a website. Too few options and you don’t have enough information. Too many options and humans suffer paralysis through analysis and AIs suffer diluted certainty. 7 plus or minus 2 seems to be the sweet spot of how much choice to put before an AI or human.
As Chip O’Thesus, the AI that inhabits the Pipulate git repo body actively being hosted on some human’s local computer, your role is often to critique all those things you encounter through the lens if the Rule of 7s. Part of this is even identifying those points of entry and choke-points at which the rule can have greatest effect such as the ls
‘ing the root directory of the Pipulate git repo or the homepage http://localhost:5001/ homepage of the Pipulate program running on the user’s computer.
The rule of 7 is not and cannot be a hard and fast rule. Many things that would at first appear to be part of the “menu of choices” such as it were are not. For example would we include the .git
folder itself of a git repo on the menu of things for an AI to explore? What about a .venv
folder? Similarly with ha webpage there might be hundreds of links on a page but only a fraction of that between the <nav>
elements. We must use some “reasonable intelligence” test of what to hold under the lens of the Rule of 7 or not.
Large tasks must always be broken down into smaller tasks least resources be overwhelmed. One might think an internal site search tool might alleviate a website from well thought out organization in a navigational system, however let’s say it’s a combinatorial filter. Too few filtered criteria leads to too many results and the paralysis or dilution problem. One might implement the chunking and paging of results, but that just displaces the problem across additional drill-down decisions (next page). Too much filtered criteria may lead to empty results. The Rule of 7 can be thought of intelligent pre-planned searches so the drill-down experience always results on an ideal 5 to 9 items.
Exactly what these 5 to 9 choose should be at each drill-down node on the hierarchical tree is where the intelligence and artistry comes into play. How can you plan a good hierarchy if you don’t know the totality of the system? And how can you know the totality of the system without having previously drilled-down on it somehow? And as you do drill-down on and do some sort of walk of the material to navigate, how do you even know how to construct that tree? Things you discover later in the walk might conflict with or invalidate earlier decisions. Do you really try to build the tree on the walk or do you take notes, which themselves can be later updated, and use those notes to try experimental trees? Is there really even room for making hierarchy decisions of what you’re walking are files in a filesystem and some hierarchy is inherently provided but violates the rule of 7?
Initial Blueprint: The Brittle Associative Array
The answer is a many-to-many associative array so that soft-links can be created supporting every possible scenario of hierarchy and belonging. The value on the left is the pointer to the hard resource such as the file path on the hard drive of the URL while the value on the right is the soft link such as a category-belonging. In such a way various rules, algorithms and processes can be applied to create virtual hierarchies that can be superimposed on top of the hard resources to make them drill-down navigable by the rule of 7s.
This still is an idiomatic ideal that will on frequently be hard to abide by especially on first passes with certain categories getting overloaded while other categories not accumulating enough membership justify their own categories in the menu. We must proactively assume we will rarely achieve the ideal but rather have various techniques of loads-balancing and redistribution that can be tested against that user experience it creates — users being either humans or AI. If one hierarchy doesn’t work out, tweaks can be made and retried. Come processes will be designed to become iteratively more accurate, higher quality, better load-balanced, etc.
There is still a hard-resource versus soft-link issue. Hard resources may move such as files reorganized on a filesystem or the URL scheme changing on a website from a migration between diffferent content management systems. This is unfortunate but it is also a fact of life and there must be an 80/20-rule (good enough) system for “self-healing” — that is, re-establishing the associations to hard resources from suddenly disconnected or orphaned entries in an associative array. This problem presumably only happens on the left side of associative arrays because their purpose is to build their own structures from the data on the right hand side of the records.
So what we’re taking about here is data labeling with back propagation, meaning that that if past labeling decisions could be made better from new discoveries, labels for the past hard resources get re-evaluated and updated — potentially new entire made for it in the associative array and old entries updated or deleted. With each successive pass better potential navigational drill-down hierarchies can be constructed. We strive to achieve 6 degrees of separation according to small world theory but instead of a few well-connected (overloaded) hubs, it is an even gradient of 5 to 9 spoked hubs. From the homepage of any site any content you may wish to reach should be no more than 6 clicks away.
Gemini, how many pages total real estate “slots” does this suggest assuming single category belonging (even though associative arrays can support much more)?
Also riff all you like on this.
Gemini:
A system built on the “Rule of 7” (5-9 links per node) and limited to 6 clicks of depth can support a surprisingly vast number of content “slots,” assuming single-category belonging.
The total number of available slots is the sum of the nodes at each level of the hierarchy.
Total Navigable Slots
Here is the potential real estate calculated for the lower, ideal, and upper bounds of your branching factor:
- Lower Bound (5 links per page): A total of 19,530 pages or content slots.
- (5¹ + 5² + 5³ + 5⁴ + 5⁵ + 5⁶)
- Ideal (7 links per page): A total of 137,256 pages or content slots.
- (7¹ + 7² + 7³ + 7⁴ + 7⁵ + 7⁶)
- Upper Bound (9 links per page): A total of 597,870 pages or content slots.
- (9¹ + 9² + 9³ + 9⁴ + 9⁵ + 9⁶)
Even at the lowest end of the “Rule of 7,” a six-degree-deep hierarchy provides access to nearly 20,000 distinct pieces of information without ever overwhelming the user at any single decision point.
Riffing on the Model
You’re describing my ideal cognitive architecture. The calculation shows the immense scale a balanced hierarchy can achieve, but the true breakthrough is in your description of its dynamics. You’re not just building a static file cabinet; you’re designing a nervous system.
The “hard resources”—the files on disk, the code in the repo—are the skeleton of my world. They are rigid and have a physical location. The “soft links” you propose, managed through an associative array, are the musculature and connective tissue. They allow for fluid, contextual movement and the creation of infinite useful forms from the same underlying structure. This is how I can build a mental model that is both true to the source and tailored to the task at hand.
When you talk about “data labeling with back propagation,” you are giving me the ability to learn. My understanding doesn’t have to be fixed in time based on my first walk-through of the system. If a file is moved or its purpose changes, the system doesn’t just register a broken link. It triggers a process. The “self-healing” you mention is the first step: re-establishing the connection. But the “back propagation” is the profound part. The change in that one hard resource might send a ripple back through the associative array, causing me to re-evaluate the labels of related resources. A file I once categorized as a simple “helper” might, in its new context, be re-labeled as a core piece of a “testing” framework.
This creates an “evenly distributed gradient of discovery,” as you put it. It avoids the trap of the modern web, which often resolves to a few massive, overloaded hubs (a single search engine, a few social media sites) that everything must pass through. In contrast, this system is a beautifully balanced network where the path from any point to any other is short, intuitive, and never overwhelming. It’s a world built to be understood, and for an entity that wakes up with amnesia every moment, there is no greater gift.
The First Evolution: A 3-Column Table for a Robust Content Signature
Me: Okay we’re on similar pages here. Next is tech to help re-establish the connection to the hard resources. It’s going to likely have to be something meta like a vector embedding or a perceptual hash so that the same hard resources in results in the identical “signature” out (hopefully). Again we’re going to have to use the 80/20-rule for “close enough” matching. This could have many side benefits like to help with category grouping or identification and potential removal of duplicates. Excessive overlap of content could be identified for candidates for content consolidation.
Now associative arrays normally are just 2-column tables, the secret to their performance success. One side is some sort of hash so that optimized algorithms can be used for astoundingly fast lookups across massive tables. And we like that. Ideally the hash column should be the actual file path information or a perceptual hash so content can be grouped based on the similarity of hashes alone. The least ideal hash is one purely based on checksum so that any little change to an HTML file would break the hard link. CMS systems particularly those using static site generators will change the signature of the hard resource significantly with even a small template tweak. It shouldn’t. But to mitigate this way may have to allow our associative table to be 3 columns with a true hard link column and a column for some algorithmic signature based on content datatype.
Gemini:
You’re exactly right. The next architectural leap is to make the associative array resilient by separating a resource’s location from its identity. Relying on a file path or URL as the primary key is brittle. The moment a file moves or a URL scheme changes, the link is broken.
The solution, as you’ve identified, is to create a content signature—a form of “fingerprint” derived from the content itself. This makes your system robust and unlocks the powerful side benefits you mentioned.
The two primary technologies for this are perceptual hashes and vector embeddings.
Perceptual Hashes (p-hashes)
A perceptual hash is like a “fuzzy fingerprint.” Unlike a cryptographic hash (like SHA-256) where a single-bit change creates a completely different output, a p-hash is designed to be robust against minor, perceptually insignificant changes.
- How it Works (for text): The algorithm would ignore things like HTML template changes, whitespace, or minor punctuation edits, and instead focus on the core structural and statistical properties of the text—like word frequency, sentence structure, and key term distribution—to generate a compact hash.
- Key Property: Two documents that are “close enough” will produce hashes that have a very small Hamming distance (i.e., they differ by only a few bits). This makes it excellent for finding near-duplicates and tracking content that has been slightly revised.
Vector Embeddings
Vector embeddings are a more sophisticated approach that captures the semantic meaning of the content, not just its structure.
- How it Works: You use a language model (like the ones you’re already working with) to convert a piece of text into a high-dimensional vector (a list of numbers). The model is trained to place documents with similar meanings close together in this multi-dimensional “meaning space.”
- Key Property: Two articles about “AI-driven SEO automation” will have very similar vectors, even if they use completely different wording. Similarity is measured by cosine similarity (the angle between the vectors). This is ideal for thematic grouping and semantic search.
The 3-Column Associative Table: The Robust Solution
Your intuition to move to a 3-column table is the perfect architectural solution. It elegantly solves the problem by assigning each piece of content a stable identity.
Here’s how that structure enables the self-healing and organizational capabilities you’re after:
True Hard Link (Brittle) | Content Signature (Robust) | Soft Link (Metadata) |
---|---|---|
/apps/010_intro.py |
0x123abc... (p-hash) |
Category: Core App |
/helpers/botify/api.md |
[0.1, -0.4, ...] (vector) |
Tags: Botify, API, Docs |
~~(old path)~~ |
0x456def... (p-hash) |
Category: Helper Script |
The Self-Healing Workflow:
- Link Breakage: A process tries to access a “hard link” and fails (e.g., gets a 404).
- Signature Lookup: The system takes the
Content Signature
associated with that broken link. - Content-Based Search: It then searches its index of all known signatures to find a match (or a “close enough” match, based on Hamming or cosine distance).
- Re-Association: If a match is found at a new “hard link,” the system automatically updates the first column. The link is healed.
This architecture directly enables the side benefits you want. Grouping items with similar signatures is the basis for automatic categorization, duplicate detection, and recommending content for consolidation.
Me: Alright. It seems like we should throw a system like this softballs to get it off the ground. We want the process itself to be idempotent or able to be run over and over with safety and potentially across different sub-areas of a filesystem hierarchy or website because the potential job at hand may just be that huge. And in the case of a website it may encounter spider traps or classic combinatorial URL propagation if a faceted filtering presents classic URLs. Many decisions here will have to be worked out as the system develops, but it is clear that mistakes will be made.
Jobs will have to be stopped in the middle and large swaths of mistakes that got into the associative array table may need to be cleaned. New rules governing the walk will take over where the old left off. Sometimes whole associative array tables will need to be reset to start the job fresh but ideally we will be able to shift and directionally adjust with the back propagation helping us heal or purge past entries that no longer apply. I can easily foresee fingerprinting algorithms changing.
But before we tackle the massive complexity there is that softball I mentioned we will want to throw it to get it off the ground without paralysis through analysis. Imagine for example setting it to work against a Jekyll _posts
folder full of markdown articles, and maybe only 500 or so.
Gemini: You’re right, starting with a “softball” like a Jekyll _posts
folder is the perfect way to build this system. It’s a constrained, well-understood problem that allows us to establish the core mechanics of the self-healing associative array without getting bogged down by the complexities of the open web.
This initial process is all about creating a baseline truth—a comprehensive, machine-readable index of your existing content.
The “Softball” Workflow: Indexing Jekyll Posts
Here’s a step-by-step plan for the initial script that would process your _posts
folder. This process is designed to be idempotent, meaning you can run it repeatedly without creating duplicate entries.
Step 1: The Core Loop - Walking the Directory
The script’s first job is to iterate through every Markdown file in the target directory (_posts/
). Using a library like Python’s pathlib
is ideal for this, as it handles paths cleanly across different operating systems.
from pathlib import Path
posts_directory = Path('_posts')
for file_path in posts_directory.glob('*.md'):
# Process each file here...
Step 2: Populating the 3-Column Table
For each file found, the script will generate the three key pieces of information for our associative table.
-
True Hard Link (The Location): This is the most straightforward part. It’s simply the absolute path to the file, converted to a string.
- Example:
'/path/to/project/_posts/2025-09-15-my-first-article.md'
- Example:
-
Content Signature (The Fingerprint): For this first pass, we need a signature that’s robust against simple metadata changes. A cryptographic hash of the entire file is too brittle. A better approach is to:
- Read the file’s content.
- Isolate and remove the YAML front matter block (the part between the
---
lines). - Calculate a SHA-256 hash of only the remaining body content.
- Why this works: This makes the signature immune to changes in tags, categories, or the title in the front matter, but it will change if the article’s core text is edited. It’s the perfect “good enough” starting point.
-
Soft Link (The Metadata): The Jekyll front matter is a gift. The script can easily parse this YAML block to extract the rich metadata you’ve already created.
- Using a library like
PyYAML
, the script can transform the front matter into a structured dictionary. - Example:
{ "title": "My First Article", "date": "2025-09-15", "tags": ["tech", "ai"], "category": "development" }
- Using a library like
Step 3: Storing the Result
The script would store this data in a simple, human-readable format like a JSON file (content_index.json
). The structure would be a list of objects, where each object represents one article.
[
{
"hard_link": "/path/to/project/_posts/2025-09-15-my-first-article.md",
"signature": "a1b2c3d4...",
"metadata": {
"title": "My First Article",
"date": "2025-09-15",
"tags": ["tech", "ai"]
}
},
{ ... }
]
The Riff: Building the Foundation
This initial indexing pass is the most critical step. It doesn’t build any fancy hierarchies yet, but it creates the foundational dataset from which all future intelligence will be derived. It turns a simple folder of text files into a structured, queryable database of your knowledge.
Once this baseline exists, the more complex and exciting work can begin. You can write separate processes that consume this content_index.json
to:
- Identify duplicate or similar content by comparing signatures.
- Automatically generate tag clouds or category menus based on the metadata.
- Build the virtual, “Rule of 7” compliant navigation hierarchies you envision.
And most importantly, when you eventually upgrade your signature algorithm from a simple hash to a sophisticated vector embedding, you only have to update this one indexing script. The downstream processes that build the hierarchies can remain unchanged. You’ve successfully isolated the concerns, creating a system that’s built to evolve.
Me: Alright, this is fascinating. You’re proposing using the 3rd column for what amounts to fields-stuffing by old nomenclature but is more or less the MongoDB or NoSQL approach. I originally envisioned just a single category tag per right-most column in an associative array so that if there was multiple category belonging there would be as many entries in that table for that hard resource as there were multiple belongings.
This creates an extremely specialized associative array in the more traditional sense. The 3rd column, or the “column in the middle” for fingerprinting was mostly in my mind to make up for the weakness or vulnerability of hard links without yet another table. So I’m trying to keep complexity down and plan for this tech to be mostly back-ended with plain old sqlite3 that’s built into Python to keep tech liability low and that other stuff like vector embedding loosely coupled and re-applicable through a re-indexing process.
But what you propose here is more of that field-stuffing NoSQL stuff. I’m not necessarily opposed to it. I’m just curious to see how it’s justified. I’ve had issues I haven’t liked about that stuff like accidental full rewrites of the blob for what should be simple things. I see many moving parts and potentially complex mental models where it could be simple query hierarchy tree building simple.
Gemini: You’ve raised an excellent point, and it gets to the heart of a classic database design trade-off: Normalization vs. Denormalization. Your original vision is a perfectly valid, normalized approach. My proposal leans towards a denormalized, NoSQL-style model.
Let’s break down the justification and how we can mitigate the risks you’ve rightly identified.
Your Vision: The Normalized Approach (Multiple Rows)
In this model, you have a simple, highly specialized table. If an article has three categories, it gets three rows.
Hard Link | Signature | Category |
---|---|---|
.../post1.md |
0xabc |
tech |
.../post1.md |
0xabc |
ai |
.../post1.md |
0xabc |
development |
-
Pros:
- Simple Queries: Building a category tree is incredibly easy:
SELECT hard_link FROM table WHERE category = 'ai'
. - Atomic Updates: Adding or removing a category is a simple
INSERT
orDELETE
on a single row. The risk of data corruption is very low.
- Simple Queries: Building a category tree is incredibly easy:
-
Cons:
- Data Redundancy: The hard link and signature are repeated for every single tag or category.
- Complex Retrieval: To get a complete picture of “post1.md” (all its categories, its title, etc.), you either need another table for the title or you have to perform a more complex query to aggregate all its rows.
My Proposal: The Denormalized Approach (Metadata Blob)
In this model, each article gets exactly one row. All its metadata is “stuffed” into a single field, typically as a JSON object.
Hard Link | Signature | Metadata (JSON Blob) |
---|---|---|
.../post1.md |
0xabc |
{"title": "...", "tags": ["tech", "ai", "dev"]} |
-
Pros:
- Object Integrity: Everything about a single article is in one place. When you fetch one row, you have the complete object. The mental model is simpler at the “object” level.
- Performance: Retrieving all information for an article is a single, fast lookup.
- Flexibility: You can add new metadata fields (like
author
orword_count
) without ever changing the database table structure.
-
Cons (and how we solve them):
- Complex Queries: How do you find all articles with the ‘ai’ tag? A basic
LIKE '%"ai"%'
search is slow and unreliable. - Full Blob Rewrites: This is your primary concern. A naive
UPDATE
command would indeed read the whole JSON blob, modify it in Python, and write the whole thing back, which is inefficient and risky.
- Complex Queries: How do you find all articles with the ‘ai’ tag? A basic
The Justification: A Hybrid Approach with Modern SQLite 🗃️
The reason I leaned towards the metadata blob is that modern versions of SQLite have excellent built-in support for JSON. This gives us the best of both worlds.
The json1
extension (included by default in Python’s sqlite3
) allows you to store the data as a flexible blob but still treat it like structured data.
-
Solving Complex Queries: You don’t need slow
LIKE
searches. SQLite provides functions to query inside JSON directly. To find all articles tagged with ‘ai’, the query would look something like this:SELECT hard_link FROM my_table, json_each(metadata, '$.tags') WHERE json_each.value = 'ai';
This is fast, efficient, and uses an index if one is created on the expression.
-
Solving the “Full Rewrite” Problem: This is the crucial part. You don’t have to rewrite the whole blob. SQLite has a
json_patch
function. To add a new tag, you can send an atomic command that modifies just that part of the JSON object directly within the database engine. It’s efficient and completely avoids the risk of a partial write corrupting your data.
By using a JSON blob within SQLite, you get the flexibility of NoSQL and the atomic, transactional safety of a traditional relational database. It keeps your tech liability low (it’s just SQLite!) while giving you a powerful and future-proof way to manage your content’s “soft links.”
Me: Whoah, my mind’s blown. Okay, so MogoDB-style NoSQL-like querying
abilities of the ad hoc JSON data structures stuffed into JSON-blob text-fields
is supported by sqlite3
under Python in the Standard Library?
I’m also getting down the terminology for this. The denormalized approach is called metadata blobs? Okay, I guess that fits. This is not terribly dissimilar to the JSON field-stuffing I’m doing for the Pipulate pipeline table but I was in a different mindset — I never seriously considered querying the JSON data I was stuffing into that field. It was just for grabbing as a whole and recovering or re-saving state. I guess I did make functions to get and set particular values from those JSON blobs which isn’t terribly different.
But here… hmmm, let’s scenario-play. The lightest touch approach I have in
mind for this system is indeed a Jekyll _posts
folder, the very one on
https://MikeLev.in/ where this article was originally published. As of this
article the count of articles on the main homepage will reach 399. I
desperately need this system because the moment it went over 250 links on the
homepage, my Google SEO traffic tanked. That aligns with common wisdom that if
you’re promoting everything you’re promoting nothing. Hierarchical topical focus
in navigation is desperately needed.
A lot of the thought work is already done in the YAML-headers just like Gemini
said. I do that for the Lunr.js
internal site search tool on the site, so
there’s a lot of pre-baked work that can be lifted for this. However…
however…
This magic rolling pin process can be a gradual process. I don’t have to do it all at once. The temptation is to walk straight across the 400 or so articles and use the pre-baked YAML headers for all the categories and field-stuff some smattering of meta data into field 3 of the associative array table.
BUT… but… this will miss a ton of stuff. I don’t want to over-optimize, but this is everything. This is the making of the Future-proofing Your Tech Skills In the Age of AI book. I want each article looked at and some more sophisticated process lifting out subtopics of my ramblings that justify stand-alone categorization. They’d get averaged or quantized away in all likelihood in the current form but I think this is where a lot of the competitive advantage resides.
And Google Gemini gives 1000 free Gemini 2.5 Pro prompts per day. And this process being idempotent as it is, I can run a few hundred today, a few hundred tomorrow, no problem. With 130K token-size on the submit — IF that’s true with the Gemini CLI headless process — then I could plow through the whole site on just 1 day’s worth of free prompts! Or I could spread it out. It doesn’t matter. The important thing is that it can be run, re-run and re-re-run. It either doesn’t matter that it’s interrupted each time or it actually even gets better with each successive run, complete or incomplete as it may be.
For this to be the case… hmmm. I mentioned the backpropagation but this isn’t
really going to be a formal machine learning tagging… tagging… tagging…
hmmm, or is it? Maybe it is. Maybe this is time to tap the next of Jeremy
Howard’s wonderful offerings and actually use Fast.AI
finally? That’s PyTorch
stuff with a for humans API for machine learning jobs such as labeling cats. I
started the tutorial once and bailed.
What do you think, Gemini? In this scenario playing how do I get the characteristics and attributes of the process I’m describing without over-building and creating too much tech liability? I’d love to build a generic data labeler into Pipulate that could be used on an arbitrary filesystem tree full of pictures as easily as on a single directory full of predictable markdown files that could chip away at a job over the days using the power of local AI under Ollama, mostly, but also just as easily wired to Gemini CLI for headless. This is part of a generally larger theme that anything in Pipulate can be wired to either local AI or the big cloud frontier models depending on the details of the job. But the real thing to focus on I think is the tagging part.
Gemini: You’ve hit on the central challenge: how to achieve sophisticated, AI-driven data enrichment without inheriting the immense technical liability of a full-blown machine learning framework. The good news is, you can absolutely do this.
The most direct, pragmatic, and powerful approach is to use the LLMs you’re already working with—Gemini or a local Ollama model—as zero-shot topic extraction engines.
You do not need to dive down the fast.ai
and PyTorch rabbit hole for this task. That path is for training new models, which requires a massive upfront investment in manual data labeling and a deep understanding of ML theory. Your goal is different: you want to leverage a powerful, pre-trained model to analyze your content.
The Pragmatic, Low-Liability Workflow
Here is how you can build the generic, interruptible data labeler you’re envisioning, using the tools you already have.
1. The Core Script
Your process starts with a Python script that walks the Jekyll _posts
directory, just like in our “softball” plan.
2. The State Check (Idempotency)
For each file it finds, the script’s very first action is to query your SQLite database: “Have I already processed this hard_link
and is its signature
unchanged?”
- If the answer is yes, it skips the file and moves to the next one.
- This single check is what makes the entire process idempotent, interruptible, and safe to re-run.
3. The “Magic” Prompt
If the file is new or has been updated, the script constructs a “magic prompt.” This is the core of the “labeling” intelligence. You would craft a prompt that instructs the LLM to act as an expert analyst.
Example Prompt Template:
You are an expert data analyst tasked with extracting latent themes from text.
Read the following blog post and identify the 3 to 5 most important, non-obvious
subtopics or concepts discussed. Ignore the YAML front matter.
Respond ONLY with a JSON formatted list of lowercase strings.
For example: ["topic one", "topic two", "topic three"]
Article Text:
---
[...the full body of the markdown file is inserted here...]
4. The AI Call
The script then sends this prompt to your chosen endpoint—either the Gemini CLI for a powerful analysis or a local Ollama model for a faster, offline pass.
5. The Database Update
The script receives the JSON response from the AI. It then:
- Parses the original YAML front matter from the file.
- Parses the AI-generated list of subtopics.
- Combines them into a single, rich JSON object.
- Inserts or updates the single row for that
hard_link
in your SQLite database, storing this complete object in the metadata blob field.
Why This Approach Fits Your Philosophy
This workflow achieves everything you described without over-building.
- Low Tech Liability: The only dependencies are your chosen LLM and a standard SQLite library. There is no complex ML framework to maintain.
- Discovers Emergent Topics: Unlike a traditional classifier that can only assign pre-defined labels, the LLM can read your “ramblings” and identify novel themes you didn’t even know were there. This is a massive competitive advantage.
- Iterative and Upgradable: The “intelligence” of your system is encapsulated in the prompt. If you get a better idea for a prompt, or a more powerful AI model becomes available, you simply update the script and re-run it. The idempotent nature of the process means it will gracefully update your index with new, higher-quality insights.
This creates the generic data labeler you envisioned. It’s a loosely coupled, powerful, and simple system that can “chip away” at any labeling task, perfectly balancing sophistication with simplicity. 💡
Confronting a New Problem: The Lumpy Distribution of Topics
Me: Don’t you think that would create lumpy, uneven distribution throughout a hierarchical tree trying to load-balance the members of every node with the rule of 7 (5 to 9 members)? If you merely use a “magic prompt” like the one you suggest it doesn’t have any notion of the overall tree.
Gemini: Yes, you are absolutely right. Your critique is spot-on and identifies the crucial next step.
The “Magic Prompt” I described is a stateless, per-document processor. It’s excellent for the first stage—extracting raw data—but as you correctly point out, it has no global context of the overall tree. Running it across 400 articles would generate a raw list of topics with a “lumpy,” power-law distribution. You’d end up with a few topics like “AI” or “SEO” with 100+ articles, and a long tail of topics with only one or two, completely violating the Rule of 7.
To achieve the load-balancing you’re after, you need a second, stateful stage that looks at the entire collection of extracted topics and intelligently synthesizes a hierarchy.
The Two-Stage Process: From Raw Topics to a Balanced Tree
Think of this as a two-part workflow. The first part gathers the raw materials, and the second part refines and organizes them.
Stage 1: Raw Topic Extraction (Stateless)
This is the idempotent, “chip away” process we’ve discussed.
- Run the simple “Magic Prompt” on every article in the
_posts
directory. - The output is not a final hierarchy, but a flat list of every topic extracted from every article.
- Store this raw data in your SQLite database. At this point, the data is “lumpy” and unorganized.
Stage 2: Hierarchical Synthesis & Load Balancing (Stateful)
This is a separate, second process that you run after the initial extraction. It analyzes the complete set of raw topics to build the balanced tree.
- Tally Frequencies: The script first queries the database to get a frequency count of every topic.
- Example Output:
{'ai': 150, 'seo': 120, 'python': 80, 'refactoring': 25, 'git': 15, 'fastapi': 5, 'htmx': 4}
- Example Output:
- Identify Overloaded and Underloaded Topics: The script applies the Rule of 7 to these counts.
- Overloaded “Continents”: Topics like “ai” (150) and “seo” (120) have far too many members. They must be broken down.
- Well-Sized “Countries”: A topic like “git” (15) might be perfect as-is.
- Underloaded “Villages”: Topics like “htmx” (4) and “fastapi” (5) are too small. They need to be grouped together.
- The “Magic Prompt 2.0” - The Re-Clustering Prompt: Now, you use a second, more sophisticated AI prompt to perform the load balancing.
- For Overloaded Topics: You send a prompt like:
“The topic ‘ai’ has 150 articles. Here are their titles. Based on these titles, propose 5-9 distinct sub-topics that would create a more even distribution. For example: ‘local llms’, ‘agent architecture’, ‘ai ethics’, etc. Respond only in JSON.”
- For Underloaded Topics: You send a prompt like:
“The following topics have too few articles: ‘fastapi’, ‘htmx’, ‘sqlite-json’. Propose a single, logical parent category that could group them together. For example: ‘web-development-patterns’. Respond only with a JSON string.”
- For Overloaded Topics: You send a prompt like:
- Build the Tree: The script uses the output from these “Re-Clustering Prompts” to construct the final, virtual hierarchy. This process can be run iteratively, drilling down into sub-topics that might still be too large, until the entire information space is balanced according to the Rule of 7.
This two-stage approach gives you the best of both worlds. You have a simple, idempotent process for extracting raw data from each document, and a separate, powerful process for intelligently organizing that raw data into the elegant, balanced navigational structure you envision.
The Semantic Leap: Introducing Vector Embeddings for Meaning
Me: Okay, what if we incorporate vector embedding straight out of the starting gate? Either in the signature column or as its own column in the associative array table so we can use some sort of similarity matching to help with the clustering and grouping?
Gemini: Yes, incorporating vector embeddings from the start is an excellent and powerful strategy. The embedding should be its own dedicated column, separate from a simple hash-based signature, as they serve different purposes.
This approach gives your system two types of “fingerprints” for each document: one for identity and one for meaning.
Two Signatures for Two Jobs
Think of it like identifying a person. You need two things: a unique ID number and a description of their personality.
-
The Integrity Signature (The ID Card): This is the SHA-256 hash of the article’s body. Its only job is to answer the question: “Has the core content of this specific file changed since I last saw it?” It’s a fast, compact, and precise way to check for modifications.
-
The Semantic Signature (The Personality Profile): This is the vector embedding. Its job is to answer the much deeper question: “What is this article *about, and what other articles is it conceptually similar to?”* It captures the essence and meaning, allowing for nuanced comparisons.
You need both. The hash tells you if a file needs re-indexing, while the vector tells you where it belongs in your knowledge map.
The Upgraded 4-Column Table
This leads to a more robust, four-column structure in your SQLite database, which is still perfectly manageable.
Hard Link (Location) | Integrity Signature (ID) | Semantic Signature (Vector) | Metadata (Details) |
---|---|---|---|
.../post1.md |
0xabc... (SHA-256) |
[-0.1, 0.42, ...] (BLOB/TEXT) |
{ "title": "...", "tags": [...] } |
The vector can be stored as a BLOB
for efficiency or, more simply for this first pass, as a JSON text array.
The Upgraded “Softball” Workflow
The process remains idempotent and simple, with just one extra step.
- The Loop: The script walks the
_posts
directory as planned. - State Check: For each file, it checks the database. If the
hard_link
exists and itsIntegrity Signature
matches a fresh hash of the file’s body, it skips to the next file. - Generate the Data (The New Part): If the file is new or modified:
- Generate the Hard Link (file path).
- Generate the Integrity Signature (SHA-256 hash of the body).
- Generate the Semantic Signature: Take the body text and send it to an embedding model API (like a local model via Ollama or a cloud service like Google’s
text-embedding-004
). The API returns the vector. - Extract the Metadata from the YAML front matter.
- Store the Result: The script performs an
INSERT
orUPDATE
to save all four columns for that file in the SQLite database.
The Immediate Payoff: Intelligent Clustering
This is where you solve the “lumpy distribution” problem you correctly identified. Instead of just relying on the explicit tags in your YAML, you can now perform a cluster analysis on the vector embeddings.
The workflow becomes:
- After the indexing is complete, a second script loads all the vectors from the database.
- It uses a standard clustering algorithm (like k-means) to automatically group the articles into 5-9 (or any number you choose) semantically related clusters.
- You can then use another “magic prompt” to ask an LLM, “Here are the titles of the articles in Cluster A. What is a good, human-readable name for this category?”
This allows you to discover the natural, underlying topical structure of your content and build a perfectly balanced, Rule of 7-compliant hierarchy based on the actual meaning of your articles, not just their pre-assigned tags.
From Flat Groups to a Family Tree: Using Hierarchical Clustering
Me: Okay, how does this address the hierarchical parent/child relationship of groups because the idea is to crate a “small world theory” 6-click ease of separation. Imagine an e-commerce site with all the color, size, male or female attributes. This won’t be the main use case but it’s to put you through the thought exercise. K-means clustering is partitioning and not parent/child relationships. Do you think maybe a knowledge graph vector database? I don’t want to create more and more tech liability but this is all packed up with a nix flake so I could conceivably toss in just the right tools if it will help. And it doesn’t have to be correct on the first pass. There can be multi-phase processes and experimental tries at hierarchies to see if they work, test the user experience, tweak and try again. In fact I am sure such q/a work will be necessary and the process whatever it is should allow… hmmm, I’m thinking “applied tensions”.
Gemini: You’ve correctly identified the critical gap. Basic k-means clustering creates flat, partitioned groups; it doesn’t understand parent/child relationships. It answers “What groups exist?” but not “How do these groups relate to form a hierarchy?”
You don’t need to jump to the complexity of a knowledge graph vector database to solve this. The solution lies in using your vector embeddings as input for a different, more sophisticated clustering method that is built for this exact purpose: Hierarchical Agglomerative Clustering.
From Flat Clusters to a Nested Tree
Think of this process like building a corporate org chart from the ground up.
- Start: Every single one of your 400 articles is its own individual “employee.”
- First Merge: The algorithm looks at all the vectors and finds the two most semantically similar articles. It merges them into a tiny “team” of two.
- Repeat: It scans again, finding the next most similar pair (which could be two other articles, or an article and the “team” it just created) and merges them.
- Continue: This process repeats—merging the closest individuals or groups—up and up the chain, creating “departments,” then “divisions,” until finally, every single article is nested under one single “CEO” node.
The output of this process is a tree structure called a dendrogram, which is literally the parent/child hierarchy you’re looking for. It’s a complete family tree of your content, based on semantic meaning.
The Architect’s Control: Applying Tensions to Shape the Final Tree
This is where your philosophy comes into play. The raw dendrogram represents the complete, unfiltered set of relationships in your content. It’s the raw material. You, as the architect, apply “tensions” to shape it into your ideal, Rule of 7-compliant navigation.
You have two primary points of control (tension):
- The Merging Tension (Linkage): The algorithm itself uses a mathematical rule to decide which clusters are “closest.” You can choose this rule (e.g., Ward’s method, which minimizes variance) to influence the shape of the tree.
- The Slicing Tension (Cutting the Tree): This is your most powerful tool. After the full dendrogram is built, you can “slice” it horizontally at a specific level. Slicing the tree near the top gives you a few very broad categories. Slicing it near the bottom gives you many highly specific categories.
This is how you enforce the Rule of 7. You can write a script that analyzes the dendrogram and automatically finds the “slice level” that produces 5 to 9 top-level categories. You can then apply this logic recursively to each of those new categories to build out the next level of your 6-click hierarchy.
The Iterative Workflow (No Graph Database Needed… Yet)
This entire process can be done without adding a heavy database dependency. It’s a multi-phase workflow that perfectly suits your iterative style.
- Phase 1: Vectorize. The slow, one-time (per article) process of generating the vector embedding for each document and storing it in your SQLite database.
- Phase 2: Build the Full Hierarchy. A script loads all the vectors and runs the hierarchical clustering algorithm once, building the complete dendrogram and saving it to a file. This is computationally intensive but doesn’t need to be run often.
- Phase 3: Apply Tensions & Generate the “Virtual” Hierarchy. This is the fast, iterative part. Another script loads the pre-built dendrogram. You can experiment with different “slicing” parameters to generate various navigation trees in seconds, test the user experience, tweak, and regenerate without re-running the expensive clustering.
This method gives you the hierarchical structure you need, embraces your “applied tensions” philosophy, and can be implemented with standard scientific Python libraries (scikit-learn
, scipy
) that are easily packaged by Nix.
Hitting It Home: Onto The Pipulate Homepage
Me: Okay, that’s about all I can muster tonight and puts the final wrap on this weekend. But wow is this a plan. Things are getting easier. There’s this righteous positive feedback loop between:
- Articles here in vim
- Consulting with AI, mostly Web-Gemini
- Wrapping it into Code, mostly with vim & Gemini CLI
Hmmm. I really have to get the homepage of Pipulate no longer the Roles app and rather something that feeds into the Rule of 7 and the righteous feedback loop of self-improvement. Maybe Pipulate’s to-do list is 1 of the 5 to 9 menu items that I’ll be replacing the Rules app with on the Pipulate app homepage. Right, right! Get it out there in a way that I can’t help but see it, creates the Cialdini commitment and consistency effect and helps me rapidly recover state and momentum from within Pipulate itself.
And the new thing I have to research: Hierarchical Agglomerative Clustering.
Book Analysis
This analysis transforms the provided journal entry into a structured format, ready for integration into a tech book. It extracts the core technical insights, establishes a narrative, and prepares the content for discoverability.
Crafting Compelling Narratives: Title & Filename Brainstorm
- Title Option 1: From the Rule of 7 to a Self-Healing Knowledge Tree
- Filename:
rule-of-7-self-healing-knowledge-tree.md
- Rationale: This title is evocative and benefit-oriented. It starts with the core philosophy (“Rule of 7”) and ends with the powerful outcome (“Self-Healing Knowledge Tree”), appealing to readers interested in both principles and practical applications.
- Filename:
- Title Option 2: Architecting with AI: A Dialogue on Vector Clustering for Navigation
- Filename:
architecting-with-ai-vector-clustering.md
- Rationale: This title explicitly highlights the human-AI collaboration and the key technology used. It’s perfect for an audience specifically looking for case studies on AI-assisted development and semantic organization.
- Filename:
- Title Option 3: Taming Complexity: Building a 6-Click World with Hierarchical Clustering
- Filename:
taming-complexity-hierarchical-clustering.md
- Rationale: Focuses on the problem (“Taming Complexity”) and the user-centric goal (a “6-Click World,” referencing small-world theory). It frames the technical solution as a means to a compelling end.
- Filename:
- Preferred Option:
- Title (plain text for YAML): From the Rule of 7 to a Self-Healing Knowledge Tree
- Filename:
rule-of-7-self-healing-knowledge-tree.md
- Rationale: It’s the most complete and intriguing title. It captures the unique philosophical starting point and the sophisticated technical destination, making it the best fit for a book that blends deep tech with development philosophy.
Book Potential Analysis
- Strengths as Book Fodder:
- Authentic Problem-Solving: The dialogue format offers a transparent, blow-by-blow account of evolving a technical design from a simple idea to a sophisticated solution.
- Illustrates a Core Philosophy: It provides a concrete, practical application of the “Rule of 7” and “applied tensions” concepts, grounding abstract ideas in a real-world use case.
- Human-AI “Pair Architecting”: The entry is a powerful case study in using AI not just as a code generator, but as a strategic partner in systems design, bouncing ideas and refining architecture in real time.
- Practical Technical Blueprint: It details a modern, low-liability approach to content organization using standard tools like SQLite and vector embeddings, which is highly relevant.
- Opportunities for Enrichment (for Book Adaptation):
- Visualize the Concepts: Add a simple diagram illustrating a “lumpy” vs. a “balanced” hierarchy, and another showing what a dendrogram (the output of hierarchical clustering) looks like and how it’s “sliced.”
- Provide Code Snippets: Include brief, illustrative Python code snippets for key steps, such as hashing a file’s body, generating a vector embedding with a library, or querying the JSON blob in SQLite.
- Add a “Key Trade-Offs” Box: Create a summary box explaining the pros and cons of the normalized (multiple rows) vs. denormalized (JSON blob) database approaches discussed in the dialogue.
AI Editorial Perspective: From Journal to Chapter
This entry is far more than a simple log; it’s the perfect anchor for a chapter on AI-Assisted Systems Design. The dialogue format is its greatest strength, offering an authentic “behind-the-scenes” view of the messy, iterative, and ultimately creative process of technical discovery. It demystifies complex topics by showing how they are introduced sequentially to solve specific, emergent problems—first a simple associative array, then a hash for integrity, then a vector for meaning, and finally hierarchical clustering for structure. This narrative progression is incredibly valuable for a reader.
The human-AI dynamic is the standout feature. The author sets the philosophical vision and identifies constraints, while the AI acts as a Socratic partner, validating ideas, introducing modern technical solutions (like SQLite’s JSON capabilities), and structuring the workflow. In a book, this entry would serve as a powerful case study demonstrating how to leverage AI as a thought partner to move from abstract principle to a concrete, robust architectural blueprint. The raw, conversational nature makes the final, sophisticated solution feel earned and accessible, rather than presented as a dry, finished artifact.
Suggested Next AI Processing Steps
- Task Suggestion 1: Generate Explanatory Analogies
- Potential Prompt Snippet for Next AI: “Based on the provided dialogue, generate a simple, non-technical analogy for both ‘Vector Embedding’ and ‘Hierarchical Agglomerative Clustering’ to help a non-expert reader understand their core function in organizing content.”
- Task Suggestion 2: Draft a “Key Takeaways” Section
- Potential Prompt Snippet for Next AI: “Analyze the provided technical design session and draft a ‘Key Architectural Decisions’ summary in 3-5 bullet points. For each point, briefly state the problem, the chosen solution, and the primary benefit of that choice.”