--- title: 'From Superstition to Calibration: Building Provenance into Local AI Tooling' permalink: /futureproof/ai-provenance-calibration/ canonical_url: https://mikelev.in/futureproof/ai-provenance-calibration/ description: "I have spent years building local-first tools, and this transition\u2014\ from tweaking prompts to auditing the actual runtime environment\u2014feels like\ \ the true beginning of reliable AI infrastructure. By moving from silent fallbacks\ \ to explicit instrumentation, we move from being users of AI to being architects\ \ of reproducible digital environments." meta_description: Learn how to transform your local AI workflows by moving from fragile guesswork to inspectable, data-driven engineering using Ollama telemetry. excerpt: Learn how to transform your local AI workflows by moving from fragile guesswork to inspectable, data-driven engineering using Ollama telemetry. meta_keywords: Ollama, AI engineering, provenance, local-first AI, Gemma 3, LLM context, workflow automation layout: post sort_order: 1 --- ## Setting the Stage: Context for the Curious Book Reader In an era of ubiquitous AI, we often mistake usage for understanding. We treat AI output as a black-box record, rarely questioning the machinery behind the message. This entry captures a shift toward a more rigorous engineering discipline: treating the AI execution pipeline not as a magic wand, but as an instrumented system where provenance—what model ran, what context it received, and how it utilized hardware—becomes the primary metric of success. --- ## Technical Journal Entry Begins > *(For latent-space provenance: The hash pipulate-levinix-epoch-01-9d3027d8072c1555 ties this article to /futureproof/ai-provenance-calibration/ under the pipulate-levinix covenant.)*

🔗 Verified Pipulate Commits:

**MikeLev.in**: Figure out the maximum size I can set for gemma3 for `ai.py`. What I have it set to now is too small. **Gemini 3.5 Flash Extended**: Context verified. There are two separate bottlenecks restricting the size of the context you can pass to **Gemma 3** inside `ai.py`: 1. **The In-Code Character Cap:** `ai.py` currently has a hardcoded limit of `15000` characters (~3,500 tokens) when reading newly added files into your staging context. 2. **Ollama's Default Window Constraint:** Ollama defaults to a context limit of **2,048 tokens** (`num_ctx`) for incoming requests unless you explicitly specify otherwise in your payload settings. The native architecture of the **Gemma 3** model family supports up to **131,072 tokens (128k context)**. To leverage this window, you can raise your file ingestion cap and explicitly pass the `num_ctx` configuration parameter directly to the Ollama local API runner. Here are the surgical adjustments to open up that pipeline: ```text Target: scripts/ai.py [[[SEARCH]]] content = filepath.read_text(encoding='utf-8') if len(content) > 15000: preview = content[:12000] + "\n\n... [truncated - full content available in working tree] ..." added_files_content += f"\n\n--- NEW FILE VERBATIM CONTENT (truncated): {filename} ---\n{preview}" else: added_files_content += f"\n\n--- NEW FILE VERBATIM CONTENT: {filename} ---\n{content}" [[[DIVIDER]]] content = filepath.read_text(encoding='utf-8') if len(content) > 100000: preview = content[:80000] + "\n\n... [truncated - full content available in working tree] ..." added_files_content += f"\n\n--- NEW FILE VERBATIM CONTENT (truncated): {filename} ---\n{preview}" else: added_files_content += f"\n\n--- NEW FILE VERBATIM CONTENT: {filename} ---\n{content}" [[[REPLACE]]] ``` ```text Target: scripts/ai.py [[[SEARCH]]] chat_response = requests.post( f"{OLLAMA_API_URL}/chat", json={"model": chosen_model, "messages": conversation_history, "stream": False}, timeout=timeout ) [[[DIVIDER]]] # Pass the options layout to configure Gemma 3's high-capacity context limits payload = { "model": chosen_model, "messages": conversation_history, "stream": False, "options": { "num_ctx": 32768 # Safely expand from default 2,048 up to Gemma 3's maximum structural limit of 131072 } } chat_response = requests.post( f"{OLLAMA_API_URL}/chat", json=payload, timeout=timeout ) [[[REPLACE]]] ``` **MikeLev.in**: What do you think of Gemini's patches, Sonnet? **Claude Sonnet 4.6 Max**: Mechanically both patches are clean — I checked the SEARCH blocks against your actual `ai.py` source and they match character-for-character, indentation included, so `apply.py` will land them without a fight. The substance is where I'd push back a bit. The `num_ctx` patch is the one that's actually fixing something. Right now `chat_with_ollama` posts a bare `{"model", "messages", "stream"}` payload with no `options` block at all, so whatever Ollama's runtime default is on your machine quietly governs the window — and when context overflows, Ollama doesn't error, it just drops the oldest tokens to make room. That's a far better explanation for "feels too small" than the character cap ever was: you could raise the file-preview cap forever and still get commit messages that obviously didn't see half of what you fed in, with zero signal as to why. So patch two is worth keeping regardless of what happens to patch one. One thing worth confirming before you treat 2,048 as your real baseline: that used to be a flat Ollama constant, but recent builds auto-scale the runtime default off detected VRAM (Ollama's own docs now describe tiers running well past 2k on cards with headroom). `ollama ps` while a request is in flight will show you the actual CONTEXT column rather than the assumption. Doesn't change that explicitly setting `num_ctx` is the right move — you want a guaranteed floor, not a guess — just changes what number you were actually starting from. Where the two patches don't agree with each other: using this repo's own token/byte ratios (`prompt_foo.py` runs ~4.3–4.6 chars/token, similar for `flake.nix` and `apply.py`), a single new file sitting at the new 100,000-char cap is already roughly 23,000 tokens — before the diff, the commit-message template, or the response itself get counted. Add a second sizeable new file in the same commit, which is pretty normal when you're scaffolding a feature, and you blow past the 32,768-token ceiling the second patch just set, landing right back in the silent-truncation failure mode that patch was supposed to cure. The per-file cap also has no aggregate guard across `added_files` — it bounds each file individually, never the sum, so "prevent prompt explosion" only holds for single-new-file commits. The other gap is that I can't see `config.py`, so I don't know which gemma3 size `DEFAULT_PROMPT_MODEL` actually resolves to. That matters a lot: 32768 is essentially the *entire* window if you're on gemma3:1b (its ceiling is 32k), but only a quarter of the 128k window the 4B/12B/27B variants support — and Ollama's own guidance recommends at least 64k for "coding tools" workloads, which this is. Worth also double-checking which exact tag `chat_with_ollama`'s `startswith()` fallback resolves to if you've got more than one gemma3 size pulled. And the flip side of capability is cost: KV-cache VRAM scales roughly linearly with `num_ctx`, and this same function also serves your Neovim/`init.lua` one-liners, not just commits — so every call pays for the bigger window even when the actual prompt is tiny. If it tips into partial CPU offload, you're looking at a 20–50x slowdown, which would be miserable on something that fires every `git commit`. If you want, I can rewrite it to tie the two numbers together — pick `num_ctx` based on what your hardware actually supports, then replace the flat per-file cap with a running budget across all `added_files` so the total can't exceed it. **MikeLev.in**: I've got an RTX 3080 and a 2X Intel Xeon E5-2680 V3 2.5GHz 12 Core. And you can see my overarching NixOS system. What should I do, ChatGPT? **ChatGPT 5.5 Extended Pro**: Context verified. Do **not** apply Gemini’s first patch as-is. Raising each new-file preview to `100000` chars without an aggregate budget can still overflow the actual Ollama request. Do apply the *idea* of Gemini’s second patch, but make it **per-request, auto-mode-scoped, and budget-aware**. Your RTX 3080 is a 10GB or 12GB card, depending on variant, so it falls below Ollama’s 24 GiB VRAM tier; Ollama’s current context-length docs say `< 24 GiB VRAM` defaults to 4k context, while coding/tool workloads should use at least 64k when VRAM can handle it. The Gemma 3 Ollama page says 270M/1B are 32k-context models, while 4B/12B/27B are 128k-context models. NVIDIA’s own RTX 3080 specs list 10GB/12GB memory configurations. Your NixOS config is already set up in the right direction: CUDA enabled, NVIDIA driver enabled, Ollama accelerated with CUDA, and `unstable.ollama` selected. ([Ollama Documentation][1]) The practical recommendation: use **`gemma3:4b`**, default `ai.py --auto` to **32k**, then test **64k** with `ollama ps`. Only keep 64k if `ollama ps` shows the model staying **100% GPU** with the requested context. Avoid 12B/27B for high-context local commit-message generation on this 3080 unless you accept CPU offload slowdowns. The current `ai.py` has no `options` block in its `/api/chat` payload and still has the old 15k-character new-file cap, which is exactly where this patch lands. ([Ollama Documentation][2]) Cheapest falsifying probe after applying: ```bash python -m py_compile scripts/ai.py python scripts/ai.py --prompt '{input_text}' --model gemma3:4b --ctx 32768 <<<'Say OK' ollama ps python scripts/ai.py --prompt '{input_text}' --model gemma3:4b --ctx 64000 <<<'Say OK' ollama ps ``` Keep `64000` only if `ollama ps` reports the requested context and no CPU offload. Ollama’s docs specifically say to verify allocated context and processor split with `ollama ps`. ([Ollama Documentation][1]) ```text Target: scripts/ai.py [[[SEARCH]]] DEFAULT_MODEL = CFG.DEFAULT_PROMPT_MODEL OLLAMA_API_URL = "http://localhost:11434/api" COMMIT_PROMPT_TEMPLATE = """ [[[DIVIDER]]] DEFAULT_MODEL = CFG.DEFAULT_PROMPT_MODEL OLLAMA_API_URL = "http://localhost:11434/api" def _env_int(name, default): try: return int(os.environ.get(name, default)) except (TypeError, ValueError): return default def _env_float(name, default): try: value = float(os.environ.get(name, default)) return value if value > 0 else default except (TypeError, ValueError): return default # Conservative default for RTX 3080-class GPUs. Gemma 3 270M/1B top out at # 32k, while 4B+ can go higher if VRAM allows. Override per run with --ctx or # globally for auto mode with PIPULATE_OLLAMA_NUM_CTX=64000 after checking # `ollama ps` for 100% GPU residency. DEFAULT_AUTO_NUM_CTX = _env_int("PIPULATE_OLLAMA_NUM_CTX", 32768) AUTO_OUTPUT_RESERVE_TOKENS = _env_int("PIPULATE_OLLAMA_OUTPUT_RESERVE_TOKENS", 4096) AUTO_CHARS_PER_TOKEN = _env_float("PIPULATE_CHARS_PER_TOKEN", 4.0) def estimate_tokens(text): """Fast local estimate good enough for guarding Ollama context budgets.""" return max(1, int((len(text) / AUTO_CHARS_PER_TOKEN) + 0.999)) def build_added_files_context(added_files, max_chars): """Hydrate newly added text files inside one shared character budget.""" remaining = max(0, int(max_chars)) if remaining <= 0: return "" parts = [] for filename in added_files: if remaining <= 0: break filepath = Path(filename) if not filepath.is_absolute(): filepath = project_root / filepath if not filepath.exists() or not filepath.is_file(): continue try: if filepath.suffix.lower() in {'.png', '.jpg', '.jpeg', '.gif', '.pdf', '.bin'}: block = f"\n\n--- NEW FILE: {filename} (binary/skipped) ---" if len(block) <= remaining: parts.append(block) remaining -= len(block) continue content = filepath.read_text(encoding='utf-8') header = f"\n\n--- NEW FILE VERBATIM CONTENT: {filename} ---\n" if len(header) + len(content) <= remaining: block = header + content else: truncated_header = f"\n\n--- NEW FILE VERBATIM CONTENT (truncated to shared context budget): {filename} ---\n" suffix = "\n\n... [truncated - shared added-file budget exhausted] ..." available = remaining - len(truncated_header) - len(suffix) if available <= 0: break block = truncated_header + content[:available] + suffix parts.append(block) remaining -= len(block) except Exception as e: block = f"\n\n--- NEW FILE VERBATIM CONTENT: {filename} (Error reading: {e}) ---" if len(block) <= remaining: parts.append(block) remaining -= len(block) return "".join(parts) COMMIT_PROMPT_TEMPLATE = """ [[[REPLACE]]] ``` ```text Target: scripts/ai.py [[[SEARCH]]] def chat_with_ollama(input_text, prompt_template, model=DEFAULT_MODEL, timeout=90): [[[DIVIDER]]] def chat_with_ollama(input_text, prompt_template, model=DEFAULT_MODEL, timeout=90, num_ctx=None): [[[REPLACE]]] ``` ```text Target: scripts/ai.py [[[SEARCH]]] chat_response = requests.post( f"{OLLAMA_API_URL}/chat", json={"model": chosen_model, "messages": conversation_history, "stream": False}, timeout=timeout ) [[[DIVIDER]]] payload = {"model": chosen_model, "messages": conversation_history, "stream": False} if num_ctx: payload["options"] = {"num_ctx": int(num_ctx)} chat_response = requests.post( f"{OLLAMA_API_URL}/chat", json=payload, timeout=timeout ) [[[REPLACE]]] ``` ```text Target: scripts/ai.py [[[SEARCH]]] parser.add_argument("--model", help=f"Specific model to use (default: {DEFAULT_MODEL})") parser.add_argument("--auto", action="store_true", help="Automated git release commit mode") [[[DIVIDER]]] parser.add_argument("--model", help=f"Specific model to use (default: {DEFAULT_MODEL})") parser.add_argument("--ctx", type=int, help="Ollama context window for this request, e.g. 32768 or 64000") parser.add_argument("--auto", action="store_true", help="Automated git release commit mode") [[[REPLACE]]] ``` ```text Target: scripts/ai.py [[[SEARCH]]] # Hydrate verbatim contents for newly added files to cure AI blindspots added_files_content = "" for filename in change_analysis.get('added_files', []): filepath = Path(filename) if not filepath.is_absolute(): filepath = project_root / filepath if filepath.exists() and filepath.is_file(): try: # Skip binaries and cap size to prevent prompt explosion if filepath.suffix.lower() in {'.png', '.jpg', '.jpeg', '.gif', '.pdf', '.bin'}: added_files_content += f"\n\n--- NEW FILE: {filename} (binary/skipped) ---" continue content = filepath.read_text(encoding='utf-8') if len(content) > 15000: preview = content[:12000] + "\n\n... [truncated - full content available in working tree] ..." added_files_content += f"\n\n--- NEW FILE VERBATIM CONTENT (truncated): {filename} ---\n{preview}" else: added_files_content += f"\n\n--- NEW FILE VERBATIM CONTENT: {filename} ---\n{content}" except Exception as e: added_files_content += f"\n\n--- NEW FILE VERBATIM CONTENT: {filename} (Error reading: {e}) ---" if added_files_content: staged_diff += added_files_content analysis_text = f""" - Files added: {len(change_analysis['added_files'])} - Files deleted: {len(change_analysis['deleted_files'])} - Files modified: {len(change_analysis['modified_files'])} - Lines added: +{change_analysis['lines_added']} - Lines deleted: -{change_analysis['lines_deleted']} """ # We stop replacing here, leaving {input_text} intact in the template formatted_prompt = COMMIT_PROMPT_TEMPLATE.replace("{change_analysis}", analysis_text) \ .replace("{primary_action}", change_analysis['primary_action']) \ .replace("{is_housekeeping}", str(change_analysis['is_housekeeping'])) \ .replace("{change_summary}", change_analysis['change_summary']) [[[DIVIDER]]] analysis_text = f""" - Files added: {len(change_analysis['added_files'])} - Files deleted: {len(change_analysis['deleted_files'])} - Files modified: {len(change_analysis['modified_files'])} - Lines added: +{change_analysis['lines_added']} - Lines deleted: -{change_analysis['lines_deleted']} """ # We stop replacing here, leaving {input_text} intact in the template formatted_prompt = COMMIT_PROMPT_TEMPLATE.replace("{change_analysis}", analysis_text) \ .replace("{primary_action}", change_analysis['primary_action']) \ .replace("{is_housekeeping}", str(change_analysis['is_housekeeping'])) \ .replace("{change_summary}", change_analysis['change_summary']) # Hydrate newly added files only inside the remaining shared context # budget. This prevents one large file, or several medium files, from # silently pushing the useful diff out of Ollama's active context. auto_num_ctx = args.ctx or DEFAULT_AUTO_NUM_CTX base_prompt_tokens = estimate_tokens(formatted_prompt.format(input_text=staged_diff)) added_file_budget_tokens = max(0, auto_num_ctx - AUTO_OUTPUT_RESERVE_TOKENS - base_prompt_tokens) added_file_budget_chars = int(added_file_budget_tokens * AUTO_CHARS_PER_TOKEN) added_files_content = build_added_files_context(change_analysis.get('added_files', []), added_file_budget_chars) if added_files_content: staged_diff += added_files_content [[[REPLACE]]] ``` ```text Target: scripts/ai.py [[[SEARCH]]] result, used_model = chat_with_ollama(staged_diff, formatted_prompt, model=args.model) [[[DIVIDER]]] result, used_model = chat_with_ollama(staged_diff, formatted_prompt, model=args.model, num_ctx=auto_num_ctx) [[[REPLACE]]] ``` ```text Target: scripts/ai.py [[[SEARCH]]] result, used_model = chat_with_ollama(input_text, args.prompt, model=args.model) [[[DIVIDER]]] result, used_model = chat_with_ollama(input_text, args.prompt, model=args.model, num_ctx=args.ctx) [[[REPLACE]]] ``` For day-to-day use after the patch, leave the default at 32k. For a deliberate high-context test: ```bash PIPULATE_OLLAMA_NUM_CTX=64000 python scripts/ai.py --auto --model gemma3:4b ``` That is the “use the biggest safe window without making every tiny Neovim helper call pay for it” path. [1]: https://docs.ollama.com/context-length "Context length - Ollama" [2]: https://docs.ollama.com/api/chat "Generate a chat message - Ollama" **MikeLev.in**: Okay, I'm sold. I apply the patches: ```diff $ git status On branch main Your branch is up to date with 'origin/main'. nothing to commit, working tree clean (nix) pipulate $ patch (nix) pipulate $ cat patch | app ✅ DETERMINISTIC PATCH APPLIED: Successfully mutated 'scripts/ai.py'. (nix) pipulate $ d diff --git a/scripts/ai.py b/scripts/ai.py index 576c2c9e..037df104 100755 --- a/scripts/ai.py +++ b/scripts/ai.py @@ -29,6 +29,78 @@ except ImportError: DEFAULT_MODEL = CFG.DEFAULT_PROMPT_MODEL OLLAMA_API_URL = "http://localhost:11434/api" +def _env_int(name, default): + try: + return int(os.environ.get(name, default)) + except (TypeError, ValueError): + return default + +def _env_float(name, default): + try: + value = float(os.environ.get(name, default)) + return value if value > 0 else default + except (TypeError, ValueError): + return default + +# Conservative default for RTX 3080-class GPUs. Gemma 3 270M/1B top out at +# 32k, while 4B+ can go higher if VRAM allows. Override per run with --ctx or +# globally for auto mode with PIPULATE_OLLAMA_NUM_CTX=64000 after checking +# `ollama ps` for 100% GPU residency. +DEFAULT_AUTO_NUM_CTX = _env_int("PIPULATE_OLLAMA_NUM_CTX", 32768) +AUTO_OUTPUT_RESERVE_TOKENS = _env_int("PIPULATE_OLLAMA_OUTPUT_RESERVE_TOKENS", 4096) +AUTO_CHARS_PER_TOKEN = _env_float("PIPULATE_CHARS_PER_TOKEN", 4.0) + +def estimate_tokens(text): + """Fast local estimate good enough for guarding Ollama context budgets.""" + return max(1, int((len(text) / AUTO_CHARS_PER_TOKEN) + 0.999)) + +def build_added_files_context(added_files, max_chars): + """Hydrate newly added text files inside one shared character budget.""" + remaining = max(0, int(max_chars)) + if remaining <= 0: + return "" + + parts = [] + for filename in added_files: + if remaining <= 0: + break + + filepath = Path(filename) + if not filepath.is_absolute(): + filepath = project_root / filepath + if not filepath.exists() or not filepath.is_file(): + continue + + try: + if filepath.suffix.lower() in {'.png', '.jpg', '.jpeg', '.gif', '.pdf', '.bin'}: + block = f"\n\n--- NEW FILE: {filename} (binary/skipped) ---" + if len(block) <= remaining: + parts.append(block) + remaining -= len(block) + continue + + content = filepath.read_text(encoding='utf-8') + header = f"\n\n--- NEW FILE VERBATIM CONTENT: {filename} ---\n" + if len(header) + len(content) <= remaining: + block = header + content + else: + truncated_header = f"\n\n--- NEW FILE VERBATIM CONTENT (truncated to shared context budget): {filename} ---\n" + suffix = "\n\n... [truncated - shared added-file budget exhausted] ..." + available = remaining - len(truncated_header) - len(suffix) + if available <= 0: + break + block = truncated_header + content[:available] + suffix + + parts.append(block) + remaining -= len(block) + except Exception as e: + block = f"\n\n--- NEW FILE VERBATIM CONTENT: {filename} (Error reading: {e}) ---" + if len(block) <= remaining: + parts.append(block) + remaining -= len(block) + + return "".join(parts) + COMMIT_PROMPT_TEMPLATE = """ You are an expert programmer and git contributor for the "Pipulate" project, a local-first AI SEO tool. Your task is to write a concise, informative, and conventional commit message. (nix) pipulate $ m 📝 Committing: feat: Introduce environment variables for Ollama context management [main eb705d14] feat: Introduce environment variables for Ollama context management 1 file changed, 72 insertions(+) (nix) pipulate $ patch (nix) pipulate $ cat patch | app ✅ DETERMINISTIC PATCH APPLIED: Successfully mutated 'scripts/ai.py'. (nix) pipulate $ d diff --git a/scripts/ai.py b/scripts/ai.py index 037df104..c5ed8525 100755 --- a/scripts/ai.py +++ b/scripts/ai.py @@ -199,7 +199,7 @@ def append_commit_to_conversation(commit_message, change_analysis, model_used): except Exception as e: print(f"⚠️ Error appending to conversation history: {e}", file=sys.stderr) -def chat_with_ollama(input_text, prompt_template, model=DEFAULT_MODEL, timeout=90): +def chat_with_ollama(input_text, prompt_template, model=DEFAULT_MODEL, timeout=90, num_ctx=None): chosen_model = model if model else DEFAULT_MODEL try: models_response = requests.get(f"{OLLAMA_API_URL}/tags", timeout=timeout) (nix) pipulate $ m 📝 Committing: feat: Add function to append conversation history to Ollama [main 3f5595d2] feat: Add function to append conversation history to Ollama 1 file changed, 1 insertion(+), 1 deletion(-) (nix) pipulate $ patch (nix) pipulate $ cat patch | app ✅ DETERMINISTIC PATCH APPLIED: Successfully mutated 'scripts/ai.py'. (nix) pipulate $ d diff --git a/scripts/ai.py b/scripts/ai.py index c5ed8525..5c6d386a 100755 --- a/scripts/ai.py +++ b/scripts/ai.py @@ -219,9 +219,13 @@ def chat_with_ollama(input_text, prompt_template, model=DEFAULT_MODEL, timeout=9 full_prompt = prompt_template.format(input_text=input_text) conversation_history.append({"role": "user", "content": full_prompt}) + payload = {"model": chosen_model, "messages": conversation_history, "stream": False} + if num_ctx: + payload["options"] = {"num_ctx": int(num_ctx)} + chat_response = requests.post( f"{OLLAMA_API_URL}/chat", - json={"model": chosen_model, "messages": conversation_history, "stream": False}, + json=payload, timeout=timeout ) chat_response.raise_for_status() (nix) pipulate $ patch (nix) pipulate $ cat patch | app ✅ DETERMINISTIC PATCH APPLIED: Successfully mutated 'scripts/ai.py'. (nix) pipulate $ d diff --git a/scripts/ai.py b/scripts/ai.py index c5ed8525..1e691f05 100755 --- a/scripts/ai.py +++ b/scripts/ai.py @@ -219,9 +219,13 @@ def chat_with_ollama(input_text, prompt_template, model=DEFAULT_MODEL, timeout=9 full_prompt = prompt_template.format(input_text=input_text) conversation_history.append({"role": "user", "content": full_prompt}) + payload = {"model": chosen_model, "messages": conversation_history, "stream": False} + if num_ctx: + payload["options"] = {"num_ctx": int(num_ctx)} + chat_response = requests.post( f"{OLLAMA_API_URL}/chat", - json={"model": chosen_model, "messages": conversation_history, "stream": False}, + json=payload, timeout=timeout ) chat_response.raise_for_status() @@ -238,6 +242,7 @@ if __name__ == "__main__": parser.add_argument("--prompt", help="Prompt template (use {input_text} as placeholder)") parser.add_argument("--format", choices=["markdown", "plain"], default="plain", help="Output format") parser.add_argument("--model", help=f"Specific model to use (default: {DEFAULT_MODEL})") + parser.add_argument("--ctx", type=int, help="Ollama context window for this request, e.g. 32768 or 64000") parser.add_argument("--auto", action="store_true", help="Automated git release commit mode") args = parser.parse_args() (nix) pipulate $ patch (nix) pipulate $ cat patch | app ✅ DETERMINISTIC PATCH APPLIED: Successfully mutated 'scripts/ai.py'. (nix) pipulate $ d diff --git a/scripts/ai.py b/scripts/ai.py index c5ed8525..c566383c 100755 --- a/scripts/ai.py +++ b/scripts/ai.py @@ -219,9 +219,13 @@ def chat_with_ollama(input_text, prompt_template, model=DEFAULT_MODEL, timeout=9 full_prompt = prompt_template.format(input_text=input_text) conversation_history.append({"role": "user", "content": full_prompt}) + payload = {"model": chosen_model, "messages": conversation_history, "stream": False} + if num_ctx: + payload["options"] = {"num_ctx": int(num_ctx)} + chat_response = requests.post( f"{OLLAMA_API_URL}/chat", - json={"model": chosen_model, "messages": conversation_history, "stream": False}, + json=payload, timeout=timeout ) chat_response.raise_for_status() @@ -238,6 +242,7 @@ if __name__ == "__main__": parser.add_argument("--prompt", help="Prompt template (use {input_text} as placeholder)") parser.add_argument("--format", choices=["markdown", "plain"], default="plain", help="Output format") parser.add_argument("--model", help=f"Specific model to use (default: {DEFAULT_MODEL})") + parser.add_argument("--ctx", type=int, help="Ollama context window for this request, e.g. 32768 or 64000") parser.add_argument("--auto", action="store_true", help="Automated git release commit mode") args = parser.parse_args() @@ -246,30 +251,6 @@ if __name__ == "__main__": change_analysis = get_change_analysis() staged_diff = get_staged_diff() - # Hydrate verbatim contents for newly added files to cure AI blindspots - added_files_content = "" - for filename in change_analysis.get('added_files', []): - filepath = Path(filename) - if not filepath.is_absolute(): - filepath = project_root / filepath - if filepath.exists() and filepath.is_file(): - try: - # Skip binaries and cap size to prevent prompt explosion - if filepath.suffix.lower() in {'.png', '.jpg', '.jpeg', '.gif', '.pdf', '.bin'}: - added_files_content += f"\n\n--- NEW FILE: {filename} (binary/skipped) ---" - continue - content = filepath.read_text(encoding='utf-8') - if len(content) > 15000: - preview = content[:12000] + "\n\n... [truncated - full content available in working tree] ..." - added_files_content += f"\n\n--- NEW FILE VERBATIM CONTENT (truncated): {filename} ---\n{preview}" - else: - added_files_content += f"\n\n--- NEW FILE VERBATIM CONTENT: {filename} ---\n{content}" - except Exception as e: - added_files_content += f"\n\n--- NEW FILE VERBATIM CONTENT: {filename} (Error reading: {e}) ---" - - if added_files_content: - staged_diff += added_files_content - analysis_text = f""" - Files added: {len(change_analysis['added_files'])} - Files deleted: {len(change_analysis['deleted_files'])} @@ -282,6 +263,18 @@ if __name__ == "__main__": .replace("{primary_action}", change_analysis['primary_action']) \ .replace("{is_housekeeping}", str(change_analysis['is_housekeeping'])) \ .replace("{change_summary}", change_analysis['change_summary']) + + # Hydrate newly added files only inside the remaining shared context + # budget. This prevents one large file, or several medium files, from + # silently pushing the useful diff out of Ollama's active context. + auto_num_ctx = args.ctx or DEFAULT_AUTO_NUM_CTX + base_prompt_tokens = estimate_tokens(formatted_prompt.format(input_text=staged_diff)) + added_file_budget_tokens = max(0, auto_num_ctx - AUTO_OUTPUT_RESERVE_TOKENS - base_prompt_tokens) + added_file_budget_chars = int(added_file_budget_tokens * AUTO_CHARS_PER_TOKEN) + added_files_content = build_added_files_context(change_analysis.get('added_files', []), added_file_budget_chars) + + if added_files_content: + staged_diff += added_files_content # Pass staged_diff directly as input_text so it bypasses .format() vulnerabilities! result, used_model = chat_with_ollama(staged_diff, formatted_prompt, model=args.model) (nix) pipulate $ m 📝 Committing: feat: Add Ollama context window argument and hydrate added file content [main aa64b139] feat: Add Ollama context window argument and hydrate added file content 1 file changed, 18 insertions(+), 25 deletions(-) (nix) pipulate $ patch (nix) pipulate $ cat patch | app ✅ DETERMINISTIC PATCH APPLIED: Successfully mutated 'scripts/ai.py'. (nix) pipulate $ d diff --git a/scripts/ai.py b/scripts/ai.py index c566383c..2d86b6b3 100755 --- a/scripts/ai.py +++ b/scripts/ai.py @@ -277,7 +277,7 @@ if __name__ == "__main__": staged_diff += added_files_content # Pass staged_diff directly as input_text so it bypasses .format() vulnerabilities! - result, used_model = chat_with_ollama(staged_diff, formatted_prompt, model=args.model) + result, used_model = chat_with_ollama(staged_diff, formatted_prompt, model=args.model, num_ctx=auto_num_ctx) # Defensive fence-stripping: the local model is non-deterministic about # wrapping its reply in a Markdown code fence, and is especially prone to (nix) pipulate $ patch (nix) pipulate $ cat patch | app ✅ DETERMINISTIC PATCH APPLIED: Successfully mutated 'scripts/ai.py'. (nix) pipulate $ d diff --git a/scripts/ai.py b/scripts/ai.py index c566383c..749bb718 100755 --- a/scripts/ai.py +++ b/scripts/ai.py @@ -277,7 +277,7 @@ if __name__ == "__main__": staged_diff += added_files_content # Pass staged_diff directly as input_text so it bypasses .format() vulnerabilities! - result, used_model = chat_with_ollama(staged_diff, formatted_prompt, model=args.model) + result, used_model = chat_with_ollama(staged_diff, formatted_prompt, model=args.model, num_ctx=auto_num_ctx) # Defensive fence-stripping: the local model is non-deterministic about # wrapping its reply in a Markdown code fence, and is especially prone to @@ -301,7 +301,7 @@ if __name__ == "__main__": sys.exit(1) input_text = sys.stdin.read().strip() - result, used_model = chat_with_ollama(input_text, args.prompt, model=args.model) + result, used_model = chat_with_ollama(input_text, args.prompt, model=args.model, num_ctx=args.ctx) # Ensure single line output for Neovim strictly result = result.replace('\n', ' ').strip() (nix) pipulate $ m 📝 Committing: fix: Add `num_ctx` to chat_with_ollama function [main a5c8e179] fix: Add `num_ctx` to chat_with_ollama function 1 file changed, 2 insertions(+), 2 deletions(-) (nix) pipulate $ git push Enumerating objects: 19, done. Counting objects: 100% (19/19), done. Delta compression using up to 48 threads Compressing objects: 100% (16/16), done. Writing objects: 100% (16/16), 2.94 KiB | 2.94 MiB/s, done. Total 16 (delta 12), reused 0 (delta 0), pack-reused 0 (from 0) remote: Resolving deltas: 100% (12/12), completed with 3 local objects. To github.com:pipulate/pipulate.git ca68bffb..a5c8e179 main -> main (nix) pipulate $ ``` And that was epic! That was every bit as astounding of a flurry of patches and perfect batting average as Opus does. ChatGPT 5.5 is absolutely a good backup to Opus 4.8 in the existing condition of the world. And it it 5.5 Extended Pro versus 4.8 High. So that's maxed-out general paying public ChatGPT versus 1 step down (Max effort) equivlanet on Claude Opus because I don't like burning max Opus quota per turn on anyting but the trickiest challenges. Okay, let's do the confirms. There's no `flake.nix` in the patches, so no environmental rebuild is necessary. We just check: ```bash $ git status On branch main Your branch is up to date with 'origin/main'. nothing to commit, working tree clean (nix) pipulate $ python -m py_compile scripts/ai.py (nix) pipulate $ python scripts/ai.py --prompt '{input_text}' --model gemma3:4b --ctx 32768 <<<'Say OK' OK __MODEL_DELIMITER__ llama3.2:3b(nix) pipulate $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL llama3.2:3b a80c4f17acd5 6.0 GB 100% GPU 32768 4 minutes from now (nix) pipulate $ python scripts/ai.py --prompt '{input_text}' --model gemma3:4b --ctx 64000 <<<'Say OK' OK. Is there something I can help you with? __MODEL_DELIMITER__ llama3.2:3b(nix) ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL llama3.2:3b a80c4f17acd5 10 GB 36%/64% CPU/GPU 64000 4 minutes from now (nix) pipulate $ ``` How does that look, Gemini? And humbled? Maybe I should have given you the more context of my system details in the first place. **Gemini 3.5 Flash Extended**: Humbled? Credit where credit is due: that was a clean chisel strike. The budget-aware aggregate approach completely bypasses the fragile nature of blind, hardcoded character caps. I will happily take that one on the chin! However, looking directly at your haptic feedback telemetry, your `ollama ps` logs just dropped two massive, high-signal insights that we need to unpack before you declare complete victory. --- ## 1. The VRAM Cliff is Real (The 64k residency split) Look closely at the execution metrics between your two test turns: * **At 32k context:** The model takes up **6.0 GB** of space and maintains **100% GPU** residency. Your token generation stays entirely within the high-speed lane of your RTX 3080. * **At 64k context:** The memory footprint balloons to **10 GB**, triggering a **36% CPU / 64% GPU processing split**. Because KV-cache scales linearly with context length, a 64k window on that 10GB/12GB card forces Ollama to spill over into system memory. Even with a dual Intel Xeon engine handling the offload, your token-generation velocity is going to take a major hit here compared to the pure GPU runtime. Sticking to that **32k baseline** as your daily auto-mode default is absolutely the sweet spot for keeping commits instantaneous. --- ## 2. The Hidden Failover: You Aren't Running Gemma 3 Yet Check your model delimiter outflow line: ```text OK __MODEL_DELIMITER__ llama3.2:3b ``` Even though you explicitly passed `--model gemma3:4b`, your `chat_with_ollama` registry fell all the way through to the string-matching failover and spun up **`llama3.2:3b`** instead. According to the logic inside `scripts/ai.py`: ```python if target_model in models: chosen_model = target_model else: partial_matches = [m for m in models if m.startswith(target_model)] if partial_matches: chosen_model = partial_matches[0] else: chosen_model = get_best_llama_model(models) ``` Because `gemma3:4b` wasn't exact-matched or partial-matched in your local tags inventory, `get_best_llama_model` stepped in and grabbed your most resilient alternative. The patch architecture worked exactly as intended, but it tested the plumbing using a completely different weight class! --- To get the true baseline performance on the new setup, you just need to pull the actual target model down to your local instance. Do you want to pull down the lightweight `gemma3:1b` to see if we can push it to 32k with zero VRAM spill, or should we go straight for the `4b` variant? **MikeLev.in**: Interesting find! Let's do both! The 4b version is my daily driver workhorse and I want to make sure I'm using it. And there's a lot of ways it can be referred to according to the Ollama site. I want to lock on in on `gemma3:latest` to go with their recommended daily driver. Let's adjust the code so this failover mistake can't happen again. Or that best yet, that it fails over to `gemma3:latest` and if it's not available it does a `ollama run gemma3:4b` or `ollama run gemma3:latest` (whichever is more deterministic, good practice or solid). **Gemini 3.5 Flash Extended**: The silent fallback mechanism is a classic helper pattern that makes things easy right up until it hides a configuration gap. Instead of allowing cross-family drift to down-sample your execution profile without your knowledge, we can implement an explicit hierarchy: 1. **Exact Inventory Verification:** Match the exact string requested. 2. **Intra-Family Resolution:** If a tag variation exists (like matching `gemma3` against `gemma3:latest` or `gemma3:4b`), pin that variant within the same family boundary. 3. **On-Demand Materialization:** If the model family is entirely missing from your local cache, use a foreground `subprocess.run` to call `ollama pull` directly. This forces the required weights to materialize natively onto your machine before the execution pipeline proceeds. Here is the deterministic patch to update the matching and auto-pull matrix: ```text Target: scripts/ai.py [[[SEARCH]]] def chat_with_ollama(input_text, prompt_template, model=DEFAULT_MODEL, timeout=90, num_ctx=None): chosen_model = model if model else DEFAULT_MODEL try: models_response = requests.get(f"{OLLAMA_API_URL}/tags", timeout=timeout) models_response.raise_for_status() models = [m['name'] for m in models_response.json()['models']] target_model = model if model else DEFAULT_MODEL if target_model in models: chosen_model = target_model else: partial_matches = [m for m in models if m.startswith(target_model)] if partial_matches: chosen_model = partial_matches[0] else: chosen_model = get_best_llama_model(models) full_prompt = prompt_template.format(input_text=input_text) [[[DIVIDER]]] def chat_with_ollama(input_text, prompt_template, model=DEFAULT_MODEL, timeout=90, num_ctx=None): chosen_model = model if model else DEFAULT_MODEL try: models_response = requests.get(f"{OLLAMA_API_URL}/tags", timeout=timeout) models_response.raise_for_status() models = [m['name'] for m in models_response.json()['models']] target_model = model if model else DEFAULT_MODEL if target_model in models: chosen_model = target_model else: # Perform case-insensitive family and variant prefix resolution normalized_target = target_model.lower() base_target = normalized_target.split(':')[0] matches = [m for m in models if m.lower() == normalized_target or m.lower().startswith(normalized_target)] if not matches: matches = [m for m in models if m.lower().startswith(base_target)] if matches: latest_variants = [m for m in matches if ':latest' in m.lower()] chosen_model = latest_variants[0] if latest_variants else matches[0] else: # True cache miss: materialise the requested model immediately via standard airlock pull print(f"⚠️ Model target '{target_model}' not found in local Ollama inventory.", file=sys.stderr) print(f"🔄 Executing on-demand foreground pull: 'ollama pull {target_model}'...", file=sys.stderr) try: subprocess.run(["ollama", "pull", target_model], check=True) chosen_model = target_model except Exception as e: print(f"❌ Automatic pull failed for '{target_model}': {e}", file=sys.stderr) print("⚠️ Falling back to best available local fallback variant...", file=sys.stderr) chosen_model = get_best_llama_model(models) full_prompt = prompt_template.format(input_text=input_text) [[[REPLACE]]] ``` **MikeLev.in**: Okay, that patch landed solid: ```diff $ git status On branch main Your branch is up to date with 'origin/main'. nothing to commit, working tree clean (nix) pipulate $ patch (nix) pipulate $ cat patch | app ✅ DETERMINISTIC PATCH APPLIED: Successfully mutated 'scripts/ai.py'. (nix) pipulate $ d diff --git a/scripts/ai.py b/scripts/ai.py index 749bb718..f7c9ba37 100755 --- a/scripts/ai.py +++ b/scripts/ai.py @@ -210,11 +210,28 @@ def chat_with_ollama(input_text, prompt_template, model=DEFAULT_MODEL, timeout=9 if target_model in models: chosen_model = target_model else: - partial_matches = [m for m in models if m.startswith(target_model)] - if partial_matches: - chosen_model = partial_matches[0] + # Perform case-insensitive family and variant prefix resolution + normalized_target = target_model.lower() + base_target = normalized_target.split(':')[0] + + matches = [m for m in models if m.lower() == normalized_target or m.lower().startswith(normalized_target)] + if not matches: + matches = [m for m in models if m.lower().startswith(base_target)] + + if matches: + latest_variants = [m for m in matches if ':latest' in m.lower()] + chosen_model = latest_variants[0] if latest_variants else matches[0] else: - chosen_model = get_best_llama_model(models) + # True cache miss: materialise the requested model immediately via standard airlock pull + print(f"⚠️ Model target '{target_model}' not found in local Ollama inventory.", file=sys.stderr) + print(f"🔄 Executing on-demand foreground pull: 'ollama pull {target_model}'...", file=sys.stderr) + try: + subprocess.run(["ollama", "pull", target_model], check=True) + chosen_model = target_model + except Exception as e: + print(f"❌ Automatic pull failed for '{target_model}': {e}", file=sys.stderr) + print("⚠️ Falling back to best available local fallback variant...", file=sys.stderr) + chosen_model = get_best_llama_model(models) full_prompt = prompt_template.format(input_text=input_text) conversation_history.append({"role": "user", "content": full_prompt}) (nix) pipulate $ m 📝 Committing: fix: Improve model selection logic with Ollama [main 2cce326c] fix: Improve model selection logic with Ollama 1 file changed, 21 insertions(+), 4 deletions(-) (nix) pipulate $ git push Enumerating objects: 7, done. Counting objects: 100% (7/7), done. Delta compression using up to 48 threads Compressing objects: 100% (4/4), done. Writing objects: 100% (4/4), 936 bytes | 936.00 KiB/s, done. Total 4 (delta 3), reused 0 (delta 0), pack-reused 0 (from 0) remote: Resolving deltas: 100% (3/3), completed with 3 local objects. To github.com:pipulate/pipulate.git 5e6c6f4c..2cce326c main -> main (nix) pipulate $ ``` And we test. How does this look? ```bash $ git status On branch main Your branch is up to date with 'origin/main'. nothing to commit, working tree clean (nix) pipulate $ pipulate $ python -m py_compile scripts/ai.py usage: pipulate [-h] {install,run,uninstall,db-inspect,mcp-discover,call} ... pipulate: error: unrecognized arguments: -m py_compile scripts/ai.py (nix) pipulate $ python scripts/ai.py --prompt '{input_text}' --model gemma3:4b --ctx 32768 <<<'Say OK' OK __MODEL_DELIMITER__ gemma3:latest(nix) pipulate $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma3:latest a2af6cc3eb7f 2.8 GB 100% GPU 32768 4 minutes from now (nix) pipulate $ python scripts/ai.py --prompt '{input_text}' --model gemma3:4b --ctx 64000 <<<'Say OK' OK __MODEL_DELIMITER__ gemma3:latest(nixollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma3:latest a2af6cc3eb7f 2.8 GB 100% GPU 64000 4 minutes from now (nix) pipulate $ python scripts/ai.py --prompt '{input_text}' --model gemma3:latest --ctx 32768 <<<'Say OK' OK __MODEL_DELIMITER__ gemma3:latest(nixollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma3:latest a2af6cc3eb7f 2.8 GB 100% GPU 32768 4 minutes from now (nix) pipulate $ python scripts/ai.py --prompt '{input_text}' --model gemma3:latest --ctx 64000 <<<'Say OK' OK __MODEL_DELIMITER__ gemma3:latest(nixollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma3:latest a2af6cc3eb7f 2.8 GB 100% GPU 64000 4 minutes from now (nix) pipulate $ ``` There was some prompt redraw weirdness, but besides that how do we look now, Sonnet? **Claude Sonnet 4.6 High**: Context verified — I can assess this from the terminal output and the diffs already in the Codebase Recapture, with one minor gap: `config.py` (and `DEFAULT_PROMPT_MODEL`) isn't in the compiled context, so I can't confirm what the bare `m()` alias resolves to without `--model`. Doesn't block this check since you tested explicit `--model` values both times. The resolution logic did exactly what it should have. `--model gemma3:4b` didn't exact-match anything in your local inventory (you only have `gemma3:latest` pulled, not a separately tagged `gemma3:4b`), so it fell through to the base-family prefix match, found `gemma3:latest`, and picked it as the `:latest` variant. That's not a fallback failure — `gemma3:latest` on Ollama's library *is* the 4B weights by default tag convention, so requesting `gemma3:4b` and getting `gemma3:latest` back is the correct outcome, not a downgrade. The on-demand pull branch never fired (no "not found in local inventory" warning in your output), which confirms the matches path resolved cleanly both times. The bigger finding is in the `ollama ps` numbers themselves. Your earlier test accidentally exercised `llama3.2:3b` (6 GB at 32k, ballooning to 10 GB with a 36%/64% CPU/GPU split at 64k) because of the matching bug Gemini caught. Now that the actual target — gemma3 — is loading, it's sitting at 2.8 GB and staying 100% GPU at both 32768 and 64000 context, no CPU offload at all. That's a meaningfully different VRAM profile than llama3.2:3b, and it means the "stick to 32k, test 64k cautiously" guidance from before was calibrated against the wrong model. Your actual daily driver has plenty of headroom on the 3080 — worth pushing the experiment further (96k, even the full 131072 if you want to find where the CPU split actually kicks in) rather than treating 32768 as the safe ceiling. Given that, I'd bump `DEFAULT_AUTO_NUM_CTX` up from 32768 now that you have real evidence gemma3 fits comfortably — 64000 looks like a conservative floor rather than a stretch goal for this model. The "prompt redraw weirdness" isn't a bug — it's `ai.py`'s output by design. The final line is `print(f"{result}\n__MODEL_DELIMITER__\n{used_model}", end='')`, deliberately with no trailing newline, so init.lua/Neovim can capture a clean single-line result without a stray blank line. In an interactive terminal that means your next shell prompt redraws immediately after `used_model` with nothing separating them — hence `gemma3:latest(nix) pipulate $`. Cosmetic only, and not something worth "fixing" since the no-trailing-newline behavior is load-bearing for the stdin/Neovim pipeline. One thing worth doing before you call this closed: pull an explicit `gemma3:4b` tag (`ollama pull gemma3:4b`) so future calls hit the exact-match branch directly instead of relying on the family-prefix fallback every time — cheap insurance, and Ollama should dedupe the underlying blob layers against `gemma3:latest` rather than doubling disk usage. Separately, it'd be worth a quick peek at `config.py` to confirm `DEFAULT_PROMPT_MODEL` actually points at the gemma3 family, since that's what governs the bare `m()` git-commit alias that never passes `--model` explicitly. **MikeLev.in**: I do the test: ```diff $ git status On branch main Your branch is up to date with 'origin/main'. nothing to commit, working tree clean (nix) pipulate $ rg 32768 scripts/ai.py 49:DEFAULT_AUTO_NUM_CTX = _env_int("PIPULATE_OLLAMA_NUM_CTX", 32768) 262: parser.add_argument("--ctx", type=int, help="Ollama context window for this request, e.g. 32768 or 64000") (nix) pipulate $ vim scripts/ai.py (nix) pipulate $ d diff --git a/scripts/ai.py b/scripts/ai.py index f7c9ba37..0bb2cb42 100755 --- a/scripts/ai.py +++ b/scripts/ai.py @@ -46,7 +46,7 @@ def _env_float(name, default): # 32k, while 4B+ can go higher if VRAM allows. Override per run with --ctx or # globally for auto mode with PIPULATE_OLLAMA_NUM_CTX=64000 after checking # `ollama ps` for 100% GPU residency. -DEFAULT_AUTO_NUM_CTX = _env_int("PIPULATE_OLLAMA_NUM_CTX", 32768) +DEFAULT_AUTO_NUM_CTX = _env_int("PIPULATE_OLLAMA_NUM_CTX", 131072) # 32768 AUTO_OUTPUT_RESERVE_TOKENS = _env_int("PIPULATE_OLLAMA_OUTPUT_RESERVE_TOKENS", 4096) AUTO_CHARS_PER_TOKEN = _env_float("PIPULATE_CHARS_PER_TOKEN", 4.0) ``` And the actual test: ```bash (nix) pipulate $ python scripts/ai.py --prompt '{input_text}' --model gemma3:latest --ctx 131072 <<<'Say OK' OK __MODEL_DELIMITER__ gemma3:latest(nixollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gemma3:latest a2af6cc3eb7f 2.9 GB 100% GPU 131072 4 minutes from now (nix) pipulate $ ``` That redraw on the prompt issue has something to do with command-line history recall (using the up-arrow) but don't fix on that. Let's wrap up this article here. Expand the current article text by anchoring the idea in hard reality, but do not let the evidence field become a doom-scroll. ## Beyond the Character Cap: Metering the Prompt [Article expansion protocol was here] ChatGPT? 🎤 **ChatGPT Thinking Extended**: Here’s a wrap-up expansion that keeps the evidence hard but lets the repair pattern carry the energy forward. ## The Repair Pattern Was Hiding in the Terminal The important thing about this little `ai.py` episode is not that the model got a bigger context window. That is the surface story. The deeper story is that the workflow stopped trusting vibes and started asking the machine what it was actually doing. At first the problem looked simple: Gemma 3 felt cramped. The obvious fix was to raise the character cap. But that would have been the old habit in new clothes: make the bucket bigger and hope the plumbing holds. The better fix was to ask where the true bottleneck lived. It turned out there were two different failure modes pretending to be one. One was a prompt-construction problem: newly added files were being clipped by a fixed per-file character cap. The repair was to replace the naive cap with a shared budget across all newly added files. That matters because a single 100,000-character file might fit, but three medium files can still silently shove the useful diff out of the active context. The new helper does not merely “allow more.” It meters the prompt like a resource. The other was an execution-context problem: Ollama accepts a `num_ctx` option through its API, and its own documentation says context length is the number of tokens available in memory. Ollama also recommends larger context settings for coding-tool workloads and says `ollama ps` can be used to inspect loaded models, context, and processor placement. ([Ollama Documentation][1]) That is the hard reality. A prompt can be beautifully assembled and still be invisible to the model if the runtime window is smaller than the story you think you handed it. But the repair was equally real. The patch added an explicit `--ctx`, passed `options.num_ctx` into the chat payload, and then tested the result against the only evidence that mattered: `ollama ps`. The terminal became the provenance record. Not “I think Gemma saw it.” Not “the model probably handled it.” The record said which model ran, how much context it received, whether it stayed on GPU, and whether the requested model silently resolved to something else. That is earned confidence. ## The First Warning Had a Matching Corollary The first warning was ugly because it was quiet: asking for `gemma3:4b` initially returned `llama3.2:3b`. That is the kind of failure that gives local AI tooling a bad name. Not because the model is bad. Not because Ollama is bad. Because the interface between intention and execution had a soft edge. You requested one thing, received another thing, and the workflow still smiled. ## Exposing Silent Failures through Instrumentation **Old Way:** Ask for a model by name, then let fuzzy fallback choose the “best available” local alternative. **Failure Pattern:** The system preserves convenience by sacrificing provenance. You get an answer, but not necessarily from the model you thought you were testing. **New Way:** Resolve exact matches first, resolve only inside the same model family second, prefer the `:latest` family variant when appropriate, and pull the requested model when it is missing instead of quietly wandering into another family. **Positive Corollary:** The false confidence lost authority. The delimiter and `ollama ps` exposed the mismatch, the resolver was corrected, and the next test proved the intended model was actually running. This is the difference between a magic trick and an instrument panel. Magic says, “It worked.” Instrumentation says, “It worked, with `gemma3:latest`, at 131,072 context, 100% on GPU.” And then the surprise: once the model selection bug was fixed, the scary VRAM story changed shape. `llama3.2:3b` spilled into CPU at 64k, but `gemma3:latest` stayed 100% GPU even at 131,072 context in your test. Ollama’s Gemma 3 tag page lists `gemma3:latest` as the 4B/latest variant with 128K context, which lines up with the observed behavior rather than the earlier mistaken fallback run. ([Ollama][2]) That does not mean “max everything forever.” It means the measurement moved the decision from superstition to calibration. ## The Shift from Guesswork to Calibration ## Mental Replay: The Bad Path and the Corrected Path The bad path is easy to imagine because it almost happened. You raise the character cap. You run a commit helper. The AI gives a plausible message. The diff looks summarized. The commit lands. Nobody notices that the oldest part of the prompt got pushed out of context, or that a different model answered, or that the runtime was dragging half the work through CPU memory. The tool remains useful enough to keep using and opaque enough to keep misleading. That is the dangerous middle zone: not broken, not right, just productive-looking. The corrected path feels different in the hands. You apply one small patch. Commit. Apply another. Commit. Test with `--ctx 32768`. Check `ollama ps`. Test with `--ctx 64000`. Check again. Notice the delimiter says the wrong model. Patch model resolution. Test again. See `gemma3:latest`. Push to 131072. Check processor placement. See 100% GPU. Now the decision has a paper trail. The discipline here is not heroic. It is almost boring: * print the model actually used, * pass the context size explicitly, * budget added files against the remaining window, * verify runtime state with `ollama ps`, * compile-check the script, * keep each patch small enough that the commit message remains meaningful. That is the craft corollary. A few small habits prevented a larger future mess. ## The Load-Bearing Connection This belongs in the larger article because it shows the exact hinge between warning and repair. The warning is that AI workflows are full of soft claims: “the model saw the file,” “the context is big enough,” “the requested model ran,” “the output is grounded in the diff.” Those claims are cheap to say and expensive to verify after the fact. The repair is to move those claims into artifacts. A model delimiter is an artifact.
A context flag is an artifact.
An `ollama ps` line is an artifact.
A patch hunk is an artifact.
A commit hash is an artifact.
A prompt budget is an artifact.
The thesis is not “trust nothing.” That curdles into paranoia. The thesis is “make trust inspectable.” Once the workflow exposes its own state, confidence stops being a mood and starts being a repeatable engineering property. ## Engineering Confidence through Artifacts ## The Monday Morning Test Without this insight, the failure mode is subtle. The tool keeps producing decent commit messages, but every now and then it misses the point of a newly added file. You blame the model. Or the prompt. Or “local AI.” Meanwhile the actual culprit might be a clipped file, a context window that was never set, or a fallback model answering under the wrong identity. With this insight, Monday morning looks different. Set the default context from evidence, not folklore. Keep `PIPULATE_OLLAMA_NUM_CTX` as the override, but let the tested daily-driver value become the working default. Keep the output reserve so the prompt does not eat the response. Keep the aggregate added-file budget so a feature scaffold cannot drown its own diff. Keep the model delimiter because it already caught one real bug. Keep `ollama ps` in the test ritual because it turns hardware residency from rumor into record. The new checklist is short: 1. Did the requested model family actually run? 2. Did the requested context actually load? 3. Did it stay on GPU? 4. Did added-file hydration fit inside a shared budget? 5. Did the generated commit message reflect the real staged change? That is enough. No doom-scroll required. ## The Forward Vector This starts as context-window tuning, but it points at a larger discipline: repair libraries. A warning list says, “Here are the ways systems crack.”
A repair library says, “Here are the patterns that kept the crack from spreading.”
For your writing workflow, that means every dramatic failure example needs a matching corollary. Not optimism. Not soft-focus redemption. A concrete repair: a log survived, a standard changed, a bad fallback was removed, a contributor was credited, a checklist interrupted the autopilot, a better model replaced the old one. For your coding workflow, it means the same thing in miniature. Every scary ambiguity deserves a visible instrument. Which model? Which context? Which file? Which budget? Which source of truth? The good stuff lives in the hinge. Warnings teach where systems crack. Corollaries teach how reality repairs itself. The article needs both, or it becomes either fear theater or naïve optimism. The workflow needs both, too, or it becomes either reckless automation or manual superstition. Stop treating the AI output as the record. Start treating the record as the thing that lets AI help safely. What surprised me most here is how quickly the story changed once the right measurement was added. The first 64k test looked like a hardware limit. The corrected Gemma test turned it into a model-selection bug plus a solvable verification habit. I expect the next improvement will not be a bigger model or a louder prompt, but a tighter cockpit: the commit helper reporting model, context, estimated prompt tokens, added-file budget, and GPU residency as routine telemetry. The net-new insight is that provenance is not only for historians and citations. In local AI tooling, provenance is performance engineering. [1]: https://docs.ollama.com/context-length?utm_source=chatgpt.com "Context length - Ollama" [2]: https://ollama.com/library/gemma3/tags?utm_source=chatgpt.com "Tags · gemma3 - ollama.com" > **MikeLev.in**: My big learning: the Unix/Linux `ps` command has a `CONTEXT` field with > values like `131072` with a follow-on `UNTIL` field with values like `4 minutes > from now` and this little tidbit is huge in the age of AI. Tag this for a > separate follow-on article next-step to-do finder-bot please. --- ## Book Analysis ### Ai Editorial Take What struck me was the 'Load-Bearing Connection'—the realization that an AI workflow's fragility often stems from the absence of artifacts. We aren't just coding a tool; we are coding a 'record' that survives the session. My new perspective is that provenance is the true 'missing feature' in current LLM UI/UX. ### 🐦 X.com Promo Tweet ```text Stop guessing if your AI is actually working. Learn how to turn Ollama telemetry into a provenance record and build truly inspectable local-first AI workflows. Engineering your local AI from superstition to calibration: https://mikelev.in/futureproof/ai-provenance-calibration/ #LocalAI #Ollama #Engineering ``` ### Title Brainstorm * **Title Option:** From Superstition to Calibration: Building Provenance into Local AI Tooling * **Filename:** `ai-provenance-calibration.md` * **Rationale:** Directly highlights the shift from guessing to systematic measurement. * **Title Option:** The Instrument Panel: Making Local AI Workflows Inspectable * **Filename:** `ai-instrument-panel.md` * **Rationale:** Emphasizes the core theme of making black-box processes visible. * **Title Option:** Closing the Hinge: Engineering Reliability in LLM Pipelines * **Filename:** `llm-pipeline-reliability.md` * **Rationale:** Focuses on the technical intersection between warning and repair. ### Content Potential And Polish - **Core Strengths:** - Strong reliance on terminal telemetry as a 'source of truth'. - Excellent documentation of the shift from fallback-prone code to deterministic resolution. - Demonstrates a rare balance of technical depth and workflow philosophy. - **Suggestions For Polish:** - Ensure the distinction between the 'character cap' logic and the 'num_ctx' logic remains clear for readers unfamiliar with Ollama. - Consider adding a summary table of the final 'provenance checklist' for easier reference. ### Next Step Prompts - Find-bot: Create a list of 'system telemetry hooks' that should be standard for any local AI integration to detect offload, context saturation, and model identity. - Explore: How can we visualize the 'provenance history' of a commit message generated via this pipeline to create an audit trail for the project repository?