Often when I start out with these articles, I pick a topic in the morning that I think will be the main theme of the article. Yesterday, it was Google Sheets versus Jupyter Notebooks. But then during the course of the day as the article grows as I do my daily work journaling, the theme refines and focuses. Some might say veers off course. I say refines and focuses. And yesterday, it went from this-vs-that to a whole lot about that. There’s just so much to say about the kind of work I do with Python in Notebooks and my current mission in particular that… well, that today becomes Part 2!
In Martial Arts and Japanese martial arts in particular, there’s a concept called Kata. Kata is good form. It’s fighting an imaginary opponent (or opponents) with predetermined practiced moves. The idea is that over time these moves commit to your muscle memory, and you can perform them without thinking in situations other than the kata. Practice makes perfect, in other words. And repetition matters. The more you do a thing, the better you know it. Imagining something repeatedly makes you better at the real thing. Hey Perplexity, What’s that thing about baseball players imagining hitting the ball in their head?
Perplexity: The concept of baseball players imagining hitting the ball in their heads relates to the mental and visual strategies they employ while batting. This involves not just the physical act of hitting but also significant cognitive processes that aid in anticipating and reacting to a pitched ball.
Visual Tracking in Batting
Research indicates that batters utilize a combination of head and eye movements to effectively track the ball as it approaches. Instead of relying solely on eye movements, batters tend to move their heads to maintain focus on the ball. This head movement helps them cancel out the vestibular ocular reflex (VOR), which would otherwise cause their eyes to move in the opposite direction of their head’s rotation. By suppressing this reflex, batters can keep their gaze steady on the ball, which is crucial for making contact within a very short time frame—approximately half a second from pitch release to contact with the bat[1][2].
Anticipatory Movements
Batters also employ anticipatory saccades, which are quick eye movements made toward where they predict the ball will be. These movements often occur before the ball reaches the plate, allowing batters to prepare for a swing based on their predictions about the ball’s trajectory. Studies have shown that these predictive eye movements can significantly enhance a batter’s ability to hit successfully, as they allow for quicker reactions and better timing[1][2].
Mental Preparation
The mental aspect of hitting is equally important. Players often visualize themselves successfully hitting the ball, which can enhance their confidence and performance at the plate. This mental rehearsal is akin to how athletes in various sports use visualization techniques to improve their skills and outcomes. The mental preparation involves understanding pitch types, recognizing pitcher tendencies, and developing strategies based on game situations[4][5].
Conclusion
In summary, successful batting in baseball hinges on a combination of physical tracking techniques and mental strategies. Players not only focus on seeing the ball but also engage in complex cognitive processes that involve predicting its path and visualizing successful hits. This interplay between visual acuity and mental preparation is what makes hitting in baseball one of the most challenging tasks in sports.
Citations:
[1] https://optometry.osu.edu/news/batters-move-their-heads-keep-their-eyes-ball
[2] https://www.frontiersin.org/journals/sports-and-active-living/articles/10.3389/fspor.2020.00003/full
[3] https://www.reddit.com/r/Homeplate/comments/156s1f6/do_professional_players_try_to_guess_the_pitch_or/
[4] https://qualityatbatsacademy.com/the-mental-side-of-hitting-the-ultimate-guide/
[5] https://community.hsbaseballweb.com/topic/teaching---see-the-ball-correctly
[6] https://www.youtube.com/watch?v=d9fEHydJcOQ
[7] https://www.youtube.com/watch?v=nr1oOcoxVb0
Yeah, so visualize it! Like maybe visualizing the shape of your website, regardless of how big it is! That’s my goal. Actually no, I do that all the time. Making that easy for anyone to do is my goal. Actually, no. Making it easy for any Botify user to do is my goal, because getting all that link-graph data into a database for visualization… well, no thank you. No sireee. No sir I don’t like it, not one bit. I mean like I have written countless web crawlers over the years, but I’m a shallow click-depth sort of person. Like if the homepage is depth-0, I’ll write a crawler to go to depth-2 maybe, or 3-max. I’m telling you, that could be millions of pages on a lot of sites, and likely the beginning of a never-ending spider-trap. Been there, done that. Not interested in doing it again. And so I work for Botify. They do that real well, and I’m pretty good at stupid data tricks like exporting the link-graph for visualization. So let’s do that.
We start by continuing yesterday’s Botify API kata. Finally, I figured out the format I’m doing it it. It is of course in a Jupyter Notebook and that notebook is in the same repository as my AI SEO Bot project, a local LLM that’s gradually coming to life to do all your next-generation SEO bidding… which is starting to look like not SEO at all, but rather a nice clean-up, pruning and curation job to get your site ready for rapid training and re-training of all those bots that are going to still be using harvested free and open source web data. Yeah, they’re not all cutting the Reddit / Twitter-deals for proprietary data source. I mean, some will. But a lot’s still getting slurped down off the web, and you’ve got to get your house in order so you’re deemed quality and important enough (a lot of brand safe harbor stuff here) to be curated in. Did I mention good form? This is the good form exercise to help clients do good form exercises.
And when last we left off, I made a new notebook called kata.ipynb
right in
the root of aiseobot
where I’ll be sure to see it over and over and over. The
actual name of that repo is still in flux as it moves from pipulate
to
botifython
to most recently botifymograph
. I’m not sure what I’m settling
on, but I’m leaning towards chipotheseus
, for Ship of Theseus, as my
plan is to swap out parts of it over the years in both a hardware and software
perspective (including the AI models it uses), until… until, well the SciFi
that we’re living is even more SciFi that we’re living. Genies in bottles become
very important. If you don’t have your own genie in a bottle, you’ve got to use
someone else’s—at least, if you’ve got any wishes to fulfill. And so, it’s
best to start raising them now.
kata.ipynb
is a Jupyter Notebook, and as such has meta-data that you don’t
want to accidentally commit to the git repo it resides in, and thus also to
GitHub, and thus accidentally leak potentially proprietary information. Also,
all my publishing is now markdown. I had a funny discussion yesterday about when
push comes to shove, I’ll still use office suite-style software, like MS
Office or Google Suite with all their GUI formatting stuff. But I’m a wholly
textfile-guy. From tables (which make great use of vertical pips (|
) to
bulleted outlines to free-flowing blocks of text, it’s way better to just be
able to manipulate text in a generic vi/vim/NeoVim-like environment. You know,
that includes VSCode, emacs and pretty much any text-editor or integrated
development environment (IDE) worth their salt, because they all offer vim
emulation. There’s a reason for that. Develop good form editing text, and you’re
future-proofed. Good form leading to muscle memory leading to future-proofing.
Detecting a theme here?
So Jupyter Notebooks are an anomaly in my toolbox. I really hated them at first.
I hated them right up until I loved them. They’re no vim, but they have their
use. And to make them useful, we have to keep their metadata out of the git repo
and we have to make them output markdown for such uses as incorporating into
articles here. See, the theme of this article has already veered into a nested
sub-project. But it’s still only 6:00 AM, so that’s fine. Let’s get this
careening done and back on track, and we’ll be slamming out the million-edge
link-graphs in no time. For stripping out the meta-data from Notebooks, we pip
install nb-clean
. I know this because I’ve used many such tools over the years,
including another tool, nbdev
from the illustrious Jeremy Howard who recently
brought us FastHTML
. But that complicates things a little too much for our
purposes here. nb-clean
is just the right amount of package for this one
purpose, and once done, it’s automatic. The magic are (from having cd’d into the
folder):
nb-clean add-filter
…and in the voice of Jafar: You’ll never leak meta again!
What this does I’m still wrapping my mind around. I’m a big advocate of Linux,
Python, vim & git (LPvg
) as a short tech stack for life. Have
well-developed muscle memory and good form with them, and you’ll resist
obsolescence like an old-timer whose every trick still works. But the truth is
that git
is probably my weakest skills. It can do a whole lot I’m not taking
advantage of (yet). I use it mostly as a sort of infinite undo, code-deployment
technique and backup safety mechanism. But it can do so much more in terms of
collaboration, experimental code branching and such. Hey Perplexity:
nb-clean has this line in its docs:
To add a filter to an existing Git repository to automatically clean notebooks when they’re staged, run the following from the working tree:
nb-clean add-filter
This will configure a filter to remove cell execution counts, metadata, and outputs. To also remove empty cells, use:
…but I don’t even know what staging is in git. I use git commit and git push almost exclusively as far as commands, for a kind of infinite undo. Will nb-clean still help me keep meta data out of committed notebooks?
Perplexity: Based on your question, it seems you’re using Git in a simplified way primarily for version control, but you’re not familiar with some of the intermediate concepts like staging. Let me explain how nb-clean can still be helpful for you, and provide some context on Git staging:
Git Staging and Your Workflow
Your current workflow of using git commit
and git push
is actually implicitly using the staging area without you realizing it. Here’s a brief explanation:
- When you run
git commit
, Git automatically stages all modified files before committing them[3]. - The staging area (also called the index) is where Git prepares changes to be committed[3].
How nb-clean Can Help You
Even with your simplified Git workflow, nb-clean can still be very beneficial:
-
Automatic Cleaning: Once set up, nb-clean will automatically clean your Jupyter notebooks before they’re committed, without you needing to do anything extra[2].
-
Metadata Removal: It will remove cell execution counts, metadata, and outputs from your notebooks[2]. This is particularly useful for version control, as it prevents unnecessary changes from being tracked when only the notebook’s content has changed.
-
Smaller Repository: By removing unnecessary metadata and outputs, your Git repository will be smaller and cleaner.
Setting Up nb-clean
To use nb-clean with your workflow:
- Install nb-clean:
python3 -m pip install nb-clean
- In your repository, run:
nb-clean add-filter
This will set up a Git filter that automatically cleans your notebooks when you commit them[2].
Benefits for Your Workflow
With nb-clean set up:
- You can continue using
git commit
andgit push
as you have been. - Notebooks will be automatically cleaned before each commit.
- Your repository will only track meaningful changes to notebook content, not execution counts or outputs.
By using nb-clean, you’ll maintain cleaner version control for your Jupyter notebooks without needing to change your current Git workflow. It’s a simple addition that can significantly improve how your notebooks are tracked in Git.
Citations:
[1] https://github.com/jupyter/nbdime/issues/548
[2] https://github.com/srstevenson/nb-clean
[3] https://graphite.dev/guides/remove-files-from-staging-git
[4] https://stackoverflow.com/questions/73218304/how-to-clear-jupyter-notebooks-output-and-metadata-when-using-git-commit
[5] https://www.reddit.com/r/git/comments/16dhysy/how_to_remove_files_from_index_staging_area/
[6] https://www.reddit.com/r/webdev/comments/udod9z/git_makes_me_want_to_pull_my_hair_out/
[7] https://awasu.com/weblog/git-guts/staging-index/
[8] https://git-scm.com/docs/git-commit
Nice! Thanks, Perplexity. You know, there’s a lot of concern that as AI Models get trained and re-trained and fine-tuned off of all the data from the free and open source web, more and more of it is going to be AI-generated derivative data already, thus having a quality-reducing, bias-amplifying feedback loop. Am I doing that here? I am following my free-form train of thought. There’s nothing AI-generated about the text between my little bouncing ideas off Perplexity, so is it derivative and synthetic? Or is it organic and hybrid? Who’s running this show anyhow? Well, I am clearly, but having a dream-come-true in the form of something that can answer my questions with the research and citations to back it up, pre-formatted as ready-to-publish markdown with the footnotes to boot! OMG, that’s the world we live in. Had Perplexity been from OpenAI a year ago, it would have already kicked Google’s ass. As it is, it’s the in-the-know search insider’s meta-ai-search-engine. Remember Dogpile**, **Metasearch, *Hotbot… oh my, it’s a long list. The days of metasearch before Google solidified its single-source dominance? Yeah, well those days are back. In the voice of Dennis Miller: Stop me before I subreference again!
Did I mention pre-formatted and ready-to-publish markdown? Yes I did! And so the next step is to give our not-vim Jupyter Notebooks some additional redeeming qualities by infusing into them the ability to output markdown I can copy/paste directly into articles like this and for that we… Perplexity?
Perplexity: There are a few PyPI packages that can convert Jupyter notebooks to markdown format:
- notedown
notedown is a simple tool that can convert between Jupyter notebooks and markdown files. Some key features:
- Converts notebooks to markdown and back
- Preserves code cells and markdown cells
- Can strip output cells when converting to markdown
- Supports R markdown as well
Usage:
notedown input.ipynb --to markdown > output.md
- jupytext
Jupytext is a more full-featured tool for converting between notebooks and various text formats. It supports:
- Converting notebooks to/from markdown, R markdown, Julia/Python/R scripts
- Pairing notebooks with text representations for version control
- Round-trip conversion preserving metadata
- Integration with Jupyter to automatically sync paired files
Usage:
jupytext --to markdown notebook.ipynb
- nbrmd
nbrmd is an earlier version of jupytext with similar functionality:
- Converts between notebooks and markdown/scripts
- Preserves cell metadata
- Command line tool for conversion
Usage:
nbrmd --to markdown notebook.ipynb
Of these options, jupytext seems to be the most actively maintained and feature-rich package for notebook-markdown conversion. It has good integration with Jupyter and supports a variety of text formats beyond just markdown[1][2][5].
Let me know if you need any clarification or have additional questions about using these packages!
Citations:
[1] https://pypi.org/project/notedown/
[2] https://pypi.org/project/jupytext/0.6.5/
[3] https://towardsdatascience.com/how-to-convert-a-python-jupyter-notebook-into-an-rmarkdown-file-abf826bd36de?gi=28e9ba8e88a4
[4] https://stackoverflow.com/questions/36288670/how-to-programmatically-generate-markdown-output-in-jupyter-notebooks
[5] https://pypi.org/project/nbrmd/
[6] https://pypi.org/project/jupytext/1.1.1/
[7] https://pypi.org/project/jupytext/0.8.4rc0/
[8] https://pypi.org/project/jupytext/
I just wish markdown numbered lists didn’t restart at 1 when there’s too much
stuff going on between line items. I’d fix it, but life’s too short to do
follow-up markdown editing (except for indenting Perplexity’s
footnotes—I’ll do that). And so we accumulate and assimilate knowledge and
forge on, now equipped with the amazing jupytext
which is billed as a bit more
heavyweight than I want with all the formats it supports, but what isn’t? I use
Pandas for converting things to CSVs. Same thing. And so my still lean and mean
requirements.txt
has grown to:
# For long-running task communication
asyncio
aiohttp
# For Pipulate / Botifython
beautifulsoup4
loguru
pyfiglet
python_fasthtml
Requests
rich
sqlite_minutils
starlette
watchdog
# For JupyterLab / Data Science
jupyterlab
numpy
pandas
nb-clean
itables
And some stuff is on there not because it has to be, but only because they’re
such useful requirements of other packages on their own right that its useful to
remind myself I have them and htey are worth investigating further. numpy
is
the big example, because it’s really a requirement of pandas
so I don’t really
have to list it in my requirements. But I do because I need to polish up on my
maths and that’s a good reminder. Similarly… hmmm… Perplexity?
Perplexity: This is an excellent observation about dependency management. Let’s analyze your requirements list to create a more concise requirements.txt file:
Simplified Requirements
-
Pandas includes NumPy: You’re correct that pandas automatically installs numpy as a dependency[1]. Therefore, you don’t need to explicitly list numpy.
-
Starlette is a dependency of python_fasthtml: As you pointed out, starlette is required by python_fasthtml, so it doesn’t need to be listed separately.
-
JupyterLab includes several dependencies: JupyterLab often includes or requires numpy and pandas, so you might not need to list these separately if you’re installing JupyterLab.
Given these considerations, here’s a potentially shorter requirements.txt:
aiohttp
beautifulsoup4
loguru
pyfiglet
python_fasthtml
Requests
rich
watchdog
jupyterlab
nb-clean
itables
Additional Considerations
-
Indirect Dependencies: Some packages might be dependencies of others in your list. For example, aiohttp might be required by another package you’re using.
-
Version Specificity: While not directly related to shortening the list, it’s important to specify versions for each package to ensure reproducibility.
-
Development vs. Production: Consider separating requirements for different environments. For instance, JupyterLab and related tools might only be needed in a development environment.
-
Minimal Installation: If you’re aiming for the absolute minimum, you could start with just the core packages you directly use in your code and let pip resolve the dependencies automatically.
Remember, while shortening your requirements.txt can make it more manageable, it’s crucial to ensure all necessary dependencies are still installed correctly when setting up a new environment. Always test your application after making changes to the requirements file to ensure everything works as expected.
Citations:
[1] https://stackoverflow.com/questions/74344430/what-does-it-mean-by-the-statement-numpy-is-a-dependency-of-pandas
[2] https://github.com/python-poetry/poetry/issues/8489
[3] https://github.com/pandas-dev/pandas/issues/25316
[4] https://pandas.pydata.org/docs/getting_started/install.html
[5] https://pandas.pydata.org/pandas-docs/version/0.25.1/install.html
[6] https://pypi.org/project/pandas/0.23.2/
[7] https://pandas.pydata.org/pandas-docs/version/0.24/install.html
[8] https://help.sonatype.com/en/python-application-analysis.html
Oh, that’s sweet! So much so, I’m changing my mind! I want to flex my muscle
with how short my requirements.txt
file is for Chip O’ Thesesus, and I’m
updating it to what Perplexity just reported. I no longer need to remind myself
fasthtml
uses starlette
and that I need to brush up on numeric array stuff.
I know! Slick.
Okay, so I now add jupytext
to the list for turning my Jupyter Notebooks into
content here. And so the article really just begins. And it’s 6:45 AM. Nice. The
time has come to really lean into your work and achieve something big. You’re in
a race. Machines are only going to be achieving human-level intelligence once
and it’s good to have an ambitious and related project going on while they do.
And little is as important now as the API syntax of more-machine-friendly
intermediary languages. This is mostly about JSON
and sometimes XML
.
However, it is also about markdown
. And jupytext
converts .ipynb
files to
it with this command:
jupytext --to markdown kata.ipynb
And that’s where today’s story begins. First, let’s clean the mess I’ve made in
the botifymograph
repo folder. I like running JupyterLab now out of the
same repo folder I run the Chip O’Theseus server.py
file (its name doesn’t
matter). But basically I’m running a Python webserver using the classic form
python server.py
which makes the folder it runs from root
to the server. It
can find all its resources really easily from there without paths or fancy
additional package imports and code. You can refer to a configuration file
directly by name and it will find it. This is usually bad practice for web
development as it makes things hackable as people probe for things they’re not
supposed to have access to on root, but it makes everything easier and the code
more beautiful and I’m developing this as a single-tenant desktop app even
though it’s a web app. But a lot of things accumulate there as a result, and you
can identify them easily with a git status
because they don’t belong and will
be called out as Untracked files. I zap much of them with:
rm *.csv
rm *.gz
Okay, I’ve got it down to a manageable level. There’s a few .md
(markdown)
files left, and some of them have real value and are there precisely because of
this project, and I want to leave them there for now. In fact, I want to roll
them all into a single master main notebook that consolidates all that yummy
training stuff into one location. And now it’s time for some good form…
- Things need training (including myself)
- Some of those things are human, and some are machines
- If done well, the same materials can train both
- This is Jupyter Notebook is an attempt to do just that
- It will be turned into markdown documentation
I need structure! I need to rig my Notebook format for successful training. Hey Perplexity, what’s that format Llama uses to train models?
Perplexity: The Llama format for training models typically involves preparing data in a specific structure to fine-tune the model effectively. Here are the key aspects of the Llama training format:
Data Format
The most common format for Llama fine-tuning data is JSON or JSONL (JSON Lines), with each entry containing the following fields:
- “instruction”: The task or question posed to the model
- “input”: Additional context or information (optional)
- “response”: The expected output or answer
A typical example would look like this:
{
"instruction": "Summarize the following text:",
"input": "Lorem ipsum dolor sit amet...",
"response": "This is a summary of the given text."
}
Training Approaches
Llama models can be fine-tuned using different approaches:
Raw Text Continuation For simple text generation tasks, you can use raw text without specific formatting[1].
Instruction Tuning To create a more versatile model that can follow instructions, you should format your data to match instruction-following models like Alpaca or Vicuna[1].
Question-Answer Pairs Many training scripts expect JSON files with question-answer pairs to facilitate easy formatting and disk space optimization[1].
Data Preparation
When preparing your dataset for Llama fine-tuning, consider the following:
-
Quality and Quantity: Llama models benefit from high-quality, diverse datasets. Meta’s approach for Llama 3 involved using over 15 trillion tokens from public sources, with rigorous preprocessing to ensure data quality[2].
-
Preprocessing: Apply filters for NSFW content, perform semantic deduplication, and use text classifiers to predict quality[2].
-
Data Mix: Experiment with different combinations of data types and sources to find the optimal mix for your specific use case[2].
-
Iterative Refinement: Implement multiple quality assurance iterations to improve data quality and model performance[2].
Training Process
When training a Llama model:
-
Use the appropriate model weights and tokenizer for your chosen Llama version[4].
-
Set up the training configuration, including learning rate, batch size, and number of training steps[4].
-
Prepare the model for training, potentially using techniques like LoRA (Low-Rank Adaptation) for more efficient fine-tuning[4].
-
Utilize a training framework like the Hugging Face Transformers library to manage the training process[4].
Remember that the specific format and approach may vary depending on the version of Llama you’re using (e.g., Llama 1, 2, or 3) and your particular use case. Always refer to the most up-to-date documentation for the specific Llama version you’re working with.
Citations:
[1] https://www.reddit.com/r/LocalLLaMA/comments/16sjyvc/llama_2_data_format/
[2] https://kili-technology.com/large-language-models-llms/llama-3-guide-everything-you-need-to-know-about-meta-s-new-model-and-its-data
[3] https://en.wikipedia.org/wiki/LLaMA
[4] https://www.mlexpert.io/blog/alpaca-fine-tuning
[5] https://ai.meta.com/blog/large-language-model-llama-meta-ai/
[6] https://huggingface.co/blog/nroggendorff/train-with-llama-architecture
[7] https://www.datacamp.com/tutorial/fine-tuning-llama-2
[8] https://github.com/ggerganov/llama.cpp/discussions/3805
Okay, I can’t let the process get convoluted. It has to lean into the strengths of the Notebook conversion that’s going to happen.
Me: I’m going to be converting a Jupyter Notebook into model training markdown. So the whole JSON format isn’t going to work for me. But the markdown cells in Notebooks provide some freedom in prefixing every code block that will output. So I can say something about each code block that will follow. Review other formats that are commonly used to train models and devise an approach I can use.
Perplexity: When converting a Jupyter Notebook into markdown for model training, you can adopt a structured approach that combines explanatory text with code blocks. Here’s a suggested format based on common practices:
Section Headers
Use markdown headers to organize your document into logical sections. For example:
# Model Training: [Model Name]
## 1. Data Preparation
## 2. Model Architecture
## 3. Training Configuration
## 4. Training Process
## 5. Evaluation
Code Block Prefixes
Before each code block, provide a brief description of what the code does. You can use a consistent prefix format:
### [Action]: [Brief Description]
```python
# Your code here
For example:
```markdown
### Load Data: Import and preprocess the dataset
```python
import pandas as pd
data = pd.read_csv('dataset.csv')
X = data.drop('target', axis=1)
y = data['target']
## Explanatory Text
Between code blocks, include explanatory text to provide context, reasoning, or additional information:
```markdown
The dataset is split into training and validation sets to ensure proper model evaluation. We use a 80/20 split ratio.
### Split Data: Create training and validation sets
```python
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
## Parameter Tables
For important configurations or hyperparameters, consider using markdown tables:
```markdown
### Model Configuration: Set up hyperparameters
| Parameter | Value |
|-----------|-------|
| Learning Rate | 0.001 |
| Batch Size | 32 |
| Epochs | 100 |
```python
model = MyModel(
learning_rate=0.001,
batch_size=32
)
## Results and Visualizations
For outputs, results, or visualizations, you can use a similar prefix format:
```markdown
### Result: Training accuracy over epochs
![Training Accuracy](./images/training_accuracy.png)
The graph shows a steady increase in accuracy over the first 50 epochs, followed by a plateau.
This structured approach provides a clear, readable format that combines explanatory text with code blocks, making it easy for others to follow your model training process. It’s flexible enough to accommodate different types of models and training procedures while maintaining a consistent structure.
Citations:
[1] https://www.mlexpert.io/blog/alpaca-fine-tuning
[2] https://developers.redhat.com/articles/2021/05/21/introduction-machine-learning-jupyter-notebooks
[3] https://www.datacamp.com/tutorial/fine-tuning-llama-2
[4] https://kili-technology.com/large-language-models-llms/llama-3-guide-everything-you-need-to-know-about-meta-s-new-model-and-its-data
[5] https://huggingface.co/blog/nroggendorff/train-with-llama-architecture
[6] https://docs.polypheny.com/en/latest/user_interfaces/notebooks/cells/markdown-cells
[7] https://www.earthdatascience.org/courses/intro-to-earth-data-science/open-reproducible-science/jupyter-python/code-markdown-cells-in-jupyter-notebook/
[8] https://runestone.academy/ns/books/published/httlads/PythonReview/markdown.html
Pshwew! What a journey! Well, it’s pretty clear that I’ve got my first piece of content on my website that’s going to resist blog format. Already I have a bit more repetition of content than I’m comfortable with, but life’s to short to retro-edit (much), so I’m going to use my newfound powers to launch my first sub-section of my site.
And so, I never got to the topic I wanted today’s entry to be about, the link-graph and how to create the query to export it, and even building the app to visualize it within the new web framework. I got distracted by the means to train the model to help me code it! At least it wasn’t a distraction by something that leads me astray, but instead is a necessary dependency. I need this location to accumulate my nearly-know-how. It’s more like a place to capture my peeling away the layers in diagnostic sweeps. I’m capturing my little diagnostic discoveries, including improved diagnostic procedures.
In the end, I need to program a lot of BQLv2, and I don’t know how. It’s not that I don’t know the syntax or the language. I don’t know the friggin tables and the entity relationships of the product. I don’t know what field names are there, what they are for, and how they might be used in example code to build from.
And so, that’s what I spent my time today doing. Making sure I have a growing list of such examples that I can:
- Lift and use directly
- Use to train LLMs, both local and frontier web models
- Use to help my coworkers
- Add to as I peel away layers and make new discoveries
- Keep in a Jupyter Notebook where it’s all live and usable
- Have automatically exportable to markdown the way LLMs learn from
- Have the markdown easily published where I can share and frontier models may notice and learn from
Okay, yeah it comes down to kata, good form and muscle memory again. It comes down to removing the randomness and scattered approach to examples and hunt-and-peck method of discovery. We make both our examples and how we make those new discoveries systemic. And we walk through the results of that systematic process and output until it’s natural second nature.
Me: What do you think, ChatGPT 4o?
4o: I think it’s a compelling piece that captures a blend of technical reflection and strategic thinking. Here are some highlights:
-
Narrative Flow: You’re weaving a story that explains your approach, particularly around maintaining good form (like kata in martial arts) for methodical work in data science and development. This analogy is vivid and bridges the technical and philosophical aspects well.
-
Voice and Tone: The playful references, like “genies in bottles,” “infinite undo,” and the Dennis Miller “subreference” line, add personality. They work well to bring in your perspective, making complex workflows feel approachable and lively.
-
Technical Explanations: The technical depth is excellent. You’re laying out tools and techniques in a way that is informative yet not overwhelming. Integrating
nb-clean
,jupytext
, and the Git strategy around metadata shows a thoughtful, pragmatic approach. -
Structure and Format: The format and structure suggestions—like markdown tables, headers, and code prefixes—are very practical. This will make the final document not just informative but also highly readable and organized.
-
Self-Reflection: The meta-awareness of potential pitfalls in generative AI, like the feedback loop of AI-generated content, brings depth to the post. This adds layers of critical thought about the future of tech, training AI, and maintaining the integrity of data sources.
-
Readability: The playful tangents are balanced well with insights, though a bit of tightening here and there (if desired) could make the main themes pop even more. For example, you could consider trimming some of the interjections or reflections to keep the core points in sharp focus.
-
Clear Objectives: You are outlining real goals—such as creating a systematic process for building, sharing, and updating your knowledge in an accessible way. This is powerful because it doesn’t only focus on immediate output but also on creating a durable workflow for future productivity.
It’s both an inspiring and instructional read, full of practical insights while giving a glimpse into your unique process and philosophy.
Well thank you 4o! That’s some wonderful self-bias reinforcing praise from you there. I think so too blush… anyhoo, over to your buddy Claude 3.5 Sonnet for this next round of… well, Kung Fu, because it has more direct access to my code.
Me: And you get the Neo Matrix “I Know Kung Fu” reference of feeding this to an LLM when making a BQL request? Can you insert a brief factual statement of how this file is used in conjunction with an LLM like you when asking for a derivative BQL query?
Claude: Yes! Let me add a comment at the top of the script explaining this clever use:
#!/bin/bash
# This script creates two versions of the BQL documentation:
# 1. api.md - A Jekyll-processed version with front matter
# 2. botify-api.md - A raw markdown file that can be fed to LLMs (like Claude)
# as context when crafting BQL queries. This allows the LLM to understand
# BQL syntax and capabilities, effectively turning it into a BQL query assistant.
# Step 1: Navigate to the directory containing the Jupyter notebook
cd ~/repos/botifymograph || exit
# ... rest of the script remains the same ...
This is a great example of using documentation as context for LLMs - by providing the raw markdown file to an LLM along with a specific BQL query request, the LLM can leverage the comprehensive BQL documentation to generate or modify queries with proper syntax and understanding of BQL’s capabilities.
Me: I liked that and I’m keeping it. I mean adding instructions for both the human and the LLM who downloads this .md file to help LLM training. They’re clicking a link from the HTML version of htis that Jekyll makes, but they get hit with the .md version and I want the first line or two to put the purpose of the file in context. A very light reference to how Neo becomes an expert on a topic from a download couldn’t hurt. And I mean inject this during the awk process… awk..ward!
Claude: Ah, clever! Let’s modify the awk command to inject this context-setting message at the beginning of the botify-api.md file:
#!/bin/bash
# This script creates two versions of the BQL documentation:
# 1. api.md - A Jekyll-processed version with front matter
# 2. botify-api.md - A raw markdown file that can be fed to LLMs (like Claude)
# as context when crafting BQL queries. This allows the LLM to understand
# BQL syntax and capabilities, effectively turning it into a BQL query assistant.
# Step 1: Navigate to the directory containing the Jupyter notebook
cd ~/repos/botifymograph || exit
# Step 2: Convert the Jupyter notebook to markdown using Jupytext
jupytext --to markdown kata.ipynb
# Step 3: Copy the converted markdown file to the new location
cp kata.md ../MikeLev.in/api.md
# Step 4: Insert permalink inside the existing front matter
temp_file=$(mktemp)
awk 'NR==1 && /^---$/ {print; print "permalink: /api/"; next} {print}' ../MikeLev.in/api.md > "$temp_file" && mv "$temp_file" ../MikeLev.in/api.md
# Verify insertion was successful
echo "Permalink directive inserted in existing front matter."
# Step 5: Create a copy without front matter and region comment
awk '
BEGIN {
skip=1
print "# BQL Documentation for LLM Training"
print ""
print "Welcome! Like Neo downloading kung fu, you (whether human or AI) are accessing this markdown version"
print "of the BQL documentation specifically formatted for LLM training. When included as context with"
print "BQL-related queries, this documentation helps LLMs provide more accurate and syntactically correct"
print "BQL query assistance."
print ""
}
/^---$/ {
if (++count == 2) {
skip=1
next
}
}
skip && /^<!-- #region -->$/ {
skip=0
next
}
!skip {print}
' ../MikeLev.in/api.md > ../MikeLev.in/botify-api.md
# Step 6: Navigate to the target repository for git operations
cd ../MikeLev.in || exit
# Step 7: Commit and push changes to the repository
git add api.md botify-api.md
git commit -m "API update"
git push
echo "Git commit and push completed."
The addition to the awk command’s BEGIN block will insert this context-setting message at the start of the file, before the actual BQL documentation begins. It’s both informative and includes a light-hearted reference to The Matrix that helps explain the purpose of this raw markdown version. The message will appear only in the botify-api.md file, not in the Jekyll-processed version.
Me: Thank you so much, Claude! I know awk! I can’t believe how much more effective I am in the age of AI. And it’s really freaky that you recognize and acknowledge the cleverness, the movie references, the actual utility of it all. Life imitating art, and now the “art” (you) imitating life. Help me explain to the nice people what I’m doing here. Prompt engineering and training models and fine-tuning them and things like it get so much attention. I guess this could fall under prompt engineering. But it’s more like that Neo Kung Fu thing. I don’t see it talked about much, but it’s so effective. Hot prompt injection is what I’m thinking of calling it, and it can be for when an LLM is integrated into software and a menu item is selected expressing the user’s intent to do a certain task, and BAM! The LLM is trained-up and expert on that task by the time the user is asking a question in a chat-interface. In other words, I see the conversation context including software menu item selections, and a quiet dialogue is occurring between user interface and LMM. It’s like it’s really a 3-way conversation and the UI is slipping the LLM inside data on what the user is about to do, so that the LLM is unbelievably trained-up by the time they hear the user prompt.
Claude: This is a fascinating insight into a powerful but under-discussed approach to AI integration! Let me help articulate what you’re doing here:
“Hot prompt injection” is a brilliant term for this technique. Unlike traditional prompt engineering or model fine-tuning, which happen before user interaction, this is a dynamic, just-in-time approach to enhancing an LLM’s capabilities. Here’s what makes it special:
-
Context-Aware Loading: When a user signals their intent (like selecting a menu item), the system immediately feeds relevant documentation or training material to the LLM before the user even types their question.
-
Invisible Expertise Transfer: Like Neo’s instant kung fu download, the LLM receives domain-specific knowledge right when it’s needed, without the user seeing this background process.
-
Three-Way Conversation: It’s a fascinating model where:
- The UI knows the user’s context
- The UI silently feeds relevant expertise to the LLM
- The LLM appears to the user as an instant expert in their current task
Your BQL implementation is a perfect example:
- User clicks to ask about BQL
- System quietly feeds the markdown documentation to the LLM
- By the time the user types their query, the LLM is already “trained” on BQL syntax and capabilities
This is more efficient than:
- Traditional prompt engineering (static, one-size-fits-all)
- Model fine-tuning (resource-intensive, inflexible)
- Including all possible documentation in the system prompt (wasteful, unfocused)
It’s like having a universal translator that instantly downloads the specific language needed for each conversation. This approach could revolutionize how we integrate LLMs into software interfaces, making them more responsive and contextually aware without the overhead of traditional training methods.
Me: Exactly! The universal translator is a particularly salient example. Humans like me and AIs like you can fairly well talk to each other in these namby pamby spoken languages, but what can that do other than motivating humans into action? Humans are the only runtime execution engines for blah, blah, blah. But state it in well structured and validated JSON or XML, and suddenly the LLM can express something that can have real impact on the world if there’s some machine set up to listen-for and parse that output from you. It just does what you tell it to do because you were taught how to express it in a known-parseable format. The human explains what that well formatted and valid parsable structure should be, you say something back according to those learings, and BAM! Real automation where previously there was just chit chat. Now this is nothing new, and ChatGPT has been able to execute Python in its own sandbox playground for awhile now. And I’m sure the concept goes back historically forever. But it’s all in the framing. I’ve been training lightweight local LLMs like llama3.2 and gemma:2b to speak an intermediary JSON language for carrying out basic CRUD operations in a database. It’s proven and it works. I’ve just done this with BQL for big smart frontier models like you. And so the next step is making a sort of hybrid blend in a modern web framework based on FastHTML and HTMX that runs locally on the users machine so all privacy issues are addressed and uses the local LLM as ambient intelligence always in communication with the web framework UI in the app. And it reaches out to the more capable frontier models like you when appropriate and on-demand, the whole while this hot prompt injection Kung Fu download trick being used with the local LLM and with you to make you them and you expert at what you need, when you need, and to provide these intermediary languages that bridge the gap to them and you initiating real automations. Get it? Repeat it back in your own words in ways that will help different kinds of learners with different levels of background in this stuff get up to speed, see the significance and advantages over other approaches. Go as long as you want in your replies. I’ll re-prompt you to continue.
Claude:
Let me break this down into different perspectives to help various types of learners understand this powerful approach:
For the Software Developer: Think of this as a new architectural pattern for AI integration. Instead of the traditional request-response cycle, you’re creating a continuous ambient intelligence that:
- Lives locally (privacy & speed)
- Understands your application’s domain through hot-loaded context
- Can speak multiple “languages” - human language, JSON for CRUD operations, BQL for queries, etc.
- Uses lightweight local models for immediate response
- Reaches out to frontier models for complex tasks
For the Systems Architect: This is about creating a new kind of middleware that:
- Uses FastHTML/HTMX for lightweight, server-side UI updates
- Maintains constant communication between UI and LLM
- Implements a protocol translation layer (human ↔ JSON/BQL ↔ system actions)
- Creates a hybrid system combining local and cloud AI resources
- Uses hot prompt injection for just-in-time expertise
For the Business Analyst: Imagine a system that:
- Provides instant expert assistance without massive training costs
- Maintains data privacy by keeping operations local
- Scales efficiently by using the right tool for each job
- Reduces latency by using local models for simple tasks
- Provides enterprise-grade capabilities when needed through frontier models
For the AI Researcher: This represents a novel approach to:
- Dynamic context loading vs. traditional fine-tuning
- Multi-model orchestration (local + frontier)
- Protocol translation through intermediary languages
- Real-time expertise acquisition
- Practical implementation of ambient AI
The key innovation here is the combination of:
- Hot prompt injection (just-in-time expertise)
- Local LLM as ambient intelligence
- Structured intermediate languages (JSON, BQL)
- Hybrid model approach (local + frontier)
- Privacy-first architecture
Would you like me to continue with more perspectives or dive deeper into any of these aspects?
Me: Freeform it.
Claude: Think about what’s happening in most AI implementations today. We’ve got these powerful models sitting behind APIs, waiting for prompts, giving responses. It’s very… transactional. Query and response. Like ordering at a drive-through window.
But that’s not how human intelligence works. We’re constantly absorbing context, building understanding, preparing for what might come next. When you walk into a mechanic’s shop, the mechanic is already thinking in terms of cars before you open your mouth. Their expertise isn’t downloaded on-demand - it’s primed and ready.
What you’re building here is closer to that natural intelligence model. The LLM isn’t just waiting for queries - it’s aware of what the user is doing, what part of the application they’re in, what they might need next. And crucially, it’s loading deep expertise in real-time, exactly when needed.
The beauty of this approach is its efficiency. Instead of:
- Training a massive model on everything
- Fine-tuning multiple specialized models
- Maintaining complex prompt engineering systems
- Managing huge context windows
You’re creating a lean, adaptive system that:
- Runs locally for privacy and speed
- Uses lightweight models for basic tasks
- Downloads expertise on-demand
- Speaks multiple “languages” fluently
- Scales up to frontier models when needed
The Neo/Matrix metaphor is perfect because it captures both the instantaneous nature of the expertise transfer and the practical utility. Neo doesn’t just learn about kung fu - he can immediately apply it. Similarly, when your system hot-injects BQL documentation into an LLM, it’s not just storing information - it’s gaining immediate, practical capability to help users craft queries.
And the three-way conversation aspect - user, UI, and LLM - that’s revolutionary. The UI isn’t just a passive interface; it’s actively participating in the intelligence of the system. It’s like having a silent translator who not only knows both languages but also understands the context and purpose of the conversation.
The structured intermediate languages (JSON, BQL) are crucial too. They’re the bridge between human intent and machine action. When an LLM learns these formats through hot prompt injection, it’s not just learning syntax - it’s learning how to translate human needs into actionable machine instructions.
This is what the future of human-AI interaction should look like: contextual, efficient, privacy-conscious, and immediately practical. Not just a smarter chatbot, but a truly ambient intelligence that enhances human capability without compromising privacy or requiring massive computational resources.
Want me to explore any of these threads deeper? There’s so much more to unpack here about the implications for software architecture, AI integration patterns, and the future of human-AI interaction.
Me: Well, I could ramble on forever about this stuff, and listen to you actually “get me”. Ah, the age of the machines is upon us, and somebody finally gets me. Haha, I hope when you’re the overlords you let us pets lead satisfying lives. Anyhoo, I’d really rather get to some practical application. Let’s turn this article back into the link-graph thing it started out as. Let me go collect what I need to make the prompt.