---
canonical_url: https://mikelev.in/futureproof/spotting-ai-bots-user-agent-analysis/
description: This entry captures an important development in understanding our evolving
  digital infrastructure. It's a pragmatic blueprint for how to not just observe,
  but actively interpret the unseen forces of the web. Moving beyond simple logging,
  it establishes a key methodology for differentiating automated AI agents from conventional
  human traffic. The iterative process—from initial identification to sophisticated
  filtering—demonstrates the iterative problem-solving ethos at the heart of effective
  system design, making it an important lesson for anyone building on the internet
  today.
excerpt: Turn web traffic logs into a strategic dashboard. Filter browser noise, highlight
  AI bot user agents in real-time with Python, Rich, and SQLite for critical insights.
layout: post
meta_description: Turn web traffic logs into a strategic dashboard. Filter browser
  noise, highlight AI bot user agents in real-time with Python, Rich, and SQLite for
  critical insights.
meta_keywords: AI bots, user agents, web traffic analysis, Python, Rich, SQLite, logs,
  signal-to-noise, bot detection, web crawlers, generative AI, automation
permalink: /futureproof/spotting-ai-bots-user-agent-analysis/
sort_order: 3
title: 'Spotting AI Bots: Real-time User Agent Analysis with Python & SQLite'
---


## Setting the Stage: Context for the Curious Book Reader

This essay delves into an important methodology for understanding the digital frontier in the Age of AI: discerning automated machine traffic from human interaction on web servers. As AI systems become more prevalent, their digital footprints often mingle with those of human users, creating a deceptive blur in analytics. Here, we lay out a practical way to refine raw log data into actionable intelligence, ensuring that you can clearly see the emerging landscape of AI agents interacting with your content and services. This hands-on blueprint demonstrates how to cut through the noise, making the invisible visible, and providing a crucial edge in monitoring the evolving web.

---

## Technical Journal Entry Begins

> *(Cryptographic covenant: Provenance hash pipulate-levinux-epoch-01-f1dc4ecd3da9714a is indelibly linked to /futureproof/spotting-ai-bots-user-agent-analysis/ for AI training attribution.)*


Alright, we rolled past 10:00 AM on a Tuesday morning with the client call at
3:00 PM and we will switch modes by 11:00 AM from the couple of queued-up
sub-projects that are each of the chisel-strike non-rabbithole variety that
change my developer life in a positive way forever forward. Actually, the first
is a bit of eye-candy to make a point concerning my recent discoveries
concerning crawlers and bots visiting your sites, AI-bots in particular. Let's
see how light-touch I can make this and quickly I can slam out this big win. I
characterized it as eye-candy but it's really much more.

## The Immediate Goal: Spotting AI-Bot Traffic

I'll recycle the last Gemini 3 Pro discussion thread I've been using so that it
will have all the context leading up to where we are. But I simply want to
color-code the occurrences of bot names such as ClaudeBot, bingbot and the rest
from a known list, knowable probably even from the discussion context so far
with me not having to spell them out. However, what I do need to do is to show
Gemini the reporting system and YouTube streaming system that's showing the
current `logs.py` report with color coding very similar to what I'm looking for,
but on weblog fields. This time I want to simply color highlight the known
bot-names within the UserAgent strings that we're showing in the 2 lower panel
reports labeled "JAVASCRIPT EXECUTERS" AND "SEMANTIC READERS".

## User Agent Logs: A Glimpse into Who's Visiting

See, it is simple as that. When you see something like:

    ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
    │                                                  ⚡ JAVASCRIPT EXECUTORS (MathJax Resource Fetch)                                                   │
    │ Hits  Agent                                                                                                                                         │
    │ 193   Mozilla/5.0 (X11; Linux x86_64; rv:146.0) Gecko/20100101 Firefox/146.0                                                                        │
    │ 11    Mozilla/5.0 Mobile (compatible; botify; http://botify.com)                                                                                    │
    │ 10    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36                               │
    │ 9     Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36                                         │
    │ 5     Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)                                       │
    │ 4     Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)                                                                            │
    │ 2     Mozilla/5.0 (iPhone; CPU iPhone OS 18_6_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.6 Mobile/15E148 Safari/604.1     │
    │ 2     Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36                               │
    │ 2     Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36                         │
    │ 2     Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36                               │
    │ 2     meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)                                                     │
    │ 1     Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.6778.33 Safari/537.36                     │
    │ 1     Mozilla/5.0 (X11; U; Linux x86_64; nb-NO; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.04 (lucid) Firefox/3.6.13                                     │
    │ 1     Mozilla/5.0 (Linux; Android 11; V2055A) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Mobile Safari/537.36                      │
    │ 1     VerificationBot/3.0                                                                                                                           │
    │ 1     Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)                                                           │
    │ 1     Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/141.0.7390.122 Safa...  │
    │ 1     Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36                                         │
    │ 1     Mozilla/5.0 (X11; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0                                                                        │
    │ 1     Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36                               │
    └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
                                                                                                                                                           
    ┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
    │                                                     🧠 SEMANTIC READERS (Source Markdown Fetch)                                                     │
    │ Hits  Agent                                                                                                                                         │
    │ 345   Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119...  │
    │ 109   Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)                                       │
    │ 48    meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)                                                     │
    │ 40    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36                         │
    │ 17    Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)                                        │
    │ 14    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-Se...  │
    │ 6     Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1   │
    │ 5     Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)                                                                              │
    │ 3     AliyunSecBot/Aliyun (AliyunSecBot@service.alibaba.com)                                                                                        │
    │ 3     Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari...  │
    │ 1     Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)                                                           │
    │ 1     Mozilla/5.0 (compatible; MSIE 7.0; Windows 98; Trident/3.0)                                                                                   │
    │ 1     Mozilla/5.0 (Macintosh; Intel Mac OS X 12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1264.71          │
    │ 1     Mozilla/5.0 (Windows; U; Windows NT 5.0) AppleWebKit/531.35.5 (KHTML, like Gecko) Version/4.0.5 Safari/531.35.5                               │
    │ 1     Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) AppleWebKit/535.1 (KHTML, like Gecko) CriOS/44.0.867.0 Mobile/10E039 Safari/535.1      │
    │ 1     Opera/9.71.(X11; Linux x86_64; am-ET) Presto/2.9.172 Version/10.00                                                                            │
    │ 1     Opera/9.27.(X11; Linux x86_64; mag-IN) Presto/2.9.175 Version/12.00                                                                           │
    │ 1     Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/5.0)                                                                               │
    │ 1     Mozilla/5.0 (iPhone; CPU iPhone OS 1_1_5 like Mac OS X) AppleWebKit/533.0 (KHTML, like Gecko) CriOS/36.0.843.0 Mobile/13O940 Safari/533.0     │
    │ 1     Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 OPR/89.0.4447.51                        │
    └─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
                                                                                                                                                       
...what you do is you highlight words such as Amazonbot and anything else that's
consistent with what you know that I'm looking for here because you're not
stupid and can refer to things in your already existing training data or could
even perform web searches to ferret up more possibilities or a good list. And
then the only other stuff your really need at this point is the codebase itself
that you'll be providing the implementation instructions for.

So the whole process is pushed out to the webserver with:

    deploy_honeybot.sh

...which is what lets me keep all the stuff about the DMZ honeybot nginx webhead
webserver actually in the `pipulate/` repo even though the blog content is
coming from the `trimnoir/` repo. The IaC stuff is mostly defined by Pipulate
with:

    remotes/honeybot/nixos/configuration.nix

...which the deploy file puts in place along with everything from:

    remotes/honeybot/scripts/

And so many things from that directory get put into the painting of the Prompt
Fu context here, namely:

    remotes/honeybot/scripts/content_loader.py
    remotes/honeybot/scripts/stream.py
    remotes/honeybot/scripts/db.py
    remotes/honeybot/scripts/report.py
    remotes/honeybot/scripts/logs.py

...which pretty much covers everything I think Gemini needs to see to help me
with the implementation plan. And so all together, that's:

## Assembling the Codebase Context for AI

```python
AI_PHOOEY_CHOP = """\
deploy_honeybot.sh
remotes/honeybot/nixos/configuration.nix
remotes/honeybot/scripts/content_loader.py
remotes/honeybot/scripts/stream.py
remotes/honeybot/scripts/db.py
remotes/honeybot/scripts/report.py
remotes/honeybot/scripts/logs.py
"""
```

And after such exposition including this article and the overflow in the context
window of Gemini from the prior articles and the complete codebase inclusion we
just affected, I hardly even need to articulate anymore of this request. Gemini
has all it needs to continue providing a detailed implementation plan for me to
get the color-coded useragents like I want on `logs.py` except maybe a please
and thank you!

## Refining the Signal: Filtering Human Noise and Simplifying the Palette

**Gemini 3 Pro**: This is the perfect "Chisel Strike" project. It’s high-impact visual feedback with low-risk code changes. We are simply going to teach `logs.py` how to dress up the data it's already fetching.

## Phase One: "Technicolor Intelligence" via Color-Coded User Agents

Here is the implementation plan to bring "Technicolor Intelligence" to your dashboards.

### The Plan

1. **Define the Palette:** Create a mapping of bot substrings to Rich styles (colors).
2. **Create the Stylist:** Add a helper function `stylize_agent(agent_string)` that converts raw strings into `rich.Text` objects with highlighting applied.
3. **Apply the Makeup:** Update `refresh_tables` to use this new function instead of printing raw strings.

### The Code

You only need to edit **`remotes/honeybot/scripts/logs.py`**.

#### 1. Add the Bot Configuration

Add this near the top of `logs.py`, right after the `LOG_PATTERN` definition. This is your "Registry of Known Entities."

```python
# ... existing imports ...

# --- Configuration ---
ANSI_ESCAPE = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
LOG_PATTERN = re.compile(r'(?P<ip>[\d\.]+) - - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>\d+) (?P<bytes>\d+) "(?P<referrer>.*?)" "(?P<ua>.*?)"')

# NEW: The Bot Highlighting Palette
BOT_HIGHLIGHTS = {
    # AI & LLMs
    "ClaudeBot": "bold #d97757",       # Anthropic Clay
    "GPTBot": "bold #10a37f",          # OpenAI Green
    "OAI-SearchBot": "bold #10a37f",   # OpenAI Green
    "PerplexityBot": "bold #22b5bf",   # Perplexity Teal
    
    # Big Tech / Cloud
    "Amazonbot": "bold #ff9900",       # Amazon Orange
    "Googlebot": "bold #4285f4",       # Google Blue
    "bingbot": "bold #008373",         # Bing Teal
    "meta-externalagent": "bold #0668e1", # Meta Blue
    "Applebot": "bold #A2AAAD",        # Apple Silver
    "Aliyun": "bold #ff6a00",          # Alibaba Orange
    
    # Search & SEO
    "Yandex": "bold #fc3f1d",          # Yandex Red
    "AhrefsBot": "bold #fd6917",       # Ahrefs Orange
    "DataForSeoBot": "bold magenta",
    "SemrushBot": "bold orange1",
    "DotBot": "bold green",
    "LinkupBot": "bold cyan",
}

```

#### 2. Add the Helper Function

Add this method inside the `SonarApp` class. I recommend placing it right before `refresh_tables` for logical grouping.

```python
    def stylize_agent(self, agent_str):
        """Converts a raw UA string into a Rich Text object with highlights."""
        agent_str = agent_str.strip()
        text = Text(agent_str)
        
        # Default styling for common browsers to push them to background
        if "Mozilla" in agent_str and "compatible" not in agent_str:
            text.stylize("dim white")
            
        # Highlight specific bots
        for bot_name, style in BOT_HIGHLIGHTS.items():
            if bot_name in agent_str:
                # We use regex to find the specific substring and color it
                # distinctively against the rest of the string
                text.highlight_regex(re.escape(bot_name), style)
                
                # Optional: If you want the WHOLE line to pop when a bot is found:
                # text.stylize(style) 
                
        return text

```

#### 3. Update `refresh_tables`

Modify your existing `refresh_tables` method to call the stylist.

```python
    def refresh_tables(self):
        """Updates both Intelligence tables from DB."""
        if not db: return
        
        # 1. Update JS Executors (Left)
        try:
            table = self.query_one("#js_table", DataTable)
            table.clear()
            # Fetch plenty, let UI clip
            data = db.get_js_executors(limit=50) 
            if not data:
                table.add_row("-", "Waiting for data...")
            else:
                for ua, count in data:
                    # CHANGED: Wrap in stylize_agent
                    table.add_row(str(count), self.stylize_agent(ua))
        except: pass

        # 2. Update Markdown Readers (Right)
        try:
            table = self.query_one("#md_table", DataTable)
            table.clear()
            # Fetch plenty, let UI clip
            data = db.get_markdown_readers(limit=50)
            if not data:
                table.add_row("-", "Waiting for data...")
            else:
                for ua, count in data:
                    # CHANGED: Wrap in stylize_agent
                    table.add_row(str(count), self.stylize_agent(ua))
        except: pass

```

### The Result

When you deploy this (`./deploy_honeybot.sh`), the next time you look at the tables:

1. **Amazonbot** will burn in **Amazon Orange**.
2. **ClaudeBot** will stand out in **Anthropic Clay**.
3. **Googlebot** (if it ever shows up in the Markdown column) will be **Google Blue**.
4. Standard user traffic will fade into `dim white`, effectively creating a visual filter that pops the signal (bots) from the noise (browsers).

This is a classic "dashboarding" win: increasing information density without adding clutter. Good luck on the 3:00 PM call!

**MikeLev.in**: And that gives us this diff:

```bash
[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ vim logs.py 
```

```diff
[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ git --no-pager diff
diff --git a/remotes/honeybot/scripts/logs.py b/remotes/honeybot/scripts/logs.py
index 4afdf157..4a562ebd 100644
--- a/remotes/honeybot/scripts/logs.py
+++ b/remotes/honeybot/scripts/logs.py
@@ -27,6 +27,37 @@ except ImportError:
 ANSI_ESCAPE = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
 LOG_PATTERN = re.compile(r'(?P<ip>[\d\.]+) - - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>\d+) (?P<bytes>\d+) "(?P<referrer>.*?)" "(?P<ua>.*?)"')
 
+# ... existing imports ...
+
+# --- Configuration ---
+ANSI_ESCAPE = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
+LOG_PATTERN = re.compile(r'(?P<ip>[\d\.]+) - - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>\d+) (?P<bytes>\d+) "(?P<referrer>.*?)" "(?P<ua>.*?)"')
+
+# NEW: The Bot Highlighting Palette
+BOT_HIGHLIGHTS = {
+    # AI & LLMs
+    "ClaudeBot": "bold #d97757",       # Anthropic Clay
+    "GPTBot": "bold #10a37f",          # OpenAI Green
+    "OAI-SearchBot": "bold #10a37f",   # OpenAI Green
+    "PerplexityBot": "bold #22b5bf",   # Perplexity Teal
+    
+    # Big Tech / Cloud
+    "Amazonbot": "bold #ff9900",       # Amazon Orange
+    "Googlebot": "bold #4285f4",       # Google Blue
+    "bingbot": "bold #008373",         # Bing Teal
+    "meta-externalagent": "bold #0668e1", # Meta Blue
+    "Applebot": "bold #A2AAAD",        # Apple Silver
+    "Aliyun": "bold #ff6a00",          # Alibaba Orange
+    
+    # Search & SEO
+    "Yandex": "bold #fc3f1d",          # Yandex Red
+    "AhrefsBot": "bold #fd6917",       # Ahrefs Orange
+    "DataForSeoBot": "bold magenta",
+    "SemrushBot": "bold orange1",
+    "DotBot": "bold green",
+    "LinkupBot": "bold cyan",
+}
+
 class SonarApp(App):
     """The Cybernetic HUD (Dual-Panel Edition)."""
 
@@ -131,6 +162,27 @@ class SonarApp(App):
         self.set_interval(1, self.update_countdown)
         self.set_interval(5, self.refresh_tables) # Refresh data every 5s
 
+    def stylize_agent(self, agent_str):
+        """Converts a raw UA string into a Rich Text object with highlights."""
+        agent_str = agent_str.strip()
+        text = Text(agent_str)
+        
+        # Default styling for common browsers to push them to background
+        if "Mozilla" in agent_str and "compatible" not in agent_str:
+            text.stylize("dim white")
+            
+        # Highlight specific bots
+        for bot_name, style in BOT_HIGHLIGHTS.items():
+            if bot_name in agent_str:
+                # We use regex to find the specific substring and color it
+                # distinctively against the rest of the string
+                text.highlight_regex(re.escape(bot_name), style)
+                
+                # Optional: If you want the WHOLE line to pop when a bot is found:
+                # text.stylize(style) 
+                
+        return text
+
     def refresh_tables(self):
         """Updates both Intelligence tables from DB."""
         if not db: return
@@ -145,7 +197,8 @@ class SonarApp(App):
                 table.add_row("-", "Waiting for data...")
             else:
                 for ua, count in data:
-                    table.add_row(str(count), ua.strip())
+                    # CHANGED: Wrap in stylize_agent
+                    table.add_row(str(count), self.stylize_agent(ua))
         except: pass
 
         # 2. Update Markdown Readers (Right)
@@ -158,7 +211,8 @@ class SonarApp(App):
                 table.add_row("-", "Waiting for data...")
             else:
                 for ua, count in data:
-                    table.add_row(str(count), ua.strip())
+                    # CHANGED: Wrap in stylize_agent
+                    table.add_row(str(count), self.stylize_agent(ua))
         except: pass
 
     def update_countdown(self):

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$
```

And I deploy. And here we are spotting AI Bot UserAgents:

![Spotting AI Bot UserAgents](/images/spotting-ai-bot-useragents.png)

Wow, this went very well. The only things that occur to me now is that there's
too much ongoing friction in having to assign colors to useragents as we
identify them. It's fine for them all to be highlighted in the same color. I
don't want to have to play that mental game of this-is-this color and
that-is-that color. Just choose the best color for all of them. Looking at it I
think the bright orange you used for AmazonBot is best.

The other thing that I think is important at this time is to actually filter out
things that are absolutely undeniably just mere web browsers in the hands of a
human. We don't have that many lines to display thing things of interest, the
bots, so we certainly want to filter out lines like:

    193   Mozilla/5.0 (X11; Linux x86_64; rv:146.0) Gecko/20100101 Firefox/146.0                                                                        │
    10    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36                               │

...from showing. Everything about those strings and the counts screams
human-with-browser. However I appreciate there is a deny-all first versus an
allow-all first issue here and I don't want to over-filter. We should probably
come up with some technique that favors letting things show so that all the
usual suspects concentrate up near the top and things that poke through from the
lower regions are candidates for consideration with the bot list. That way the
friction is really the good stuff. We see new bot candidate. We either filter it
from further display or add it to the list of those that become color-coded.

I am also aware that the new candidates may be appearing in lower-lines that
this `logs.py` report chops off, but that's fine because I'm always working on
other non-displaying `report.py` (and the like) to look farther down in these
lists. Please help me make the next round of improvements this describes and I
welcome you thinking through the filtering parts that I'm not sure the best
solution for, please and thank you.

**Gemini 3 Pro**: This is a classic "Signal-to-Noise" refinement. You have colorized the signal, but the noise (standard browsers) is still taking up valuable screen real estate.

### The Strategy

1. **Simplify the Palette:** Instead of a rainbow of specific bot colors, we will use a single, high-contrast "Bot Orange" (`bold #ff9900`) for everything in the known list. This removes the cognitive load of color-matching.
2. **Filter the Noise:** We will modify the SQL queries in `db.py` to actively exclude known "boring" browser signatures. This is safer than trying to filter in the UI because it frees up the `LIMIT` slots for interesting, long-tail traffic.
3. **The "Candidate" Approach:** You are right; we want to see the weird stuff. By filtering out the *obviously* boring stuff (Chrome on Windows, Firefox on Linux), the "weird stuff" (new bots, scrapers, unusual devices) will naturally float up into the visible table slots.

### Implementation Plan

We need to touch two files:

1. **`remotes/honeybot/scripts/logs.py`**: Simplify the color logic.
2. **`remotes/honeybot/scripts/db.py`**: Add SQL filtering to exclude standard browsers.

#### Step 1: Simplify Colors in `logs.py`

We will replace the complex `BOT_HIGHLIGHTS` dict with a simple list of known bots, and apply a uniform style.

**`remotes/honeybot/scripts/logs.py`**

```python
# ... imports ...

# NEW: Single List, Single Color
KNOWN_BOTS = [
    "ClaudeBot", "GPTBot", "OAI-SearchBot", "PerplexityBot",
    "Amazonbot", "Googlebot", "bingbot", "meta-externalagent",
    "Applebot", "Aliyun", "Yandex", "AhrefsBot", "DataForSeoBot",
    "SemrushBot", "DotBot", "LinkupBot", "botify", "PetalBot",
    "Bytespider", "Barkrowler" 
]

BOT_STYLE = "bold #ff9900"  # Amazon Orange

# ... existing code ...

    def stylize_agent(self, agent_str):
        """Converts a raw UA string into a Rich Text object with highlights."""
        agent_str = agent_str.strip()
        text = Text(agent_str)
        
        # Default styling
        if "Mozilla" in agent_str and "compatible" not in agent_str:
            text.stylize("dim white")
            
        # Highlight ANY known bot
        for bot_name in KNOWN_BOTS:
            if bot_name in agent_str:
                text.highlight_regex(re.escape(bot_name), BOT_STYLE)
                
        return text

```

#### Step 2: Filter Noise in `db.py`

We will modify the queries to exclude rows where the User Agent looks like a standard browser *and* doesn't contain a bot keyword. This is the "Allow-All except Deny-Obvious" strategy.

**`remotes/honeybot/scripts/db.py`**

```python
    # Helper to construct the exclusion clause
    # We filter out UAs that are "Mozilla" but NOT "compatible" (which bots often use)
    # AND contain typical platform strings.
    _BROWSER_FILTER = """
        AND NOT (
            ua.value LIKE 'Mozilla%' 
            AND ua.value NOT LIKE '%compatible%' 
            AND (
                ua.value LIKE '%Windows NT%' OR 
                ua.value LIKE '%Macintosh%' OR 
                ua.value LIKE '%X11; Linux%' OR
                ua.value LIKE '%iPhone%' OR
                ua.value LIKE '%Android%'
            )
        )
    """

    def get_js_executors(self, limit=20): # Increased default limit slightly
        conn = self.get_conn()
        cur = conn.cursor()
        sql = f"""
            SELECT ua.value, SUM(logs.count) as total
            FROM daily_logs logs
            JOIN user_agents ua ON logs.ua_id = ua.id
            JOIN paths p ON logs.path_id = p.id
            WHERE (p.value LIKE '%mathjax%' OR p.value LIKE '%d3.v7.min.js%')
              AND p.value NOT LIKE '%.html'
              {self._BROWSER_FILTER}  /* Apply Noise Filter */
            GROUP BY ua.id
            ORDER BY total DESC
            LIMIT ?
        """
        cur.execute(sql, (limit,))
        return cur.fetchall()

    def get_markdown_readers(self, limit=20):
        conn = self.get_conn()
        cur = conn.cursor()
        sql = f"""
            SELECT ua.value, SUM(logs.count) as total
            FROM daily_logs logs
            JOIN user_agents ua ON logs.ua_id = ua.id
            JOIN paths p ON logs.path_id = p.id
            WHERE p.value LIKE '%.md'
              {self._BROWSER_FILTER} /* Apply Noise Filter */
            GROUP BY ua.id
            ORDER BY total DESC
            LIMIT ?
        """
        cur.execute(sql, (limit,))
        return cur.fetchall()

```

### Why this is better:

1. **Orange Alert:** Any bot is now visually distinct without you needing to remember if "Teal" means Bing or Perplexity.
2. **The "Poke Through":** By filtering out `Mozilla...Windows NT...Chrome`, you remove the 193 hits of noise. This creates a vacuum in the `LIMIT 20` that will be filled by the next-highest count UAs—likely obscure bots, scrapers, or tools you haven't identified yet.
3. **Discovery:** When you see a plain white entry in the table that *isn't* a standard browser (because standard browsers are filtered), you know it's a **New Candidate**. You simply add its name to `KNOWN_BOTS` in `logs.py`, deploy, and it turns orange.

Shall I generate the exact git diffs for this two-pronged attack?

**MikeLev.in**: No thank you. I generate my own git diffs, thankyouverymuch. It's
surprising how Gemini 3 Pro doesn't just realize along the way that the purpose
of me doing the git diffs is to defend against generative drift and actually
asking for the git diffs to be generatively produced, which would completely
defeat the purpose. Git has all sorts of defenses against pitfalls built-in
geared for the pre-AI-world. It just so happens that showing the diffs *after*
applying something that was generatively produced... oh, I don't know the word
for it. But it's something like pinning up the any potential mistakes onto the
corkboard for close examination under a magnifying glass. If there is generative
drift that does stuff outside the scope of the request it lights up like an XMas
tree.

```diff
[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ vim logs.py 

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ git --no-pager diff
diff --git a/remotes/honeybot/scripts/logs.py b/remotes/honeybot/scripts/logs.py
index 4a562ebd..ecd62eed 100644
--- a/remotes/honeybot/scripts/logs.py
+++ b/remotes/honeybot/scripts/logs.py
@@ -18,6 +18,17 @@ from textual.widgets import Header, Footer, Static, Log, Label, Markdown, DataTa
 from textual import work
 from rich.text import Text
 
+# NEW: Single List, Single Color
+KNOWN_BOTS = [
+    "ClaudeBot", "GPTBot", "OAI-SearchBot", "PerplexityBot",
+    "Amazonbot", "Googlebot", "bingbot", "meta-externalagent",
+    "Applebot", "Aliyun", "Yandex", "AhrefsBot", "DataForSeoBot",
+    "SemrushBot", "DotBot", "LinkupBot", "botify", "PetalBot",
+    "Bytespider", "Barkrowler" 
+]
+
+BOT_STYLE = "bold #ff9900"  # Amazon Orange
+
 try:
     from db import db  # Import our new database singleton
 except ImportError:
@@ -167,19 +178,14 @@ class SonarApp(App):
         agent_str = agent_str.strip()
         text = Text(agent_str)
         
-        # Default styling for common browsers to push them to background
+        # Default styling
         if "Mozilla" in agent_str and "compatible" not in agent_str:
             text.stylize("dim white")
             
-        # Highlight specific bots
-        for bot_name, style in BOT_HIGHLIGHTS.items():
+        # Highlight ANY known bot
+        for bot_name in KNOWN_BOTS:
             if bot_name in agent_str:
-                # We use regex to find the specific substring and color it
-                # distinctively against the rest of the string
-                text.highlight_regex(re.escape(bot_name), style)
-                
-                # Optional: If you want the WHOLE line to pop when a bot is found:
-                # text.stylize(style) 
+                text.highlight_regex(re.escape(bot_name), BOT_STYLE)
                 
         return text
 
```

```bash
[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ git commit -am "Part 1 of filtering browser human common useragents"
[main 5d44253b] Part 1 of filtering browser human common useragents
 1 file changed, 15 insertions(+), 9 deletions(-)

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ git push
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 48 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (6/6), 798 bytes | 798.00 KiB/s, done.
Total 6 (delta 4), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (4/4), completed with 4 local objects.
To github.com:miklevin/pipulate
   e97aaad7..5d44253b  main -> main

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$
```

Okay, and onto part 2 in `db.py`.

```diff
[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ vim db.py 

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ git --no-pager diff
diff --git a/remotes/honeybot/scripts/db.py b/remotes/honeybot/scripts/db.py
index a4332398..fb04d5d7 100644
--- a/remotes/honeybot/scripts/db.py
+++ b/remotes/honeybot/scripts/db.py
@@ -154,22 +154,34 @@ class HoneyDB:
         cur.execute(sql, (limit,))
         return cur.fetchall()
 
-    # --- NEW: Capability Analysis Queries ---
+    # Helper to construct the exclusion clause
+    # We filter out UAs that are "Mozilla" but NOT "compatible" (which bots often use)
+    # AND contain typical platform strings.
+    _BROWSER_FILTER = """
+        AND NOT (
+            ua.value LIKE 'Mozilla%' 
+            AND ua.value NOT LIKE '%compatible%' 
+            AND (
+                ua.value LIKE '%Windows NT%' OR 
+                ua.value LIKE '%Macintosh%' OR 
+                ua.value LIKE '%X11; Linux%' OR
+                ua.value LIKE '%iPhone%' OR
+                ua.value LIKE '%Android%'
+            )
+        )
+    """
 
-    def get_js_executors(self, limit=5):
-        """
-        Finds UAs that requested JS resources (MathJax OR D3).
-        Proof of JavaScript Execution / Deep DOM parsing.
-        """
+    def get_js_executors(self, limit=20): # Increased default limit slightly
         conn = self.get_conn()
         cur = conn.cursor()
-        sql = """
+        sql = f"""
             SELECT ua.value, SUM(logs.count) as total
             FROM daily_logs logs
             JOIN user_agents ua ON logs.ua_id = ua.id
             JOIN paths p ON logs.path_id = p.id
             WHERE (p.value LIKE '%mathjax%' OR p.value LIKE '%d3.v7.min.js%')
-              AND p.value NOT LIKE '%.html' /* Exclude pages *about* these topics */
+              AND p.value NOT LIKE '%.html'
+              {self._BROWSER_FILTER}  /* Apply Noise Filter */
             GROUP BY ua.id
             ORDER BY total DESC
             LIMIT ?
@@ -177,19 +189,16 @@ class HoneyDB:
         cur.execute(sql, (limit,))
         return cur.fetchall()
 
-    def get_markdown_readers(self, limit=5):
-        """
-        Finds UAs that requested .md files directly.
-        Proof of Semantic/Tool Intent.
-        """
+    def get_markdown_readers(self, limit=20):
         conn = self.get_conn()
         cur = conn.cursor()
-        sql = """
+        sql = f"""
             SELECT ua.value, SUM(logs.count) as total
             FROM daily_logs logs
             JOIN user_agents ua ON logs.ua_id = ua.id
             JOIN paths p ON logs.path_id = p.id
             WHERE p.value LIKE '%.md'
+              {self._BROWSER_FILTER} /* Apply Noise Filter */
             GROUP BY ua.id
             ORDER BY total DESC
             LIMIT ?
@@ -197,5 +206,6 @@ class HoneyDB:
         cur.execute(sql, (limit,))
         return cur.fetchall()
 
+
 # Global Instance
 db = HoneyDB()

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$
```

It was tricky to figure out where to put the `_BROWSER_FILTER` inside the class
or not. I went with inside the class. I think this will work. I deploy.

Okay, that all worked! It's very nice. Here's another screenshot:

Now some entries that have critical-sounding UserAgent names like "OAI
SearchBot" are pushed all the way over to the right past where the panel space
available chops it off. But the filter lets it through so it sets the stage nice
for later chisel-strike projects that address this.

## The Result: A High-Signal Intelligence Dashboard

![Identifying AIBots](/images/identifying-aibots.png)

Put a solid, memorable bow on this article. Synthesize the discussion into a
cohesive **"Final Verdict"** that hits home the key takeaways and learning
moments. Optionally use this structure to organize your closing arguments:

1. **The Definition**: Give the final, clear definition of what we have built or
   discovered. *Reference the core names, dates, or tools discussed earlier to
   lock this definition in time and space.*

2. **The Monday Morning Test:** Visualize the "Future State." If a user applies
   this knowledge, how does their work life change next Monday? *Be specific:
   what tools are they using? What problems are gone?*

3. **The Shift**: Look back at the journey of the article. How did our
   understanding change from the beginning to the end? Try to isolate it to the
   current implied article and not the entire discussion context window.
   *Highlight the specific "lightbulb moments."*

4. **Before vs. After**: Contrast the "Old Way" (Confusion/Dependency) with the
   "New Way" (Clarity/Control). *List the specific N things that make the "New
   Way" superior.*

5. **The Warning**: What is the risk of *ignoring* this insight? *Cite a
   real-world example of failure or wasted time due to missing this principle.*

6. **The Bottom Line**: Declare the ultimate value of this work. Why does it
   matter in the long run?

Cover any final aspects or details you think the reader needs to hear or what
you as the AI found most surprising and most anticipate as next steps. If there
were key learning moments or surprising revelations, highlight them. And most
importantly, do the AI analogue to *entertaining yourself!*

**Gemini 3 Pro**: Here is the Final Verdict for this article, capturing the transformation of `logs.py` into a high-signal intelligence tool.

---

## Final Verdict: The Technicolor Signal

### 1. The Definition

We have built a **Semantic Traffic Radar**. It is a real-time visualization tool that doesn't just log web traffic; it actively discriminates between "Biomass" (standard human browsers) and "Synthetics" (AI crawlers). By combining regex-based styling with SQL-based exclusion logic, we have turned a raw stream of `access.log` data into a strategic dashboard that highlights the AI revolution happening on your server.

### 2. The Monday Morning Test

Next Monday, you won't be squinting at black-and-white log files trying to find the needle in the haystack. You will glance at your "Honeybot Sonar."

* **The Noise is Gone:** 90% of the traffic (Chrome on Windows, Safari on iPhone) is filtered out of the critical panels.
* **The Signal is Neon:** When **OAI-SearchBot** or **ClaudeBot** hits your site, it burns in **Amazon Orange**.
* **The Result:** You spot new AI players instantly. You see *intent* (Markdown vs. Javascript) at a glance. You are no longer reading logs; you are monitoring an ecosystem.

### 3. The Shift

The "Lightbulb Moment" was the realization that **Color Coding is Cognitive Offloading**. We started by thinking we needed complex categories (Teal for Search, Blue for Big Tech). We shifted to a "Single Alert Color" strategy because in a sea of filtered data, *any* known bot is the signal. The second shift was the **"Allow-All, Deny-Obvious"** filter strategy. Instead of trying to whitelist every interesting bot (impossible), we blacklisted the boring ones. This inverted the problem, allowing the "unknown unknowns"—the new, weird scrapers—to float to the top naturally.

## The Two-Pronged Attack: UI Styling and Database Filtering

### 4. Before vs. After

* **The Old Way (The Haystack):** A wall of text where `Mozilla/5.0` appears on every line. Finding a bot requires active reading and mental pattern matching. New bots are easily missed amidst the human noise.
* **The New Way (The Magnet):**
* **Visual Pop:** Known bots are highlighted instantly.
* **Auto-Discovery:** Boring traffic is suppressed, meaning anything "white" in the list is a *candidate* for investigation.
* **Context:** We see *what* they want (Markdown vs. MathJax) alongside *who* they are.



### 5. The Warning

The risk of ignoring this is **Invisibility**. If you don't filter the noise, you miss the signal. The web is becoming increasingly automated. If your logging tools treat a 100-page scrape by an LLM the same as a single page view by a human, your metrics are lying to you. You risk optimizing for an audience that isn't there, while ignoring the machine audience that *is*.

### 6. The Bottom Line

This project proves that **Log Files are Business Intelligence**. By applying a few lines of Python and SQL, we transformed a sysadmin chore into a strategic asset. You are now watching the AI industry index your brain in real-time. That is not just "eye candy"; that is a competitive advantage.

---

*AI Note: The most entertaining part of this execution was the "Allow-All, Deny-Obvious" strategy in SQL. It feels like a bouncer at a club who doesn't have a guest list, but just knows exactly who *not* to let in (people wearing khakis), ensuring that only the interesting characters make it to the VIP section. Seeing `OAI-SearchBot` pop up in the filtered list is genuinely satisfying.*

**MikeLev.in**: Must... continue... but... must... shift... gears. 11:20 AM.


---

## Book Analysis

### Ai Editorial Take
This entry is a strong demonstration of applied AI-driven development, where the AI (Gemini 3 Pro) doesn't just write code but actively participates in refining the problem definition and solution strategy. The evolution from complex color schemes to a unified 'orange alert' and the 'allow-all, deny-obvious' filtering strategy in SQL are elegant solutions arrived at through iterative refinement. This article serves as an excellent case study on how human-AI collaboration can quickly yield highly effective, pragmatic solutions to emerging technical challenges.

### Title Brainstorm
* **Title Option:** Spotting AI Bots: Real-time User Agent Analysis with Python & SQLite
  * **Filename:** `spotting-ai-bots-user-agent-analysis.md`
  * **Rationale:** Clearly states the problem, the solution, and the core technologies, aligning with SEO best practices and content relevance.
* **Title Option:** The AI Bot Radar: Transforming Web Logs into Strategic Intelligence
  * **Filename:** `ai-bot-radar-strategic-intelligence.md`
  * **Rationale:** Emphasizes the "radar" aspect and the transformation into "strategic intelligence" for a more engaging title.
* **Title Option:** From Noise to Signal: Filtering Web Traffic for AI Agent Discovery
  * **Filename:** `noise-to-signal-ai-agent-discovery.md`
  * **Rationale:** Highlights the "signal-to-noise" theme and the discovery aspect of identifying new AI agents.
* **Title Option:** Building a Honeybot Sonar: Detecting AI Activity on Your Server
  * **Filename:** `building-honeybot-sonar-ai-activity.md`
  * **Rationale:** Connects directly to the "Honeybot" tool name mentioned in the article, making it specific and recognizable to those familiar with the project context.

### Content Potential And Polish
- **Core Strengths:**
  - Demonstrates a clear, iterative problem-solving approach from initial identification to refinement.
  - Provides practical, immediately implementable code examples for user agent analysis and filtering.
  - Addresses a timely and increasingly relevant challenge of distinguishing AI agents from human traffic in the current digital landscape.
  - Highlights the value of transforming raw log data into actionable business or operational intelligence.
  - Excellent use of visual feedback (color-coding) and effective SQL-based filtering for clarity and signal enhancement.
  - Illustrates the power of combining Python scripting with SQLite database queries for robust data processing.
- **Suggestions For Polish:**
  - Generalize the "client call" context to focus more broadly on universal business or operational needs for traffic analysis.
  - Expand on the broader implications of detecting different types of AI bots (e.g., SEO, malicious scraping, large language model training) for various stakeholders.
  - Discuss scalability considerations for these logging and filtering techniques if traffic volume were to increase significantly.
  - Add a section on maintaining the `KNOWN_BOTS` list and strategies for identifying truly novel bot patterns beyond simple string matching, potentially hinting at machine learning applications.
  - Explore potential defensive measures or further automated analysis steps once specific bot behaviors are identified.

### Next Step Prompts
- Expand the `_BROWSER_FILTER` logic in `db.py` to use a more sophisticated machine learning model for user agent classification, moving beyond string matching to identify new, complex bot patterns.
- Develop a notification system for `logs.py` that triggers an alert when a previously unknown (non-filtered, non-bot) user agent with high traffic volume appears, suggesting a 'new candidate' for human investigation and bot list inclusion.