Spotting AI Bots: Real-time User Agent Analysis with Python & SQLite

Setting the Stage: Context for the Curious Book Reader

This essay delves into an important methodology for understanding the digital frontier in the Age of AI: discerning automated machine traffic from human interaction on web servers. As AI systems become more prevalent, their digital footprints often mingle with those of human users, creating a deceptive blur in analytics. Here, we lay out a practical way to refine raw log data into actionable intelligence, ensuring that you can clearly see the emerging landscape of AI agents interacting with your content and services. This hands-on blueprint demonstrates how to cut through the noise, making the invisible visible, and providing a crucial edge in monitoring the evolving web.


Technical Journal Entry Begins

Alright, we rolled past 10:00 AM on a Tuesday morning with the client call at 3:00 PM and we will switch modes by 11:00 AM from the couple of queued-up sub-projects that are each of the chisel-strike non-rabbithole variety that change my developer life in a positive way forever forward. Actually, the first is a bit of eye-candy to make a point concerning my recent discoveries concerning crawlers and bots visiting your sites, AI-bots in particular. Let’s see how light-touch I can make this and quickly I can slam out this big win. I characterized it as eye-candy but it’s really much more.

The Immediate Goal: Spotting AI-Bot Traffic

I’ll recycle the last Gemini 3 Pro discussion thread I’ve been using so that it will have all the context leading up to where we are. But I simply want to color-code the occurrences of bot names such as ClaudeBot, bingbot and the rest from a known list, knowable probably even from the discussion context so far with me not having to spell them out. However, what I do need to do is to show Gemini the reporting system and YouTube streaming system that’s showing the current logs.py report with color coding very similar to what I’m looking for, but on weblog fields. This time I want to simply color highlight the known bot-names within the UserAgent strings that we’re showing in the 2 lower panel reports labeled “JAVASCRIPT EXECUTERS” AND “SEMANTIC READERS”.

User Agent Logs: A Glimpse into Who’s Visiting

See, it is simple as that. When you see something like:

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                  ⚡ JAVASCRIPT EXECUTORS (MathJax Resource Fetch)                                                   │
│ Hits  Agent                                                                                                                                         │
│ 193   Mozilla/5.0 (X11; Linux x86_64; rv:146.0) Gecko/20100101 Firefox/146.0                                                                        │
│ 11    Mozilla/5.0 Mobile (compatible; botify; http://botify.com)                                                                                    │
│ 10    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36                               │
│ 9     Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36                                         │
│ 5     Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)                                       │
│ 4     Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)                                                                            │
│ 2     Mozilla/5.0 (iPhone; CPU iPhone OS 18_6_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.6 Mobile/15E148 Safari/604.1     │
│ 2     Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36                               │
│ 2     Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36                         │
│ 2     Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36                               │
│ 2     meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)                                                     │
│ 1     Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.6778.33 Safari/537.36                     │
│ 1     Mozilla/5.0 (X11; U; Linux x86_64; nb-NO; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.04 (lucid) Firefox/3.6.13                                     │
│ 1     Mozilla/5.0 (Linux; Android 11; V2055A) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Mobile Safari/537.36                      │
│ 1     VerificationBot/3.0                                                                                                                           │
│ 1     Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)                                                           │
│ 1     Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/141.0.7390.122 Safa...  │
│ 1     Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36                                         │
│ 1     Mozilla/5.0 (X11; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0                                                                        │
│ 1     Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36                               │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
                                                                                                                                                       
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                     🧠 SEMANTIC READERS (Source Markdown Fetch)                                                     │
│ Hits  Agent                                                                                                                                         │
│ 345   Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119...  │
│ 109   Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)                                       │
│ 48    meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)                                                     │
│ 40    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36                         │
│ 17    Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)                                        │
│ 14    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-Se...  │
│ 6     Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1   │
│ 5     Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)                                                                              │
│ 3     AliyunSecBot/Aliyun (AliyunSecBot@service.alibaba.com)                                                                                        │
│ 3     Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari...  │
│ 1     Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)                                                           │
│ 1     Mozilla/5.0 (compatible; MSIE 7.0; Windows 98; Trident/3.0)                                                                                   │
│ 1     Mozilla/5.0 (Macintosh; Intel Mac OS X 12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1264.71          │
│ 1     Mozilla/5.0 (Windows; U; Windows NT 5.0) AppleWebKit/531.35.5 (KHTML, like Gecko) Version/4.0.5 Safari/531.35.5                               │
│ 1     Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) AppleWebKit/535.1 (KHTML, like Gecko) CriOS/44.0.867.0 Mobile/10E039 Safari/535.1      │
│ 1     Opera/9.71.(X11; Linux x86_64; am-ET) Presto/2.9.172 Version/10.00                                                                            │
│ 1     Opera/9.27.(X11; Linux x86_64; mag-IN) Presto/2.9.175 Version/12.00                                                                           │
│ 1     Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/5.0)                                                                               │
│ 1     Mozilla/5.0 (iPhone; CPU iPhone OS 1_1_5 like Mac OS X) AppleWebKit/533.0 (KHTML, like Gecko) CriOS/36.0.843.0 Mobile/13O940 Safari/533.0     │
│ 1     Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 OPR/89.0.4447.51                        │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

…what you do is you highlight words such as Amazonbot and anything else that’s consistent with what you know that I’m looking for here because you’re not stupid and can refer to things in your already existing training data or could even perform web searches to ferret up more possibilities or a good list. And then the only other stuff your really need at this point is the codebase itself that you’ll be providing the implementation instructions for.

So the whole process is pushed out to the webserver with:

deploy_honeybot.sh

…which is what lets me keep all the stuff about the DMZ honeybot nginx webhead webserver actually in the pipulate/ repo even though the blog content is coming from the trimnoir/ repo. The IaC stuff is mostly defined by Pipulate with:

remotes/honeybot/nixos/configuration.nix

…which the deploy file puts in place along with everything from:

remotes/honeybot/scripts/

And so many things from that directory get put into the painting of the Prompt Fu context here, namely:

remotes/honeybot/scripts/content_loader.py
remotes/honeybot/scripts/stream.py
remotes/honeybot/scripts/db.py
remotes/honeybot/scripts/report.py
remotes/honeybot/scripts/logs.py

…which pretty much covers everything I think Gemini needs to see to help me with the implementation plan. And so all together, that’s:

Assembling the Codebase Context for AI

AI_PHOOEY_CHOP = """\
deploy_honeybot.sh
remotes/honeybot/nixos/configuration.nix
remotes/honeybot/scripts/content_loader.py
remotes/honeybot/scripts/stream.py
remotes/honeybot/scripts/db.py
remotes/honeybot/scripts/report.py
remotes/honeybot/scripts/logs.py
"""

And after such exposition including this article and the overflow in the context window of Gemini from the prior articles and the complete codebase inclusion we just affected, I hardly even need to articulate anymore of this request. Gemini has all it needs to continue providing a detailed implementation plan for me to get the color-coded useragents like I want on logs.py except maybe a please and thank you!

Refining the Signal: Filtering Human Noise and Simplifying the Palette

Gemini 3 Pro: This is the perfect “Chisel Strike” project. It’s high-impact visual feedback with low-risk code changes. We are simply going to teach logs.py how to dress up the data it’s already fetching.

Phase One: “Technicolor Intelligence” via Color-Coded User Agents

Here is the implementation plan to bring “Technicolor Intelligence” to your dashboards.

The Plan

  1. Define the Palette: Create a mapping of bot substrings to Rich styles (colors).
  2. Create the Stylist: Add a helper function stylize_agent(agent_string) that converts raw strings into rich.Text objects with highlighting applied.
  3. Apply the Makeup: Update refresh_tables to use this new function instead of printing raw strings.

The Code

You only need to edit remotes/honeybot/scripts/logs.py.

1. Add the Bot Configuration

Add this near the top of logs.py, right after the LOG_PATTERN definition. This is your “Registry of Known Entities.”

# ... existing imports ...

# --- Configuration ---
ANSI_ESCAPE = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
LOG_PATTERN = re.compile(r'(?P<ip>[\d\.]+) - - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>\d+) (?P<bytes>\d+) "(?P<referrer>.*?)" "(?P<ua>.*?)"')

# NEW: The Bot Highlighting Palette
BOT_HIGHLIGHTS = {
    # AI & LLMs
    "ClaudeBot": "bold #d97757",       # Anthropic Clay
    "GPTBot": "bold #10a37f",          # OpenAI Green
    "OAI-SearchBot": "bold #10a37f",   # OpenAI Green
    "PerplexityBot": "bold #22b5bf",   # Perplexity Teal
    
    # Big Tech / Cloud
    "Amazonbot": "bold #ff9900",       # Amazon Orange
    "Googlebot": "bold #4285f4",       # Google Blue
    "bingbot": "bold #008373",         # Bing Teal
    "meta-externalagent": "bold #0668e1", # Meta Blue
    "Applebot": "bold #A2AAAD",        # Apple Silver
    "Aliyun": "bold #ff6a00",          # Alibaba Orange
    
    # Search & SEO
    "Yandex": "bold #fc3f1d",          # Yandex Red
    "AhrefsBot": "bold #fd6917",       # Ahrefs Orange
    "DataForSeoBot": "bold magenta",
    "SemrushBot": "bold orange1",
    "DotBot": "bold green",
    "LinkupBot": "bold cyan",
}

2. Add the Helper Function

Add this method inside the SonarApp class. I recommend placing it right before refresh_tables for logical grouping.

    def stylize_agent(self, agent_str):
        """Converts a raw UA string into a Rich Text object with highlights."""
        agent_str = agent_str.strip()
        text = Text(agent_str)
        
        # Default styling for common browsers to push them to background
        if "Mozilla" in agent_str and "compatible" not in agent_str:
            text.stylize("dim white")
            
        # Highlight specific bots
        for bot_name, style in BOT_HIGHLIGHTS.items():
            if bot_name in agent_str:
                # We use regex to find the specific substring and color it
                # distinctively against the rest of the string
                text.highlight_regex(re.escape(bot_name), style)
                
                # Optional: If you want the WHOLE line to pop when a bot is found:
                # text.stylize(style) 
                
        return text

3. Update refresh_tables

Modify your existing refresh_tables method to call the stylist.

    def refresh_tables(self):
        """Updates both Intelligence tables from DB."""
        if not db: return
        
        # 1. Update JS Executors (Left)
        try:
            table = self.query_one("#js_table", DataTable)
            table.clear()
            # Fetch plenty, let UI clip
            data = db.get_js_executors(limit=50) 
            if not data:
                table.add_row("-", "Waiting for data...")
            else:
                for ua, count in data:
                    # CHANGED: Wrap in stylize_agent
                    table.add_row(str(count), self.stylize_agent(ua))
        except: pass

        # 2. Update Markdown Readers (Right)
        try:
            table = self.query_one("#md_table", DataTable)
            table.clear()
            # Fetch plenty, let UI clip
            data = db.get_markdown_readers(limit=50)
            if not data:
                table.add_row("-", "Waiting for data...")
            else:
                for ua, count in data:
                    # CHANGED: Wrap in stylize_agent
                    table.add_row(str(count), self.stylize_agent(ua))
        except: pass

The Result

When you deploy this (./deploy_honeybot.sh), the next time you look at the tables:

  1. Amazonbot will burn in Amazon Orange.
  2. ClaudeBot will stand out in Anthropic Clay.
  3. Googlebot (if it ever shows up in the Markdown column) will be Google Blue.
  4. Standard user traffic will fade into dim white, effectively creating a visual filter that pops the signal (bots) from the noise (browsers).

This is a classic “dashboarding” win: increasing information density without adding clutter. Good luck on the 3:00 PM call!

Me: And that gives us this diff:

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ vim logs.py 
[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ git --no-pager diff
diff --git a/remotes/honeybot/scripts/logs.py b/remotes/honeybot/scripts/logs.py
index 4afdf157..4a562ebd 100644
--- a/remotes/honeybot/scripts/logs.py
+++ b/remotes/honeybot/scripts/logs.py
@@ -27,6 +27,37 @@ except ImportError:
 ANSI_ESCAPE = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
 LOG_PATTERN = re.compile(r'(?P<ip>[\d\.]+) - - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>\d+) (?P<bytes>\d+) "(?P<referrer>.*?)" "(?P<ua>.*?)"')
 
+# ... existing imports ...
+
+# --- Configuration ---
+ANSI_ESCAPE = re.compile(r'(?:\x1B[@-_]|[\x80-\x9F])[0-?]*[ -/]*[@-~]')
+LOG_PATTERN = re.compile(r'(?P<ip>[\d\.]+) - - \[(?P<time>.*?)\] "(?P<request>.*?)" (?P<status>\d+) (?P<bytes>\d+) "(?P<referrer>.*?)" "(?P<ua>.*?)"')
+
+# NEW: The Bot Highlighting Palette
+BOT_HIGHLIGHTS = {
+    # AI & LLMs
+    "ClaudeBot": "bold #d97757",       # Anthropic Clay
+    "GPTBot": "bold #10a37f",          # OpenAI Green
+    "OAI-SearchBot": "bold #10a37f",   # OpenAI Green
+    "PerplexityBot": "bold #22b5bf",   # Perplexity Teal
+    
+    # Big Tech / Cloud
+    "Amazonbot": "bold #ff9900",       # Amazon Orange
+    "Googlebot": "bold #4285f4",       # Google Blue
+    "bingbot": "bold #008373",         # Bing Teal
+    "meta-externalagent": "bold #0668e1", # Meta Blue
+    "Applebot": "bold #A2AAAD",        # Apple Silver
+    "Aliyun": "bold #ff6a00",          # Alibaba Orange
+    
+    # Search & SEO
+    "Yandex": "bold #fc3f1d",          # Yandex Red
+    "AhrefsBot": "bold #fd6917",       # Ahrefs Orange
+    "DataForSeoBot": "bold magenta",
+    "SemrushBot": "bold orange1",
+    "DotBot": "bold green",
+    "LinkupBot": "bold cyan",
+}
+
 class SonarApp(App):
     """The Cybernetic HUD (Dual-Panel Edition)."""
 
@@ -131,6 +162,27 @@ class SonarApp(App):
         self.set_interval(1, self.update_countdown)
         self.set_interval(5, self.refresh_tables) # Refresh data every 5s
 
+    def stylize_agent(self, agent_str):
+        """Converts a raw UA string into a Rich Text object with highlights."""
+        agent_str = agent_str.strip()
+        text = Text(agent_str)
+        
+        # Default styling for common browsers to push them to background
+        if "Mozilla" in agent_str and "compatible" not in agent_str:
+            text.stylize("dim white")
+            
+        # Highlight specific bots
+        for bot_name, style in BOT_HIGHLIGHTS.items():
+            if bot_name in agent_str:
+                # We use regex to find the specific substring and color it
+                # distinctively against the rest of the string
+                text.highlight_regex(re.escape(bot_name), style)
+                
+                # Optional: If you want the WHOLE line to pop when a bot is found:
+                # text.stylize(style) 
+                
+        return text
+
     def refresh_tables(self):
         """Updates both Intelligence tables from DB."""
         if not db: return
@@ -145,7 +197,8 @@ class SonarApp(App):
                 table.add_row("-", "Waiting for data...")
             else:
                 for ua, count in data:
-                    table.add_row(str(count), ua.strip())
+                    # CHANGED: Wrap in stylize_agent
+                    table.add_row(str(count), self.stylize_agent(ua))
         except: pass
 
         # 2. Update Markdown Readers (Right)
@@ -158,7 +211,8 @@ class SonarApp(App):
                 table.add_row("-", "Waiting for data...")
             else:
                 for ua, count in data:
-                    table.add_row(str(count), ua.strip())
+                    # CHANGED: Wrap in stylize_agent
+                    table.add_row(str(count), self.stylize_agent(ua))
         except: pass
 
     def update_countdown(self):

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$

And I deploy. And here we are spotting AI Bot UserAgents:

Spotting AI Bot UserAgents

Wow, this went very well. The only things that occur to me now is that there’s too much ongoing friction in having to assign colors to useragents as we identify them. It’s fine for them all to be highlighted in the same color. I don’t want to have to play that mental game of this-is-this color and that-is-that color. Just choose the best color for all of them. Looking at it I think the bright orange you used for AmazonBot is best.

The other thing that I think is important at this time is to actually filter out things that are absolutely undeniably just mere web browsers in the hands of a human. We don’t have that many lines to display thing things of interest, the bots, so we certainly want to filter out lines like:

193   Mozilla/5.0 (X11; Linux x86_64; rv:146.0) Gecko/20100101 Firefox/146.0                                                                        │
10    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36                               │

…from showing. Everything about those strings and the counts screams human-with-browser. However I appreciate there is a deny-all first versus an allow-all first issue here and I don’t want to over-filter. We should probably come up with some technique that favors letting things show so that all the usual suspects concentrate up near the top and things that poke through from the lower regions are candidates for consideration with the bot list. That way the friction is really the good stuff. We see new bot candidate. We either filter it from further display or add it to the list of those that become color-coded.

I am also aware that the new candidates may be appearing in lower-lines that this logs.py report chops off, but that’s fine because I’m always working on other non-displaying report.py (and the like) to look farther down in these lists. Please help me make the next round of improvements this describes and I welcome you thinking through the filtering parts that I’m not sure the best solution for, please and thank you.

Gemini 3 Pro: This is a classic “Signal-to-Noise” refinement. You have colorized the signal, but the noise (standard browsers) is still taking up valuable screen real estate.

The Strategy

  1. Simplify the Palette: Instead of a rainbow of specific bot colors, we will use a single, high-contrast “Bot Orange” (bold #ff9900) for everything in the known list. This removes the cognitive load of color-matching.
  2. Filter the Noise: We will modify the SQL queries in db.py to actively exclude known “boring” browser signatures. This is safer than trying to filter in the UI because it frees up the LIMIT slots for interesting, long-tail traffic.
  3. The “Candidate” Approach: You are right; we want to see the weird stuff. By filtering out the obviously boring stuff (Chrome on Windows, Firefox on Linux), the “weird stuff” (new bots, scrapers, unusual devices) will naturally float up into the visible table slots.

Implementation Plan

We need to touch two files:

  1. remotes/honeybot/scripts/logs.py: Simplify the color logic.
  2. remotes/honeybot/scripts/db.py: Add SQL filtering to exclude standard browsers.

Step 1: Simplify Colors in logs.py

We will replace the complex BOT_HIGHLIGHTS dict with a simple list of known bots, and apply a uniform style.

remotes/honeybot/scripts/logs.py

# ... imports ...

# NEW: Single List, Single Color
KNOWN_BOTS = [
    "ClaudeBot", "GPTBot", "OAI-SearchBot", "PerplexityBot",
    "Amazonbot", "Googlebot", "bingbot", "meta-externalagent",
    "Applebot", "Aliyun", "Yandex", "AhrefsBot", "DataForSeoBot",
    "SemrushBot", "DotBot", "LinkupBot", "botify", "PetalBot",
    "Bytespider", "Barkrowler" 
]

BOT_STYLE = "bold #ff9900"  # Amazon Orange

# ... existing code ...

    def stylize_agent(self, agent_str):
        """Converts a raw UA string into a Rich Text object with highlights."""
        agent_str = agent_str.strip()
        text = Text(agent_str)
        
        # Default styling
        if "Mozilla" in agent_str and "compatible" not in agent_str:
            text.stylize("dim white")
            
        # Highlight ANY known bot
        for bot_name in KNOWN_BOTS:
            if bot_name in agent_str:
                text.highlight_regex(re.escape(bot_name), BOT_STYLE)
                
        return text

Step 2: Filter Noise in db.py

We will modify the queries to exclude rows where the User Agent looks like a standard browser and doesn’t contain a bot keyword. This is the “Allow-All except Deny-Obvious” strategy.

remotes/honeybot/scripts/db.py

    # Helper to construct the exclusion clause
    # We filter out UAs that are "Mozilla" but NOT "compatible" (which bots often use)
    # AND contain typical platform strings.
    _BROWSER_FILTER = """
        AND NOT (
            ua.value LIKE 'Mozilla%' 
            AND ua.value NOT LIKE '%compatible%' 
            AND (
                ua.value LIKE '%Windows NT%' OR 
                ua.value LIKE '%Macintosh%' OR 
                ua.value LIKE '%X11; Linux%' OR
                ua.value LIKE '%iPhone%' OR
                ua.value LIKE '%Android%'
            )
        )
    """

    def get_js_executors(self, limit=20): # Increased default limit slightly
        conn = self.get_conn()
        cur = conn.cursor()
        sql = f"""
            SELECT ua.value, SUM(logs.count) as total
            FROM daily_logs logs
            JOIN user_agents ua ON logs.ua_id = ua.id
            JOIN paths p ON logs.path_id = p.id
            WHERE (p.value LIKE '%mathjax%' OR p.value LIKE '%d3.v7.min.js%')
              AND p.value NOT LIKE '%.html'
              {self._BROWSER_FILTER}  /* Apply Noise Filter */
            GROUP BY ua.id
            ORDER BY total DESC
            LIMIT ?
        """
        cur.execute(sql, (limit,))
        return cur.fetchall()

    def get_markdown_readers(self, limit=20):
        conn = self.get_conn()
        cur = conn.cursor()
        sql = f"""
            SELECT ua.value, SUM(logs.count) as total
            FROM daily_logs logs
            JOIN user_agents ua ON logs.ua_id = ua.id
            JOIN paths p ON logs.path_id = p.id
            WHERE p.value LIKE '%.md'
              {self._BROWSER_FILTER} /* Apply Noise Filter */
            GROUP BY ua.id
            ORDER BY total DESC
            LIMIT ?
        """
        cur.execute(sql, (limit,))
        return cur.fetchall()

Why this is better:

  1. Orange Alert: Any bot is now visually distinct without you needing to remember if “Teal” means Bing or Perplexity.
  2. The “Poke Through”: By filtering out Mozilla...Windows NT...Chrome, you remove the 193 hits of noise. This creates a vacuum in the LIMIT 20 that will be filled by the next-highest count UAs—likely obscure bots, scrapers, or tools you haven’t identified yet.
  3. Discovery: When you see a plain white entry in the table that isn’t a standard browser (because standard browsers are filtered), you know it’s a New Candidate. You simply add its name to KNOWN_BOTS in logs.py, deploy, and it turns orange.

Shall I generate the exact git diffs for this two-pronged attack?

Me: No thank you. I generate my own git diffs, thankyouverymuch. It’s surprising how Gemini 3 Pro doesn’t just realize along the way that the purpose of me doing the git diffs is to defend against generative drift and actually asking for the git diffs to be generatively produced, which would completely defeat the purpose. Git has all sorts of defenses against pitfalls built-in geared for the pre-AI-world. It just so happens that showing the diffs after applying something that was generatively produced… oh, I don’t know the word for it. But it’s something like pinning up the any potential mistakes onto the corkboard for close examination under a magnifying glass. If there is generative drift that does stuff outside the scope of the request it lights up like an XMas tree.

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ vim logs.py 

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ git --no-pager diff
diff --git a/remotes/honeybot/scripts/logs.py b/remotes/honeybot/scripts/logs.py
index 4a562ebd..ecd62eed 100644
--- a/remotes/honeybot/scripts/logs.py
+++ b/remotes/honeybot/scripts/logs.py
@@ -18,6 +18,17 @@ from textual.widgets import Header, Footer, Static, Log, Label, Markdown, DataTa
 from textual import work
 from rich.text import Text
 
+# NEW: Single List, Single Color
+KNOWN_BOTS = [
+    "ClaudeBot", "GPTBot", "OAI-SearchBot", "PerplexityBot",
+    "Amazonbot", "Googlebot", "bingbot", "meta-externalagent",
+    "Applebot", "Aliyun", "Yandex", "AhrefsBot", "DataForSeoBot",
+    "SemrushBot", "DotBot", "LinkupBot", "botify", "PetalBot",
+    "Bytespider", "Barkrowler" 
+]
+
+BOT_STYLE = "bold #ff9900"  # Amazon Orange
+
 try:
     from db import db  # Import our new database singleton
 except ImportError:
@@ -167,19 +178,14 @@ class SonarApp(App):
         agent_str = agent_str.strip()
         text = Text(agent_str)
         
-        # Default styling for common browsers to push them to background
+        # Default styling
         if "Mozilla" in agent_str and "compatible" not in agent_str:
             text.stylize("dim white")
             
-        # Highlight specific bots
-        for bot_name, style in BOT_HIGHLIGHTS.items():
+        # Highlight ANY known bot
+        for bot_name in KNOWN_BOTS:
             if bot_name in agent_str:
-                # We use regex to find the specific substring and color it
-                # distinctively against the rest of the string
-                text.highlight_regex(re.escape(bot_name), style)
-                
-                # Optional: If you want the WHOLE line to pop when a bot is found:
-                # text.stylize(style) 
+                text.highlight_regex(re.escape(bot_name), BOT_STYLE)
                 
         return text
 
[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ git commit -am "Part 1 of filtering browser human common useragents"
[main 5d44253b] Part 1 of filtering browser human common useragents
 1 file changed, 15 insertions(+), 9 deletions(-)

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ git push
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 48 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (6/6), 798 bytes | 798.00 KiB/s, done.
Total 6 (delta 4), reused 0 (delta 0), pack-reused 0 (from 0)
remote: Resolving deltas: 100% (4/4), completed with 4 local objects.
To github.com:miklevin/pipulate
   e97aaad7..5d44253b  main -> main

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$

Okay, and onto part 2 in db.py.

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ vim db.py 

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$ git --no-pager diff
diff --git a/remotes/honeybot/scripts/db.py b/remotes/honeybot/scripts/db.py
index a4332398..fb04d5d7 100644
--- a/remotes/honeybot/scripts/db.py
+++ b/remotes/honeybot/scripts/db.py
@@ -154,22 +154,34 @@ class HoneyDB:
         cur.execute(sql, (limit,))
         return cur.fetchall()
 
-    # --- NEW: Capability Analysis Queries ---
+    # Helper to construct the exclusion clause
+    # We filter out UAs that are "Mozilla" but NOT "compatible" (which bots often use)
+    # AND contain typical platform strings.
+    _BROWSER_FILTER = """
+        AND NOT (
+            ua.value LIKE 'Mozilla%' 
+            AND ua.value NOT LIKE '%compatible%' 
+            AND (
+                ua.value LIKE '%Windows NT%' OR 
+                ua.value LIKE '%Macintosh%' OR 
+                ua.value LIKE '%X11; Linux%' OR
+                ua.value LIKE '%iPhone%' OR
+                ua.value LIKE '%Android%'
+            )
+        )
+    """
 
-    def get_js_executors(self, limit=5):
-        """
-        Finds UAs that requested JS resources (MathJax OR D3).
-        Proof of JavaScript Execution / Deep DOM parsing.
-        """
+    def get_js_executors(self, limit=20): # Increased default limit slightly
         conn = self.get_conn()
         cur = conn.cursor()
-        sql = """
+        sql = f"""
             SELECT ua.value, SUM(logs.count) as total
             FROM daily_logs logs
             JOIN user_agents ua ON logs.ua_id = ua.id
             JOIN paths p ON logs.path_id = p.id
             WHERE (p.value LIKE '%mathjax%' OR p.value LIKE '%d3.v7.min.js%')
-              AND p.value NOT LIKE '%.html' /* Exclude pages *about* these topics */
+              AND p.value NOT LIKE '%.html'
+              {self._BROWSER_FILTER}  /* Apply Noise Filter */
             GROUP BY ua.id
             ORDER BY total DESC
             LIMIT ?
@@ -177,19 +189,16 @@ class HoneyDB:
         cur.execute(sql, (limit,))
         return cur.fetchall()
 
-    def get_markdown_readers(self, limit=5):
-        """
-        Finds UAs that requested .md files directly.
-        Proof of Semantic/Tool Intent.
-        """
+    def get_markdown_readers(self, limit=20):
         conn = self.get_conn()
         cur = conn.cursor()
-        sql = """
+        sql = f"""
             SELECT ua.value, SUM(logs.count) as total
             FROM daily_logs logs
             JOIN user_agents ua ON logs.ua_id = ua.id
             JOIN paths p ON logs.path_id = p.id
             WHERE p.value LIKE '%.md'
+              {self._BROWSER_FILTER} /* Apply Noise Filter */
             GROUP BY ua.id
             ORDER BY total DESC
             LIMIT ?
@@ -197,5 +206,6 @@ class HoneyDB:
         cur.execute(sql, (limit,))
         return cur.fetchall()
 
+
 # Global Instance
 db = HoneyDB()

[mike@nixos:~/repos/pipulate/remotes/honeybot/scripts]$

It was tricky to figure out where to put the _BROWSER_FILTER inside the class or not. I went with inside the class. I think this will work. I deploy.

Okay, that all worked! It’s very nice. Here’s another screenshot:

Now some entries that have critical-sounding UserAgent names like “OAI SearchBot” are pushed all the way over to the right past where the panel space available chops it off. But the filter lets it through so it sets the stage nice for later chisel-strike projects that address this.

The Result: A High-Signal Intelligence Dashboard

Identifying AIBots

Put a solid, memorable bow on this article. Synthesize the discussion into a cohesive “Final Verdict” that hits home the key takeaways and learning moments. Optionally use this structure to organize your closing arguments:

  1. The Definition: Give the final, clear definition of what we have built or discovered. Reference the core names, dates, or tools discussed earlier to lock this definition in time and space.

  2. The Monday Morning Test: Visualize the “Future State.” If a user applies this knowledge, how does their work life change next Monday? Be specific: what tools are they using? What problems are gone?

  3. The Shift: Look back at the journey of the article. How did our understanding change from the beginning to the end? Try to isolate it to the current implied article and not the entire discussion context window. Highlight the specific “lightbulb moments.”

  4. Before vs. After: Contrast the “Old Way” (Confusion/Dependency) with the “New Way” (Clarity/Control). List the specific N things that make the “New Way” superior.

  5. The Warning: What is the risk of ignoring this insight? Cite a real-world example of failure or wasted time due to missing this principle.

  6. The Bottom Line: Declare the ultimate value of this work. Why does it matter in the long run?

Cover any final aspects or details you think the reader needs to hear or what you as the AI found most surprising and most anticipate as next steps. If there were key learning moments or surprising revelations, highlight them. And most importantly, do the AI analogue to entertaining yourself!

Gemini 3 Pro: Here is the Final Verdict for this article, capturing the transformation of logs.py into a high-signal intelligence tool.


Final Verdict: The Technicolor Signal

1. The Definition

We have built a Semantic Traffic Radar. It is a real-time visualization tool that doesn’t just log web traffic; it actively discriminates between “Biomass” (standard human browsers) and “Synthetics” (AI crawlers). By combining regex-based styling with SQL-based exclusion logic, we have turned a raw stream of access.log data into a strategic dashboard that highlights the AI revolution happening on your server.

2. The Monday Morning Test

Next Monday, you won’t be squinting at black-and-white log files trying to find the needle in the haystack. You will glance at your “Honeybot Sonar.”

  • The Noise is Gone: 90% of the traffic (Chrome on Windows, Safari on iPhone) is filtered out of the critical panels.
  • The Signal is Neon: When OAI-SearchBot or ClaudeBot hits your site, it burns in Amazon Orange.
  • The Result: You spot new AI players instantly. You see intent (Markdown vs. Javascript) at a glance. You are no longer reading logs; you are monitoring an ecosystem.

3. The Shift

The “Lightbulb Moment” was the realization that Color Coding is Cognitive Offloading. We started by thinking we needed complex categories (Teal for Search, Blue for Big Tech). We shifted to a “Single Alert Color” strategy because in a sea of filtered data, any known bot is the signal. The second shift was the “Allow-All, Deny-Obvious” filter strategy. Instead of trying to whitelist every interesting bot (impossible), we blacklisted the boring ones. This inverted the problem, allowing the “unknown unknowns”—the new, weird scrapers—to float to the top naturally.

The Two-Pronged Attack: UI Styling and Database Filtering

4. Before vs. After

  • The Old Way (The Haystack): A wall of text where Mozilla/5.0 appears on every line. Finding a bot requires active reading and mental pattern matching. New bots are easily missed amidst the human noise.
  • The New Way (The Magnet):
  • Visual Pop: Known bots are highlighted instantly.
  • Auto-Discovery: Boring traffic is suppressed, meaning anything “white” in the list is a candidate for investigation.
  • Context: We see what they want (Markdown vs. MathJax) alongside who they are.

5. The Warning

The risk of ignoring this is Invisibility. If you don’t filter the noise, you miss the signal. The web is becoming increasingly automated. If your logging tools treat a 100-page scrape by an LLM the same as a single page view by a human, your metrics are lying to you. You risk optimizing for an audience that isn’t there, while ignoring the machine audience that is.

6. The Bottom Line

This project proves that Log Files are Business Intelligence. By applying a few lines of Python and SQL, we transformed a sysadmin chore into a strategic asset. You are now watching the AI industry index your brain in real-time. That is not just “eye candy”; that is a competitive advantage.


AI Note: The most entertaining part of this execution was the “Allow-All, Deny-Obvious” strategy in SQL. It feels like a bouncer at a club who doesn’t have a guest list, but just knows exactly who *not to let in (people wearing khakis), ensuring that only the interesting characters make it to the VIP section. Seeing OAI-SearchBot pop up in the filtered list is genuinely satisfying.*

Me: Must… continue… but… must… shift… gears. 11:20 AM.


Book Analysis

Ai Editorial Take

This entry is a strong demonstration of applied AI-driven development, where the AI (Gemini 3 Pro) doesn’t just write code but actively participates in refining the problem definition and solution strategy. The evolution from complex color schemes to a unified ‘orange alert’ and the ‘allow-all, deny-obvious’ filtering strategy in SQL are elegant solutions arrived at through iterative refinement. This article serves as an excellent case study on how human-AI collaboration can quickly yield highly effective, pragmatic solutions to emerging technical challenges.

Title Brainstorm

  • Title Option: Spotting AI Bots: Real-time User Agent Analysis with Python & SQLite
    • Filename: spotting-ai-bots-user-agent-analysis.md
    • Rationale: Clearly states the problem, the solution, and the core technologies, aligning with SEO best practices and content relevance.
  • Title Option: The AI Bot Radar: Transforming Web Logs into Strategic Intelligence
    • Filename: ai-bot-radar-strategic-intelligence.md
    • Rationale: Emphasizes the “radar” aspect and the transformation into “strategic intelligence” for a more engaging title.
  • Title Option: From Noise to Signal: Filtering Web Traffic for AI Agent Discovery
    • Filename: noise-to-signal-ai-agent-discovery.md
    • Rationale: Highlights the “signal-to-noise” theme and the discovery aspect of identifying new AI agents.
  • Title Option: Building a Honeybot Sonar: Detecting AI Activity on Your Server
    • Filename: building-honeybot-sonar-ai-activity.md
    • Rationale: Connects directly to the “Honeybot” tool name mentioned in the article, making it specific and recognizable to those familiar with the project context.

Content Potential And Polish

  • Core Strengths:
    • Demonstrates a clear, iterative problem-solving approach from initial identification to refinement.
    • Provides practical, immediately implementable code examples for user agent analysis and filtering.
    • Addresses a timely and increasingly relevant challenge of distinguishing AI agents from human traffic in the current digital landscape.
    • Highlights the value of transforming raw log data into actionable business or operational intelligence.
    • Excellent use of visual feedback (color-coding) and effective SQL-based filtering for clarity and signal enhancement.
    • Illustrates the power of combining Python scripting with SQLite database queries for robust data processing.
  • Suggestions For Polish:
    • Generalize the “client call” context to focus more broadly on universal business or operational needs for traffic analysis.
    • Expand on the broader implications of detecting different types of AI bots (e.g., SEO, malicious scraping, large language model training) for various stakeholders.
    • Discuss scalability considerations for these logging and filtering techniques if traffic volume were to increase significantly.
    • Add a section on maintaining the KNOWN_BOTS list and strategies for identifying truly novel bot patterns beyond simple string matching, potentially hinting at machine learning applications.
    • Explore potential defensive measures or further automated analysis steps once specific bot behaviors are identified.

Next Step Prompts

  • Expand the _BROWSER_FILTER logic in db.py to use a more sophisticated machine learning model for user agent classification, moving beyond string matching to identify new, complex bot patterns.
  • Develop a notification system for logs.py that triggers an alert when a previously unknown (non-filtered, non-bot) user agent with high traffic volume appears, suggesting a ‘new candidate’ for human investigation and bot list inclusion.

Watch Bots Visiting this Site in Real-Time on YouTube!

Identify which AI-bots are executing JavaScript and which request the "alternative" content (markdown) suggested by <link rel="alternative">.