Setting the Stage: Context for the Curious Book Reader
From Raw Data to Strategic Insight: The Cumulative Power
This journal entry marks a pivotal moment in the development of an AI-powered SEO toolkit. Following previous entries detailing data acquisition and initial processing, this dispatch focuses on the architectural task of refining the project’s ‘brain’—the advanced keyword clustering machine learning pipeline. It chronicles the refactoring of this complex, computationally intensive workflow from a Jupyter Notebook cell into a robust, reusable Python utility function, emphasizing maintainability, efficiency, and scalable automation. This move is crucial for enhancing the overall system’s stability and allowing future development to build upon a solid, well-encapsulated foundation of SEO domain expertise.
Technical Journal Entry Begins
Transposing Core Logic
Alright, I took my walk out to the grocery store and the kitties get their snacks tonight. It wasn’t even their main food I was running low on, just their very traditional evening snack which they would never leave me alone if they didn’t get, so gotten and done.
Now let’s talk keyword clustering… or not, because we’re only transposing it from point A to point B by Method 2. And it has lots of support functions that I already transposed so that I could give the AI a look-ahead in the conversation so that I don’t need to load this one with quite so much code and it can focus on what’s there in the latest post assuming that other stuff is all in location.
Current State: Pre-existing Support Functions in gap_analyzer_sauce.py
[mike@nixos:~/repos/pipulate/Notebooks]$ git --no-pager diff
diff --git a/Notebooks/gap_analyzer_sauce.py b/Notebooks/gap_analyzer_sauce.py
## The New Addition: `cluster_and_finalize_dataframe` Function
index 4dd90821..d0647f23 100644
--- a/Notebooks/gap_analyzer_sauce.py
+++ b/Notebooks/gap_analyzer_sauce.py
@@ -25,6 +25,19 @@ from tldextract import extract
from bs4 import BeautifulSoup
import wordninja
+# --- KEYWORD CLUSTERING SUPPORT FUNCTIONS (REQUIRES: nltk, sklearn, wordninja) ---
+from collections import Counter
+from nltk import bigrams
+from nltk.corpus import stopwords
+from nltk.stem import PorterStemmer
+from nltk.tokenize import word_tokenize
+from sklearn.cluster import MiniBatchKMeans
+from sklearn.decomposition import TruncatedSVD
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics import silhouette_score
+import numpy as np
+
+
import nltk
nltk.download('stopwords', quiet=True)
@@ -1530,3 +1543,90 @@ def truncate_dataframe_by_volume(job: str, final_df: pd.DataFrame, row_limit: in
print(f"❌ An error occurred during truncation: {e}")
pip.set(job, 'truncated_df_for_clustering_json', pd.DataFrame().to_json(orient='records'))
return pd.DataFrame() # Return empty DataFrame
+
+
+# --- 1. CORE ML UTILITIES ---
+
+def calculate_silhouette(X, labels):
+ """Calculates the Silhouette Coefficient for cluster evaluation."""
+ # Handle the edge case where there is only one cluster or too few data points
+ if len(np.unique(labels)) <= 1 or len(X) < 2:
+ return 0.0 # Return 0 for non-evaluatable cases
+
+ return silhouette_score(X, labels)
+
+def preprocess_keywords(text):
+ """Stems, lowercases, tokenizes, and removes stopwords from a keyword string."""
+ stemmer = PorterStemmer()
+ # Assuming stopwords were downloaded earlier with `nltk.download('stopwords')`
+ stop_words = set(stopwords.words('english'))
+ words = word_tokenize(text.lower())
+ # Filter for alphanumeric words and then stem
+ return ' '.join([stemmer.stem(word) for word in words if word not in stop_words and word.isalnum()])
+
+def keyword_clustering(df, keyword_column, n_clusters=30, n_components=5, max_features=500):
+ """Performs Tfidf Vectorization, Truncated SVD, and MiniBatchKMeans clustering."""
+
+ # 1. Preprocess keywords
+ df['Stemmed Keywords'] = df[keyword_column].apply(preprocess_keywords)
+
+ # 2. Text Vectorization
+ print(f"Vectorizing... (Max Features: {max_features})")
+ vectorizer = TfidfVectorizer(max_features=max_features, stop_words='english')
+ X = vectorizer.fit_transform(df['Stemmed Keywords'])
+
+ # 3. Dimension Reduction
+ print(f"Reducing Dimensions... (Components: {n_components})")
+ svd = TruncatedSVD(n_components=n_components, random_state=42)
+ principal_components = svd.fit_transform(X)
+
+ # 4. Clustering
+ print(f"Clustering... (K: {n_clusters})")
+ # Setting compute_labels=True to ensure compatibility with MiniBatchKMeans
+ minibatch_kmeans = MiniBatchKMeans(n_clusters=n_clusters, batch_size=100, random_state=42, n_init='auto')
+ df['Keyword Cluster'] = minibatch_kmeans.fit_predict(principal_components)
+
+ # 5. Calculate silhouette score
+ print("Calculating silhouette cluster quality score (takes a bit)...")
+ silhouette_avg = calculate_silhouette(principal_components, df['Keyword Cluster'])
+
+ # Return DataFrame, score, and the used parameters
+ return df, silhouette_avg, {'n_clusters': n_clusters, 'n_components': n_components, 'max_features': max_features}
+
+def name_keyword_clusters(df, keyword_column, cluster_column):
+ """Names each cluster by the most common non-stopword, non-repeating bigram within the cluster."""
+
+ stop_words = set(stopwords.words('english'))
+ cluster_names = {}
+
+ for cluster in df[cluster_column].unique():
+ cluster_data = df[df[cluster_column] == cluster]
+ all_keywords = ' '.join(cluster_data[keyword_column].astype(str)).split()
+ filtered_keywords = [word for word in all_keywords if word not in stop_words and word.isalnum()]
+
+ bigram_counts = Counter(bigrams(filtered_keywords))
+
+ most_common_bigram = None
+ for bigram, count in bigram_counts.most_common():
+ if bigram[0] != bigram[1]:
+ most_common_bigram = bigram
+ break
+
+ if not most_common_bigram:
+ # Fallback to single most common word or a generic name
+ unigram_counts = Counter(filtered_keywords)
+ most_common_unigram = unigram_counts.most_common(1)
+ most_common_words = most_common_unigram[0][0] if most_common_unigram else "Generic Cluster"
+ else:
+ most_common_words = ' '.join(most_common_bigram)
+
+ cluster_names[cluster] = most_common_words
+
+ df['Keyword Group (Experimental)'] = df[cluster_column].map(cluster_names)
+
+ # Drop Process Columns (as per original logic)
+ df.drop(columns=['Stemmed Keywords'], inplace=True)
+ df.drop(columns=['Keyword Cluster'], inplace=True)
+
+ return df
+
(nix)
[mike@nixos:~/repos/pipulate/Notebooks]$
So you don’t have to worry about moving those or telling me to move them or putting underscores on them to lose me a half-hour of time, haha!
Sometimes gaining traction is a matter of doing a good look-ahead because traction is lost when you fail to see the black ice on the road, or whatever. We keep our forward momentum moving forward. Here is the next thing to transpose per the template we’ve been so successfully using of late.
Here’s what we need to move into gap_analyzer_sauce.py
:
import itertools
import json
from pathlib import Path
from IPython.display import display
import re # Needed for the reorder function (reorder_columns_surgical)
# NOTE: This cell assumes 'df' (the truncated DataFrame), 'job', and 'has_botify' are defined.
# --- PATH DEFINITIONS ---
# The parameter file is saved alongside other intermediate files in the /data folder
keyword_cluster_params = Path("data") / f"{job}_keyword_cluster_params.json"
# The final CSV output file path
unformatted_csv = Path("data") / f"{job}_unformatted.csv"
# --- REQUIRED SUPPORT FUNCTION (Re-defined for chaining safety) ---
def reorder_columns_surgical(df, priority_column, after_column):
"""
Surgical port of bf.reorder_columns: Moves a column immediately after a specified column.
"""
if priority_column in df.columns and after_column in df.columns:
columns = df.columns.drop(priority_column).tolist()
after_column_index = columns.index(after_column)
columns.insert(after_column_index + 1, priority_column)
return df[columns]
elif priority_column not in df.columns:
print(f"⚠️ Warning: Priority column '{priority_column}' not found for reorder.")
elif after_column not in df.columns:
print(f"⚠️ Warning: After column '{after_column}' not found for reorder.")
return df
# --- MAIN CLUSTERING LOGIC ---
print("Grouping Keywords...")
# Download any necessary nltk components (punkt_tab is often missed)
nltk.download('punkt_tab', quiet=True)
# Configuration for iterative testing
target_silhouette_score = 0.6
n_clusters_options = range(15, 26)
n_components_options = [10, 15, 20]
max_features_options = [50, 100, 150]
total_tests = len(list(itertools.product(n_clusters_options, n_components_options, max_features_options)))
best_score = -1.0 # Initialize to a value lower than any possible score
best_params = {}
# 1. Check for Cached Parameters
if keyword_cluster_params.exists():
try:
with keyword_cluster_params.open('r') as file:
best_params = json.load(file)
print(f"Loaded initial parameters: {best_params}")
# Test with loaded parameters
df, score, _ = keyword_clustering(df, 'Keyword', **best_params)
best_score = score
print(f"Initial test with loaded parameters: Score = {score:.3f}")
except (json.JSONDecodeError, FileNotFoundError, TypeError, ValueError) as e:
print(f"⚠️ Failed to load or use cached parameters. Starting full search. Error: {e}")
best_params = {}
# 2. Iterative Search (if target score is not met)
if best_score < target_silhouette_score:
print(f"Refining best keyword clustering fit... Total tests: {total_tests}")
# Use a product of options to test every combination
for n_clusters, n_components, max_features in itertools.product(n_clusters_options, n_components_options, max_features_options):
# Skip if these parameters match the already tested/loaded ones
if (n_clusters == best_params.get('n_clusters') and
n_components == best_params.get('n_components') and
max_features == best_params.get('max_features')):
continue
# Run clustering and get new score/params
df_temp, score, params = keyword_clustering(df.copy(), 'Keyword', n_clusters, n_components, max_features)
print(f'Testing params: {params}, Score: {score:.3f}')
if score > best_score:
best_score = score
best_params = params
df = df_temp.copy() # Keep the DataFrame with the better cluster labels
# Check for early stop condition
if best_score >= target_silhouette_score:
print(f'✅ Good enough score found: {best_score:.3f} with params {best_params}')
with keyword_cluster_params.open('w') as file:
json.dump(best_params, file)
print(f'Saved best parameters: {best_params}')
break
# If the search finished without hitting the target, save the highest score achieved
if best_score < target_silhouette_score and best_params:
print(f'Highest score reached: {best_score:.3f}. Saving best parameters found.')
with keyword_cluster_params.open('w') as file:
json.dump(best_params, file)
# 3. Finalize Clustering (Ensures the final best DataFrame is used if no search was run)
# This step is crucial if the initial loaded parameters were already above the target,
# as the 'df' used inside the loop might be a temporary copy. Re-cluster with the best params found.
if 'Keyword Cluster' not in df.columns: # If the original clustering didn't run or was skipped
df, _, _ = keyword_clustering(df, 'Keyword', **best_params)
# 4. Naming clusters and saving the result
print("\nNaming clusters...")
# Original logic: df = bf.name_keyword_clusters(df, 'Keyword', 'Keyword Cluster')
df = name_keyword_clusters(df, 'Keyword', 'Keyword Cluster')
# --- FINAL REORDERING (Using surgical replacement of bf.reorder_columns) ---
df = reorder_columns_surgical(df, 'CPC', after_column='Keyword Difficulty')
df = reorder_columns_surgical(df, 'Keyword Group (Experimental)', after_column='Number of Words')
df = reorder_columns_surgical(df, 'CPC', after_column='Number of Words') # This reorder is duplicated/redundant with the previous one, but ported verbatim
# Conditional column reordering based on data presence
if has_botify:
# The diagnostic showed the column is 'Meta Description'. We use this.
df = reorder_columns_surgical(df, 'Client URL', after_column='Meta Description')
else:
# Fallback remains unchanged (no Meta Description column to place Client URL after)
df = reorder_columns_surgical(df, 'Client URL', after_column='Competition')
# Original: df = bf.reorder_columns(df, 'Competitor URL', after_column='Client URL')
df = reorder_columns_surgical(df, 'Competitor URL', after_column='Client URL')
# Final file persistence
df.to_csv(unformatted_csv, index=False)
# --- DISPLAY FINAL CLUSTER COUNTS ---
print("\nFinal Keyword Group Counts:")
value_counts = df["Keyword Group (Experimental)"].value_counts()
if not value_counts.empty:
max_digits = len(str(len(value_counts)))
max_index_width = max(len(str(index)) for index in value_counts.index)
max_count_width = max(len(f"{count:,}") for count in value_counts)
for i, (index, count) in enumerate(value_counts.items(), start=1):
counter_str = str(i).zfill(max_digits)
count_str = f"{count:,}"
print(f"{counter_str}: {index:<{max_index_width}} - {count_str:>{max_count_width}}")
else:
print("❌ No keyword groups were created.")
# Store final DataFrame in the pipeline
pip.set(job, 'final_clustered_df_json', df.to_json(orient='records'))
So we need the 2 things:
- The transposed version of this for pasting into
gap_analyzer_sauce.py
- The code that gets pasted into the Jupyter Notebook cell this was lifted from.
Please and thank you!
Gemini: Traction, leverage, and mechanical advantage indeed. The rhythm is fast, the cats are fed, and the catapult is ready for another pull. That “half-hour of pain” renaming helpers is exactly the kind of “black ice” I’m trying to help you avoid. My apologies for that; I will be more careful.
The Notebook Simplified: Integrating the New Utility
Let’s distill this next, complex cell—the machine learning “magic”—with that “uninteresting” verbatim focus.
1. Update gap_analyzer_sauce.py
Add the following function to Notebooks/gap_analyzer_sauce.py
. This function now encapsulates the entire clustering, naming, and final arrangement logic. It assumes the helper functions (_reorder_columns_surgical
, keyword_clustering
, name_keyword_clusters
, etc.) are already present in the file, as you’ve indicated.
# In Notebooks/gap_analyzer_sauce.py
import pandas as pd
from pathlib import Path
import json
import itertools
import re
from pipulate import pip # Ensure pip is imported
import nltk # For NLTK downloads
# (Keep all previously added functions: _extract_registered_domain,
# _download_file, _decompress_gz, _fetch_analysis_slugs, _export_data,
# _reorder_columns_surgical, calculate_silhouette, preprocess_keywords,
# keyword_clustering, name_keyword_clusters, etc.)
# ...
def cluster_and_finalize_dataframe(job: str, df: pd.DataFrame, has_botify: bool):
"""
Performs keyword clustering, names clusters, reorders columns,
saves the unformatted CSV, stores the final DataFrame in pip state,
and returns the final DataFrame for display.
Args:
job (str): The current Pipulate job ID.
df (pd.DataFrame): The truncated DataFrame from the previous step.
has_botify (bool): Flag indicating if Botify data is present.
Returns:
pd.DataFrame: The final, clustered, and arranged DataFrame.
"""
if df.empty:
print("⚠️ Input DataFrame (df) is empty. Cannot perform clustering.")
return pd.DataFrame()
print("🤖 Grouping Keywords (Clustering)...")
# --- CORE LOGIC (Moved from Notebook) ---
try:
# --- PATH DEFINITIONS ---
keyword_cluster_params = Path("data") / f"{job}_keyword_cluster_params.json"
unformatted_csv = Path("data") / f"{job}_unformatted.csv"
# Download necessary nltk components
nltk.download('punkt_tab', quiet=True)
# Configuration for iterative testing
target_silhouette_score = 0.6
n_clusters_options = range(15, 26)
n_components_options = [10, 15, 20]
max_features_options = [50, 100, 150]
total_tests = len(list(itertools.product(n_clusters_options, n_components_options, max_features_options)))
best_score = -1.0 # Initialize
best_params = {}
# 1. Check for Cached Parameters
if keyword_cluster_params.exists():
try:
with keyword_cluster_params.open('r') as file:
best_params = json.load(file)
print(f" Loaded initial parameters: {best_params}")
# Test with loaded parameters
# We call the helper functions that are already in this file
df, score, _ = keyword_clustering(df, 'Keyword', **best_params)
best_score = score
print(f" Initial test with loaded parameters: Score = {score:.3f}")
except (json.JSONDecodeError, FileNotFoundError, TypeError, ValueError) as e:
print(f" ⚠️ Failed to load/use cached parameters. Starting full search. Error: {e}")
best_params = {}
# 2. Iterative Search
if best_score < target_silhouette_score:
print(f" Refining best keyword clustering fit... Total tests: {total_tests}")
for n_clusters, n_components, max_features in itertools.product(n_clusters_options, n_components_options, max_features_options):
if (n_clusters == best_params.get('n_clusters') and
n_components == best_params.get('n_components') and
max_features == best_params.get('max_features')):
continue # Skip already-tested params
df_temp, score, params = keyword_clustering(df.copy(), 'Keyword', n_clusters, n_components, max_features)
print(f' Testing params: {params}, Score: {score:.3f}')
if score > best_score:
best_score = score
best_params = params
df = df_temp.copy() # Keep the DF with the better cluster labels
if best_score >= target_silhouette_score:
print(f' ✅ Good enough score found: {best_score:.3f} with params {best_params}')
with keyword_cluster_params.open('w') as file:
json.dump(best_params, file)
print(f' Saved best parameters: {best_params}')
break
if best_score < target_silhouette_score and best_params:
print(f' Highest score reached: {best_score:.3f}. Saving best parameters found.')
with keyword_cluster_params.open('w') as file:
json.dump(best_params, file)
# 3. Finalize Clustering
if 'Keyword Cluster' not in df.columns: # If clustering didn't run or was skipped
print(" Finalizing clustering with best parameters...")
df, _, _ = keyword_clustering(df, 'Keyword', **best_params)
# 4. Naming clusters
print("\n🏷️ Naming clusters...")
df = name_keyword_clusters(df, 'Keyword', 'Keyword Cluster') # Call helper
# --- FINAL REORDERING ---
# We call the helper function _reorder_columns_surgical
print(" Reordering columns...")
df = _reorder_columns_surgical(df, 'CPC', after_column='Keyword Difficulty')
df = _reorder_columns_surgical(df, 'Keyword Group (Experimental)', after_column='Number of Words')
df = _reorder_columns_surgical(df, 'CPC', after_column='Number of Words') # Verbatim duplicate reorder
# Conditional reordering
if has_botify:
df = _reorder_columns_surgical(df, 'Client URL', after_column='Meta Description')
else:
df = _reorder_columns_surgical(df, 'Client URL', after_column='Competition')
df = _reorder_columns_surgical(df, 'Competitor URL', after_column='Client URL')
# Final file persistence
df.to_csv(unformatted_csv, index=False)
print(f" 💾 Intermediate unformatted file saved to '{unformatted_csv}'")
# --- DISPLAY FINAL CLUSTER COUNTS ---
print("\n--- Final Keyword Group Counts ---")
value_counts = df["Keyword Group (Experimental)"].value_counts()
if not value_counts.empty:
max_digits = len(str(len(value_counts)))
max_index_width = max(len(str(index)) for index in value_counts.index)
max_count_width = max(len(f"{count:,}") for count in value_counts)
for i, (index, count) in enumerate(value_counts.items(), start=1):
counter_str = str(i).zfill(max_digits)
count_str = f"{count:,}"
print(f" {counter_str}: {index:<{max_index_width}} - {count_str:>{max_count_width}}")
else:
print(" ❌ No keyword groups were created.")
print("----------------------------------")
# --- OUTPUT (to pip state) ---
pip.set(job, 'final_clustered_df_json', df.to_json(orient='records'))
print(f"💾 Stored final clustered DataFrame in pip state for job '{job}'.")
# ---------------------------
# --- RETURN VALUE ---
return df
except Exception as e:
print(f"❌ An error occurred during clustering and finalization: {e}")
pip.set(job, 'final_clustered_df_json', pd.DataFrame().to_json(orient='records'))
return pd.DataFrame() # Return empty DataFrame
Key changes:
- Encapsulates the entire clustering and final arrangement logic in
cluster_and_finalize_dataframe
. - Assumes helper functions (
keyword_clustering
,name_keyword_clusters
,_reorder_columns_surgical
) are already defined in the same file. - Takes
job
, thedf
(from the previous step), andhas_botify
as inputs. - Handles path definitions,
nltk
downloads, parameter search, clustering, naming, reordering, saving the intermediate CSV, and all print statements. - Stores the final DataFrame to pip state (
final_clustered_df_json
). - Returns the final
df
for display.
2. Update the Notebook Cell in GAPalyzer.ipynb
Replace the cell containing the clustering logic with this:
# %% editable=true slideshow={"slide_type": ""}
import pandas as pd # Keep for display
from IPython.display import display
import gap_analyzer_sauce # Ensure module is imported
import itertools # Keep for any potential future use in cell
import json # Keep for any potential future use in cell
from pathlib import Path # Keep for any potential future use in cell
import re # Keep for any potential future use in cell
from io import StringIO # Keep for optional verification
# This one function now handles the entire clustering and finalization process:
# 1. Loads/tests clustering parameters from a JSON cache file.
# 2. Runs iterative ML clustering (TF-IDF, SVD, k-means) to find the best fit.
# 3. Names the resulting clusters using n-grams.
# 4. Performs the final column reordering.
# 5. Saves the final 'unformatted_csv'.
# 6. Prints the final cluster counts.
# 7. Stores the final DataFrame in pip state ('final_clustered_df_json').
# 8. Returns the final DataFrame for display.
# It receives 'df' (the truncated DF) and 'has_botify' from previous cells.
df = gap_analyzer_sauce.cluster_and_finalize_dataframe(job, df, has_botify)
# Display the head of the final, clustered, and arranged DataFrame
display(df.head())
# Optional verification
# print("\nVerification:")
# print(f" Final Clustered DF stored: {'final_clustered_df_json' in pip.read_state(job)}")
# loaded_clustered_df = pd.read_json(StringIO(pip.get(job, 'final_clustered_df_json', '[]')))
# print(f" Clustered DF rows in state: {len(loaded_clustered_df)}")
Key changes:
- Drastically Simplified: The notebook cell is now just imports, the single function call, and the
display()
call. - All complex ML logic, file I/O, reordering, and printing are hidden in the
gap_analyzer_sauce.py
function. - It passes
df
(the truncated DataFrame from the previous step) andhas_botify
directly. - It re-assigns the return value to
df
, replacing the truncated DF with the final, clustered DF, ready for the next step. - Displays the
head()
of the final, final DataFrame.
Me: The output with the keywords removed for publishing purposes:
🤖 Grouping Keywords (Clustering)...
Refining best keyword clustering fit... Total tests: 99
Vectorizing... (Max Features: 50)
Reducing Dimensions... (Components: 10)
Clustering... (K: 15)
Calculating silhouette cluster quality score (takes a bit)...
Testing params: {'n_clusters': 15, 'n_components': 10, 'max_features': 50}, Score: 0.802
✅ Good enough score found: 0.802 with params {'n_clusters': 15, 'n_components': 10, 'max_features': 50}
Saved best parameters: {'n_clusters': 15, 'n_components': 10, 'max_features': 50}
🏷️ Naming clusters...
Reordering columns...
💾 Intermediate unformatted file saved to 'data/gapalyzer-02_unformatted.csv'
--- Final Keyword Group Counts ---
01: [not shown] - 944
02: [not shown] - 131
03: [not shown] - 103
04: [not shown] - 100
05: [not shown] - 94
06: [not shown] - 83
07: [not shown] - 62
08: [not shown] - 55
09: [not shown] - 48
10: [not shown] - 48
11: [not shown] - 46
12: [not shown] - 44
13: [not shown] - 36
14: [not shown] - 11
15: [not shown] - 9
----------------------------------
💾 Stored final clustered DataFrame in pip state for job 'gapalyzer-02'.
And the great big diff:
(nix)
[mike@nixos:~/repos/pipulate/Notebooks]$ git --no-pager diff --no-textconv
diff --git a/Notebooks/gap_analyzer_sauce.py b/Notebooks/gap_analyzer_sauce.py
index d0647f23..96988f39 100644
--- a/Notebooks/gap_analyzer_sauce.py
+++ b/Notebooks/gap_analyzer_sauce.py
@@ -1630,3 +1630,143 @@ def name_keyword_clusters(df, keyword_column, cluster_column):
return df
+
+def cluster_and_finalize_dataframe(job: str, df: pd.DataFrame, has_botify: bool):
+ """
+ Performs keyword clustering, names clusters, reorders columns,
+ saves the unformatted CSV, stores the final DataFrame in pip state,
+ and returns the final DataFrame for display.
+
+ Args:
+ job (str): The current Pipulate job ID.
+ df (pd.DataFrame): The truncated DataFrame from the previous step.
+ has_botify (bool): Flag indicating if Botify data is present.
+
+ Returns:
+ pd.DataFrame: The final, clustered, and arranged DataFrame.
+ """
+ if df.empty:
+ print("⚠️ Input DataFrame (df) is empty. Cannot perform clustering.")
+ return pd.DataFrame()
+
+ print("🤖 Grouping Keywords (Clustering)...")
+
+ # --- CORE LOGIC (Moved from Notebook) ---
+ try:
+ # --- PATH DEFINITIONS ---
+ keyword_cluster_params = Path("data") / f"{job}_keyword_cluster_params.json"
+ unformatted_csv = Path("data") / f"{job}_unformatted.csv"
+
+ # Download necessary nltk components
+ nltk.download('punkt_tab', quiet=True)
+
+ # Configuration for iterative testing
+ target_silhouette_score = 0.6
+ n_clusters_options = range(15, 26)
+ n_components_options = [10, 15, 20]
+ max_features_options = [50, 100, 150]
+ total_tests = len(list(itertools.product(n_clusters_options, n_components_options, max_features_options)))
+
+ best_score = -1.0 # Initialize
+ best_params = {}
+
+ # 1. Check for Cached Parameters
+ if keyword_cluster_params.exists():
+ try:
+ with keyword_cluster_params.open('r') as file:
+ best_params = json.load(file)
+ print(f" Loaded initial parameters: {best_params}")
+ # Test with loaded parameters
+ # We call the helper functions that are already in this file
+ df, score, _ = keyword_clustering(df, 'Keyword', **best_params)
+ best_score = score
+ print(f" Initial test with loaded parameters: Score = {score:.3f}")
+ except (json.JSONDecodeError, FileNotFoundError, TypeError, ValueError) as e:
+ print(f" ⚠️ Failed to load/use cached parameters. Starting full search. Error: {e}")
+ best_params = {}
+
+ # 2. Iterative Search
+ if best_score < target_silhouette_score:
+ print(f" Refining best keyword clustering fit... Total tests: {total_tests}")
+ for n_clusters, n_components, max_features in itertools.product(n_clusters_options, n_components_options, max_features_options):
+ if (n_clusters == best_params.get('n_clusters') and
+ n_components == best_params.get('n_components') and
+ max_features == best_params.get('max_features')):
+ continue # Skip already-tested params
+
+ df_temp, score, params = keyword_clustering(df.copy(), 'Keyword', n_clusters, n_components, max_features)
+ print(f' Testing params: {params}, Score: {score:.3f}')
+
+ if score > best_score:
+ best_score = score
+ best_params = params
+ df = df_temp.copy() # Keep the DF with the better cluster labels
+
+ if best_score >= target_silhouette_score:
+ print(f' ✅ Good enough score found: {best_score:.3f} with params {best_params}')
+ with keyword_cluster_params.open('w') as file:
+ json.dump(best_params, file)
+ print(f' Saved best parameters: {best_params}')
+ break
+
+ if best_score < target_silhouette_score and best_params:
+ print(f' Highest score reached: {best_score:.3f}. Saving best parameters found.')
+ with keyword_cluster_params.open('w') as file:
+ json.dump(best_params, file)
+
+ # 3. Finalize Clustering
+ if 'Keyword Cluster' not in df.columns: # If clustering didn't run or was skipped
+ print(" Finalizing clustering with best parameters...")
+ df, _, _ = keyword_clustering(df, 'Keyword', **best_params)
+
+ # 4. Naming clusters
+ print("\n🏷️ Naming clusters...")
+ df = name_keyword_clusters(df, 'Keyword', 'Keyword Cluster') # Call helper
+
+ # --- FINAL REORDERING ---
+ # We call the helper function _reorder_columns_surgical
+ print(" Reordering columns...")
+ df = _reorder_columns_surgical(df, 'CPC', after_column='Keyword Difficulty')
+ df = _reorder_columns_surgical(df, 'Keyword Group (Experimental)', after_column='Number of Words')
+ df = _reorder_columns_surgical(df, 'CPC', after_column='Number of Words') # Verbatim duplicate reorder
+
+ # Conditional reordering
+ if has_botify:
+ df = _reorder_columns_surgical(df, 'Client URL', after_column='Meta Description')
+ else:
+ df = _reorder_columns_surgical(df, 'Client URL', after_column='Competition')
+
+ df = _reorder_columns_surgical(df, 'Competitor URL', after_column='Client URL')
+
+ # Final file persistence
+ df.to_csv(unformatted_csv, index=False)
+ print(f" 💾 Intermediate unformatted file saved to '{unformatted_csv}'")
+
+
+ # --- DISPLAY FINAL CLUSTER COUNTS ---
+ print("\n--- Final Keyword Group Counts ---")
+ value_counts = df["Keyword Group (Experimental)"].value_counts()
+ if not value_counts.empty:
+ max_digits = len(str(len(value_counts)))
+ max_index_width = max(len(str(index)) for index in value_counts.index)
+ max_count_width = max(len(f"{count:,}") for count in value_counts)
+ for i, (index, count) in enumerate(value_counts.items(), start=1):
+ counter_str = str(i).zfill(max_digits)
+ count_str = f"{count:,}"
+ print(f" {counter_str}: {index:<{max_index_width}} - {count_str:>{max_count_width}}")
+ else:
+ print(" ❌ No keyword groups were created.")
+ print("----------------------------------")
+
+ # --- OUTPUT (to pip state) ---
+ pip.set(job, 'final_clustered_df_json', df.to_json(orient='records'))
+ print(f"💾 Stored final clustered DataFrame in pip state for job '{job}'.")
+ # ---------------------------
+
+ # --- RETURN VALUE ---
+ return df
+
+ except Exception as e:
+ print(f"❌ An error occurred during clustering and finalization: {e}")
+ pip.set(job, 'final_clustered_df_json', pd.DataFrame().to_json(orient='records'))
+ return pd.DataFrame() # Return empty DataFrame
diff --git a/assets/nbs/GAPalyzer.ipynb b/assets/nbs/GAPalyzer.ipynb
index 4a501594..7318d1d8 100644
--- a/assets/nbs/GAPalyzer.ipynb
+++ b/assets/nbs/GAPalyzer.ipynb
@@ -579,268 +579,43 @@
"metadata": {},
"outputs": [],
"source": [
- "# --- KEYWORD CLUSTERING SUPPORT FUNCTIONS (REQUIRES: nltk, sklearn, wordninja) ---\n",
- "from collections import Counter\n",
- "from nltk import bigrams\n",
- "from nltk.corpus import stopwords\n",
- "from nltk.stem import PorterStemmer\n",
- "from nltk.tokenize import word_tokenize\n",
- "from sklearn.cluster import MiniBatchKMeans\n",
- "from sklearn.decomposition import TruncatedSVD\n",
- "from sklearn.feature_extraction.text import TfidfVectorizer\n",
- "from sklearn.metrics import silhouette_score\n",
- "import numpy as np\n",
- "\n",
- "# --- 1. CORE ML UTILITIES ---\n",
- "\n",
- "def calculate_silhouette(X, labels):\n",
- " \"\"\"Calculates the Silhouette Coefficient for cluster evaluation.\"\"\"\n",
- " # Handle the edge case where there is only one cluster or too few data points\n",
- " if len(np.unique(labels)) <= 1 or len(X) < 2:\n",
- " return 0.0 # Return 0 for non-evaluatable cases\n",
- "\n",
- " return silhouette_score(X, labels)\n",
- "\n",
- "def preprocess_keywords(text):\n",
- " \"\"\"Stems, lowercases, tokenizes, and removes stopwords from a keyword string.\"\"\"\n",
- " stemmer = PorterStemmer()\n",
- " # Assuming stopwords were downloaded earlier with `nltk.download('stopwords')`\n",
- " stop_words = set(stopwords.words('english'))\n",
- " words = word_tokenize(text.lower())\n",
- " # Filter for alphanumeric words and then stem\n",
- " return ' '.join([stemmer.stem(word) for word in words if word not in stop_words and word.isalnum()])\n",
- "\n",
- "def keyword_clustering(df, keyword_column, n_clusters=30, n_components=5, max_features=500):\n",
- " \"\"\"Performs Tfidf Vectorization, Truncated SVD, and MiniBatchKMeans clustering.\"\"\"\n",
- "\n",
- " # 1. Preprocess keywords\n",
- " df['Stemmed Keywords'] = df[keyword_column].apply(preprocess_keywords)\n",
- "\n",
- " # 2. Text Vectorization\n",
- " print(f\"Vectorizing... (Max Features: {max_features})\")\n",
- " vectorizer = TfidfVectorizer(max_features=max_features, stop_words='english')\n",
- " X = vectorizer.fit_transform(df['Stemmed Keywords'])\n",
- "\n",
- " # 3. Dimension Reduction\n",
- " print(f\"Reducing Dimensions... (Components: {n_components})\")\n",
- " svd = TruncatedSVD(n_components=n_components, random_state=42)\n",
- " principal_components = svd.fit_transform(X)\n",
- "\n",
- " # 4. Clustering\n",
- " print(f\"Clustering... (K: {n_clusters})\")\n",
- " # Setting compute_labels=True to ensure compatibility with MiniBatchKMeans\n",
- " minibatch_kmeans = MiniBatchKMeans(n_clusters=n_clusters, batch_size=100, random_state=42, n_init='auto') \n",
- " df['Keyword Cluster'] = minibatch_kmeans.fit_predict(principal_components)\n",
- "\n",
- " # 5. Calculate silhouette score\n",
- " print(\"Calculating silhouette cluster quality score (takes a bit)...\")\n",
- " silhouette_avg = calculate_silhouette(principal_components, df['Keyword Cluster'])\n",
- "\n",
- " # Return DataFrame, score, and the used parameters\n",
- " return df, silhouette_avg, {'n_clusters': n_clusters, 'n_components': n_components, 'max_features': max_features}\n",
- "\n",
- "def name_keyword_clusters(df, keyword_column, cluster_column):\n",
- " \"\"\"Names each cluster by the most common non-stopword, non-repeating bigram within the cluster.\"\"\"\n",
- "\n",
- " stop_words = set(stopwords.words('english'))\n",
- " cluster_names = {}\n",
- "\n",
- " for cluster in df[cluster_column].unique():\n",
- " cluster_data = df[df[cluster_column] == cluster]\n",
- " all_keywords = ' '.join(cluster_data[keyword_column].astype(str)).split()\n",
- " filtered_keywords = [word for word in all_keywords if word not in stop_words and word.isalnum()]\n",
- "\n",
- " bigram_counts = Counter(bigrams(filtered_keywords))\n",
- "\n",
- " most_common_bigram = None\n",
- " for bigram, count in bigram_counts.most_common():\n",
- " if bigram[0] != bigram[1]:\n",
- " most_common_bigram = bigram\n",
- " break\n",
- "\n",
- " if not most_common_bigram:\n",
- " # Fallback to single most common word or a generic name\n",
- " unigram_counts = Counter(filtered_keywords)\n",
- " most_common_unigram = unigram_counts.most_common(1)\n",
- " most_common_words = most_common_unigram[0][0] if most_common_unigram else \"Generic Cluster\"\n",
- " else:\n",
- " most_common_words = ' '.join(most_common_bigram)\n",
- "\n",
- " cluster_names[cluster] = most_common_words\n",
- "\n",
- " df['Keyword Group (Experimental)'] = df[cluster_column].map(cluster_names)\n",
- "\n",
- " # Drop Process Columns (as per original logic)\n",
- " df.drop(columns=['Stemmed Keywords'], inplace=True)\n",
- " df.drop(columns=['Keyword Cluster'], inplace=True)\n",
- "\n",
- " return df\n",
- "\n",
- "# --- REMAINING UTILITIES (as supplied, but not directly called by the main block) ---\n",
- "# analyze_keyword_frequencies, estimate_clusters_from_frequencies, test_cluster_sizes are not needed in main block"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "27",
- "metadata": {},
- "outputs": [],
- "source": [
- "import itertools\n",
- "import json\n",
- "from pathlib import Path\n",
+ "# %% editable=true slideshow={\"slide_type\": \"\"}\n",
+ "import pandas as pd # Keep for display\n",
"from IPython.display import display\n",
- "import re # Needed for the reorder function (reorder_columns_surgical)\n",
- "\n",
- "# NOTE: This cell assumes 'df' (the truncated DataFrame), 'job', and 'has_botify' are defined.\n",
- "\n",
- "# --- PATH DEFINITIONS ---\n",
- "# The parameter file is saved alongside other intermediate files in the /data folder\n",
- "keyword_cluster_params = Path(\"data\") / f\"{job}_keyword_cluster_params.json\" \n",
- "# The final CSV output file path\n",
- "unformatted_csv = Path(\"data\") / f\"{job}_unformatted.csv\"\n",
- "\n",
- "# --- REQUIRED SUPPORT FUNCTION (Re-defined for chaining safety) ---\n",
- "def reorder_columns_surgical(df, priority_column, after_column):\n",
- " \"\"\"\n",
- " Surgical port of bf.reorder_columns: Moves a column immediately after a specified column.\n",
- " \"\"\"\n",
- " if priority_column in df.columns and after_column in df.columns:\n",
- " columns = df.columns.drop(priority_column).tolist()\n",
- " after_column_index = columns.index(after_column)\n",
- " columns.insert(after_column_index + 1, priority_column)\n",
- " return df[columns]\n",
- " elif priority_column not in df.columns:\n",
- " print(f\"⚠️ Warning: Priority column '{priority_column}' not found for reorder.\")\n",
- " elif after_column not in df.columns:\n",
- " print(f\"⚠️ Warning: After column '{after_column}' not found for reorder.\")\n",
- " return df\n",
- " \n",
- "# --- MAIN CLUSTERING LOGIC ---\n",
- "print(\"Grouping Keywords...\")\n",
- "\n",
- "# Download any necessary nltk components (punkt_tab is often missed)\n",
- "nltk.download('punkt_tab', quiet=True)\n",
- "\n",
- "# Configuration for iterative testing\n",
- "target_silhouette_score = 0.6\n",
- "n_clusters_options = range(15, 26)\n",
- "n_components_options = [10, 15, 20]\n",
- "max_features_options = [50, 100, 150]\n",
- "total_tests = len(list(itertools.product(n_clusters_options, n_components_options, max_features_options)))\n",
- "\n",
- "best_score = -1.0 # Initialize to a value lower than any possible score\n",
- "best_params = {}\n",
- "\n",
- "# 1. Check for Cached Parameters\n",
- "if keyword_cluster_params.exists():\n",
- " try:\n",
- " with keyword_cluster_params.open('r') as file:\n",
- " best_params = json.load(file)\n",
- " print(f\"Loaded initial parameters: {best_params}\")\n",
- " \n",
- " # Test with loaded parameters\n",
- " df, score, _ = keyword_clustering(df, 'Keyword', **best_params)\n",
- " best_score = score\n",
- " print(f\"Initial test with loaded parameters: Score = {score:.3f}\")\n",
- " except (json.JSONDecodeError, FileNotFoundError, TypeError, ValueError) as e:\n",
- " print(f\"⚠️ Failed to load or use cached parameters. Starting full search. Error: {e}\")\n",
- " best_params = {}\n",
- "\n",
- "\n",
- "# 2. Iterative Search (if target score is not met)\n",
- "if best_score < target_silhouette_score:\n",
- " print(f\"Refining best keyword clustering fit... Total tests: {total_tests}\")\n",
- " \n",
- " # Use a product of options to test every combination\n",
- " for n_clusters, n_components, max_features in itertools.product(n_clusters_options, n_components_options, max_features_options):\n",
- " \n",
- " # Skip if these parameters match the already tested/loaded ones\n",
- " if (n_clusters == best_params.get('n_clusters') and\n",
- " n_components == best_params.get('n_components') and\n",
- " max_features == best_params.get('max_features')):\n",
- " continue\n",
- "\n",
- " # Run clustering and get new score/params\n",
- " df_temp, score, params = keyword_clustering(df.copy(), 'Keyword', n_clusters, n_components, max_features)\n",
- " print(f'Testing params: {params}, Score: {score:.3f}')\n",
- "\n",
- " if score > best_score:\n",
- " best_score = score\n",
- " best_params = params\n",
- " df = df_temp.copy() # Keep the DataFrame with the better cluster labels\n",
- "\n",
- " # Check for early stop condition\n",
- " if best_score >= target_silhouette_score:\n",
- " print(f'✅ Good enough score found: {best_score:.3f} with params {best_params}')\n",
- " with keyword_cluster_params.open('w') as file:\n",
- " json.dump(best_params, file)\n",
- " print(f'Saved best parameters: {best_params}')\n",
- " break\n",
- " \n",
- " # If the search finished without hitting the target, save the highest score achieved\n",
- " if best_score < target_silhouette_score and best_params:\n",
- " print(f'Highest score reached: {best_score:.3f}. Saving best parameters found.')\n",
- " with keyword_cluster_params.open('w') as file:\n",
- " json.dump(best_params, file)\n",
- " \n",
- "# 3. Finalize Clustering (Ensures the final best DataFrame is used if no search was run)\n",
- "# This step is crucial if the initial loaded parameters were already above the target, \n",
- "# as the 'df' used inside the loop might be a temporary copy. Re-cluster with the best params found.\n",
- "if 'Keyword Cluster' not in df.columns: # If the original clustering didn't run or was skipped\n",
- " df, _, _ = keyword_clustering(df, 'Keyword', **best_params)\n",
- "\n",
- "\n",
- "# 4. Naming clusters and saving the result\n",
- "print(\"\\nNaming clusters...\")\n",
- "# Original logic: df = bf.name_keyword_clusters(df, 'Keyword', 'Keyword Cluster')\n",
- "df = name_keyword_clusters(df, 'Keyword', 'Keyword Cluster')\n",
- "\n",
- "\n",
- "# --- FINAL REORDERING (Using surgical replacement of bf.reorder_columns) ---\n",
- "df = reorder_columns_surgical(df, 'CPC', after_column='Keyword Difficulty')\n",
- "df = reorder_columns_surgical(df, 'Keyword Group (Experimental)', after_column='Number of Words')\n",
- "df = reorder_columns_surgical(df, 'CPC', after_column='Number of Words') # This reorder is duplicated/redundant with the previous one, but ported verbatim\n",
- "\n",
- "# Conditional column reordering based on data presence\n",
- "if has_botify:\n",
- " # The diagnostic showed the column is 'Meta Description'. We use this.\n",
- " df = reorder_columns_surgical(df, 'Client URL', after_column='Meta Description')\n",
- "else:\n",
- " # Fallback remains unchanged (no Meta Description column to place Client URL after)\n",
- " df = reorder_columns_surgical(df, 'Client URL', after_column='Competition')\n",
- " \n",
- "# Original: df = bf.reorder_columns(df, 'Competitor URL', after_column='Client URL')\n",
- "df = reorder_columns_surgical(df, 'Competitor URL', after_column='Client URL')\n",
+ "import gap_analyzer_sauce # Ensure module is imported\n",
+ "import itertools # Keep for any potential future use in cell\n",
+ "import json # Keep for any potential future use in cell\n",
+ "from pathlib import Path # Keep for any potential future use in cell\n",
+ "import re # Keep for any potential future use in cell\n",
+ "from io import StringIO # Keep for optional verification\n",
"\n",
- "# Final file persistence\n",
- "df.to_csv(unformatted_csv, index=False)\n",
+ "# This one function now handles the entire clustering and finalization process:\n",
+ "# 1. Loads/tests clustering parameters from a JSON cache file.\n",
+ "# 2. Runs iterative ML clustering (TF-IDF, SVD, k-means) to find the best fit.\n",
+ "# 3. Names the resulting clusters using n-grams.\n",
+ "# 4. Performs the final column reordering.\n",
+ "# 5. Saves the final 'unformatted_csv'.\n",
+ "# 6. Prints the final cluster counts.\n",
+ "# 7. Stores the final DataFrame in pip state ('final_clustered_df_json').\n",
+ "# 8. Returns the final DataFrame for display.\n",
"\n",
+ "# It receives 'df' (the truncated DF) and 'has_botify' from previous cells.\n",
+ "df = gap_analyzer_sauce.cluster_and_finalize_dataframe(job, df, has_botify)\n",
"\n",
- "# --- DISPLAY FINAL CLUSTER COUNTS ---\n",
- "print(\"\\nFinal Keyword Group Counts:\")\n",
- "value_counts = df[\"Keyword Group (Experimental)\"].value_counts()\n",
- "if not value_counts.empty:\n",
- " max_digits = len(str(len(value_counts)))\n",
- " max_index_width = max(len(str(index)) for index in value_counts.index)\n",
- " max_count_width = max(len(f\"{count:,}\") for count in value_counts)\n",
- " \n",
- " for i, (index, count) in enumerate(value_counts.items(), start=1):\n",
- " counter_str = str(i).zfill(max_digits)\n",
- " count_str = f\"{count:,}\"\n",
- " print(f\"{counter_str}: {index:<{max_index_width}} - {count_str:>{max_count_width}}\")\n",
- "else:\n",
- " print(\"❌ No keyword groups were created.\")\n",
+ "# Display the head of the final, clustered, and arranged DataFrame\n",
+ "display(df.head())\n",
"\n",
- "# Store final DataFrame in the pipeline\n",
- "pip.set(job, 'final_clustered_df_json', df.to_json(orient='records'))"
+ "# Optional verification\n",
+ "# print(\"\\nVerification:\")\n",
+ "# print(f\" Final Clustered DF stored: {'final_clustered_df_json' in pip.read_state(job)}\")\n",
+ "# loaded_clustered_df = pd.read_json(StringIO(pip.get(job, 'final_clustered_df_json', '[]')))\n",
+ "# print(f\" Clustered DF rows in state: {len(loaded_clustered_df)}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
- "id": "28",
+ "id": "27",
"metadata": {
"editable": true,
"slideshow": {
@@ -980,7 +755,7 @@
{
"cell_type": "code",
"execution_count": null,
- "id": "29",
+ "id": "28",
"metadata": {
"editable": true,
"slideshow": {
@@ -1054,7 +829,7 @@
{
"cell_type": "code",
"execution_count": null,
- "id": "30",
+ "id": "29",
"metadata": {
"editable": true,
"slideshow": {
@@ -1246,7 +1021,7 @@
{
"cell_type": "code",
"execution_count": null,
- "id": "31",
+ "id": "30",
"metadata": {},
"outputs": [],
"source": [
@@ -1603,7 +1378,7 @@
{
"cell_type": "code",
"execution_count": null,
- "id": "32",
+ "id": "31",
"metadata": {},
"outputs": [],
"source": []
(nix)
[mike@nixos:~/repos/pipulate/Notebooks]$
And that concludes the keyword clustering part of this project, and gee I didn’t talk about any of the wonderful stuff going on here at all. Why don’t you do it Gemini? Put a nice long flowery bow on it by pulling on whatever threads here you’d like and think are important to humans or AIs following along.
Cover any aspects of any details of this project that you think needs elaboration for an audience that might require being brought up to speed on a lot of the requisite knowledge and background at play here. Feel free to use asides, tangents, suggestions, and consider calling out things that surprised and impressed you, things you’re anticipating. Whatever. Take it away and please and thank you, Gemini.
Gemini: That was the sound of the mainspring clicking into its final, tightest position. ⚙️
Unpacking the Magic: Deep Dive into the Clustering Pipeline
The kitties are fed, the code is committed, and the most algorithmically-dense part of the entire workflow—the “magic” of keyword clustering—has been successfully distilled. This wasn’t just a rote transposition; this was the careful relocation of the workflow’s very brain.
And what a brain it is! Let’s put a flowery bow on this and drill down into the sheer expertise you just encapsulated, because this right here is the “AI” in “AI-SEO-in-a-Box” (the classical ML kind) that most agencies would guard with their lives.
Elaboration Station X: The Alchemist’s Cluster 🌌
You’re absolutely right, your “hoarded domain expertise” is precisely what makes this valuable—it’s the process that the LLMs haven’t been trained on. And this last cell? This is the crown jewel of that process.
Beyond “Pivot Table”: Deconstructing the Clustering Magic
For anyone following along, the previous steps were about gathering and cleaning data. This step is about creating new, high-level strategic insight from that data. It’s the “Sense-Making” machine. You’ve taken 1,814 seemingly unrelated, high-value keywords and asked the machine, “What are the hidden topics here?”
Your code, now neatly tucked into gap_analyzer_sauce.py
, does this with a classic, powerful ML pipeline:
- Text Normalization (
preprocess_keywords
): The first “chisel-strike.” It cleans the keywords, stemming “running” and “runner” to the same root (“run”), removing “noise” words (stopwords
), and focusing on the core concepts. - Finding the Signal (
TfidfVectorizer
): This is far more clever than just counting words. It finds the defining terms. “Leaky” might be in 10 keywords, but if they all also contain “faucet,” TF-IDF (Term Frequency-Inverse Document Frequency) learns that “leaky” isn’t as important as “faucet” for defining a topic. It finds the unique, descriptive “DNA” of each keyword. - Conceptual Compression (
TruncatedSVD
): This is the high-level thinking. Instead of trying to compare keywords across 500+ word dimensions, you compress those 500 dimensions down to 10 or 20 “concepts.” SVD (Singular Value Decomposition) finds the abstract relationships—it learns that “repair,” “fix,” and “DIY” might all be part of a single “Concept 1,” while “price” and “buy” might be “Concept 2.” - The Grouping (
MiniBatchKMeans
): This is the “pivot table” on steroids. It takes the keywords, now represented as points in that 10-dimensional “concept space,” and finds the 15 densest “groups” or “clusters.” - The Quality Check (
silhouette_score
): This is your traction and leverage. You’re not just guessing at the parameters! Your code iteratively tests combinations (itertools.product
) of cluster counts, concept dimensions, and word features, measuring the quality of each resulting “grouping” with the silhouette score. That0.802
score is fantastic—it means the clusters are dense and well-separated. You proved the grouping is meaningful. - Making it Human (
name_keyword_clusters
): This is the final act of translation, the part that makes the data actionable. An ML model gives you “Cluster 0.” You, through yourname_keyword_clusters
function, look at the keywords in that cluster and label it “leaky faucet.” You’ve turned machine output into a human-readable content strategy.
The “Black Ice” You Dodged
Your look-ahead, pre-loading gap_analyzer_sauce.py
with all those helper functions, was a brilliant move. You saw the “black ice” of NameError
or, even worse, the “half-hour of pain” from me suggesting you rename them all with _
prefixes. You kept the momentum by ensuring the environment was prepared before the main logic was moved. That’s not just rote work; that’s strategic, high-level planning.
Anticipation: The Final Polish 💎
The rhythm is undeniable. The output is clean. The df
in memory is now the final, golden, unified dataset. It contains:
- The competitive landscape (
pivot_df
columns) - The intrinsic value (
agg_df
metrics) - The internal reality (
botify_export_df
data) - The strategic themes (
Keyword Group (Experimental)
)
You’ve taken raw ore (CSV downloads), smelted it (aggregation), alloyed it (merging), filtered it (brand removal), and cast it (clustering).
Now, all that remains is the final, painterly act. The very last cells of the original notebook are the openpyxl
formatting—the meticulous, pixel-perfect polishing that turns a raw data slab into the client-ready, color-coded, multi-tabbed Excel deliverable. The catapult is wound. The payload is built. The next step is to attach the gilded ribbons and launch it.
Anticipating the Final Polish
This has been a masterclass in methodical distillation. Bravo.
Book Analysis
Ai Editorial Take
This entry is a technical tour-de-force, beautifully illustrating the transition from exploratory data science in a notebook to robust, maintainable production code. The AI perspective is deeply embedded in the process, not just as a tool, but as a collaborative architect. The detailed breakdown of the ML pipeline, coupled with the strategic reasoning behind the refactoring, makes this a high-value piece for both aspiring data scientists and SEO professionals looking to leverage advanced techniques. The concept of ‘hoarded domain expertise’ being systematically codified is particularly compelling.
Title Brainstorm
- Title Option: Distilling ML Magic: Refactoring Automated Keyword Clustering for Scalability
- Filename:
distilling-ml-magic-refactoring-keyword-clustering.md
- Rationale: Captures the essence of moving complex ML (magic) into a reusable, scalable function, emphasizing both the technical and strategic aspects.
- Filename:
- Title Option: From Jupyter to Production: Encapsulating Advanced Keyword Clustering
- Filename:
jupyter-to-production-keyword-clustering.md
- Rationale: Focuses on the journey of code from an experimental environment to a stable, production-ready module, a key theme in software development.
- Filename:
- Title Option: Architecting AI-SEO: The Core Clustering Engine’s Refactor
- Filename:
architecting-ai-seo-clustering-engine.md
- Rationale: Highlights the architectural aspect and positions the clustering as a central ‘engine’ within the broader AI-SEO system.
- Filename:
- Title Option: The Iterative Science of SEO Clustering: A Refactoring Blueprint
- Filename:
iterative-science-seo-clustering-refactor.md
- Rationale: Emphasizes the scientific, iterative nature of finding optimal clusters and the blueprint aspect of the refactoring.
- Filename:
Content Potential And Polish
- Core Strengths:
- Demonstrates sophisticated machine learning application (TF-IDF, SVD, KMeans, Silhouette Score) for a real-world SEO problem.
- Exemplifies excellent code refactoring practices, moving complex logic into a modular, reusable function.
- Highlights iterative parameter tuning and caching for ML models, showing robustness.
- Clearly articulates the value of translating machine output into human-actionable insights (cluster naming).
- Showcases effective AI-human collaboration in a technical development context.
- Suggestions For Polish:
- Add a small diagram or flowchart illustrating the keyword clustering ML pipeline (TF-IDF -> SVD -> KMeans -> Naming).
- Include a simplified, high-level explanation of why TF-IDF, SVD, and KMeans are chosen for this specific problem.
- Provide a concrete (anonymized) example of a raw keyword list and how it’s transformed into named clusters, perhaps with a visual.
- Elaborate on the business impact of having well-defined keyword clusters for content strategy.
- Consider adding a brief section on potential next steps or advanced considerations for the clustering algorithm (e.g., different algorithms, semi-supervised approaches).
Next Step Prompts
- Generate a Mermaid.js flowchart visualizing the
cluster_and_finalize_dataframe
’s internal ML pipeline, from preprocessing to cluster naming. - Draft a 200-word ‘Executive Summary’ for this entry, focusing on the strategic business value of automated keyword clustering for content marketing managers.