- Jump to Gemini GSC Research
- Jump to Grok 3 GSC Research
- Jump to ChatGPT GSC Research
The Value of Staying Open to Serendipity
It’s nice to have particular subject-matter in mind when you begin typing, but it’s also nice to keep yourself open to happy accidents. This is a high exposure point for a black swan event meeting a butterfly effect event. You never know exactly what is going to be the spark that ignites some tinder you hardly knew you were sitting on top of.
Leveraging AI Tools for Content Enhancement
The last article I wrote, I kept open-ended. It is still open ended. I did though get Claude 3.7 Sonnet to layer-in my headlines, subheads and such. It’s a wonderful non-programming use for Cursor AI. Claude has more smarts than just coding, and even though VSCode/Cursor are not the most wonderful writing environments (I’m writing this in nvim, which is), it is a good stop for your markdown text-file on its way to the public.
Gemini 2.5’s Role in Content Refinement
After such headlining (sometimes before) a great stop is Gemini 2.5 Experimental. I started using the Gemini family of LLMs for this purpose because it was the first to reach that 128K token level which was large enough for my articles, so it was the first able to provide title and headline suggestions, along with a summary and its opinion on the article for revision and refinement purposes. But the habit stuck and I’m glad it did, because Gemini 2.5’s ability to do this sort of work is outstanding. All 3 things I ask it for, the descriptions, title & headlines, and AI-opinion are surprisingly impressing me. It is not AI-schlock. It is making my own dense, rambling writing more accessible to others and giving me all sorts of ideas for the forthcoming distillation-to-book projects.
Building a Workflow Automation System
But before I get to that project, I have to get to this project. It’s a long time in coming and could change my workflow forever. Let’s break it down, banking this magical bit of code for myself and others once and for all.
These are project that will be bottled as Pipulate workflows, where nobody using it will really ever have to look at the Python code. They’ll just run the workflow and do what the prompts tell them to do. This is me making one. It is also one of those super-prompts for the AI coding assistants that will be helping me. And we begin with a bit of storytelling panache and flair (haven’t we already?).
The Writing Process and Data Integration
I like to write. I write a lot. I try to drudge out all I’ve got. I dump the thoughts, the easy part. To implement? That is the art! So here we are, the headlines in. But there’s no title. No, don’t begin! We need more data, don’t you see? The place to start is GSC. It’s got impressions and clicks too, plus avg pos to name a few. There will be some we can compute with great upside - low-hanging fruit! I failed to target with a page more to that point. We need it sage! But only if it aligns well with this here text. How can we tell?
Technical Implementation Guide
Okay, enough of that. Let’s be explicit. The first step will be walking through the use of the Google Cloud Platform (GCP) or whatever it’s called these days for the best way to get the easiest authentication method possible for something running as a Python script locally. It’s probably going to be a client secret JSON file or possible a service account. There’s also 1-off app passwords. I’m open to possibilities. Better still, enumerate the pluses and minuses and give explicit instructions on doing the JSON file approach.
Setting Up GSC Access Through GCP
After that, there’s activating access to GSC from GCP. It’s admin stuff. Maybe it’s part of the above step, I don’t know. Organize it however you like. But do acknowledge that this is easy if you’re doing it for your own site that you are the admin owner for, and the nuance if you’re added to someone else’s GSC for that site, or it’s under Google Suite where the admin will have to specifically enable GSC API access for you. Don’t just repeat what I say here. Do the due diligence research and really cover the nuance.
Understanding the GSC API Structure
The next step is understanding the API. There’s a lot of parts here like the fact that downloading GSC data like impressions and such is different sitemaps and site management. Talk about the API as if to a newb, talking about the different kinds of things you can do with the API, but zeroing in on the fact that we’re in this to exceed the 1000-row download limit from the web user interface.
Authentication and Initial Testing
Next, we want to blast out the pipes and make sure we get over the authentication hurdle. This means the equivalent of a Hello World for Google OAuth – no easy thing in its own right. Assume the user has the JSON secret file and GSC API access turned on through GCP. Provide the minimal script to demonstrate to them that they have access, perhaps by merely enumerating the site domains accessible to them with their login access.
Data Dimensions and Metrics Explained
Following that, we need to explain the issues of dimension and metrics to the user. They probably don’t understand that metrics add columns and domains force fewer or greater rows to be returned, and the aggregation of the numbers per row. Explain as if to a newb. For example, you can ask for metrics per keyword. You can also ask for metrics per URLs. And in an amazing twist, you can ask for metrics per URL per keyword! In this way you can have extremely granular knowledge of which keywords lead to which URLs…
Time Period Specifications and Limitations
…and for what time-periods! We have to cover the way time-periods are specificity in the API, including those special words that cover common time-periods and the actual patterns when specifying specific time-periods. When talking about these time-periods, we have to explain how to get a good newest starting date, because all the data isn’t necessarily available up to today. It’s going to only be available as of 2 or 3 or 4 days ago. We don’t know! So there has to be a method of figuring out the most recent date we can use in our date ranges. Likewise, we need an oldest data that data is available, which is 16 months ago. There’s probably language for “everything available” and there’s also techniques for knowing what those dates actually are. Cover both.
Understanding Data Structure and Query Options
Did I mention platforms and regions? Are they metrics? Are they dimensions? They certainly change the “shape” of data we can request. Is the returned data actually hierarchical like all this implies, or does it come back in simple row and column format? Speaking of which, also cover the issue of aggregate queries that get everything you want in one pass (plus paging), versus the ability to fire off entirely separate queries if you wanted a more day-by-day or week-by-week approach. Just talk about options here, don’t use up your tokens giving code examples. We are going to focus on a single query with paging.
Executing the Core Query
Okay, finally the query we want to execute. We want to give a begin date, an end date, and both keyword (maybe it’s query) and url (maybe it’s page) fields. This will result in more overall rows than just asking for query or page, because queries will likely have lead to more than one page. So each page will have a list of the queries that led to it.
Managing API Limitations and Data Storage
Last time I did this, we could download up to 50,000 rows per API-query with paging. And there’s a number-of-pages parameter in there somewhere. I think with the number of rows set to 1,000, it’s 50 requests to “empty it out”. But based on the activity on the site, it might become empty before you reach that. So the end goal here is a “banked” (saved to the hard-drive) .csv (probably using Pandas) for each paging step of the query. We want a nice naming convention that we can use in subsequent steps to load them back in, concatenate them and do further analysis on them as a single Pandas dataframe. But if it fails in the middle, we don’t want to have to re-execute any of our already successful queries. We’re going to be mindful of quotas.
Implementing Smart File Management
In fact on re-running the script, it should look to see if those CSVs that it would be asking for and saving are already there, making it safe to re-run without re-executing the same queries and wasting the daily quota. The file naming convention should use any of the parameters that are variable, such as site, start-date and end-date. Things like our URL, keyword, platform and region sections are generally going to be static (consistently set one way and not changed, or not part of the API-call at all), and so not part of the filename.
All files should be stored locally relative to where the script is run. I don’t want to see anything filesystem-dependent or fragile. Make no assumptions about file layouts, data directories, or any of that nonsense. This will be run as a Jupyter Notebook under Nix Flakes. The repo folder, the script folder and data folder are all the same thing, so paths should all be relative, and frankly non-existent. There is no appending path parts through either os or pathlib Path, though they can be used to check for file existence to prevent wasteful API calls.
Technical Requirements and Constraints
We do not need asynchronous behavior, so I don’t want to see any async/await. We are keeping the Python simple. You may use the super-popular Requests package, or alternatively, httpx. This is not a web app. I don’t want to see FastAPI under any circumstances. asyncio is acceptable if you think it best, but I don’t want to see it complicating the code for a newb.
The visual feedback while the script runs should be fun to watch for the newb. I the steps to be “narrated” by the output of the script. You don’t have to go overboard, but there is a definite educational and entertainment aspect to all this. We are making this entire process more clear and accessible than it probably ever has been on the web before.
Analyzing Keyword Opportunities
Once we have this data, we are going to look for keywords that qualify as striking distance or low hanging fruit with significant traffic upside potential. So once we have this dataframe, we will find 2-word, 3-word, 4-word and 5-word combinations that are in position >=11. In old SEO terms, that’s not in the top-10. Things aren’t really that way anymore, but for the sake of an easy exercise, that’s how we’re defining striking distance. The only qualified candidates are ones that are not performing any better than this on any other URL. Remember we have keywords per URL, so keywords and their respective positions may appear more than once on the list. So our striking distance candidates are only those that don’t appear better than position 11 on any URLs.
Integrating SEMRush Data
We are then going to take the resulting list of candidate striking distance keywords as a 1-column list and bring it over to SEMRush where we will use their keyword tool to get a detailed keyword report that will include search volumes and other keyword suggestions. We will bring that back into our Python Notebook as a new CSV turned into a Pandas DataFrame and join it so that the original clicks, impressions and position data is on the combined list. That means there will be as many rows on the resulting joined dataframe as there were in the SEMRush CSV, and there will conditionally be additional data on those rows that matched GSC data. The fact that keywords appear more than once on the GSC side might have to be dealt with using an aggregate query to collapse it down to just keywords for an intermediate table before the join to SEMRush. Any SEMRush rows not matching a GSC keyword will have blank cells on the extended GSC rows.
Final Content Optimization Process
The result should be an excellent selection of striking distance (aka low-hanging fruit) keyword options for the site as a whole. We then upload that entire list to the LLM along with an article such as this that already has lots of good keyword ideas layered in with the Claude provided headlines and the Gemini description, title/headline ideas and opinion of the article. At that point, the LLM can be asked to list the most likely 2-grams, 3-grams, 4-grams and 5-grams for an article (like this), and we can make the best picks for matches (if any) from the striking distance list.
Implementing Content Changes in Jekyll
The choice if we end up going that way after this long exercise, which may actually not be the case if there’s not a good match, will be worked into the article headline, permalink and filename, which through the Jekyll SSG system manifests as the URL, title tag and headline.
Project Implementation Steps
Is that all clear? I know it’s a lot, but we can do this with multiple passes if necessary. The most important thing to get this underway is the GSC access stuff at the beginning. So, we start there and go for as long as you can with enough detail to do this project justice. FYI, all your instructions will be published immediately below this prompt-text as an article on the URL we select for this article using this process.
Expected Growth Trajectory
For your entertainment, the working filename before I change it is hockeystick-curve.md, representing the hockeystick-shaped curve that’s going to result in the GSC clicks and impressions line (up and to the right) after I actually lather, rinse and repeat this project a few times on a few articles.
Google Gemini 2.0 Experimental’s Take (So Far)
Okay, here are the components you requested:
Author’s First-Person Summary (Long Meta Description Style):
In this article, I’m sharing my journey and technical plan for building a powerful content optimization workflow; I detail how I intend to use Python to tap into the Google Search Console API, moving beyond its interface limits to find ‘striking distance’ keywords, integrate SEMRush data for richer insights, and leverage AI tools like Claude and Gemini 2.5 to refine articles, ultimately aiming to systematically improve SEO performance and drive significant traffic growth by optimizing content based on this data-driven process.
Title & Headline Ideas:
- Titles:
- Automating SEO Growth: My Python & GSC API Workflow
- Building a Data-Driven Content Engine with Python, GSC, and AI
- Unlocking ‘Striking Distance’ Keywords via the GSC API
- From GSC Data to Optimized Content: A Detailed Workflow
- My SEO Automation Project: Python, APIs, and LLMs
- Leveraging GSC API & SEMRush for Advanced Content Optimization
- Headlines (Sub-sections):
- Why Automate SEO Data Analysis?
- Harnessing AI for Initial Content Structuring (Claude & Gemini)
- The Core Goal: A Reusable Python Workflow for GSC Data
- Step 1: Setting Up GCP for GSC API Access
- Understanding GSC API Authentication Methods
- Navigating GSC API Nuances: Dimensions, Metrics, and Time Periods
- Querying GSC Beyond 1000 Rows: The Python Approach
- Managing API Quotas and Smart Data Storage
- Identifying ‘Striking Distance’ Keywords in Your Data
- Enhancing Insights with SEMRush Integration
- The Final Step: AI-Assisted Content & Keyword Optimization
- Implementing SEO Changes for Tangible Growth (Jekyll Example)
AI Opinion of the Article:
This article provides a highly detailed and ambitious blueprint for a sophisticated SEO content optimization workflow. Its strength lies in its comprehensive scope, outlining a practical integration of various powerful tools: Python scripting, the Google Search Console API, SEMRush, and modern LLMs (Claude 3.7 Sonnet, Gemini 2.5 Experimental). The author clearly articulates the rationale behind each step, from overcoming GSC UI limitations to identifying specific keyword opportunities and leveraging AI for refinement.
The piece functions as both a personal project log and a technical guide. The step-by-step breakdown of accessing and utilizing the GSC API, including considerations for authentication, data structure, quotas, and error handling, is particularly valuable for readers looking to implement similar systems. The explicit instructions requested for the Python implementation, focusing on clarity and avoiding unnecessary complexity (like async unless essential), demonstrate a practical approach aimed at creating a robust and reusable tool.
However, the sheer density of information and the breadth of topics covered (GCP setup, API specifics, Python coding, data analysis, SEMRush integration, LLM prompting, CMS implementation) might be overwhelming for a true beginner, despite the author’s intention to explain concepts clearly. It bridges intermediate-to-advanced technical skills with strategic SEO thinking.
Overall, it’s an impressive and valuable piece outlining a powerful, data-driven approach to content optimization. It serves as an excellent guide for anyone serious about automating and scaling their SEO efforts through custom tooling and data analysis. The project itself, if executed as described, promises a significant enhancement to the author’s content strategy.
Super-Prompt to Research-Request Distillation
I didn’t expect this at first, but given the response Google 2.5 Experimental gave to this prompt, I am no longer going to feed the entire article to Google Deep Research. I think it’s better without the full context.
Me: Excellent! Now add the additional commands to take the intent that you see expressed here and turn it into the research instructions to carry out the work. I plan on feeding it to the Deep Research AIs so that it can really dig into the API documentation and such before attempting to put together the code. It’s sort of a one-shot coding test given a sufficiently detailed prompt.
Google 2.5 Experimental: Okay, here are the detailed research instructions, formulated as commands for a Deep Research AI, based on the intent and technical requirements expressed in the article. The goal is to gather all necessary information before attempting to write the Python code.
Overall Goal: Collect comprehensive information and examples necessary to build a Python script that retrieves detailed performance data from the Google Search Console (GSC) API, processes it to find “striking distance” SEO keyword opportunities, manages data efficiently, and integrates with external tools/processes like SEMRush (via manual export/import) and LLMs. The final script should be robust, user-friendly for a “newb” running it locally, and adhere to specific technical constraints (Python, standard libraries, no async unless vital, relative paths, mindful of quotas).
Research Instructions:
Phase 1: Authentication and API Access
- GCP Setup & Authentication Methods:
- Research and document the current, recommended methods for authenticating a locally-run Python script to access Google APIs, specifically the GSC API, via Google Cloud Platform (GCP).
- Focus on Service Accounts (using JSON key files) and OAuth 2.0 Client IDs (using client secret JSON files).
- Provide step-by-step instructions for setting up each method within the GCP console, assuming the user has a GCP project. Include creating credentials and downloading the necessary JSON files.
- Analyze and list the pros and cons of each authentication method specifically for this use case (local, potentially long-running script, user-initiated).
- Detail the exact steps required to enable the Google Search Console API within a GCP project.
- Identify and specify the minimum required IAM roles and permissions needed for the chosen credential type (Service Account or OAuth Client) to successfully read GSC performance data.
- GSC Access Nuances:
- Investigate and clearly explain the differences and required steps for enabling API access under varying GSC ownership/permission scenarios:
- Scenario A: The user running the script is the verified owner of the website property in GSC.
- Scenario B: The user has been granted delegated access (e.g., Full or Restricted user) to the property via the GSC interface by the owner.
- Scenario C: The website property is managed under a Google Workspace (formerly G Suite) account, potentially requiring administrator actions to enable API access for specific users or service accounts.
- Provide verified information, referencing official Google documentation where possible.
- Investigate and clearly explain the differences and required steps for enabling API access under varying GSC ownership/permission scenarios:
Phase 2: Understanding the GSC API for Performance Data
- API Overview & Endpoints:
- Provide a summary of the current Google Search Console API version and its capabilities.
- Identify the exact API endpoint(s) used to query website performance data (clicks, impressions, CTR, position).
- Document the request method (e.g., POST).
- Dimensions & Metrics:
- List all available
dimensions
for the performance data endpoint (e.g.,query
,page
,date
,country
,device
,searchAppearance
). - List all available
metrics
(e.g.,clicks
,impressions
,ctr
,position
). - Explain the concept of
dimensionFilterGroups
andaggregationType
. - Critically, explain how combining dimensions (specifically
query
ANDpage
) affects the data granularity and the structure of the rows returned. Confirm this returns keyword-per-URL data.
- List all available
- Request Structure & Parameters:
- Detail the structure of a typical API request payload (JSON) for performance data.
- List and explain all key parameters, including:
siteUrl
(or equivalent property identifier).startDate
,endDate
(required formats, including relative terms if available, e.g., “last 7 days”).dimensions
(how to specify multiple).rowLimit
(explain its function).startRow
(explain its function for pagination).- Any other relevant parameters (e.g.,
dataState
,type
).
- API Limitations & Constraints:
- Determine and document the maximum historical data lookback period (confirm if it is still 16 months).
- Research the typical data latency: how many days old is the most recent data usually available via the API?
- Provide a reliable method (ideally an API call, if one exists, or a documented best practice) to programmatically determine the actual latest date for which data is available before making a query.
- Document the maximum
rowLimit
allowed per API request for performance data. - Research and document the current daily API quotas (queries per day per project/user). Provide links to official Google documentation on GSC API quotas and limits.
- Explain the API’s pagination mechanism using
startRow
androwLimit
. Confirm how to detect when all data for a query has been retrieved.
Phase 3: Python Implementation Details (Pre-Code Research)
- Google API Client Library:
- Identify the official Google API Client Library for Python.
- Provide basic, working Python code examples using this library for:
- Authentication: Snippets demonstrating how to load credentials and authenticate using both the Service Account JSON file method and the OAuth Client Secret JSON file method.
- “Hello World” Test: A minimal script to list the website properties accessible to the authenticated user (verifies authentication and basic API communication).
- Core Data Query: Constructing and executing a performance data query requesting
query
andpage
dimensions for a specified date range. - Pagination: Implementing a loop structure to handle pagination correctly using
startRow
androwLimit
to retrieve all results for a query that exceeds the single-request limit. - Basic Error Handling: Show how to wrap API calls in try/except blocks to catch common errors (e.g.,
googleapiclient.errors.HttpError
) and identify potential quota or permission issues from the error response.
- Secure Credential Handling & File Management:
- Research and recommend best practices for storing and accessing the API credential JSON files securely in a local script environment. Emphasize avoiding hardcoding credentials.
- Provide Python code examples using
os
orpathlib
to check for the existence of potential output CSV files based on relative paths. Ensure the method is platform-independent and adheres to the constraint of not needing complex path manipulations (assume script, data files are in the same directory). - Propose a clear and robust file naming convention for the output CSV files (saved per page of API results). The convention must include variable parameters like
siteUrl
(sanitized for filename use),startDate
, andendDate
to allow caching and prevent re-running identical queries.
Phase 4: Data Processing and Analysis (Pandas)
- Pandas Operations:
- Provide standard Python code snippets using the Pandas library for:
- Reading multiple CSV files (matching the proposed naming convention) from a directory into a single DataFrame.
- Filtering the combined DataFrame to isolate rows where
position
is >= 11. - Performing analysis (e.g., using
groupby()
andfilter()
or other methods) to identifyquery
values that only appear withposition >= 11
across all pages they are associated with in the dataset. - Aggregating this filtered data by
query
, potentially calculating average position, total clicks/impressions for these “striking distance” keywords. - Exporting the final list of unique “striking distance”
query
values to a simple format (e.g., single-column CSV or Python list) suitable for external use (like manual input into SEMRush). - Demonstrating how to perform a left join between the GSC DataFrame (potentially the aggregated one) and a hypothetical external DataFrame (loaded from a CSV simulating the SEMRush export), using the
query
/keyword column as the key.
- Provide standard Python code snippets using the Pandas library for:
Phase 5: Integration with External Tools (Conceptual Outline)
- SEMRush Data:
- Briefly describe the typical data fields found in a keyword export from tools like SEMRush’s Keyword Magic Tool or Keyword Overview (e.g., Keyword, Search Volume, Keyword Difficulty, CPC, Competition). (No API research needed, assume manual CSV import).
- LLM Prompting Strategy:
- Outline general strategies for prompting an LLM (like Gemini 2.5):
- Provide the LLM with the target article’s text.
- Provide the LLM with the curated list of “striking distance” keywords (potentially enriched with SEMRush data like volume).
- Task the LLM to identify the most relevant n-grams (2-word, 3-word, 4-word, 5-word phrases) already present in the article text.
- Task the LLM to compare the article’s existing n-grams/themes with the provided “striking distance” keyword list and suggest which keywords (if any) are a strong thematic match and could be naturally integrated or emphasized further in the article’s title, headlines, or body.
- Outline general strategies for prompting an LLM (like Gemini 2.5):
Output Format: Please provide the research findings in a structured report. Include code examples directly where requested (Phase 3 and 4). Reference official documentation links wherever possible, especially for API limits, quotas, and setup instructions.
Google Deep Research
I am definitely going with Google Deep Research for this today. I’m hoping for the lightning bruiser advantage. Google can dedicate massive resources to this and may have the insider’s advantage on the data itself. I also changed my mind about feeding the whole above article as the prompt and gave just what Google 2.5 Experimental structured and suggested.
Looking at the amount of sites Google checks in the “Researching websites” step is staggering. I am quite certain they do not need to send a browser out like everyone else, because that part’s already done thanks to their crawl & index infrastructure. It’s at over 500 sites checked, and that’s just on the Google OAuth issues alone, HAHAHA! Even LLMs think it’s difficult.
Google Deep Research:
Building a Python Script for Google Search Console Performance Data Analysis
This report details the comprehensive information and examples necessary to develop a Python script for retrieving and analyzing performance data from the Google Search Console (GSC) API. The script aims to identify “striking distance” SEO keyword opportunities, manage data efficiently, and integrate with external tools such as SEMRush and Large Language Models (LLMs). The final solution is designed to be robust and user-friendly for individuals with limited technical expertise running it locally, while adhering to specific technical constraints.
Authenticating with the Google Search Console API
Establishing secure and reliable authentication is the foundational step for interacting with the Google Search Console API. Two primary methods are recommended for locally run Python scripts: Service Accounts and OAuth 2.0 Client IDs 1. Both approaches leverage the Google Cloud Platform (GCP) for credential management.
Service Accounts
Service Accounts are non-human Google accounts that can be used by applications to access Google APIs programmatically 2. This method utilizes a JSON key file downloaded from GCP for authentication.
GCP Setup Steps:
- Create a GCP Project: Navigate to the Google Cloud Console and either select an existing project or create a new one 1.
- Enable the Google Search Console API: Within your chosen project, go to “APIs & Services” and then “Library.” Search for “Google Search Console API” and click “Enable” 1.
- Create a Service Account: Go to “IAM & Admin” and then “Service Accounts.” Click “Create service account” and provide a name and description 5.
- Grant IAM Roles: Assign the necessary IAM roles to the service account. The minimum required roles are detailed in a subsequent section 5.
- Create and Download JSON Key: Navigate to the “Keys” tab of the newly created service account. Click “Add Key” and then “Create new key.” Select “JSON” as the key type and click “Create.” This will download a JSON file containing the private key to your local machine 5.
Pros and Cons:
- Pros: Service accounts offer a straightforward authentication mechanism for automated scripts as they do not require interactive user consent for each execution once configured 8. Many online tutorials utilize this method, providing ample resources for implementation 10.
- Cons: Google recommends OAuth 2.0 Client IDs as a more secure alternative 10. Service account keys are long-lived credentials, and if compromised, they can pose a significant security risk 10. Storing the JSON key file locally requires careful management to prevent unauthorized access 10.
OAuth 2.0 Client IDs
OAuth 2.0 Client IDs represent a specific application that requests access to Google APIs on behalf of a user 12. This method involves an initial authorization flow where the user grants the script permission to access their Google Search Console data.
GCP Setup Steps:
- Create a GCP Project: As with Service Accounts, begin by selecting or creating a project in the Google Cloud Console 1.
- Enable the Google Search Console API: Enable the API within your project by following the same steps as outlined for Service Accounts 1.
- Configure the OAuth Consent Screen: Go to “APIs & Services” and then “OAuth consent screen.” Select the “External” user type and fill in the required application details, such as the app name, support email, and developer contact information 1.
- Create OAuth 2.0 Credentials: Navigate to “APIs & Services” and then “Credentials.” Click “Create credentials” and select “OAuth client ID.” Choose “Desktop app” as the application type and click “Create.” This will generate a client ID and client secret, which can be downloaded as a JSON file 1.
Pros and Cons:
- Pros: Google recommends OAuth 2.0 as the more secure method for accessing user data 10. It requires explicit user consent, and the generated access tokens are typically short-lived, limiting the potential for misuse if compromised 10. Users can also revoke the granted permissions at any time 10.
- Cons: The initial setup can be slightly more complex than using Service Accounts, involving the configuration of the OAuth consent screen and the authorization flow 10. The script will require an initial manual authorization step from the user unless refresh tokens are implemented for long-term access 16.
Enabling the Google Search Console API in GCP
Regardless of the chosen authentication method, the Google Search Console API must be explicitly enabled within the GCP project. This can be done through the Google Cloud Console by navigating to “APIs & Services” and enabling the API from the library 1. Alternatively, the Google Cloud CLI can be used with commands such as gcloud services enable searchconsole.googleapis.com 4.
Minimum Required IAM Roles and Permissions
To successfully read Google Search Console performance data, specific IAM roles and permissions are necessary for both Service Accounts and OAuth Client IDs.
Service Accounts
The service account needs to be granted permissions to access the Google Search Console data. This is typically achieved by adding the service account as a user to the Google Search Console property with appropriate permissions 18. Within the Google Search Console interface, under “Settings” and then “Users and permissions,” the service account’s email address (found in the downloaded JSON key file) should be added with at least “Full” or “Owner” permissions to allow data retrieval via the API 19. On the GCP side, the service account might need a basic “Viewer” role on the project itself, although the primary authorization happens at the Google Search Console property level 5.
OAuth Client IDs
With OAuth 2.0, the permissions are granted by the user during the authorization flow. The script will request specific scopes, which define the level of access it needs. For read-only access to Google Search Console performance data, the scope https://www.googleapis.com/auth/webmasters.readonly is sufficient 21. The user will be prompted to grant the application permission to access their Search Console data with this scope 16. No specific IAM roles need to be assigned to the OAuth Client ID within GCP for accessing the Google Search Console API, as the authorization is user-centric 22.
Enabling API Access Under Different GSC Ownership Scenarios
The steps for enabling API access can vary depending on how the user owns or has access to the Google Search Console property.
- User is the Verified Owner: If the user has verified ownership of the website property in Google Search Console, they have full control and can directly add the service account as a user with the necessary permissions 19. For OAuth 2.0, the user can directly authorize the script to access their data 21.
- User has Delegated Access: If the user has been granted delegated access (e.g., as a “Full user” or “Owner”) by the verified owner, they typically have the same capabilities to add a service account user or authorize an OAuth application 19. However, only an “Owner” can manage other users and their permissions 19.
- The Property is Managed Under a Google Workspace Account: When the Google Search Console property is associated with a Google Workspace account, the process remains similar. The user with the appropriate permissions within the Workspace-managed property can add a service account or authorize an OAuth application 24. Domain-wide delegation can also be configured by a Google Workspace super administrator to authorize service accounts to access data on behalf of users without requiring individual consent for each user 24. To use this, the service account’s client ID needs to be authorized in the Google Admin console, and the necessary OAuth scopes for the Google Search Console API need to be specified 6. The Google Search Console API needs to be enabled in the associated GCP project 25.
Google Search Console API Endpoint and Parameters
The current version of the Google Search Console API is v1 26. The primary endpoint used to query website performance data is https://www.googleapis.com/webmasters/v3/sites/{siteUrl}/searchAnalytics/query 1.
Available Dimensions and Metrics
For the performance data endpoint, the following dimensions and metrics are available:
Dimensions:
- date: The date of the data.
- query: The search query that triggered impressions or clicks.
- page: The URL of the page on your site that appeared in the search results.
- country: The country where the search originated.
- device: The type of device used for the search (desktop, mobile, tablet).
- searchAppearance: The type of search result appearance (e.g., AMP, Rich Results).
Metrics:
- clicks: The total number of clicks your site received from Google Search results for the specified dimensions.
- impressions: The total number of times any URL from your site appeared in Google Search results for the specified dimensions.
- ctr: The click-through rate, calculated as clicks / impressions.
- position: The average position of your site’s URLs in Google Search results for the specified dimensions.
dimensionFilterGroups and aggregationType
The dimensionFilterGroups parameter allows you to filter the performance data based on specific dimension values. You can specify multiple filters within a group, and you can also have multiple filter groups with different logical operators (AND or OR) between them, enabling complex data filtering 1. The aggregationType parameter is used to specify how the data should be aggregated. For the performance report, common values include byDate, byPage, byQuery, etc., allowing you to group the results based on the chosen dimensions 1.
Combining query and page Dimensions
When the query and page dimensions are combined in a performance data query, the API returns data where each row represents a unique combination of a search query and a specific URL on your website that appeared in the search results for that query 1. This provides keyword-per-URL data, which is crucial for identifying which keywords are driving traffic to specific pages and understanding their performance metrics (clicks, impressions, CTR, position) for that particular URL.
Structure of a Typical JSON Request Payload
A typical JSON request payload for retrieving performance data from the Google Search Console API includes several key parameters:
JSON
{
"siteUrl": "your-website-url.com",
"startDate": "YYYY-MM-DD",
"endDate": "YYYY-MM-DD",
"dimensions": ["query", "page"],
"rowLimit": 1000,
"startRow": 0,
"dimensionFilterGroups":
}
],
"aggregationType": "byQuery"
}
Key Parameters:
- siteUrl: Specifies the website property in Google Search Console for which to retrieve data 1. This should be the exact URL as listed in the Search Console account.
- startDate: The starting date for the data range, in YYYY-MM-DD format 1.
- endDate: The ending date for the data range, also in YYYY-MM-DD format 1.
- dimensions: An array of dimensions to segment the data by (e.g., [“query”, “page”]) 1.
- rowLimit: The maximum number of rows to return in the response. The maximum value is typically 1000 1.
- startRow: The zero-based index of the first row to return. Used for pagination 1.
- dimensionFilterGroups (Optional): An array of filter groups to apply to the data 1.
- aggregationType (Optional): Specifies how the data should be aggregated 1.
Data Lookback, Latency, and Availability
The Google Search Console API typically allows access to historical data for the past 16 months 1. Data latency is usually around 1-2 days, meaning that the most recent data available might be from the previous day or two 1. There is no direct method within the API to programmatically determine the absolute latest available data. However, a practical approach is to query data up to the current date and handle potential empty results for the very last day.
rowLimit and Daily API Quotas
The maximum value for the rowLimit parameter in a single query is generally 1000 1. The current daily API quotas vary depending on the specific API method and usage patterns. It is advisable to consult the official Google Search Console API documentation for the most up-to-date information on quotas and limits 1. Exceeding these quotas can lead to temporary throttling or errors.
API Pagination Mechanism
To retrieve more than the rowLimit in a single response, the Google Search Console API employs pagination. This is managed using the startRow parameter. The initial request should set startRow to 0. The API response will indicate if more rows are available. To retrieve the next set of rows, you need to make subsequent requests, incrementing the startRow parameter by the rowLimit used in the previous request. This process continues until no more rows are returned in the response, indicating that all data has been retrieved 16.
Python Code Examples Using the Google API Client Library
The official Google API Client Library for Python (google-api-python-client) provides convenient tools for interacting with Google APIs, including the Search Console API. Ensure this library is installed (pip install google-api-python-client google-auth-httplib2 google-auth-oauthlib).
Authentication Using Service Account JSON File
Python
from google.oauth2 import service_account
from googleapiclient.discovery import build
# Path to your downloaded service account JSON key file
SERVICE_ACCOUNT_FILE = 'path/to/your/service_account_key.json'
# Define the scope for Google Search Console API (read-only)
SCOPES = ['https://www.googleapis.com/auth/webmasters.readonly']
def authenticate_service_account():
"""Authenticates with Google Search Console API using a service account."""
creds = service_account.Credentials.from_service_account_file(
SERVICE_ACCOUNT_FILE, scopes=SCOPES)
service = build('webmasters', 'v3', credentials=creds)
return service
if __name__ == '__main__':
gsc_service = authenticate_service_account()
# You can now use gsc_service to make API calls
print("Successfully authenticated using service account.")
Authentication Using OAuth Client Secret JSON File
Python
import os
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from google.auth.transport.requests import Request
# Path to your downloaded OAuth client secret JSON file
CLIENT_SECRET_FILE = 'path/to/your/client_secret.json'
# Define the scope for Google Search Console API (read-only)
SCOPES = ['https://www.googleapis.com/auth/webmasters.readonly']
def authenticate_oauth():
"""Authenticates with Google Search Console API using OAuth 2.0."""
creds = None
# The 'token.json' file stores the user's access and refresh tokens, and is
# created automatically when the authorization flow completes for the first
# time.
if os.path.exists('token.json'):
creds = Credentials.from_authorized_user_file('token.json', SCOPES)
# If there are no (valid) credentials available, let the user log in.
if not creds or not creds.valid:
if creds and creds.expired and creds.refresh_token:
creds.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
CLIENT_SECRET_FILE, SCOPES)
creds = flow.run_local_server(port=0)
# Save the credentials for the next run
with open('token.json', 'w') as token:
token.write(creds.to_json())
service = build('webmasters', 'v3', credentials=creds)
return service
if __name__ == '__main__':
gsc_service = authenticate_oauth()
# You can now use gsc_service to make API calls
print("Successfully authenticated using OAuth 2.0.")
Minimal Script to List Accessible Website Properties
Python
# Assuming you have a gsc_service object from either authentication method
def list_properties(service):
"""Lists the Search Console properties accessible to the authenticated user."""
try:
response = service.sites().list().execute()
if 'siteEntry' in response:
print("Accessible Search Console Properties:")
for site in response['siteEntry']:
print(f"- {site['siteUrl']}")
else:
print("No accessible properties found.")
except Exception as e:
print(f"An error occurred: {e}")
if __name__ == '__main__':
# Use either authenticate_service_account() or authenticate_oauth()
gsc_service = authenticate_oauth()
list_properties(gsc_service)
Code for Constructing and Executing a Performance Data Query
Python
# Assuming you have a gsc_service object and a site_url
def get_performance_data(service, site_url, start_date, end_date, dimensions):
"""Retrieves performance data from Google Search Console."""
request_body = {
'startDate': start_date,
'endDate': end_date,
'dimensions': dimensions,
'rowLimit': 1000 # Start with a reasonable limit
}
try:
response = service.searchanalytics().query(
siteUrl=site_url, body=request_body).execute()
return response.get('rows',)
except Exception as e:
print(f"An error occurred: {e}")
return
if __name__ == '__main__':
gsc_service = authenticate_oauth()
site = "your-website-url.com" # Replace with your website URL
start = '2024-01-01'
end = '2024-01-31'
dims = ['query', 'page']
data = get_performance_data(gsc_service, site, start, end, dims)
if data:
print("Retrieved performance data:")
for row in data:
print(row)
else:
print("No performance data found.")
Implementing Pagination to Retrieve All Results
Python
def get_all_performance_data(service, site_url, start_date, end_date, dimensions):
"""Retrieves all performance data using pagination."""
all_rows =
start_row = 0
row_limit = 1000
while True:
request_body = {
'startDate': start_date,
'endDate': end_date,
'dimensions': dimensions,
'rowLimit': row_limit,
'startRow': start_row
}
try:
response = service.searchanalytics().query(
siteUrl=site_url, body=request_body).execute()
rows = response.get('rows',)
if not rows:
break # No more data
all_rows.extend(rows)
start_row += row_limit
except Exception as e:
print(f"An error occurred during pagination: {e}")
break
return all_rows
Basic Error Handling for API Calls
The try…except blocks in the provided code examples demonstrate basic error handling. For more robust error management, it is recommended to specifically catch HttpError exceptions from the googleapiclient.errors module to access detailed error information returned by the API. This allows for more specific handling of different types of API errors, such as authentication failures, quota issues, or invalid requests.
Ensuring Security and Efficient Data Management
Implementing security measures and efficient data management practices is crucial for a robust and reliable script.
Best Practices for Securely Storing and Accessing API Credential JSON Files
To prevent unauthorized access to sensitive API credentials, it is essential to follow secure storage and access practices 10.
- Avoid Hardcoding: Never directly embed the contents of your JSON credential files within the Python script. This exposes the credentials in plain text within the code.
- Environment Variables: Consider storing the absolute paths to your credential files as environment variables on your operating system. Your script can then retrieve these paths from the environment variables, keeping the actual file locations out of the code itself.
- Relative Paths: For better portability and easier sharing of the script, utilize relative paths to your credential files. Store the credential files in a dedicated, non-public directory within your project’s file structure and refer to them using paths relative to the script’s location.
- .gitignore: If you are using a version control system like Git, ensure that your credential JSON files are included in the .gitignore file. This prevents them from being accidentally committed to your repository, where they could be exposed to a wider audience.
- Google Cloud Secret Manager (Advanced): For more advanced use cases and production environments, explore using Google Cloud Secret Manager. This service provides a secure and centralized platform for storing, managing, and accessing sensitive data like API credentials, offering features like versioning, access control, and audit logging.
Python Code Examples for Checking Existence of Output CSV Files Using Relative Paths
Before writing data to a CSV file, it’s often useful to check if a file with the same name already exists. This can prevent accidental overwriting of data. The os and pathlib modules in Python provide convenient ways to perform this check.
Python
import os
from pathlib import Path
def check_output_file_exists_os(filepath):
"""Checks if a file exists using os.path."""
if os.path.exists(filepath):
print(f"Output file '{filepath}' already exists.")
return True
else:
print(f"Output file '{filepath}' does not exist.")
return False
def check_output_file_exists_pathlib(filepath):
"""Checks if a file exists using pathlib."""
file_path = Path(filepath)
if file_path.exists():
print(f"Output file '{filepath}' already exists (using pathlib).")
return True
else:
print(f"Output file '{filepath}' does not exist (using pathlib).")
return False
if __name__ == '__main__':
output_file = 'gsc_performance_data.csv' # Relative path
check_output_file_exists_os(output_file)
check_output_file_exists_pathlib(output_file)
Clear File Naming Convention for Output CSV Files
Adopting a consistent and informative file naming convention for your output CSV files is essential for organization and easy identification of the data contained within each file. A recommended convention is to include key parameters in the filename, such as the website URL, the start and end dates of the data being exported, and a descriptive label indicating the type of data. For example:
- example.com_2023-01-01_2023-12-31_gsc_performance.csv
- my-blog_2024-01-01_2024-02-29_gsc_keywords.csv
- domain_example_2023-06-01_2023-06-30_gsc_page_data.csv
This convention allows for quick identification of the website, time period, and data type without needing to open the file. When implementing your script, ensure that the filename is generated programmatically using these relevant parameters.
Analyzing and Processing Data with Pandas
The Pandas library in Python is a powerful tool for data manipulation and analysis. It provides efficient data structures and functions for working with structured data, such as the performance data retrieved from the Google Search Console API.
Standard Python Code Snippets Using Pandas for Reading Multiple CSV Files
If you have exported Google Search Console data for different time periods or website properties into multiple CSV files, Pandas can easily read and combine these files into a single DataFrame for unified analysis.
Python
import pandas as pd
import glob
import os
def read_multiple_csv_pandas(file_pattern):
"""Reads multiple CSV files matching a pattern into a single Pandas DataFrame."""
all_files = glob.glob(file_pattern)
all_data =
for filename in all_files:
try:
df = pd.read_csv(filename)
all_data.append(df)
print(f"Successfully read: {filename}")
except Exception as e:
print(f"Error reading '{filename}': {e}")
if all_data:
combined_df = pd.concat(all_data, ignore_index=True)
return combined_df
else:
return pd.DataFrame()
if __name__ == '__main__':
# Assuming your files are named like example.com_YYYY-MM-DD_YYYY-MM-DD_gsc_performance.csv
file_pattern = '*_gsc_performance.csv' # Adjust the pattern as needed
combined_data = read_multiple_csv_pandas(file_pattern)
if not combined_data.empty:
print("nCombined DataFrame:")
print(combined_data.head())
else:
print("No CSV files found matching the pattern.")
Filtering for Rows Where position is Greater Than or Equal to 11
The concept of “striking distance” keywords refers to keywords for which your website already ranks, but not in the top 10 positions (typically positions 1-10). These keywords represent opportunities for improvement as even a small increase in ranking can lead to a significant increase in traffic. To identify these keywords, you need to filter the performance data for rows where the position metric is greater than or equal to 11.
Python
def filter_striking_distance(df):
"""Filters a DataFrame to include rows where position is >= 11."""
if 'position' in df.columns:
striking_distance_df = df[df['position'] >= 11]
return striking_distance_df
else:
print("Error: 'position' column not found in DataFrame.")
return pd.DataFrame()
if __name__ == '__main__':
# Assuming you have a Pandas DataFrame called combined_data
if not combined_data.empty:
striking_distance_data = filter_striking_distance(combined_data)
print("nDataFrame with position >= 11:")
print(striking_distance_data.head())
Identifying query Values That Only Appear With This Position Across All Associated Pages
To further refine the identification of truly “striking distance” keywords, it’s valuable to find queries that only appear with a position of 11 or greater across all the pages they rank for. This indicates keywords where you consistently miss the top 10.
Python
def identify_consistently_outside_top_10(df):
"""Identifies queries that only appear with position >= 11."""
if 'query' in df.columns and 'position' in df.columns:
grouped = df.groupby('query')['position'].agg(['min', 'max'])
consistently_outside = grouped[(grouped['min'] >= 11) & (grouped['max'] >= 11)]
return consistently_outside.index.tolist()
else:
print("Error: 'query' or 'position' column not found in DataFrame.")
return
if __name__ == '__main__':
# Assuming you have a Pandas DataFrame called striking_distance_data
if not striking_distance_data.empty:
consistent_queries = identify_consistently_outside_top_10(striking_distance_data)
print("nQueries consistently outside the top 10:")
print(consistent_queries)
Aggregating This Filtered Data by query
Once the “striking distance” keywords are identified, it’s useful to aggregate the associated performance data by query to get an overview of their total clicks, impressions, and average position.
Python
def aggregate_striking_distance_data(df):
"""Aggregates striking distance data by query."""
if 'query' in df.columns and 'clicks' in df.columns and 'impressions' in df.columns:
aggregated_data = df.groupby('query').agg(
total_clicks=('clicks', 'sum'),
total_impressions=('impressions', 'sum'),
average_position=('position', 'mean')
).reset_index()
return aggregated_data
else:
print("Error: Required columns not found in DataFrame.")
return pd.DataFrame()
if __name__ == '__main__':
# Assuming you have a Pandas DataFrame called striking_distance_data
if not striking_distance_data.empty:
aggregated_striking_data = aggregate_striking_distance_data(striking_distance_data)
print("nAggregated striking distance data by query:")
print(aggregated_striking_data)
Exporting the Unique “Striking Distance” Queries
After identifying the unique “striking distance” queries, you’ll likely want to export this list for further analysis or integration with other tools.
Python
def export_striking_distance_queries(queries, filename='striking_distance_queries.csv'):
"""Exports a list of striking distance queries to a CSV file."""
df = pd.DataFrame({'query': queries})
try:
df.to_csv(filename, index=False)
print(f"nStriking distance queries exported to '{filename}'")
except Exception as e:
print(f"Error exporting queries to CSV: {e}")
if __name__ == '__main__':
# Assuming you have a list of consistent_queries
if consistent_queries:
export_striking_distance_queries(consistent_queries)
Performing a Left Join with a Hypothetical External DataFrame Representing a SEMRush Keyword Export
To enrich the “striking distance” keyword analysis, you can integrate data from external tools like SEMRush. This often involves manually exporting keyword data from SEMRush and then joining it with your GSC data using Pandas.
Python
def join_with_semrush_data(gsc_df, semrush_file):
"""Performs a left join with hypothetical SEMRush data."""
try:
semrush_df = pd.read_csv(semrush_file)
merged_df = pd.merge(gsc_df, semrush_df, on='query', how='left')
print("nMerged DataFrame with SEMRush data:")
print(merged_df.head())
return merged_df
except FileNotFoundError:
print(f"Error: SEMRush file '{semrush_file}' not found.")
return gsc_df
except Exception as e:
print(f"Error joining with SEMRush data: {e}")
return gsc_df
if __name__ == '__main__':
# Assuming you have aggregated_striking_data and a SEMRush CSV file
semrush_data_file = 'semrush_keyword_export.csv' # Replace with your file path
final_df = join_with_semrush_data(aggregated_striking_data, semrush_data_file)
Leveraging LLMs for Content Enhancement
Large Language Models (LLMs) can be valuable in the process of optimizing content for “striking distance” keywords. By analyzing existing article text and the identified keywords (potentially enriched with SEMRush data), LLMs can help identify opportunities for improvement.
Description of Typical Data Fields Found in a Keyword Export from SEMRush
A typical keyword export from SEMRush often includes fields such as:
- Keyword: The specific search term.
- Search Volume: The estimated average monthly number of searches for the keyword.
- Keyword Difficulty: A score indicating how challenging it is to rank in the top positions for this keyword.
- CPC (Cost Per Click): The average cost advertisers pay for a click on an ad for this keyword.
- Competition: A metric reflecting the level of competition among advertisers for the keyword.
- Trend: Historical search volume trends for the keyword over time.
- SERP Features: Information about the types of search results appearing for the keyword (e.g., featured snippets, image packs).
This data provides crucial context for prioritizing “striking distance” keywords based on their search demand and the competitive landscape.
General Strategies for Prompting an LLM to Identify Relevant N-grams Already Present in a Given Article Text
To leverage an LLM to find instances of your “striking distance” keywords within your content, you can use prompts with the following structure:
- Provide the Article Text: Input the complete text of the article you want to analyze.
- Provide the List of Keywords: Include the list of “striking distance” keywords you have identified.
- Specify the Task: Clearly instruct the LLM to identify all occurrences of these keywords within the provided text.
- Request Output: Specify the desired output format. For instance, you might ask for a list of the found keywords along with the sentences or paragraphs where they appear.
Example Prompt: “Analyze the following article text and identify all instances of the keywords provided below. Please list each keyword found and the sentence in which it appears.
Article Text: [Insert your article text here]
Keywords: [Insert your list of striking distance keywords here]”
Suggestions for Prompting an LLM to Recommend Keywords for Natural Integration or Emphasis Within the Article
To prompt an LLM to suggest ways to naturally integrate or emphasize your “striking distance” keywords within an article, consider the following strategies:
- Provide Context: Give the LLM the article text and the list of “striking distance” keywords (ideally with search volume data from SEMRush).
- State the Goal: Clearly tell the LLM that you want suggestions on how to incorporate these keywords into the article to improve its search engine ranking without making the content sound unnatural or overly optimized.
- Request Integration Points: Ask the LLM to suggest specific sections or paragraphs where the keywords could be added or emphasized.
- Consider Semantic Relevance: Encourage the LLM to suggest semantically related terms or phrases that could also be beneficial for SEO and user engagement.
- Request Examples: Ask for specific examples of how the keywords could be naturally integrated into the existing text.
Example Prompt: “Based on the following article text and the list of ‘striking distance’ keywords with their corresponding search volume, suggest specific ways to naturally integrate or emphasize these keywords within the article to improve its search engine ranking. Ensure that the suggestions are contextually relevant and do not disrupt the natural flow of the content.
Article Text: [Insert your article text here]
Keywords and Search Volume: [Insert your list of keywords and their search volumes]”
Conclusion
This report has provided a comprehensive overview of the steps and information required to build a Python script for analyzing Google Search Console performance data. By utilizing the GSC API, the script can automate the identification of “striking distance” keyword opportunities, offering valuable insights for SEO strategy. The outlined authentication methods, API parameters, and Python code examples provide a solid foundation for developing this tool. Furthermore, the integration strategies with SEMRush and LLMs demonstrate how to enhance the analysis process and inform content optimization efforts.
The developed script offers significant value by automating a time-consuming process, allowing for more frequent and in-depth analysis of SEO performance. This can lead to better-informed content decisions and ultimately contribute to improved search engine rankings and increased organic traffic.
Potential next steps and further enhancements include:
- Automating the export and import process with SEMRush using their API if available, streamlining the data integration workflow.
- Implementing more advanced data analysis techniques with Pandas, such as identifying keyword trends over time or comparing performance across different website sections.
- Developing a user-friendly graphical interface (GUI) for the script to make it accessible to users without extensive coding knowledge.
- Exploring other Google Search Console API endpoints to retrieve additional data, such as URL inspection results or sitemap information, for a more holistic SEO analysis.
- Adding logging functionality to the script to track its execution, record any errors, and facilitate debugging.
Works cited
- How to Leverage Google Search Console API for Custom SEO Dashboards, accessed March 28, 2025, https://blog.coupler.io/google-search-console-api/
- Service accounts | Content API for Shopping - Google for Developers, accessed March 28, 2025, https://developers.google.com/shopping-content/guides/how-tos/service-accounts
- Create a service account - Google Workspace Admin Help, accessed March 28, 2025, https://support.google.com/a/answer/7378726?hl=en
- Enabling and Disabling Services | Service Usage Documentation - Google Cloud, accessed March 28, 2025, https://cloud.google.com/service-usage/docs/enable-disable
- How to Create a GCP Service Account for use with Google Drive …, accessed March 28, 2025, https://docs.vectorize.io/tutorials-and-how-to-guides/how-to-guides/create-a-gcp-service-account-for-google-drive
- Create access credentials | Google Workspace - Google for Developers, accessed March 28, 2025, https://developers.google.com/workspace/guides/create-credentials
- Create service accounts | IAM Documentation - Google Cloud, accessed March 28, 2025, https://cloud.google.com/iam/docs/service-accounts-create
- Using OAuth 2.0 for Server to Server Applications | Authorization - Google for Developers, accessed March 28, 2025, https://developers.google.com/identity/protocols/oauth2/service-account
- google cloud platform - GCP Authentication OAuth vs Service …, accessed March 28, 2025, https://stackoverflow.com/questions/73214596/gcp-authentication-oauth-vs-service-accounts
- Why is OAuth 2.0 Client IDs considered more secure than service …, accessed March 28, 2025, https://www.reddit.com/r/googlecloud/comments/1adw3mf/why_is_oauth_20_client_ids_considered_more_secure/
- Why You Should Migrate to OAuth 2.0 From API Keys - Auth0, accessed March 28, 2025, https://auth0.com/blog/why-migrate-from-api-keys-to-oauth2-access-tokens/
- Authenticate to Google Workspace APIs by using OAuth 2.0 client credentials | SAP, accessed March 28, 2025, https://cloud.google.com/solutions/sap/docs/abap-sdk/on-premises-or-any-cloud/latest/authentication-oauth-client-credentials
- Manage OAuth Clients - Google Cloud Platform Console Help, accessed March 28, 2025, https://support.google.com/cloud/answer/15549257?hl=en
- Google OAuth2 How to setup a client ID for use in desktop software?, accessed March 28, 2025, https://www.auronsoftware.com/kb/general/miscellaneous/google-oauth2-how-to-setup-a-client-id-for-use-in-desktop-software/
- OAuth 2.0 Simplified | What is Oauth and How Does it Work | FusionAuth, accessed March 28, 2025, https://fusionauth.io/articles/oauth/modern-guide-to-oauth
- Tutorial : Google Search Console API using Python with Google Colab - YouTube, accessed March 28, 2025, https://www.youtube.com/watch?v=ptWJkrd0vqc
- Enable and disable APIs - API Console Help - Google Help, accessed March 28, 2025, https://support.google.com/googleapi/answer/6158841?hl=en
- Google Search Console and Indexing APIs - Mirasvit, accessed March 28, 2025, https://mirasvit.com/docs/module-search-console/current/configuration/auth_file
- Managing owners, users, and permissions - Search Console Help, accessed March 28, 2025, https://support.google.com/webmasters/answer/7687615?hl=en
- IAM basic and predefined roles reference | IAM Documentation - Google Cloud, accessed March 28, 2025, https://cloud.google.com/iam/docs/understanding-roles
- Authorize Requests | Search Console API - Google for Developers, accessed March 28, 2025, https://developers.google.com/webmaster-tools/v1/how-tos/authorizing
- Using OAuth 2.0 to Access Google APIs | Authorization, accessed March 28, 2025, https://developers.google.com/identity/protocols/oauth2
- How to Add a User to Google Search Console - SEOmator, accessed March 28, 2025, https://seomator.com/blog/how-to-add-a-user-to-google-search-console
- Control API access with domain-wide delegation - Google Workspace Admin Help, accessed March 28, 2025, https://support.google.com/a/answer/162106?hl=en
- Enable Google Workspace APIs - Google for Developers, accessed March 28, 2025, https://developers.google.com/workspace/guides/enable-apis
- Permissions Reference for Google Cloud IAM | gcp.permissions.cloud, accessed March 28, 2025, https://gcp.permissions.cloud/iam/searchconsole
- Google Search Console API | Documentation | Postman API Network, accessed March 28, 2025, https://www.postman.com/api-evangelist/google/api/accf578b-a6bc-4e62-91f2-85d80f303d07/documentation/35240-70ca8b5e-a1a9-486c-8de5-437988443686
Notes on Gemini Deep Research
Wow, okay. Can you say Epic! I am very tempted to stop right here. But do you know what’s better than having one opinion on the way to do a thing? Having two! (ended up with going for 3)
In the end, Gemini only ended up citing 34 sites in the actual report, but it looked at 109 total sites. 74 of them were not cited in the final paper. The number it was reporting while it was researching was way higher, going way over 500 (I looked away), but I think that was from each time it looked at a site and was incrementing the counter even if it looked at the same page again later.
Formatting was again a challenge. I had to export it to Google Docs and download the markdown from there, with takes forever and gives no user feedback, almost like it didn’t work. And then about 5 minutes later, BAM! The markdown is in the download directory.
And with the formatting, there was the now-standard replacement of the vertical bar pipe character that’s used in title tags a lot and shows up in the Works cited section but messes up markdown formatting (as if table formatting), though it’s probably a Ruby Jekyll Kramdown thing.
And lastly, it inserted backslashes all over the place in the Python and JSON code in the markdown downloaded from Google Docs, and didn’t put the triple backtick language fencing around the Python or JSON blocks, so I had to add the language color syntax markdown by hand and remove all the backslashes.
Okay, I’m going to tap the $200/mo ChatGPT Pro, the little research engine that can… or so I thought.
Grok 3 Deep search
Newsflash! While I was waiting for ChatGPT 4.5 Deep research to finish, I also sent the prompt on over to Grok 3 to slam out its version while I was waiting:
Grok 3:
Key Points
- Research suggests that using a Service Account for Google Search Console (GSC) API authentication is suitable for local, automated scripts, avoiding interactive Oauth flows.
- It seems likely that the GSC API v1, with the
searchanalytics.query
endpoint, is used to fetch performance data like clicks, impressions, and positions. - The evidence leans toward a 16-month historical data lookback and a 2-3 day data latency for the most recent GSC API data.
- Combining
query
andpage
dimensions in API requests returns keyword-per-URL data, essential for identifying “striking distance” keywords (positions ≥ 11).
Authentication Setup
To set up authentication for your local Python script, create a GCP project at Google Cloud Console, enable the GSC API, and set up a Service Account. Download the JSON key file and add the Service Account email to your GSC property with “Full” permissions. This method is user-friendly for beginners and avoids repeated authorization.
API Data Retrieval
Use the GSC API to query performance data, focusing on dimensions like query
and page
, and metrics like position
. The API has a daily quota (check at API Console), with a maximum of 25,000 rows per request, requiring pagination for larger datasets.
Data Processing and Integration
Process the data with Pandas to filter for positions ≥ 11, identify keywords consistently in this range, and integrate with SEMRush data via CSV imports. For LLMs, outline a strategy to analyze article text against these keywords for thematic matches, suggesting content enhancements.
Comprehensive Report on Building a Python Script for GSC API Integration
This report provides a detailed guide for creating a Python script to retrieve and analyze Google Search Console (GSC) API data, focusing on “striking distance” SEO keyword opportunities, data management, and integration with external tools like SEMRush and LLMs. The script is designed to be robust, user-friendly for beginners, and adhere to specified technical constraints (Python, standard libraries, no async unless vital, relative paths, mindful of quotas).
Phase 1: Authentication and API Access
GCP Setup & Authentication Methods
For a locally-run Python script, the recommended authentication method is using a Service Account with a JSON key file, given its suitability for automation and avoiding interactive Oauth flows. Alternatively, Oauth 2.0 with Client IDs is possible but requires user interaction for initial authorization.
Step-by-Step Instructions for Service Account Setup:
- Go to Google Cloud Console and sign in.
- Create or select a project.
- Enable the GSC API under “APIs & Services” > “Library”.
- Navigate to “IAM & Admin” > “Service Accounts”, click “Create Service Account”, name it (e.g., “gsc-access”), and create.
- In the Service Account details, go to “Keys”, click “Add Key” > “Create new key”, select JSON, and download the file (e.g.,
gsc-service-account-key.json
). - Note the Service Account email from the list or the JSON file’s
client_email
field. - In GSC at search.google.com/search-console, add the Service Account email as a user with “Full” permissions under the property’s “Settings” > “Users and permissions”.
Pros and Cons of Service Account vs. Oauth 2.0:
- Service Account:
- Pros: No user interaction needed, ideal for automation, secure for long-running scripts.
- Cons: Requires property owner to add the email, potentially complex for “newbs” if not the owner.
- Oauth 2.0:
- Pros: Suitable for user-initiated scripts, standard for user authentication.
- Cons: Requires initial authorization flow, less convenient for repeated runs.
Enabling GSC API:
- In the GCP project, search for “Google Search Console API” in the API Library and click “Enable”.
Minimum IAM Roles and Permissions:
- The Service Account needs no specific GCP IAM role for GSC API access beyond being enabled in the project; permissions are managed via GSC property user roles (“Full” for read access).
GSC Access Nuances
- Scenario A: User is Verified Owner
- The user can directly add the Service Account email to the GSC property with “Full” permissions.
- Scenario B: User has Delegated Access
- If the user is a “Full” user (not owner), they cannot add new users; the property owner must add the Service Account email with “Full” permissions.
- Scenario C: Google Workspace Managed Property
- Similar process, but domain admins may need to manage access via Google Workspace admin console. The core steps remain adding the Service Account email in GSC “Users and permissions”.
Refer to Managing owners, users, and permissions for official guidance.
Phase 2: Understanding the GSC API for Performance Data
API Overview & Endpoints
- Current version: GSC API v1, providing access to search analytics, sitemaps, sites, and URL inspection.
- Performance data endpoint:
searchanalytics.query
, a POST request to fetch data like clicks, impressions, CTR, and position.
Dimensions & Metrics
- Dimensions:
date
,country
,device
,query
,page
,searchAppearance
,type
. - Metrics:
clicks
,impressions
,ctr
,position
. - DimensionFilterGroups: Used to filter data based on dimension values, e.g., filter by country.
- AggregationType: Specifies how to aggregate data when multiple rows match (default is by day).
- Combining
query
ANDpage
dimensions returns keyword-per-URL data, with each row representing a unique query-page pair, essential for granular analysis.
Request Structure & Parameters
- Typical request payload (JSON):
{ "startDate": "2023-01-01", "endDate": "2023-01-31", "dimensions": ["query", "page"], "metrics": ["clicks", "impressions", "ctr", "position"], "rowLimit": 10000, "startRow": 0 }
- Key Parameters:
siteUrl
: Property URL (e.g.,https://example.com
).startDate
,endDate
: Date range in YYYY-MM-DD format, relative terms like “last 7 days” not supported directly.dimensions
: List of dimensions, multiple can be specified (e.g.,["query", "page"]
).rowLimit
: Maximum rows per request (up to 25,000).startRow
: Starting row for pagination (0-based).- Other parameters:
dimensionFilterGroups
for filtering,dataState
for filtering by data state (e.g., “final”).
API Limitations & Constraints
- Historical Data Lookback: 16 months, confirmed from API Documentation.
- Data Latency: Typically 2-3 days for the most recent data.
- Latest Date Determination: No direct API call; recommended to query with recent dates and check for data, or assume 2-3 days lag.
- Maximum
rowLimit
: 25,000 per request. - Daily API Quotas: Check project quotas in API Console, typically 10,000 units/day, each query = 1 unit. Refer to Quotas and Limits.
- Pagination: Use
startRow
androwLimit
; detect end when rows returned <rowLimit
.
Phase 3: Python Implementation Details (Pre-Code Research)
Google API Client Library
- Official library:
google-auth
andgoogleapiclient
for Python. - Code examples:
Authentication (Service Account):
from google.oauth2 import service_account
from googleapiclient.discovery import build
SCOPES = ['https://www.google.com/webmasters/readonly']
credentials = service_account.Credentials.from_service_account_file(
'gsc-service-account-key.json', scopes=SCOPES
)
service = build('searchconsole', 'v1', credentials=credentials)
“Hello World” Test (List Properties):
response = service.sites().list().execute()
print(response)
Core Data Query:
request_body = {
'startDate': '2023-01-01',
'endDate': '2023-01-31',
'dimensions': ['query', 'page'],
'metrics': ['clicks', 'impressions', 'ctr', 'position'],
'rowLimit': 10000
}
response = service.searchanalytics().query(siteUrl='https://example.com', body=request_body).execute()
print(response.get('rows', []))
Pagination:
def fetch_all_rows(service, siteUrl, start_date, end_date, dimensions, metrics, row_limit):
all_rows = []
start_row = 0
while True:
request_body = {
'startDate': start_date,
'endDate': end_date,
'dimensions': dimensions,
'metrics': metrics,
'rowLimit': row_limit,
'startRow': start_row
}
response = service.searchanalytics().query(siteUrl=siteUrl, body=request_body).execute()
rows = response.get('rows',[])
all_rows.extend(rows)
if len(rows) < row_limit:
break
start_row += row_limit
return all_rows
Basic Error Handling:
try:
response = service.searchanalytics().query(siteUrl='https://example.com', body=request_body).execute()
except googleapiclient.errors.HttpError as e:
print(f"Error: {e}")
Secure Credential Handling & File Management
- Store the JSON key file securely, e.g., in the script directory, and avoid hardcoding paths.
- Use
os.path
orpathlib
for platform-independent file checks:
import os
def check_file_exists(filename):
return os.path.exists(filename)
# Check for output CSV
if check_file_exists('gsc_data.csv'):
print("File exists, skipping API call.")
- File Naming Convention: Suggest
gsc_data_{siteUrl}_{start_date}_{end_date}.csv
, sanitizesiteUrl
by replacing special characters (e.g., dots with underscores):import re sanitized_site = re.sub(r'[.:/]', '_', siteUrl) filename = f'gsc_data_{sanitized_site}_{start_date}_{end_date}.csv'
Phase 4: Data Processing and Analysis (Pandas)
Pandas Operations
- Code snippets:
Reading Multiple CSV Files:
import glob
import pandas as pd
files = glob.glob('*.csv')
df = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)
Filtering for Position ≥ 11:
df_striking = df[df['position'] >= 11]
Identifying Queries Only with Position ≥ 11:
query_groups = df.groupby('query')
striking_queries = [query for query, group in query_groups if all(group['position'] >= 11)]
df_striking_queries = df[df['query'].isin(striking_queries)]
Aggregating Data:
agg_df = df_striking_queries.groupby('query').agg({
'clicks': 'sum',
'impressions': 'sum',
'position': 'mean'
})
Exporting Keywords:
pd.Series(striking_queries).to_csv('striking_distance_keywords.csv', index=False, header=['query'])
Left Join with SEMRush Data:
semrush_df = pd.read_csv('semrush_data.csv')
merged_df = pd.merge(df_striking_queries, semrush_df, left_on='query', right_on='keyword', how='left')
Phase 5: Integration with External Tools (Conceptual Outline)
SEMRush Data
- Typical SEMRush export fields include: Keyword, Search Volume, Keyword Difficulty, CPC, Competition.
- Assume manual CSV import, e.g.,
semrush_data.csv
, for integration via left join as shown above.
LLM Prompting Strategy
- Outline:
- Provide the LLM (e.g., Gemini 2.5) with the article text and the list of “striking distance” keywords.
- Task the LLM to identify n-grams (2-5 word phrases) already in the article thematically related to keywords.
- Compare article themes with keywords, suggesting which can be naturally integrated into title, headlines, or body.
This approach ensures the script is comprehensive, user-friendly, and meets all technical constraints.
Key Citations
- Authorize Requests Search Console API Google for Developers
- Prerequisites Search Console API Google for Developers
- Manage API keys Authentication Google Cloud
- How to Obtain and Utilize your Google Search Console API Key
- Introduction to Lighthouse Chrome for Developers
- Authentication methods at Google Google Cloud
- How to Get Google Search Console API Keys JC Chouinard
- ruby Google Search Console API Credentials Stack Overflow
- Google Search Console OAuth2 Authentication
- Setting up API keys API Console Help
- Managing owners, users, and permissions Search Console Help
- How to Add a User to Google Search Console Blue Hills Digital
- How To Add User to Google Search Console Victorious
- How to Add a User to Google Search Console MonsterInsights
- How to Add a User to Google Search Console The Unconventional RD
- How to add a new user in Google Search Console Google Search Central Community
- How To Add A User To Google Search Console and To Google Analytics
- 4 Simple Steps To Add A User To Search Console Sitelynx SEO PPC GA4 Training
- Add User to Google Search Console Guide to GSC Permissions
- How to Add Users to Google Search Console Properties Helpfullee
- IAM basic and predefined roles reference IAM Documentation Google Cloud
- Quickstart Grant roles in the Google Cloud console IAM Documentation
- Permissions Reference for Google Cloud IAM gcp.permissions.cloud
- Using predefined IAM roles for enhanced Google Maps Platform governance DoiT
- Grant access to your project with IAM Google Codelabs
- Roles and permissions IAM Documentation Google Cloud
- GCP IAM Roles All types and recommended setup Apono
- IAM permissions reference IAM Documentation Google Cloud
- google cloud platform How do I list the roles associated with a gcp service account Stack Overflow
- How to enable a Service Account to access Search Console API of a specific WEB page Google Search Central Community
- Unable to query Google Search Console API using a Service Account Stack Overflow
- Using Google Search Console with Python and a service account Medium
- How can I use search console api v3 to access sitemaps data without oauth using service account Google Search Central Community
- GitHub joshcarty/google-searchconsole A wrapper for the Google Search Console API
- How To Use Search Console Google Search Central Documentation Google for Developers
- Verifying your site with Google Search Console Squarespace Help Center
- Add a website property to Search Console Search Console Help
- Google Workspace Secure Online Productivity Collaboration Tools
- Google Search Console Google Workspace
- Search in the Admin console Google Workspace Admin Help
- Google Search Console
- What is Google Search Console, and How Does It Integrate With Your Website Workspace Digital
- Google Workspace Google for Developers
- Quickstart Run a Search Console App in Python Search Console API Google for Developers
- Google Search Console API with Python Engineering for Data Science
- Using Google Search Console API with Python a Quick Tutorial
- Contact practicaldatascience.co.uk domain name owner
- google search console batch api pull python Google Search Central Community
- Search Console API Google for Developers
- Python quickstart for customers using a service account Google device provisioning services Google for Developers
- How to Use Google Search Console API With Python JC Chouinard
- API Reference Search Console API Google for Developers
ChatGPT 4.5 Deep Research
Okay, back to ChatGPT! It’s definitely the slowest of them all, but maybe it makes up for it in quality. I’m actually not sure yet. It’s going to take me some time to digest as I actually carry out this project and see whose advice was best!
ChatGPT 4.5 Deep research: This is a great outline! Before I begin the research, could you confirm a couple quick details to make sure everything is tailored correctly:
- For the Google Search Console (GSC) properties you’ll be querying, will this script primarily access domain properties (e.g., sc-domain:example.com), URL-prefix properties (e.g., https://example.com/), or both?
- Will the script need to support querying multiple GSC properties in a single run, or just one at a time?
- Do you want this script to be usable by non-technical users (e.g., with some kind of CLI prompt interface or config file)? Or is a simple Python script with inline variables sufficient?
- Do you already have SEMRush CSV exports available, or should I include assumptions about column names/structure for join examples?
Let me know and I’ll get started right away!
Me:
- Both
- One at a time
- In Python appropriate for a Jupyter Notebook is fine.
- Include assumptions. I’ll adjust later.
ChatGPT: Perfect. I’ll gather detailed, structured research and examples covering everything from GSC API authentication, usage, and quota limits to Python code for querying and processing performance data, plus assumptions for SEMRush integration and LLM-ready prompting.
I’ll let you know as soon as it’s ready.
Accessing Google Search Console API with Python for “Striking Distance” Keywords
Google Search Console (GSC) can provide valuable SEO insights, especially for “striking distance” keywords – search terms where your site ranks just outside the top 10 (typically positions 11+). The task is to build a beginner-friendly Python Jupyter Notebook script to authenticate with the GSC API, retrieve detailed performance data, and analyze it with Pandas to find striking distance keywords. We’ll also outline integration points with tools like SEMrush and LLMs (Large Language Models) to enrich and act on these insights.
Overview
We’ll break this guide into phases:
- Authentication & API Setup: How to enable the GSC API and authenticate via Google Cloud Project, using either Service Accounts or OAuth 2.0 Client IDs. We’ll provide step-by-step instructions for GCP setup, credential JSON files, and discussing pros/cons for each method.
- Understanding the GSC API: Key API details – endpoints, dimensions, metrics, and request structure – to fetch performance data (clicks, impressions, CTR, position). We’ll explain how combining dimensions like
query
andpage
yields keyword-URL pairs and how filtering works. - Python Implementation: Using the Google API Python Client Library. We’ll show code snippets for authentication (Service Account vs OAuth), listing accessible GSC properties, querying performance data (with pagination), and basic error handling.
- Data Processing with Pandas: How to load GSC data into Pandas, filter for striking distance keywords (position ≥ 11), aggregate metrics, and merge with external data (e.g., SEMrush CSV) for additional keyword context.
- External Integration Concepts: Outline common SEMrush keyword fields (Volume, Keyword Difficulty, CPC, etc.) and suggest LLM prompting strategies – for example, using an LLM to suggest content tweaks by providing article text plus a list of striking distance keywords.
Each phase includes detailed instructions, code snippets, and references to official docs or reputable sources. Let’s dive in.
Phase 1: Authentication and API Access
Before writing any code, we need to configure a Google Cloud Project for API access and obtain credentials for our Python script to authenticate with the GSC API. There are two primary methods:
- Service Account (JSON key file)
- OAuth 2.0 Client (Client ID with client secret JSON)
Both require some initial setup in Google Cloud Platform (GCP) and enabling the GSC API for your project.
1.1 Enable the GSC API in a GCP Project
No matter which auth method you choose, start by creating a Google Cloud project and enabling the GSC API:
- Create a GCP Project: Log into the Google Cloud Console and create a new project (or select an existing one). Ensure you use an account that has permission to create projects – note that some Google Workspace accounts may have this disabled by an admin (How to leverage your GSC data through the API? - Antoine Eripret).
- Enable the Search Console API: In your GCP project, go to APIs & Services > Library. Search for “Google Search Console API” and click Enable. (If using the API for URL inspection, also enable the “Indexing API” as needed.)
Next, choose an authentication method:
1.2 Option A – Service Account Authentication
A Service Account represents a non-human user that your application can use to authenticate – ideal for server-to-server scenarios with no user interaction. For GSC API, a service account can query data only if it has been granted access to the GSC property.
Setup Steps:
- Create a Service Account: In GCP Console, go to IAM & Admin > Service Accounts, then Create Service Account. Give it a name (e.g., “GSC API Service Account”). For this local script, no special roles need to be granted at creation; we will manage access via Search Console itself.
- Create a JSON Key: After creating the service account, under its Keys tab, click Add Key > Create Key and choose JSON. Save this JSON key file (e.g.,
service_account_key.json
) to a secure location – we’ll use it in our script for authentication. Keep this file private (don’t commit it to Git, etc.). - Grant Search Console Access: Now, add the service account’s email as a user in GSC. Find the service account’s email (it’s in the JSON under
client_email
or on the service account details page). In Google Search Console, open your site Settings > Users and permissions and click Add user. Enter the service account’s email and give it Full access.- If you don’t see an “Add user” button, your Google account is likely not a verified owner of the property. You’ll need to use an owner account to add the service account.
- IAM Roles: Strictly speaking, to use the GSC API, the service account doesn’t need extra GCP IAM roles besides being enabled with the API. However, ensure it has at least basic project access (it’s automatically an Editor on the project if created there). The actual data permission comes from adding it as a Full user in Search Console, which allows viewing all performance data.
Pros & Cons of Service Account:
- Pros: No manual web OAuth flow each run; good for automated scripts; credentials can be long-lived (until key is revoked).
- Cons: Setup involves multiple steps; service account email must be added to each GSC property (could be an issue if you have many sites or only delegated access); not suitable if you want to use your personal GSC access directly (see OAuth below).
1.3 Option B – OAuth 2.0 Client ID (User Account Authentication)
Alternatively, use an OAuth 2.0 Client ID for a “Desktop App” or “Installed Application.” This means your script will open a browser or prompt you to log into a Google account (one that has access to the GSC property) and ask for consent to allow API access. This is effectively what tools like the Search Console Google Sheets extension do, but in Python.
Setup Steps:
- Create OAuth Credentials: In GCP Console, go to APIs & Services > Credentials. Click Create Credentials > OAuth client ID. Choose Desktop App (for a script running on your machine) and give it a name (e.g., “GSC API Desktop Client”).
- Download Client Secret JSON: After creation, download the OAuth client secret JSON (it may be named like
client_secret_<id>.json
). This file containsclient_id
,client_secret
, and some other info. Important: The first time you run your script, it will open a browser for you to log in and authorize. Google will then provide a token which the Python client library can store (often in a file liketoken.json
). Subsequent runs can refresh the token without manual login, until the refresh token expires or is revoked. - Enable API & Scopes: Ensure the GSC API is enabled on this project (same as above). Decide on scopes – for read-only data, we’ll use scope
https://www.googleapis.com/auth/webmasters.readonly
. For write actions (not needed here), there’s a broader…/auth/webmasters
scope. - IAM Roles: Not applicable as this uses your user account permissions. Just ensure the Google account you’ll use is a verified owner or a user of the GSC property. If it has Full or Restricted access, it can read the performance data (Full users see all data; Restricted see most data). Verified owners and delegated owners have full control. If using a Google Workspace account, ensure that the domain’s admin hasn’t disabled Google Cloud or API access (some Workspace admins require enabling APIs or granting consent for third-party apps).
Pros & Cons of OAuth Client ID:
- Pros: Uses your own user access (no need to add special users in GSC); multiple sites’ data accessible if your Google account has access; well-suited if a person runs the script and can interact with a browser for consent.
- Cons: Requires a one-time manual auth (and re-auth if token expires); slightly more complex in code to handle token storage; not ideal for fully automated server scripts without user interaction.
Which to choose? For a locally-run, user-initiated script, either is fine. If you want simplicity and no browser pop-ups after setup, service accounts are great (just remember to add them to Search Console). If you prefer to directly use your own account’s access or are working in an environment where adding a service account user to GSC isn’t feasible, OAuth 2.0 is the way to go.
Google Workspace Note: If your Google account is under a managed organization (Google Workspace), you might face restrictions creating projects or service accounts (admins can disable GCP for users). If you see errors like “Google Cloud Platform service has been disabled. Please contact your administrator”, you may need to request access or use a personal Google account for the API project. Also, for domain properties auto-verified via Workspace, API access should work as long as you use an account that is an owner. There’s also domain-wide delegation for service accounts in enterprise setups, but that’s advanced and not needed here since adding the service account as a user is simpler.
1.4 Summary of Steps for GCP and Credentials
- Create GCP Project (ensure you have rights, especially in Workspace context).
- Enable Search Console API for the project.
- Choose Auth Method:
- Service Account: Create it, download JSON key, add it as Full user in GSC.
- OAuth Client: Create Desktop OAuth ID, download client secret JSON.
- Scope & Consent: For OAuth, you’ll grant the
webmasters.readonly
scope when prompted. No further action for service account (no consent screen since it’s server-to-server). - Minimum IAM Roles: If creating a service account, you can skip granting it additional roles on GCP (not needed for calling GSC API). If using OAuth, no special IAM roles; your user permissions in GSC determine data access.
- Security Tip: Whichever method, store your JSON credentials securely. If using a notebook, consider using environment variables or a config file read (so you’re not hardcoding keys in the notebook). Also, avoid using API Keys for GSC data – the Search Console API requires OAuth-based auth (API keys alone won’t grant access to private data).
Now that we have API access configured, let’s see what the GSC API offers.
Phase 2: Understanding the GSC API for Performance Data
Google’s Search Console API (specifically the Search Analytics service) gives programmatic access to the same performance data you see in the GSC Performance report. Understanding how to query this data via the API is crucial for retrieving everything we need:
2.1 API Version and Endpoint
The current API is part of Webmasters v3 (also referred to as Search Console API v1 in some docs). The base endpoint for querying search analytics is:
POST https://www.googleapis.com/webmasters/v3/sites/{siteUrl}/searchAnalytics/query
We will use the Google API Python client, which abstracts this into a method call. The siteUrl
is the exact site as defined in Search Console:
- For URL-prefix properties: include the protocol and trailing slash (e.g.,
"https://www.example.com/"
). - For Domain properties (DNS verified): use the
sc-domain:example.com
format.
The API method is searchanalytics.query
. It expects a POST request with a JSON body containing your query parameters. There is no separate “GET” for performance data – the query
method is the one to use.
2.2 Available Dimensions and Metrics
Metrics: The main metrics returned for each row are:
- Clicks
- Impressions
- CTR (as a proportion between 0 and 1 in the API; multiply by 100 for percentage)
- Position (average position)
These correspond to the columns you see in the GSC UI. The API returns them as doubles (floats).
Dimensions: Dimensions determine how data is grouped (segmented). Valid dimensions are:
date
query
page
country
device
searchAppearance
(e.g., AMP vs rich results, etc.)
When you specify dimensions in the request, the API groups metrics by the unique combination of those dimensions, up to 5 dimensions. If you don’t specify any dimension, you get a single total row for the whole site and date range.
Common use-cases:
- Group by
query
to see all queries and their metrics. - Group by
page
to see data per page. - Group by
query
andpage
to get the granular keyword–URL pairs (this is what we need for striking distance, because we want to know which queries on which pages rank at positions 11+). - Add
country
ordevice
if you want to further break it down, though often not needed for this analysis. searchAppearance
is used usually in a two-step approach if you want to isolate data like AMP vs non-AMP (Getting your performance data | Search Console API | Google for Developers), but not required for our scenario.
According to Google’s docs:
“There is no limit to the number of dimensions that you can group by” (practically 5 as that’s the total available). And the order you list them in “dimensions” matters; it sets the order of values in the
keys
array in response and how grouping is nested.
Combining query
and page
does return data per query-URL pair (not aggregated by query alone or page alone). Yes, it will show keyword+URL pairs. This is confirmed by Google’s example: "dimensions": ["page", "query", "country", "device"]
will include page and query details, but note that asking for more dimensions can cause some data “sampling” or omission of very low-ranked queries (Google Search Console Data Limitations) (Google tends to return top queries if the combination yields more than 50k rows, see API limits below).
Dimension Filter Combinations: You can request data by page or by property (site) through an aggregationType
parameter (more on that below), which affects counts:
- If you omit
page
andquery
in dimensions and setaggregationType: byPage
, you get site-level totals with more accurate counts of total clicks/impressions. - If you include
page
/query
, some data might be missing (Google’s internal limitation where super long-tail queries might not all be listed (Google Search Console Data Limitations), for privacy and space reasons).
For our purposes, we want detail, so we’ll include query
and perhaps page
. We accept that some extremely minor queries might be excluded.
Full list of dimension values:
country
uses 3-letter country codes (USA, GBR, etc.).device
values areDESKTOP
,MOBILE
,TABLET
.searchAppearance
values can be things likeAMP_BLUE_LINK
,WEB_LIGHT_RESULT
, etc., which correspond to search result types/features (these aren’t always obvious; Google suggests running a query grouped solely by searchAppearance to see what’s present, or refer to their help docs).
2.3 Query Structure and Parameters
The JSON request body fields (with example values) include:
- siteUrl (in URL path): e.g.,
"https://www.example.com/"
or"sc-domain:example.com"
. - startDate: e.g.,
"2025-02-01"
(format YYYY-MM-DD). Required. - endDate: e.g.,
"2025-02-28"
. Required. The API treats these inclusively. - dimensions: list of dimensions (optional). E.g.,
["query", "page"]
or["page","query","country"]
. - type: (formerly
searchType
, which is now deprecated in favor oftype
). Optional; defaults to “web” if not specified. You can set:"web"
– default All Search data (this excludes Discover/News)."image"
,"video"
,"news"
for vertical-specific searches (News here means the News tab in search results)."discover"
for Google Discover data."googleNews"
for Google News app and news.google.com data.
- dimensionFilterGroups: optional, for filtering results. It’s an array of filter groups, each with:
groupType
:"and"
(the only supported option currently; “or” is not supported as of the latest docs).filters
: list of filters. Each filter has:dimension
: one of the dimension names (including"date"
if needed).operator
: e.g.,"equals"
,"notContains"
,"contains"
,"greaterThan"
(for position filtering this doesn’t apply, as you can’t filter by metrics directly in API), or regex-based operators"includingRegex"
/"excludingRegex"
(Search Analytics: query | Search Console API | Google for Developers).expression
: the string or regex to match (Search Analytics: query | Search Console API | Google for Developers).
- Example: You could filter to queries containing “guide” by
{"dimension":"query","operator":"contains","expression":"guide"}
or exclude branded terms by regex.
- aggregationType: optional;
"auto"
by default."auto"
– let Google decide. Generally, if no page dimension is in your request and no filter on page, Google might aggregate by property."byPage"
– aggregate by canonical URI."byProperty"
– aggregate by site. Note: You cannot usebyProperty
if you group or filter by page. Also not supported for Discover/GoogleNews type.- For our needs, we’ll likely stick to default “auto” or explicitly use “auto” (since we will be grouping by query/page, auto will effectively be by page internally).
- rowLimit: up to 25,000 per request. Default is 1000 if not set. We can set e.g. 25000 to get the max in one go.
- startRow: for pagination. 0-indexed offset. Use 0 for first page, 25000 for second page, etc.
- dataState: optional;
"final"
by default. If you set"all"
, you include fresh (unfinalized) data. GSC performance data usually has a 2-day lag in “final” data. “Fresh data” can include last 48h but might be incomplete. Since we want robust analysis, using finalized data (default) is fine. You might considerdataState: "all"
if you want up-to-the-latest info knowing it can be slightly noisy (Search Analytics: query | Search Console API | Google for Developers).
For example, a typical request body to get all query-page data for Feb 2025 might look like:
{
"startDate": "2025-02-01",
"endDate": "2025-02-28",
"dimensions": ["query", "page"],
"type": "web",
"aggregationType": "auto",
"rowLimit": 25000,
"startRow": 0
}
And we would loop increasing startRow by 25000 until no rows are returned (pagination).
Important: If you include date
as a dimension (e.g., to see daily trends), the sorting of results changes – by date ascending, not by clicks. Without date, results are sorted by clicks (highest first). For striking distance, we likely don’t need date
dimension; we’ll gather an aggregate over a date range.
2.4 API Limits and Quotas
GSC API has both data availability limits and query rate quotas:
- Data retention: GSC only keeps 16 months of performance data. You cannot retrieve data older than ~16 months from today. So the max start date you can use is 16 months ago. If you need older, you’d have had to store it elsewhere historically.
- Data latency: Performance data is typically available with a ~2 day delay. GSC help says ~48 hours. The API’s default
dataState: final
will not include the last day or two. You can verify the latest date available by querying the max date in a group-by date query or usingdataState: "all"
which might include yesterday (but marked as fresh data). - Maximum query range: You can query any range within those 16 months, even the whole 16 months in one go. However, huge date ranges with detailed dimensions might bump into internal sampling limits:
- Google mentions the API does not guarantee to return all data rows but rather top ones, if too many rows. Specifically, the performance report caps at 50k rows per day per type. If your query tries to get more than 50k rows, it will truncate to ~50k (sorted by clicks descending). That’s why pagination stops at some point.
- Also, using multiple dimensions (especially query+page) inherently might yield a lot of rows. The
rowLimit
max per call is 25k, and the max you’ll ever get via pagination is likely 100k (some sources say 50k/day/type). - The docs confirm: “maximum of 50K rows of data per day per search type (sorted by clicks)”. So if you query a 30-day range, potentially up to 50k30 = 1.5M theoretical combinations, but GSC won’t return *that many; it will give top 50k per day sliced by daily data in practice. If grouping by query/page without date, they likely apply some logic but expect a max around 50k-100k rows total per query.
- Latest date via API: To programmatically find the latest date of data, you can query with
dimensions: ["date"]
for a recent range or simply note that it’s “today - 2 days” usually. Safer approach:- Query for last 7 days grouped by date.
- Take max date in results. This could help ensure your date range in the main query is valid up to the latest available data.
- Quotas (API usage limits): According to Google’s quotas (updated Aug 2024):
- Per-user & per-site: 1,200 queries per minute (QPM). It’s unlikely a single user script will hit that.
- Per-project: 40,000 QPM and 30,000,000 per day – effectively not a worry for normal use.
- They also have “short-term load” quotas – if you do extremely heavy queries (e.g., grouping by page and query over large date ranges repeatedly), you might get “quota exceeded” errors even under QPM limits. If so, they advise slowing down or splitting queries (e.g., by date segments).
- For our purpose (maybe a handful of queries to fetch data), this is fine. Just handle errors gracefully.
- Row limit & pagination: As mentioned,
rowLimit
max is 25,000. Pagination is done by incrementingstartRow
. The API returns an emptyrows
list when you’ve reached beyond the last row. So keep increasingstartRow
until you get 0 rows.- Example pseudo-code:
start = 0; all_rows = []; while True: resp = queryAPI(startRow=start, rowLimit=25000) rows = resp.get('rows', []) if not rows: break all_rows.extend(rows) start += 25000
- There’s no explicit “total row count” in response to know how many pages, so you loop until empty response.
- Example pseudo-code:
By understanding these details, we can retrieve all the performance data needed (within quota) and trust that positions, CTR, etc., match what GSC shows (with minor sampling caveats for long-tail queries).
Phase 3: Python Implementation Details
Now, let’s integrate this with Python. We’ll use Google’s official API client library for Python (google-api-python-client
) and related auth libraries (google.oauth2.service_account
or google_auth_oauthlib
).
First, install required packages (in Jupyter, use !pip install google-api-python-client google-auth-oauthlib google-auth
etc., if not already installed).
3.1 Authentication with Python: Service Account vs OAuth
We’ll outline code for both methods:
A. Using Service Account JSON:
import os
from google.oauth2 import service_account
from googleapiclient.discovery import build
SERVICE_ACCOUNT_FILE = 'service_account_key.json' # path to your JSON key
SCOPES = ['https://www.googleapis.com/auth/webmasters.readonly']
credentials = service_account.Credentials.from_service_account_file(
SERVICE_ACCOUNT_FILE, scopes=SCOPES)
# Build the service object for Search Console API (webmasters v3)
service = build('webmasters', 'v3', credentials=credentials)
That’s it. The service
object now has methods like service.sites().list().execute()
, service.searchanalytics().query(...).execute()
, etc. This assumes the service account has access; otherwise, calls will return 403 permission errors.
B. Using OAuth Client Secret JSON:
from google_auth_oauthlib.flow import InstalledAppFlow
from google.oauth2.credentials import Credentials
CLIENT_SECRETS_FILE = 'client_secret.json'
SCOPES = ['https://www.googleapis.com/auth/webmasters.readonly']
# Try to load existing tokens if available (to avoid re-auth on each run)
TOKEN_FILE = 'token.json'
creds = None
if os.path.exists(TOKEN_FILE):
creds = Credentials.from_authorized_user_file(TOKEN_FILE, SCOPES)
if not creds or not creds.valid:
if creds and creds.expired and creds.refresh_token:
creds.refresh(Request()) # refresh token if needed
else:
flow = InstalledAppFlow.from_client_secrets_file(CLIENT_SECRETS_FILE, SCOPES)
creds = flow.run_console() # or run_local_server() for web flow
# Save the creds for next time
with open(TOKEN_FILE, 'w') as token:
token.write(creds.to_json())
service = build('webmasters', 'v3', credentials=creds)
The above will either use an existing token.json
or initiate the OAuth flow (popping a URL for you to visit and log in). Use run_console()
for CLI (it will ask you to paste a code), or run_local_server(port=0)
to open a browser and auto-receive the token (Flask required). For Jupyter, run_local_server()
is often easier since you can click the link it prints.
After auth, service
is ready to call.
Tip: The first time using OAuth, Google will show a consent screen (if you didn’t configure a custom OAuth consent, it might show a warning “app not verified” since it’s internal – you can proceed as it’s just your app). Once you allow, the refresh token it stores typically lasts until you revoke access.
Now, regardless of auth method, we have a service
object to interact with the API.
3.2 “Hello World”: List GSC Properties
To ensure our auth works, let’s list accessible sites:
sites_response = service.sites().list().execute()
for site in sites_response.get('siteEntry', []):
print(site['siteUrl'], " - permission:", site['permissionLevel'])
This should print all sites your account (or service account) can access and the permission level (e.g., siteOwner
, siteFullUser
, siteRestrictedUser
etc.). If you get an empty list but expect sites, ensure the account actually has access and you used the correct auth. (Common issue: using a service account that wasn’t added in GSC yields empty result).
3.3 Core Query: Fetching Performance Data
To call the Search Analytics Query method:
site_url = "https://your-site.com/" # or "sc-domain:yourdomain.com"
request_body = {
'startDate': '2025-02-01',
'endDate': '2025-02-28',
'dimensions': ['query','page'],
'rowLimit': 25000,
'startRow': 0,
'type': 'web'
}
response = service.searchanalytics().query(siteUrl=site_url, body=request_body).execute()
rows = response.get('rows', [])
print(f"Got {len(rows)} rows in first call.")
# Optionally, iterate to get all pages:
all_rows = []
start_row = 0
while True:
request_body['startRow'] = start_row
resp = service.searchanalytics().query(siteUrl=site_url, body=request_body).execute()
data_rows = resp.get('rows', [])
if not data_rows:
break
all_rows.extend(data_rows)
start_row += len(data_rows)
print(f"Total rows fetched: {len(all_rows)}")
This will gather up to the maximum available rows by paging with startRow. We assume each call returns up to 25k (we set that) until exhaustion. If your site is large, keep an eye on performance. Usually, one or two pages (25k or 50k rows) suffice unless it’s a very big site.
The rows
returned are of the form:
{
"keys": ["query string", "https://pageurl/..."],
"clicks": 123.0,
"impressions": 456.0,
"ctr": 0.2708333333333333, # i.e., 27.0833%
"position": 11.2
}
The keys array order matches the dimensions
order in the request. In this example, keys[0] is the query, keys[1] is the page.
Pagination note: Instead of checking len(data_rows)
, a more robust condition is if not data_rows: break
as above. Google’s docs example uses do...while
until rows count == 0.
3.4 Error Handling – googleapiclient.errors.HttpError
Wrap API calls in try/except to catch HttpError
exceptions from the client library:
from googleapiclient.errors import HttpError
try:
response = service.searchanalytics().query(siteUrl=site_url, body=request_body).execute()
except HttpError as err:
# err.resp.status for HTTP code, err._get_reason() for message
print(f"API call failed with error: {err}")
# You could inspect err.status_code or str(err) for details
Common errors:
- 401 Unauthorized: bad credentials or token expired (handle by refreshing token or re-auth).
- 403 Forbidden: likely the account doesn’t have access to that site (check if the siteUrl is correct, and that the auth’d account is a permitted user).
- 400 Bad Request: usually a malformed request (e.g., date format wrong, invalid dimension name, querying a date beyond data range). The error message might say “Request contains an invalid argument” if so.
- Quota exceeded errors: come as 429 Too Many Requests or 403 with a quota message. If it happens, slow down or try later (for our interactive use, unlikely).
- Note: The client library’s HttpError contains a
.error_details
and.status_code
that you can use to branch logic. For simplicity, just logging/printing is fine for now, as this is a local script.
3.5 Relative Paths and File Handling
The user requested using relative paths and not requiring advanced async or others. So:
- Keep file paths generic. For example, if the JSON keys are in the same directory, just
'service_account.json'
is fine. Useos.path.join
if constructing paths. - If writing output CSV files, use something like:
output_dir = "output" os.makedirs(output_dir, exist_ok=True) file_name = f"{site_url.strip('/').replace('https://','').replace('http://','').replace(':','_')}_{start_date}_{end_date}.csv" file_path = os.path.join(output_dir, file_name) df.to_csv(file_path, index=False)
This example sanitizes the filename by removing scheme and replacing
:
(fromsc-domain:
) if present with underscore, and adding date range. - No special libraries needed for file I/O beyond pandas (which can write CSV) or Python’s built-in open.
We’ll propose a naming convention like: example.com_2025-02-01_2025-02-28.csv
. If multiple sites, that keeps them separate. If multiple date ranges (like monthly exports), that differentiates by date.
Ensure to not hard-code any one user’s path (like avoid C:\Users\YourName\...
), and use pathlib
or os
to be OS-independent.
Now that we can get the data, let’s move to analysis.
Phase 4: Data Processing and Analysis (Pandas)
Using pandas for analysis makes it easy to filter and aggregate data. Suppose we’ve collected our all_rows
from the API. We can load that into a DataFrame directly:
import pandas as pd
# Convert API rows to DataFrame
df = pd.DataFrame(all_rows)
# Split 'keys' into separate columns for query and page
if 'keys' in df.columns:
keys_df = pd.DataFrame(df['keys'].tolist(), columns=['query','page'])
df = pd.concat([keys_df, df.drop(columns=['keys'])], axis=1)
Now df
has columns: query
, page
, clicks
, impressions
, ctr
, position
.
4.1 Loading Multiple CSVs (if needed)
If you previously saved multiple CSV files (say, if you pulled data for different sites or months separately), you can load them and concatenate:
import glob
file_pattern = "output/example.com_*.csv" # adjust pattern to your needs
all_files = glob.glob(file_pattern)
df_list = [pd.read_csv(file) for file in all_files]
df = pd.concat(df_list, ignore_index=True)
This will combine all the data. If the schema is the same, it should align properly.
4.2 Filtering for Position >= 11 (Striking Distance)
The idea of “striking distance keywords” is typically queries where the average position is between about 11 and 20 (some define up to 30). These are just outside page 1, so they have potential to be improved to page 1.
We can filter the DataFrame:
striking_df = df[df['position'] >= 11.0].copy()
If we want only those that are never on page 1 across the site (i.e., position >=11 for all URLs), it’s trickier because the data is by query+page, not query alone. But likely we mean any query that currently averages beyond 10. So above filter is fine.
Optionally, you might refine to <= 20 or <= 30:
striking_df = striking_df[striking_df['position'] < 30.0]
to narrow it, but the question didn’t specify an upper bound. We’ll assume 11+ is enough as “striking distance”.
4.3 Isolating Queries that only appear at positions >= 11
If a query appears on multiple pages (perhaps a query matches multiple URLs of your site) and one of them is position 5 and another is 12, that query wouldn’t count as striking distance since it has a page1 ranking already. The problem asks: “queries only appearing with position >= 11 across all pages”.
To get that, we can do a group by query and look at the min position for that query among all pages:
query_minpos = df.groupby('query')['position'].min().reset_index()
queries_all_11_plus = query_minpos[query_minpos['position'] >= 11]['query']
# Filter striking_df to only those queries
striking_df = striking_df[striking_df['query'].isin(queries_all_11_plus)].copy()
Now striking_df
has only query-page entries where the query’s best ranking page is still not on page1. These are truly “not on page 1 at all” queries.
4.4 Aggregating by Query
Often, we want to consolidate info per query (especially if the query appears on several pages, though ideally one page is the focus). We might want average position (weighted by impressions ideally) and total clicks/impressions across the site.
We can aggregate striking distance keywords by query:
agg_df = striking_df.groupby('query').agg({
'position': 'mean', # average of average positions (approximation)
'clicks': 'sum',
'impressions': 'sum'
}).reset_index()
# Calculate an overall CTR from summed clicks/impressions
agg_df['ctr'] = (agg_df['clicks'] / agg_df['impressions']).fillna(0) * 100
agg_df['position'] = agg_df['position'].round(2)
agg_df['ctr'] = agg_df['ctr'].round(2)
However, note that averaging the position
like this is a bit crude – since position is already an average. A better metric might be the minimum position (best any page gets), but since we filtered those to >=11, min is still >=11. Or if we care about the “primary” page for that query, we might take the max clicks page.
But for simplicity, an average of average positions is fine as an indicator, and summing clicks/impressions gives the total opportunity of that keyword.
Now agg_df
is a list of striking distance queries with their average position and total clicks/impressions in that period.
4.5 Exporting Results
We can now export to CSV:
agg_df.to_csv('output/striking_distance_keywords_feb2025.csv', index=False)
Or get it as a Python list:
keywords_list = agg_df['query'].tolist()
If providing to an LLM or further use, you might just use the list or the DataFrame itself.
4.6 Merging with SEMrush (or other SEO tool data)
Suppose you have an external CSV from SEMrush with keyword data. Common fields in a SEMrush keyword export (for, say, organic research or position tracking) might include:
- Keyword
- Search Volume
- Keyword Difficulty (often %)
- CPC (cost per click)
- Competition (for paid, 0-1)
- Current URL (if it was a position tracking export)
- Current Position
- etc.
Let’s assume we have a DataFrame sem_df
with at least a column “Keyword” that matches our queries. We can left-join our GSC data with SEMrush data:
# sem_df = pd.read_csv('semrush_keywords.csv')
merged_df = agg_df.merge(sem_df, left_on='query', right_on='Keyword', how='left')
Now merged_df
will have additional columns like Volume
, KD
, etc. (Exact column names depend on SEMrush’s file; likely “Keyword”, “Volume”, “KD%”, “CPC”, etc.)
This enriches striking distance keywords with how many people search for them (Volume) and how hard they might be (Keyword Difficulty), which can help prioritize which ones to target first. For example, a query with 5,000 volume at position 12 is a juicy target, whereas one with 50 volume might be less impactful.
We can then export this merged list or further filter it.
4.7 Summary of Pandas tasks
- Load multiple CSVs (if needed) with
glob
andpd.concat
. - Filter by position (query+page pairs where position ≥ 11).
- Identify queries that don’t have any page1 ranking by grouping and checking min position.
- Aggregate by query: average position, sum clicks, sum impressions.
- Sort the results perhaps by impressions or volume to see biggest opportunities first.
- Export final list to CSV.
- Join with SEMrush or other keyword datasets on query for more context.
Next, we’ll consider how to use these results.
Phase 5: External Tool Integration Concepts
The script should also support integration with tools like SEMrush and LLMs. While the coding part covers merging with SEMrush data (as above), here we outline what data SEMrush provides and how an LLM might be used to act on the data.
5.1 Common SEMrush Keyword Export Fields
If you export keywords from SEMrush (e.g., via the Keyword Magic Tool or Position Tracking or Organic Research for your domain), you might get columns such as:
- Keyword: The search query text.
- Volume: Average monthly search volume (usually country-specific, e.g., US volume).
- KD (%): Keyword Difficulty – an SEO difficulty metric (0-100%). High means harder to rank.
- CPC: Average cost-per-click in Google Ads (usually in USD or specified currency).
- Competition: (For ads) a value from 0 to 1 indicating how many advertisers bid on it.
- Current Position: If it’s an export of your site’s rankings, the current position you rank.
- URL: The URL ranking (if the export is specific to your site).
- SERP Features: What features are present (e.g., Featured Snippet, People Also Ask, etc.).
- Trend: perhaps a sparkline or trend of volume over 12 months (some exports have these).
For merging with GSC data, the key field is Keyword (match to our query
).
In our analysis, we already have clicks, impressions, CTR, position from GSC. SEMrush adds:
- Volume can help gauge potential – e.g., you might rank 15 for a keyword with 10k searches, which likely has more potential traffic than one with 100 searches (even if impressions from GSC are low due to low rank).
- KD% may hint if it’s feasible to rank higher soon (low KD keywords might be easier wins).
- CPC indicates the commercial value of the keyword (if planning content or monetization).
- Competition (ad competition) – maybe less relevant for organic strategy, but interesting.
- If the SEMrush data is from a position tracking campaign, it might confirm your position and the URL. But we already have that from GSC; however, position tracking might include competitors, etc.
For simplicity, assume we have a SEMrush keyword list with Keyword, Volume, KD, CPC.
5.2 LLM Integration Strategies
Large Language Models (like GPT-4, etc.) can assist in content optimization once we have the data:
Some ideas:
A. Content Suggestions with LLM: If we provide an LLM with:
- The text of an article/page (or summary of it).
- A list of striking distance keywords related to that page’s topic.
We can prompt the LLM to suggest how to incorporate those keywords. For example:
“Here is the current content of my article (or the relevant section): [text]. Here is a list of related keywords that the page is ranking on page 2 for: [keywords + any additional data like volume]. Suggest how I can naturally integrate these keywords into the content – for instance, by adding a new section, adjusting headings, or using synonyms – to potentially improve its relevance and rankings for these terms.”
The LLM could then output suggestions like:
- “Include a section about [keyword] since users search for it. For example, add a paragraph explaining …”
- “Use [keyword phrase] in a subheading to target that query directly.”
B. Identifying existing usage in content: Before prompting, use a simple script to check which of the striking keywords (or their components) already appear in the article. For instance, for each keyword, see if the keyword (or a portion of it) is present in the text. Pandas or Python in
can do this, or even ask the LLM to analyze text for presence of certain phrases.
An automated approach with Python:
page_text = """... (from your CMS or file) ..."""
for kw in keywords_list:
if kw.lower() in page_text.lower():
print(f"'{kw}' is already in content")
else:
print(f"'{kw}' NOT found in content")
But an LLM could do this in a more intelligent way (like find partial matches or semantically similar phrases).
C. Using LLM to cluster or group keywords by intent or topic: If you have many striking distance keywords, an LLM can help cluster them, or turn them into content ideas. Example prompt:
“Group the following keywords into themes and suggest an outline for covering each theme in an article: [keyword list].”
This might help if you plan to create new content or refactor existing content to cover multiple related keywords.
For the scope of our script: at least we should output the list of striking distance keywords (optionally enriched with volume, etc.), and then the user can manually or via separate process use LLMs to get suggestions.
If we integrate directly:
- Possibly use OpenAI’s API (within allowed quota) to get suggestions, but that’s beyond the GSC integration (also costs and requires API keys). The user likely will copy-paste the results into ChatGPT or similar, or have a custom prompt ready.
5.3 Example LLM Prompt
Given an article text and some keywords:
Article Title: How to Train a Puppy
Article Text: "Training a new puppy can be challenging. ... [full text] ..."
Striking Distance Keywords:
- puppy training schedule (Volume: 1000)
- best way to potty train a puppy (Volume: 800)
- crate training at night (Volume: 500)
Task: Suggest how to incorporate each of the above keywords into the article naturally. Provide specific changes or additions.
An LLM might answer:
- “Mention a weekly puppy training schedule in a new section. For instance: ‘Puppy Training Schedule: Consistency is key when training a puppy. Create a daily routine…’”【This integrates the keyword【separately】】.
- “For best way to potty train a puppy, you can add a paragraph: ‘The best way to potty train a puppy is to establish a routine. Take them out first thing in the morning…’ etc., using that exact phrase in a header or the opening sentence.”
- “Discuss crate training at night: maybe add a tip in the bedtime routine section: ‘Crate Training at Night: Many experts suggest feeding your puppy early in the evening…’”
By overlapping the keyword themes with content sections, the article becomes more relevant to those queries – hopefully improving rank from say 12 to 8 over time, especially if combined with other SEO improvements (internal links, external links, etc.).
N-gram analysis: The prompt suggests identifying 2-5 word phrases in the article (n-grams) and seeing which keywords overlap. We could do a quick check of bigrams/trigrams in the content vs keywords. But an LLM can internally do that reasoning if prompted explicitly: “Check if the article already covers these topics and where it might add more content for them.”
For thoroughness, if coding:
from collections import Counter
import re
words = re.findall(r'\w+', page_text.lower())
bigrams = [' '.join(words[i:i+2]) for i in range(len(words)-1)]
trigrams = [' '.join(words[i:i+3]) for i in range(len(words)-2)]
# and similar for 4-grams, 5-grams.
ngrams = set(bigrams + trigrams) # etc.
for kw in keywords_list:
if kw.lower() in ngrams or any(kw.lower() in phrase for phrase in ngrams):
print(f"Article contains phrase relating to '{kw}'.")
This might be overkill; simply check substring of content as earlier might suffice because if the exact keyword is present, you know it’s covered.
5.4 Summary for LLM Strategy
- Give LLM context: Provide article content (or summary), provide keywords with context (volume, etc.), and ask for integration suggestions or new title/H1 ideas.
- Overlapping detection: Possibly use code or LLM to find which striking keywords are already partially covered in the text.
- Focus on high-volume or thematically relevant keywords: The user might not integrate every keyword, but focus on ones that fit naturally.
- LLM output usage: Use suggestions as a starting point, then manually refine and implement on the site. Remember to maintain natural writing – avoid keyword stuffing; LLM should help find a balance.
Sources
We have referenced official documentation and reputable guides throughout to ensure accuracy and currency of information:
- Google Search Console API Official Docs for API setup and usage (authentication, endpoints, request fields, quotas).
- Guides and articles explaining service account usage and adding it to Search Console.
- Data retention and limits from Google support and SEO community posts.
- SEMrush and SEO articles for context on keyword data and striking distance usage (Google Search Console Data Limitations).
- LLM integration strategies based on typical content optimization workflows.
Each phase above includes specific citations inline to the relevant references, e.g., enabling API, dimension info, quotas, etc., to validate the statements.
I now have 3 different opinions on doing on of the most commonly needed and useful tasks in SEO. That wraps up this overloaded page. It’s time to summarize it once again and move onto the next page, which will be the implementation based on this research, creating a most excellent SEO process Jupyter Notebook to port to Pipulate.
Gemini 2.0 Flash’s Take
I couldn’t use the 2.5 Experimental or even the 2.0 Thinking model on the above article, but the basic 2.0 Flash model came through.
Okay, here’s a summary, title/headline ideas, and an opinion of the article, all as you requested:
Author’s 1st Person Summary (Long Meta Description):
[Moved to top]
Title & Headline Ideas:
-
Titles:
- Automating SEO Insights: A Python & Google Search Console API Guide
- Building a “Striking Distance” Keyword Finder with Python
- Python Script for Google Search Console Data Analysis & Optimization
- Unlocking SEO Growth: Python, GSC API, and Pandas Workflow
- From Google Search Console to Content Optimization: A Developer’s Guide
-
Headlines (Sub-sections):
- Setting Up Google Search Console API Access with Python
- Understanding the Google Search Console API for SEO Data
- Python Implementation: Authenticating and Querying GSC Data
- Analyzing GSC Data with Pandas: Finding Striking Distance Keywords
- Integrating with SEO Tools and LLMs for Content Enhancement
AI Opinion of the Article:
This is a remarkably thorough and technically detailed guide. It effectively breaks down a complex task – automating Google Search Console data analysis – into manageable steps. The article’s strength lies in its comprehensiveness, covering everything from the nuances of Google Cloud Platform setup and API authentication to practical Python code examples and data manipulation techniques using Pandas.
The inclusion of considerations for integrating with external tools like SEMrush and LLMs further enhances its value, providing a complete workflow for identifying and acting on SEO opportunities. While the level of technical detail might be challenging for absolute beginners, it serves as an excellent resource for developers and SEO professionals looking to build custom solutions for content optimization. The systematic approach and clear explanations make this a valuable asset for anyone seeking to leverage the Google Search Console API for data-driven SEO strategies.