Tackling MOZ Browser Automation Until I Get It Right
I'm tackling browser automation with the MOZ Links API and Chrome, and have set up a Playwright context on a Linux server. I'm writing code to automate tasks for MOZ Pro, and running the code from the command line with 'python mozpro.py -k' to analyze keywords and download the results. Join me on my journey as I explore the possibilities of automated browser tasks with MOZ Pro!
Exploring Automated Browser Tasks with MOZ Pro - Join Me on My Journey!
By Michael Levin
Thursday, April 6, 2023
Wow, okay, I just slammed out what I think is a great introduction to the MOZ Links API. Knock people’s socks off with how effectively I can walk people through the basics of an API and browser automation. The Links API stuff is kind of just like a warm up given how simple the issues are compared to browser automation.
I made a few interesting choices yesterday that I right now have to decide whether I’m happy with. I wanted to push right on ahead to nbdev as applied to the Links API and Browser Automation notebooks, however I don’t really have all the fundamentals worked out for the browser automation yet.
First, I decided to lock in on Chrome. Even though it would seem that using
Firefox or Chromium would be leaning into the strengths of Microsoft
Playwright, there are always little nuances that keep steering me back to
Chrome. However the one big show-stopper is that the ability to load and save
specific state files the way you can with the --load-storage
and
--save-storage
flags from playwright codegen does not appear to be available
in how you can use the playwright.chromium.launch_persistent_context()
method.
I am balanced in the middle of leaning into Playwright’s defaults and what my gut is telling me is worth the extra work. Let’s keep it Chrome. I think I could easily switch it to Edge if I do. I could also connect it to genuine user profiles which would not be so easy with Firefox or Chromium.
I change the deliberately left out of repo file named assets/mozcreds.txt to linksapi.txt so that I can use the other name for the more general website login info. Doing a manual login every time while still using persistent sessions under Chrome/Edge seems to be the best way to go.
Okay, now this bit of code handles login. It has quite a number of things going for it:
- It uses genuine Linux Chrome. Issues regarding User Agents are easier.
- It’s tied to a genuine User Profile which offers similar advantages.
- While this cold be an issue when moving to server, it’s not necessary.
- The user_data location can be changed to a temporary directory.
- Operation is otherwise identical because login still occurs.
import nest_asyncio
nest_asyncio.apply()
import asyncio
import IPython
from playwright.async_api import Playwright, async_playwright
pause_to_record = False
# pause_to_record = True
slow_mo = 50
headless = True
moz_creds = "assets/mozcreds.txt"
chrome_exe = "/usr/bin/google-chrome"
downloads_path = "/home/ubuntu/Downloads"
user_data = "/home/ubuntu/.config/google-chrome/"
# user_data = "session"
def in_notebook():
try:
if IPython.get_ipython().__class__.__name__ == 'ZMQInteractiveShell':
return True # Jupyter notebook or qtconsole
else:
return False # Other type (likely a script)
except NameError:
return False # Probably standard Python interpreter
with open(moz_creds) as fh:
UN, PW = [x.strip().split(" ")[1] for x in fh.readlines()]
async def main():
async with async_playwright() as playwright:
context = await playwright.chromium.launch_persistent_context(
viewport={"width": 1200, "height": 1100},
downloads_path=downloads_path,
executable_path=chrome_exe,
user_data_dir=user_data,
accept_downloads=True,
headless=headless,
channel="chrome",
slow_mo=slow_mo,
)
page = await context.new_page()
await page.goto("https://moz.com/")
# Codegen activated
if pause_to_record:
await page.pause() # Edit this line in for codegen and out for automation.
# -- BEGIN CODEGEN LINES --
try:
await page.get_by_role("link", name="Log in").click()
await page.locator("#email").click()
await page.locator("#email").fill(UN)
await page.locator("#email").press("Tab")
await page.locator("#password").fill(PW)
await page.locator("#password").press("Enter")
await page.get_by_role("navigation").get_by_role("link", name="Moz Pro").click()
except:
...
await page.goto("https://analytics.moz.com/pro/keyword-explorer/keyword/suggestions?locale=en-US")
await page.get_by_placeholder("Enter a term or phrase to get analysis, suggestions, difficulty, and more").fill("linux")
await page.get_by_role("button", name="analyze").click()
async with page.expect_download() as download_info:
await page.get_by_role("button", name="Export CSV").click()
download = download_info.value
download = await download
await download.save_as("downloads/" + download.suggested_filename)
# -- END CODEGEN LINES --
# When done, close the browser.
await asyncio.sleep(2)
await context.close()
async def run_main():
await main()
if in_notebook():
try:
asyncio.get_running_loop()
asyncio.run(run_main())
except RuntimeError as e:
if "no running event loop" in str(e):
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(run_main())
else:
asyncio.run(run_main())
Wow, I played around with it a lot and I think I’ve got my next major breakthrough. This script simply goes into MOS Keyword Explorer and downloads the CSV file for the keyword “linux”.
I want to get the most mileage out of this script as possible. NBDev is rapidly coming up, but before that I want to make it optionally support arguments.
import nest_asyncio
nest_asyncio.apply()
import asyncio
from playwright.async_api import Playwright, async_playwright
def in_notebook():
"""Return True if run from a Jupyter Notebook and False if not."""
try:
import IPython
if IPython.get_ipython().__class__.__name__ == "ZMQInteractiveShell":
return True # Jupyter notebook or qtconsole
else:
return False # Other type (likely a script)
except NameError:
return False # Probably standard Python interpreter
if in_notebook():
keyword = "Foo" # or set to any default value that you prefer
else:
import argparse
parser = argparse.ArgumentParser(description="Example script")
parser.add_argument("-k", "--keyword", type=str, required=True, help="Value for keyword")
args = parser.parse_args()
keyword = args.keyword
pause_to_record = False
# pause_to_record = True
slow_mo = 50
headless = False
moz_creds = "assets/mozcreds.txt"
chrome_exe = "/usr/bin/google-chrome"
downloads_path = "/home/ubuntu/Downloads"
user_data = "/home/ubuntu/.config/google-chrome/"
# user_data = "session"
with open(moz_creds) as fh:
UN, PW = [x.strip().split(" ")[1] for x in fh.readlines()]
async def main():
async with async_playwright() as playwright:
context = await playwright.chromium.launch_persistent_context(
viewport={"width": 1200, "height": 800},
downloads_path=downloads_path,
executable_path=chrome_exe,
user_data_dir=user_data,
accept_downloads=True,
headless=headless,
channel="chrome",
slow_mo=slow_mo,
)
page = await context.new_page()
await page.goto("https://moz.com/")
# Codegen activated
if pause_to_record:
await page.pause() # Edit this line in for codegen and out for automation.
# -- BEGIN CODEGEN LINES --
try:
await page.get_by_role("link", name="Log in").click()
await page.locator("#email").click()
await page.locator("#email").fill(UN)
await page.locator("#email").press("Tab")
await page.locator("#password").fill(PW)
await page.locator("#password").press("Enter")
await page.get_by_role("navigation").get_by_role(
"link", name="Moz Pro"
).click()
except:
...
await page.goto(
"https://analytics.moz.com/pro/keyword-explorer/keyword/suggestions?locale=en-US"
)
await page.get_by_placeholder(
"Enter a term or phrase to get analysis, suggestions, difficulty, and more"
).fill(keyword)
await page.get_by_role("button", name="analyze").click()
async with page.expect_download() as download_info:
await page.get_by_role("button", name="Export CSV").click()
download = download_info.value
download = await download
await download.save_as("downloads/" + download.suggested_filename)
# -- END CODEGEN LINES --
# When done, close the browser.
await asyncio.sleep(2)
await context.close()
async def run_main():
await main()
if in_notebook():
try:
asyncio.get_running_loop()
asyncio.run(run_main())
except RuntimeError as e:
if "no running event loop" in str(e):
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(run_main())
else:
asyncio.run(run_main())
Wow, this is pretty amazing. It’s tested working on both Jupyter Notebook and and a Windows subsystem for Linux (WSL) Ubuntu 20.04.1 LTS. I’m going to have to test it on Linux without Windows.
Okay, it’s time to set this up on my NAS LXD container. Making this work on a headless cloud server is part of the pitch and I have to be able to follow through. First problem I found is that git doesn’t track empty folders. I already have 3 I want to track, so I’ll have to create them and then add them and add a .gitkeep file to them.
- assets
- downloads
- dbs
Okay, worked like a charm. I had to create the mozcreds.txt file and put my credentials in there. I also had to pip install nest_asyncio. I added nest_asyncio to the drinkme requirements.txt file.
One of my realizations is that I can turn headless on or off now based on the server. Here’s the final form before nbdev-ifying it:
import nest_asyncio
nest_asyncio.apply()
import asyncio
from playwright.async_api import Playwright, async_playwright
pause_to_record = False
# pause_to_record = True
slow_mo = 50
moz_creds = "assets/mozcreds.txt"
chrome_exe = "/usr/bin/google-chrome"
downloads_path = "/home/ubuntu/Downloads"
user_data = "/home/ubuntu/.config/google-chrome/"
# user_data = "session"
def in_notebook():
"""Return True if run from a Jupyter Notebook and False if not."""
try:
import IPython
if IPython.get_ipython().__class__.__name__ == "ZMQInteractiveShell":
return True # Jupyter notebook or qtconsole
else:
return False # Other type (likely a script)
except NameError:
return False # Probably standard Python interpreter
if in_notebook():
keyword = "Foo" # or set to any default value that you prefer
headless = False
else:
import argparse
headless = True
parser = argparse.ArgumentParser(description="Example script")
parser.add_argument(
"-k", "--keyword", type=str, required=True, help="Value for keyword"
)
args = parser.parse_args()
keyword = args.keyword
with open(moz_creds) as fh:
UN, PW = [x.strip().split(" ")[1] for x in fh.readlines()]
async def main():
async with async_playwright() as playwright:
context = await playwright.chromium.launch_persistent_context(
viewport={"width": 1200, "height": 800},
downloads_path=downloads_path,
executable_path=chrome_exe,
user_data_dir=user_data,
accept_downloads=True,
headless=headless,
channel="chrome",
slow_mo=slow_mo,
)
page = await context.new_page()
await page.goto("https://moz.com/")
# Codegen activated
if pause_to_record:
await page.pause() # Edit this line in for codegen and out for automation.
# -- BEGIN CODEGEN LINES --
try:
await page.get_by_role("link", name="Log in").click()
await page.locator("#email").click()
await page.locator("#email").fill(UN)
await page.locator("#email").press("Tab")
await page.locator("#password").fill(PW)
await page.locator("#password").press("Enter")
await page.get_by_role("navigation").get_by_role(
"link", name="Moz Pro"
).click()
except:
...
await page.goto(
"https://analytics.moz.com/pro/keyword-explorer/keyword/suggestions?locale=en-US"
)
await page.get_by_placeholder(
"Enter a term or phrase to get analysis, suggestions, difficulty, and more"
).fill(keyword)
await page.get_by_role("button", name="analyze").click()
async with page.expect_download() as download_info:
await page.get_by_role("button", name="Export CSV").click()
download = download_info.value
download = await download
await download.save_as("downloads/" + download.suggested_filename)
# -- END CODEGEN LINES --
# When done, close the browser.
await asyncio.sleep(2)
await context.close()
async def run_main():
await main()
if in_notebook():
try:
asyncio.get_running_loop()
asyncio.run(run_main())
except RuntimeError as e:
if "no running event loop" in str(e):
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(run_main())
else:
asyncio.run(run_main())
When called from a command-line, it looks like this:
$ python mozpro.py -k "foo"
Very interesting. I have some decisions to make regarding how many variations I want for different MOZ Pro “endpoints”. The Web UI is going to have so many parameters! I’m not going to start a holy endeavor to cover everything. I’m just going to provide a few excellent template examples.