How To Crawl a Website For Quick SEO and Social Media Audits
by Mike Levin SEO & Datamaster, 04/03/2012
Update: 360iTiger was something from my long-ago past. If you’re interested in the new stuff, go learn about Pipulate.
What makes 360iTiger, a tool that crawls websites and social counters directly into Google Docs, different from similar products? First, because it’s cloud and bookmarklet-based, there’s no software to download. There’s almost nothing to learn. It’s free and works in an environment many people are already comfortable in. In short, pretty much anyone can do web crawls using stuff they already have and know.
The only way to really appreciate this is to try it and have that “ah-ha!” moment. For me, this way of working just makes so much sense—the tools just sort of fade into the background, and it feels like you’re just using spreadsheets the way they’re supposed to work. Now that I’ve been using my own application for two years, and had it deployed for internal use where I work, it’s hard to imagine doing investigations like this any other way.
Even though things like web crawlers are tough in Google App Script, the flip-side is that you can easily pull the data of individual pages with the importHTML function, and do further grabs at the data within using the optional xpath parameter—basically, everything but the crawl. Therefore, half the solution is dangled tantalizingly in front of you natively inside of Google Docs. You feel a kind of Kung Fu using native Google Docs capabilities to do SEO. Unfortunately, this is ultimately too restrictive, and blinds you to better approaches that live apart and separate from the spreadsheet.
Fortunately, there is better Kung Fu—and it involves using the Google Data API rather than Google App Script. Yes, there are two API’s to Google Docs, and they are not equivalent in capabilities. Crawls and many other tasks are challenging in App Script because you are be bound by the less-than-full control of it not being your own server. For example, you might hit http fetch quotas, or you might like to connect to someone else’s APIs but can’t install their client libraries on App Script. In short, Google App Script is not your own server, so you are limited to what Google lets you do.
What’s worse, you will have difficulty porting your work if something comes along with just as powerful a Web-API as Google Docs—as opposed to actually doing all the processing on your own server, and merely using Google Spreadsheets for input and output (I/O)—in which case you can keep your app as lightly coupled to Google Docs as possible, and your work takes on much broader potential than merely the modern version of Visual Basic for Applications (VBA).
That’s precisely what we have done with 360iTiger. It is back-ended by independent servers, so we can go well beyond the capabilities of Google App Script, and manage our code outside the Google Docs environment. The connection between Google Docs, our servers and your web browser is simply made via the same bookmarklet trick first popularized so many years ago my delicio.us, and more recently by Pinterest. Click the bookmarklet, and our servers start working their magic on your spreadsheet.
The upshot of this, is that data collection can be automated into spreadsheets with a single button-press from the user’s perspective. There’s no need to mess around with the Script Editor or Excel-like functions. Just visit a website, click a button, and now you have a spreadsheet auto-populating with the first 25 pages found on that site. Of course, there are additional parameters that you might like to control such as fetching different information off of the page, or crawling more than 25 pages. But that will all come in time as we reveal more and more of what lurks beneath the surface of the 360iTiger system.