How To Crawl a Website For Quick SEO and Social Media Audits

by Mike Levin SEO & Datamaster, 04/03/2012

Update: 360iTiger was something from my long-ago past. If you’re interested in the new stuff, go learn about Pipulate.

360iTigerWhat makes 360iTiger, a tool that crawls websites and social counters directly into Google Docs, different from similar products? First, because it’s cloud and bookmarklet-based, there’s no software to download. There’s almost nothing to learn. It’s free and works in an environment many people are already comfortable in. In short, pretty much anyone can do web crawls using stuff they already have and know.

The only way to really appreciate this is to try it and have that “ah-ha!” moment. For me, this way of working just makes so much sense—the tools just sort of fade into the background, and it feels like you’re just using spreadsheets the way they’re supposed to work. Now that I’ve been using my own application for two years, and had it deployed for internal use where I work, it’s hard to imagine doing investigations like this any other way.

So then why isn’t this approach of just supercharging spreadsheets more commonly used? Well, there have been a few postings that have tried to show the way, but they keep falling into the same trap—teaching people how to capture data mostly using Google Doc’s built-in capabilities of the importFeed and importXML functions—using something called Google App Script, Google’s JavaScript answer to Microsoft’s Visual Basic for Applications (VBA). The problem is that this approach shifts onto the user too much responsibility to manage programming code to be directly embedded into each spreadsheet. In other words, the incantations to make the magic happen are just too complex.

Embedding all that code into the spreadsheet is just not a “frictionless” process. Case-in-point: you will notice a complete lack of a way to crawl a website using Google App Script. That’s because it’s kinda hard to cram the complex issues of managing a crawl into a sheet-resident JavaScript macro. It’s not impossible, and I investigated this approach. But all the elegance of working in a simple spreadsheet environment breaks down when you stuff it full of automation code.

Even though things like web crawlers are tough in Google App Script, the flip-side is that you can easily pull the data of individual pages with the importHTML function, and do further grabs at the data within using the optional xpath parameter—basically, everything but the crawl. Therefore, half the solution is dangled tantalizingly in front of you natively inside of Google Docs. You feel a kind of Kung Fu using native Google Docs capabilities to do SEO. Unfortunately, this is ultimately too restrictive, and blinds you to better approaches that live apart and separate from the spreadsheet.

Fortunately, there is better Kung Fu—and it involves using the Google Data API rather than Google App Script. Yes, there are two API’s to Google Docs, and they are not equivalent in capabilities. Crawls and many other tasks are challenging in App Script because you are be bound by the less-than-full control of it not being your own server. For example, you might hit http fetch quotas, or you might like to connect to someone else’s APIs but can’t install their client libraries on App Script. In short, Google App Script is not your own server, so you are limited to what Google lets you do.

What’s worse, you will have difficulty porting your work if something comes along with just as powerful a Web-API as Google Docs—as opposed to actually doing all the processing on your own server, and merely using Google Spreadsheets for input and output (I/O)—in which case you can keep your app as lightly coupled to Google Docs as possible, and your work takes on much broader potential than merely the modern version of Visual Basic for Applications (VBA).

That’s precisely what we have done with 360iTiger. It is back-ended by independent servers, so we can go well beyond the capabilities of Google App Script, and manage our code outside the Google Docs environment. The connection between Google Docs, our servers and your web browser is simply made via the same bookmarklet trick first popularized so many years ago my delicio.us, and more recently by Pinterest. Click the bookmarklet, and our servers start working their magic on your spreadsheet.

The upshot of this, is that data collection can be automated into spreadsheets with a single button-press from the user’s perspective. There’s no need to mess around with the Script Editor or Excel-like functions. Just visit a website, click a button, and now you have a spreadsheet auto-populating with the first 25 pages found on that site. Of course, there are additional parameters that you might like to control such as fetching different information off of the page, or crawling more than 25 pages. But that will all come in time as we reveal more and more of what lurks beneath the surface of the 360iTiger system.

Enhanced by Zemanta