Robots—also known as spiders—are programs that retrieve Web resources, using the embedded hyperlinks to automate the process. Robots retrieve data for all sorts of purposes, but they were invented mostly to drive search engines. Herewith a tour through Robot Village.
Shakin’ All Over · I’m going to begin with a war story. The year was 1994 and the Web was a new thing, my access was via Mosaic and a 14.4K modem. At the time, I worked for Open Text, then a commercial search software vendor. At an SGML conference, a fellow named Eric Van Hervijnen gave an “Introduction To the Web” speech; he introduced this notion of a “Web Search Engine” and offered the opinion that these would be hot stuff.
Sitting in the audience, I felt like a bomb had gone off in my head; we had a search engine, and it wouldn’t be that hard to write a program to fetch web pages and copy them in and index the text and follow the hyperlinks and then link our search engine to a Web Server and let anybody search the Web. All obvious stuff, now; but at that point maybe 100 people in the world had had that bomb go off in their head. I was so excited that I was physically shaking on and off for the next couple of days. I sold my CEO on the idea and spent six months writing the software, and in April 1995 the Open Text Index of the Web launched, and in January 1996 we did an IPO on NASDAQ, and Open Text is now a big, successful, profitable company (but no longer a search company).
To prepare for the launch of Antarctica, I ended up writing another large-scale hundreds-of-millions-of-pages Web Robot, so I’ve been down this track twice.
How Robots Work · Every search-engine robot ever written uses essentially the same approach, and it’s a simple one:
Select one of the URIs you know about.
Do a GET
on that URI.
Decide whether you can index whatever you got back. If not, go to Step 1.
Update the search-engine index with what you just fetched.
Extract the hyperlinks from what you just fetched.
Add the URIs from those hyperlinks to the list you know about.
Go to Step 1.
So, robots should be simple. The problem is, when you want to run them on really big hypertexts—like, say, the whole Web—you end up doing each of these things maybe a billion times, so little problems become huge, enormous problems very quickly. I’ll use that list of steps above to structure a walk through the problems.
1. Select A URI · Well, let’s assume you know millions of URIs, what order do you crawl them in? You probably need more than one strategy in parallel: one of my robots used an approach that’s a lot like what Google seems to do: a “fast crawl” and a “deep crawl.” The Deep Crawl tries to revisit more or less everything, probably in more or less random order. The Fast Crawl keeps going back to places that get updated a lot and looking for new stuff. ongoing gets visited by Google’s fast crawl lots of times every day; sometimes the lag between me publishing something and Google cruising by to grab it is just a few minutes. No, they’re not (I don’t think) reading my RSS feed, they’ve just made it adaptive and it’s learned that there are lots of updates here.
In my first robot, for deep-crawling I had a big pool of URIs and selected from it randomly, so I ended up going to a different server every time. This was problematic (among other things, I burned huge amounts of time doing DNS lookups), so for my second I organized the URI pool by server, then each thread would line up all the URIs for some server and get them all one by one, up to some maximum like a thousand or so. This made way more efficient use of all the network infrastructure and on balance I recommend it.
2. GET the URI ·
Here’s where you start getting into difficulties.
This step runs orders of magnitude slower than all the other steps in our
robot algorithm; the Web is slow enough that a thread mostly consumed with
fetching URIs is going to be spending 99% or so of its time idle picking up
the bytes trickling in off the Net.
The solution to this is obvious: run lots of threads.
Every serious Web robot is running hundreds or thousands of GET
threads in parallel, with a relatively smaller number of threads collecting
the completed fetches and working on the rest of the steps.
This means that the code that does the GET
needs to have a
fairly light memory footprint: in practical terms, you can’t do it with
something like Perl, which is memory-hungry and running thousands of Perl
processes in parallel is beyond the capabilities of even big beefy computers
(and it doesn’t do threads very well).
Once you’re sending GET
s out into the wilds of the Net, all
sorts of weird things start to happen.
Lots of servers (or the routes to them) are slow; some just hang up or go
offline unpredictably, or turn out to yield gigabytes of data, or find new and
surprising ways to misbehave.
What you end up having to write is some highly-tuned code that has a routine
that might be described as “Go do a GET
on this URI, but don’t
wait for longer than this many seconds for the data to trickle in and don’t
collect more than this many bytes of data, and when you’re done come back and
report what happened.”
I’ve always ended up having to do this in C, because you really need
low-level access to your OS network primitives to avoid mysterious lockups
and blowups.
You can test all you want on your private network, but when you send your
little software child off into the wilds of the Net, you can be sure that
all sorts of weird stuff is going to start happening.
Another thing that a robot has to be smart about is redirects—a remarkably
high proportion of all the URIs you find out there in the wild redirect to
somewhere else; for example http://www.tbray.org
redirects to
http://www.tbray.org/
, with a slash on the end.
There’s no way to predict them (it’s often not as simple as a slash), you just
have to deal with them when they happen.
Finally, you have to respect the
Robot Exclusion
Protocol, i.e. whenever
you visit a server, read /robots.txt
and respect what it says.
There are lots of good libraries out there that implement this correctly.
3. Decide Whether You Can Use What You Got ·
When you do a GET
on a URI, the server will send you some
headers along with the data; one of them, labeled Content-type
,
tells you what the data is: HTML, graphics, audio, video, whatever.
So what your robot needs is a list of Content-types that it can handle: HTML
for sure (including XHTML in its various flavors), then maybe PDF and Word
and PowerPoint and so on and so on.
One problem is that a small proportion of server administrators get their Content-type wrong; and when you’re fetching a billion URIs, “a small proportion” means that you’ve got to deal with errors in the millions.
The worst thing that happens is when someone sends you a video feed claiming to be HTML, which will quite likely cause your text parser to blow up messily; but there is an infinite variety of things that can (and will) go wrong here. In my earlier attempts, I wrote fairly aggressive code to peek inside the data and try to guess its type and work around the errors; but this turns out not to be remotely cost-effective. Make your indexing software exit gracefully when the data isn’t what it says it is, and just don’t worry about missing those pages; there are more than enough others out there.
4. Update Your Index · It turns out that update performance has been a weak link of search engines since the dawn of time. One of the nice things about Lucene is that it claims to have good performance in this. Still, a search engine indexing a billion or so Web resources is just amazingly difficult to update in real time.
Up until mid-2003, Google couldn’t manage it, and there was the well-known monthly phenomenon called the “Googledance,” they’d be updating their indices and you could do the same query twice in a row and get very different results, depending on whether the server you hit had been updated or not.
I’ll skip lightly over the problems of extracting the words to be indexed from the data you’ve retrieved. This is pretty easy with any form of XML, kind of painful with most HTML which is broken and malformed and kludgy, really difficult with Word Processor files, and very difficult indeed (impossible in some cases) with PDF. You don’t have to write your own software, there are commercial and Open-Source products to address most of these problems.
Of course, you have to be prepared to deal with the fact that a vast majority of the HTML out there is not remotely close to being valid, and in many case is not even (in XML terms) well-formed; so parsing it gets into rocket science. Fortunately, this problem, while hard, is thoroughly solved.
5, 6. Link/URI Wrangling · Extracting hyperlinks, especially from HTML, is not rocket science. The only real problem here is normalizing them before you stuff them in the database; for example, for robot purposes, all of the following are equivalent:
HTTP://WWW.EXAMPLE.COM/a/b
http://www.example.com:80/a/b
http://www.example.com/a/../a/./b
And there are lots more variations.
Social Considerations · When you’re running a robot, your code is going to end up interacting with thousands of Web servers, which means that you are going to end up interacting with a few of the people behind them. You’ll get lots of feedback, but they mostly fall into two or three kinds of buckets:
Who the #*$^! are you and why are you pestering me?!?!
I updated my site Wednesday and you didn’t crawl me again until Sunday!!!
Your stupid buggy robot tried to fetch 8,351 nonexistent URIs from my site!!!
The first two kinds you can deal with via boilerplate, the first by pointing them at the Robot Exclusion Protocol, the second by ignoring them. The third will require real attention; whether it’s because you really do have a bug, or because of the webmaster’s stupid broken relative URIs, your robot probably did abuse his site, and it’s not his problem, it’s your problem.
Implementation Considerations · Both my robots were in Perl, plus a low-level C module to handle the actual URI fetching. If I were doing it again, I’d probably use Perl again, or maybe Python because it’s so much more hip & fashionable these days. A very high proportion of your code is going to be doing text-wrangling, whether of URIs or page contents; and unless you’re in the top 0.005% of programmers, or like Google can throw a team at it, you’re not going to write C or Java code that’s going to outperform Perl/Python text-wrangling.
Any code that deals with the Internet in the raw and in the large is going to have problems with performance and survivability. A substantial robot is a 24/7, never-ending performance optimization problem with a substantial customer-relations component. Speaking as one who’s intimate with the ongoing server logs, I can testify that there are a lot of very badly written robots out there, many of them paying insufficient attention to public relations.
There was a time when robots were important, glamorous, and new. They’re still important, but there’s little novelty left in state of the art, and it was never the least bit glamorous.