This one for web-tech aficionados only.
Those of you who watch your webserver logs, go do a fgrep msnbot
access_log
(MSNbot got to ongoing today).
Unlike any robot I’ve seen or heard of, MSNbot tells you the
referer
, so you can actually watch the trails it takes
into and through
your online presence.
Neato. Ordinary people who are well-integrated with the real world can
safely ignore this fascinating discovery and be fairly sure it will not
impair their quality-of-life.
Move along, now; nothing to watch here.
(Update: I’m baffled, this makes no sense.)
Baffled ·
I’ve written two large-scale many-millions-of-pages Web robots, so I
have some experience in this space.
And this referer
behavior has me shaking my head.
Every robot I’ve written ar known about has more or less the same basic
algorithm; you keep a big pool of URIs to work on, and you have a zillion
parallel threads, each doing:
while(true)
{
URI uri = pool.getUriThatNeedsCrawling();
Page page = uri.fetch();
page.addContentToIndex();
URI link;
while (link = page.nextUriInPage())
pool.addUri(link);
}
Because, of course, the same URI is going to show up in lots of different
pages, so the notion that there is one referer
is just wrong.
Side-trip: One of the real interesting design choices in designing a robot
is what getUriThatNeedsCrawling()
does.
In my most recent robot, I had each thread work on a single site, asking for
its pages one by one; other designs give each thread a new random page to
fetch, usually not from the same site.
You can get a heated discussion going among any group of robot designers by
raising this issue (by the way, I’m right). But I digress.
But I have no idea what MSNbot is doing; maybe they’ve got a radical new crawler architecture. That would be surprising.