I originally covered this subject back on February 27th, 2003, the same day that I announced ongoing to the world. I think it’s worth revisiting, because it sure would be handy if there were such a thing as a Web Site, as Dave Winer, Sam Ruby, and Jeremy Zawodny have all observed.
Summary: I think the Jeremy/Dave idea of a site feed directory in OPML or some equivalent is just fine. I think, though, that we’re going to have to point to it from the individual pages rather than try to park it in “a well-known spot on the site.” Note that Joe Gregorio has published a rant along the same lines as this one, but covering some useful ground that I don’t.
The problem is that all the Web knows about is URIs, and the Web can’t tell whether a URI points to a home page, a picture of a cute cat, or to one of a dozen daily entries on some blog. On the other hand, there are a lot of things that we’d like to know about a site, including:
What pages are in the site.
What the home page of the site is.
What syndication feeds there are for the site (the problem that got us here today).
What the site-owner’s policies are for crawlers (what we now use
robots.txt
for).
What little icon should be displayed in the address bar (what we now
use favicon.ico
for).
What the site’s privacy policies and content ratings are.
Where the site’s sitemap is.
And I bet, down the road, once we really have the notion of a site, we’ll be able to think of all sorts of other useful things to do with it.
Historically things like robots.txt
and
favicon.ico
have been jammed into the URI right after the host
name.
Unfortunately, there are lots of cases where this just doesn’t work.
Anyone who’s run a big corporate website has gotten tired of explaining to,
say, the HR group why they can’t have their own robots.txt
in
the root of their space so they can establish their own crawling
policies.
They don’t want to be pestering the webmaster for every little change, and
the webmaster doesn’t want to be making those changes either.
(If your setup allows the use of virtual hosts that helps, but some
don’t).
Finding the “Site” Isn’t Simple ·
There’s just no way, as far as I can tell, to look at a URI
and figure out what site it’s from.
Some sites just aren’t hierarchical, sometimes the site isn’t rooted at the
top level.
For example, the root of ongoing is at
http://www.tbray.org/ongoing/
, but there are things that are part
of ongoing that don’t start with
http://www.tbray.org/ongoing/
and there are things elsewhere on
http://www.tbray.org/
that are part of other web sites.
In particular, some of the big content management systems have URI-space layouts that have nothing to do with hierarchy. (In general, most of them also have URI-space layouts that suck, but that’s another matter).
Grabbing Pieces of Namespace Isn’t OK ·
Now, let’s assume that we could somehow find the “root” of a web
site by some magic.
I just don’t think it’s OK now in 2003, when we’re maybe 1% of the way into
the Web’s lifespan, to start gobbling up little bits of the namespace.
As it is, the names robots.txt
and favicon.ico
are
stolen forever, nobody will ever be able to use them for their own purposes
again.
What To Do? ·
I think that the MyFeeds.opml
idea is basically sound;
there’s lots of room to argue about the merits or deficiencies of OPML but
hey, it’s here and it works and it doesn’t get in the way.
So in the short term, I would just arrange for web pages to contain one
more link
element like so:
<link rel="feed-directory" href="myfeeds.opml" />
It ain’t perfect, but it’ll get the job done. Down the road, if we all manage to agree on how to wire the notion of a “Site” into the Web, one of the things that a “Site File” ought to point to will, of course, be this OPML file. But that’s down the road.