ongoing by Tim Bray

This is a stripped-down implementation of the server side of the Atom Publishing Protocol as an Apache module, implemented in C. It felt like something that needed to exist and I am better-qualified for this particular chore than your average geek; having said that, I have no idea if anyone actually needs such a thing. mod_atom activity can be tracked on this blog, for now, here. If any interest develops, then I’ll transfer discussion to a blog at mod-atom.net which will be driven, of course, by mod_atom.

For the moment, I’m going to brain-dump everything about the project right here, if only as a crutch for my own memory. People who care about the Atom protocol, and those who care about Apache internals wrangling, might find it interesting; the intersection of those two groups is, I suspect, me.

What’s an Apache Module? · It’s code that gets linked into httpd, the Web server binary. There are hundreds; a few are included with the server distro, but most aren’t. Code in a module doesn’t have to do anything like CGI, you’re just a C subroutine that gets called with a package of details about the request and the current server state. Which can save some cycles. Might those cycles be significant in your application? Maybe, sometimes. If mod_atom is fast, it’s more apt to be fast because of its low-rent flat-file-only approach. On the other hand, being in the server means that you have to code in C and you have to be really careful about concurrency and memory management and all sorts of low-level grunge.

By the way, to be technically correct, whenever I say Apache I should probably be saying “httpd”, since while they used to be synonyms, Apache means much more now, the httpd Web Server is just one piece. But httpd is an ugly little splodge of letters, Apache sounds so much better. And on modern Debian-family systems, httpd is called “apache” anyhow.

Why Me? · Well, I understand the Atom Protocol pretty well and I’ve already written a couple of Apache modules (for a failed startup), so it’s less work for me than it would be for nearly anyone else.

Also, I think that the protocol is going to be a big enough part of the Web ecosystem that Apache, as perhaps the world’s single most important piece of Web infrastructure, really ought to support it. Think of it as giving PUT something useful to do.

What Does it Do? ·

Implements all of the Atom Protocol, near as I can tell.
There’s no database. Everything is persisted in files. Entry paths look like /blogs/tim/atom/e/entries/2007/06/23/cat-pix
Since it blasts Atom Entries straight into files, it can easily (unlike most Atom protocol implementations) preserve foreign markup.
It should run fine under any MPM, without concurrency issues.
All the atom:id values begin urn:uuid, so you could in principle move a whole publication from one server and directory to another. Those who have memories of me arguing bitterly against URNs in general and atom:id in particular can please restrain your snickering while I’m around.

Configuration · There isn’t much. In your Apache config file, you can define as many “publications” as you want. Each requires one directive, for example:

AtomPub /blogs/joe /z0/pubs/blogs/jb "Joe's Blog" "J. Blow"

The first argument is a prefix; any URI beginning with it is considered to be part of the publication. The second is the filesystem directory where the data is rooted. The filenames are the same as the URIs, only with the directory substituted for the prefix. The title and author are self-explanatory. There are no defaults.

When Apache starts up, if there’s an AtomPub directive but the directory structure isn’t there, the init code creates it.

mod_atom doesn’t do any other configuration of any kind, for the moment. Yes, I know there are lots of other kinds of configurations you might like to be able to do. People talk about hitting an 80/20 point; this more like a 60/1 point. Publications have collections, and per RFC4287, the minimum you need is a title and an author; so you really couldn’t do this with any less. And with one line in a config file you get a fully-functional publication.

One thing you can’t configure at all is the directory layout where the data goes. That’s hard-wired way deep into the code.

Right now, a publication comes with two hard-wired collection named “Entries” and “Media”. The code can actually (in theory) handle multiple Entry and Media collections, but I haven’t figured out a cheap enough way to configure them.

After all, haven’t people been saying “Complexion over Commiseration” or something like that recently?

How Much Work Was It To Implement the Atom Protocol? · Not much, actually, for a competent C programmer who understands the protocol and some of Apache. My Apache-module experience was less valuable than I’d expected, because I had written Apache 1.* modules and the 2.* API is quite a bit different.

Anyhow, I started on April 26th and I have enough today to start showing the world. I program fast but I’ve been busy, so it’s a very part-time thing. There are 8400 lines of code, but that includes a 2600 lines of of Genx (because Apache doesn’t have much of an XML generator) and then 2700 or so of unit-test code (1700 or so being Genx’s). So it’s really no big deal.

Life was immensely easier because of having the Ape available. Being an Apache module imposes some constraints that make unit testing tricky. While the Ape provides functional rather than unit testing, strictly speaking, using it shook out loads of bugs and saved a huge amount of time. The setup was amusingly arcane; The Ape’s Ruby code running under JRuby in a servlet in a Java EE app server talking to my naked hacked Apache server, 8080 to 4444 I think. What with some other things that are there to support ongoing, my little laptop is running more than its share of Web servers.

Rocket Science? · There’s really not much. You suck in XML and bit-bags from the net, you find a place to put ’em, you build feeds describing them, you echo them back on request, you’re careful about concurrency. It’s vanilla infrastructure engineering.

There’s one premature optimization; I worried about someone setting up a few thousand publications on one server (wouldn’t be surprising) and since the way a module works is you have to look at every URI that comes in to see if it’s one of yours, the task of scanning through your list of known pubs for prefix matches could be pretty costly. So, the mod_atom setup code compiles the list of known pubs into a simple little finite automaton which can tell you which if any of your pubs a URI belongs to really fast. Which is pretty silly, YAGNI territory probably. But I’m a sucker for finite automata.

I tried to avoid mutexing; the only place where you really have to (I think) is when a PUT comes in and you need to lock things down while you check the ETag and, if you accept the PUT, blast it in. I think you should be able to get enough concurrency out of the filesystem for the rest of the protocol. Based on what I hear, if someone took a mod_atom install and started firing PUTs at a few existing URIs from a lot of parallel sources, I bet the apr_global_mutex... calls would start to hurt pretty quick. I have lots more premature-optimization ideas for that situation.

Frankly, the hardest bit was figuring out all the autoconf and libtool voodoo to compile the sucker, and in the end I couldn’t; in the finest open-source tradition I reused code from Josh Rotenberg and did cut/paste/hack till it worked.

I’m assuming that one of these days someone I respect will explain to me why libtool & friends are a good idea and how to use them properly; until then I’m going to ignore them and hope they’re replaced. This technique allowed me to avoid ever learning either imake or C++.

Legal Status · Apache V2 license, copyright Sun Microsystems, if the ASF ever got interested I have the go-ahead to sign over whatever to whomever. Haven’t figured out where to host yet, but here’s a tarball. If you want to actually try to run it, do please contact me.

Technical Status · It’s not really ready to use, but I’m publishing it because I want to start talking and get some advice and opinions on what I should do about some things, and that’s easier if you can point at source code.

mod_atom passes a few (eighty-odd) unit tests, plus it gets a clean bill of health from the Ape. One of my short-term to-dos is to run Joe Gregorio’s test client against it. I’m pretty sure the basic technical approach to wrangling entries and feeds is sensible and can probably be made to run very efficiently.

It has one big and one small missing piece, and a major enhancement I think would be good. The big missing piece is HTML (see next section). The small missing piece is collection paging; it just isn’t there at the moment; you get the last 20 entries in reverse app:edited order and that’s all you get. No biggie.

The big enhancement I want to do is non-destructive editing. Right now it implements PUT by replacing the old data with the new, and DELETE by, well, deleting the data. I think it would be better, in all cases, to copy the data aside, uh, somewhere. But I want to talk to people about this one too, because I suspect it may involve weird corners.

To HTML or not to HTML? · For the moment, mod_atom is just an Atom server, not a blog engine. Which is to say that it accepts and stores and updates and deletes the Atom Entries and generates feeds appropriately, but doesn’t actually generate any HTML versions.

I’m not sure what to do about this. It’d be pretty easy to just pull the data out of the Atom Entries, wrap some basic HTML around it, and have a blogging engine. But I think it’s irresponsible to publish HTML from outside without sanitizing it. While I’m betting that it’s appropriate to do the low-level persistence and CRUD in the bowels of httpd, I’m having trouble believing that HTML sanitation and beautification belong in there too. There are tools like TagSoup and Hpricot which are just the thing for the job.

So maybe there is an ancillary “blogging system” that does the necessary with the Atom entries? Or maybe there’s a TagSoup equivalent available for C that could help out?

To-Do · Suggestions welcome.

Try it out on a few other systems, right now I’ve only tested OS X. I expect breakage in my hacked-up build system, but not much in the actual code. Programs written in C are portable, everyone knows that.
Shake it down with Joe Gregorio’s APP Test Client.
Add a bunch more tests to the Ape for bits of the protocol which, now having implemented them, I realize are tricky. In particular, the Ape never tested sending a PUT to a media resource, so that portion of the mod_atom code is unexercised and likely buggy.
Add collection paging.
See if anyone at ASF might be interested, now or down the road.
Fix up error handling so that client errors get an explanation in the response body, not just an HTTP error code. Apache doesn’t make this as straightforward as you might expect.
Simultaneously, refactor error-handling internally. Some of my routines return apr_status_t and others char *; it’s kind of ad-hoc and not very well thought through.
Figure out how to do some load testing.
Do some evangelism. My eyes have that a Ruby gleam these days, and grinding out all this C has been kind of painful so it would be nice if it turned out to be useful for somebody.

Contributions

Comment feed for ongoing:

From: Alex Waterhouse-Hayward (Jun 26 2007, at 22:11)

The above is Greek to me in English and in Spanish I would say, "Es chino." I am intrigued by your constant self-labeling as a geek. That reminds me of my friend, baroque violininst (and violist) Paul Luchkow who says that he can call his violin a fiddle but I may not.

Alexwh