This is a stripped-down implementation of the server side of the Atom Publishing Protocol as an Apache module, implemented in C. It felt like something that needed to exist and I am better-qualified for this particular chore than your average geek; having said that, I have no idea if anyone actually needs such a thing. mod_atom activity can be tracked on this blog, for now, here. If any interest develops, then I’ll transfer discussion to a blog at mod-atom.net which will be driven, of course, by mod_atom.

For the moment, I’m going to brain-dump everything about the project right here, if only as a crutch for my own memory. People who care about the Atom protocol, and those who care about Apache internals wrangling, might find it interesting; the intersection of those two groups is, I suspect, me.

What’s an Apache Module? · It’s code that gets linked into httpd, the Web server binary. There are hundreds; a few are included with the server distro, but most aren’t. Code in a module doesn’t have to do anything like CGI, you’re just a C subroutine that gets called with a package of details about the request and the current server state. Which can save some cycles. Might those cycles be significant in your application? Maybe, sometimes. If mod_atom is fast, it’s more apt to be fast because of its low-rent flat-file-only approach. On the other hand, being in the server means that you have to code in C and you have to be really careful about concurrency and memory management and all sorts of low-level grunge.

By the way, to be technically correct, whenever I say Apache I should probably be saying “httpd”, since while they used to be synonyms, Apache means much more now, the httpd Web Server is just one piece. But httpd is an ugly little splodge of letters, Apache sounds so much better. And on modern Debian-family systems, httpd is called “apache” anyhow.

Why Me? · Well, I understand the Atom Protocol pretty well and I’ve already written a couple of Apache modules (for a failed startup), so it’s less work for me than it would be for nearly anyone else.

Also, I think that the protocol is going to be a big enough part of the Web ecosystem that Apache, as perhaps the world’s single most important piece of Web infrastructure, really ought to support it. Think of it as giving PUT something useful to do.

What Does it Do? ·

  • Implements all of the Atom Protocol, near as I can tell.

  • There’s no database. Everything is persisted in files. Entry paths look like /blogs/tim/atom/e/entries/2007/06/23/cat-pix

  • Since it blasts Atom Entries straight into files, it can easily (unlike most Atom protocol implementations) preserve foreign markup.

  • It should run fine under any MPM, without concurrency issues.

  • All the atom:id values begin urn:uuid, so you could in principle move a whole publication from one server and directory to another. Those who have memories of me arguing bitterly against URNs in general and atom:id in particular can please restrain your snickering while I’m around.

Configuration · There isn’t much. In your Apache config file, you can define as many “publications” as you want. Each requires one directive, for example:

AtomPub /blogs/joe /z0/pubs/blogs/jb "Joe's Blog" "J. Blow"

The first argument is a prefix; any URI beginning with it is considered to be part of the publication. The second is the filesystem directory where the data is rooted. The filenames are the same as the URIs, only with the directory substituted for the prefix. The title and author are self-explanatory. There are no defaults.

When Apache starts up, if there’s an AtomPub directive but the directory structure isn’t there, the init code creates it.

mod_atom doesn’t do any other configuration of any kind, for the moment. Yes, I know there are lots of other kinds of configurations you might like to be able to do. People talk about hitting an 80/20 point; this more like a 60/1 point. Publications have collections, and per RFC4287, the minimum you need is a title and an author; so you really couldn’t do this with any less. And with one line in a config file you get a fully-functional publication.

One thing you can’t configure at all is the directory layout where the data goes. That’s hard-wired way deep into the code.

Right now, a publication comes with two hard-wired collection named “Entries” and “Media”. The code can actually (in theory) handle multiple Entry and Media collections, but I haven’t figured out a cheap enough way to configure them.

After all, haven’t people been saying “Complexion over Commiseration” or something like that recently?

How Much Work Was It To Implement the Atom Protocol? · Not much, actually, for a competent C programmer who understands the protocol and some of Apache. My Apache-module experience was less valuable than I’d expected, because I had written Apache 1.* modules and the 2.* API is quite a bit different.

Anyhow, I started on April 26th and I have enough today to start showing the world. I program fast but I’ve been busy, so it’s a very part-time thing. There are 8400 lines of code, but that includes a 2600 lines of of Genx (because Apache doesn’t have much of an XML generator) and then 2700 or so of unit-test code (1700 or so being Genx’s). So it’s really no big deal.

Life was immensely easier because of having the Ape available. Being an Apache module imposes some constraints that make unit testing tricky. While the Ape provides functional rather than unit testing, strictly speaking, using it shook out loads of bugs and saved a huge amount of time. The setup was amusingly arcane; The Ape’s Ruby code running under JRuby in a servlet in a Java EE app server talking to my naked hacked Apache server, 8080 to 4444 I think. What with some other things that are there to support ongoing, my little laptop is running more than its share of Web servers.

Rocket Science? · There’s really not much. You suck in XML and bit-bags from the net, you find a place to put ’em, you build feeds describing them, you echo them back on request, you’re careful about concurrency. It’s vanilla infrastructure engineering.

There’s one premature optimization; I worried about someone setting up a few thousand publications on one server (wouldn’t be surprising) and since the way a module works is you have to look at every URI that comes in to see if it’s one of yours, the task of scanning through your list of known pubs for prefix matches could be pretty costly. So, the mod_atom setup code compiles the list of known pubs into a simple little finite automaton which can tell you which if any of your pubs a URI belongs to really fast. Which is pretty silly, YAGNI territory probably. But I’m a sucker for finite automata.

I tried to avoid mutexing; the only place where you really have to (I think) is when a PUT comes in and you need to lock things down while you check the ETag and, if you accept the PUT, blast it in. I think you should be able to get enough concurrency out of the filesystem for the rest of the protocol. Based on what I hear, if someone took a mod_atom install and started firing PUTs at a few existing URIs from a lot of parallel sources, I bet the apr_global_mutex... calls would start to hurt pretty quick. I have lots more premature-optimization ideas for that situation.

Frankly, the hardest bit was figuring out all the autoconf and libtool voodoo to compile the sucker, and in the end I couldn’t; in the finest open-source tradition I reused code from Josh Rotenberg and did cut/paste/hack till it worked.

I’m assuming that one of these days someone I respect will explain to me why libtool & friends are a good idea and how to use them properly; until then I’m going to ignore them and hope they’re replaced. This technique allowed me to avoid ever learning either imake or C++.

Legal Status · Apache V2 license, copyright Sun Microsystems, if the ASF ever got interested I have the go-ahead to sign over whatever to whomever. Haven’t figured out where to host yet, but here’s a tarball. If you want to actually try to run it, do please contact me.

Technical Status · It’s not really ready to use, but I’m publishing it because I want to start talking and get some advice and opinions on what I should do about some things, and that’s easier if you can point at source code.

mod_atom passes a few (eighty-odd) unit tests, plus it gets a clean bill of health from the Ape. One of my short-term to-dos is to run Joe Gregorio’s test client against it. I’m pretty sure the basic technical approach to wrangling entries and feeds is sensible and can probably be made to run very efficiently.

It has one big and one small missing piece, and a major enhancement I think would be good. The big missing piece is HTML (see next section). The small missing piece is collection paging; it just isn’t there at the moment; you get the last 20 entries in reverse app:edited order and that’s all you get. No biggie.

The big enhancement I want to do is non-destructive editing. Right now it implements PUT by replacing the old data with the new, and DELETE by, well, deleting the data. I think it would be better, in all cases, to copy the data aside, uh, somewhere. But I want to talk to people about this one too, because I suspect it may involve weird corners.

To HTML or not to HTML? · For the moment, mod_atom is just an Atom server, not a blog engine. Which is to say that it accepts and stores and updates and deletes the Atom Entries and generates feeds appropriately, but doesn’t actually generate any HTML versions.

I’m not sure what to do about this. It’d be pretty easy to just pull the data out of the Atom Entries, wrap some basic HTML around it, and have a blogging engine. But I think it’s irresponsible to publish HTML from outside without sanitizing it. While I’m betting that it’s appropriate to do the low-level persistence and CRUD in the bowels of httpd, I’m having trouble believing that HTML sanitation and beautification belong in there too. There are tools like TagSoup and Hpricot which are just the thing for the job.

So maybe there is an ancillary “blogging system” that does the necessary with the Atom entries? Or maybe there’s a TagSoup equivalent available for C that could help out?

To-Do · Suggestions welcome.

  1. Try it out on a few other systems, right now I’ve only tested OS X. I expect breakage in my hacked-up build system, but not much in the actual code. Programs written in C are portable, everyone knows that.

  2. Shake it down with Joe Gregorio’s APP Test Client.

  3. Add a bunch more tests to the Ape for bits of the protocol which, now having implemented them, I realize are tricky. In particular, the Ape never tested sending a PUT to a media resource, so that portion of the mod_atom code is unexercised and likely buggy.

  4. Add collection paging.

  5. See if anyone at ASF might be interested, now or down the road.

  6. Fix up error handling so that client errors get an explanation in the response body, not just an HTTP error code. Apache doesn’t make this as straightforward as you might expect.

  7. Simultaneously, refactor error-handling internally. Some of my routines return apr_status_t and others char *; it’s kind of ad-hoc and not very well thought through.

  8. Figure out how to do some load testing.

  9. Do some evangelism. My eyes have that a Ruby gleam these days, and grinding out all this C has been kind of painful so it would be nice if it turned out to be useful for somebody.



Contributions

Comment feed for ongoing:Comments feed

From: Alex Waterhouse-Hayward (Jun 26 2007, at 22:11)

The above is Greek to me in English and in Spanish I would say, "Es chino." I am intrigued by your constant self-labeling as a geek. That reminds me of my friend, baroque violininst (and violist) Paul Luchkow who says that he can call his violin a fiddle but I may not.

Alexwh

[link]

From: pkeane (Jun 26 2007, at 23:14)

httpd: Syntax error on line 115 of /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_atom.so into server: /etc/httpd/modules/mod_atom.so: undefined symbol: qsort_r

very excited about getting this working, by the way...

[link]

From: John Cowan (Jun 26 2007, at 23:28)

Rich $alz at IBM has written a C++ version of TagSoup which is going to be released (with my blessing) under Apache 2.0; not all the legal process is done, though. Nag him about it; I can't legally give you my copy.

[link]

From: Justin (Jun 26 2007, at 23:50)

Very cool, DeWitt Clinton mentioned the possibility a while ago. If the APP takes off the way many are expecting it too, having an easy to setup server could make for a nice WebDAV alternative, at a minimum.

If the module gains complexity, configuring it using an annotated Atom service document might be an interesting possibility, same goes for category documents. Both seem like good places to provide special directives for the server.

[link]

From: Tim (Jun 26 2007, at 23:59)

PKeane (and follow his link for some good guitar tunes): you mean other Unix-like operating systems don't have qsort_r?!?!? Sigh.

[link]

From: Manuzhai (Jun 27 2007, at 01:18)

Hmm, it only works with httpd >= 2.2? That's a pity.

checking for Apache 2.0 version >= 2.2.4... no

configure: error: *** Apache version 2.2.4 not found!

[link]

From: Asbjørn Ulsberg (Jun 27 2007, at 03:35)

Very exciting, Tim. Looking forward to seeing where this is going; I'm hoping for ASF incubation/adoption. For shared-source environments I'd think it needs a way to configure where to store stuff. And a database backend would be awesome. Not sure how pretty that would be to implement in C, though.

[link]

From: katre (Jun 27 2007, at 06:38)

As far as HTML display, it seems to me the best bet is for mod_atom to emit the data for an entry, suitable sanitized, and let the actual user then use XSLT or whatever to transform that into actual HTML. This steps nicely around the issue of including an entire templating system into the module, and lets end users set up whatever frontend they want.

It sounds pretty cool, I'm definitely interested in seeing what happens next. I also want to take a look at the finite automata code and see if I can figure out how it works. :-)

[link]

From: John Hart (Jun 27 2007, at 09:08)

What kind of finite automata? A trie?

A Bloom Filter would also work well for this sort of thing (it has no false negatives but it does have a configurable rate of false positives, so you'd have to actually check the directory structure after a "yes" but you're doing that anyway).

[link]

From: Seth Gordon (Jun 27 2007, at 11:33)

Planet (http://www.planetplanet.org/) was designed to crawl all the feeds on the blogroll and produce some appropriately formatted HTML page with all their contents; you could just set it up so it only read your own blog's mod_atom feed, make some appropriate template, and voila!

[link]

From: d.w. (Jun 27 2007, at 15:16)

Nice -- might be interesting to hack on something like Blosxom to povide the blog-serving part of the equation.

[link]

From: lennon (Jun 27 2007, at 15:42)

I've made the requisite trivial change to the sources to make them build on Linux, (i.e., using qsort and a global iterator pointer in place of qsort_r, and accepting that only a forking MPM will be safe) but I'm wondering if there's something I'm missing with the configuration.

The module loads fine, and on-disk storage directories are being created, but I get a 404 for any request under the AtomPub publishing root. Any suggestions for next steps?

[link]

From: Tim (Jun 27 2007, at 16:19)

Seth: Cool idea about Planet.

lennon: I patched that one. I'm not willing to give up on concurrency, so my patch laboriously packs the filename/mtime pairs into an array and doesn't require any globals.

lennon: The software is fussy about file names. It explicitly won't return anything unless it looks like (via regex matching) something it created via a POST. What were you trying to retrieve?

[link]

From: Marius Mathiesen (Jun 28 2007, at 01:39)

Thanks, Tim, this will be a great help in exploring APP from my Ruby apps. It installed sweetly on my Debian install on Parallels, looks like a fine piece of software.

So your painful C work the last couple of days allows me to do APP stuff in Ruby, making me happy.

[link]

From: Thomas Broyer (Jun 28 2007, at 14:56)

Hi Tim,

re HTML parsing/sanitizing, maybe you'd like to port html5lib to C or C++ and/or integrate it to libxml? ;-)

html5lib: http://code.google.com/p/html5lib/

libxml: http://xmlsoft.org/

[link]

From: Elliotte Rusty Harold (Jul 02 2007, at 04:22)

Great idea, but shouldn't this be called mod-atompub rather than mod-atom? I had to read halfway down before I figured out what this actually did, and why it mattered. Previously I kept trying to figure out why Apache needed a new module just to serve Atom feeds. :-)

[link]

From: Colm Divilly (Jul 06 2007, at 10:41)

Given the following httpd.conf instruction:

AtomPub /blogs/joe /z0/pubs/blogs/jb "Joe's Blog" "J. Blow"

what would the uri of the service document be ? I figured it should be /blogs/joe/atom/service, but I got a 404 for that :(

[link]

author · Dad
colophon · rights
picture of the day
June 25, 2007
· Technology (90 fragments)
· · Atom (91 more)
· · Open Source (82 more)
· · Publishing (161 more)
· · Syndication (67 more)
· · mod_atom (1 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!