This is a lengthy note to myself. I initially wanted to capture the thinking that went into the construction of mod_atom while it was still fresh in my mind, and dumped out the first dozen or so sections. Then as I expanded and refactored the code, I find that I’m keeping this up to date. This mostly by way of putting it in a place where I won’t lose it. I can write stuff for ongoing faster than for any other medium, and “On the Net” is a good place not to lose stuff. If mod_atom eventually gets picked up and used, this may be useful to me or anyone else who’s maintaining it; and if it doesn’t, there’ll still eventually be an AtomPub server module for Apache, and this might be useful to whoever builds it. But this is not designed to be entertaining or pedagogical; among other things, it’s in essentially random order.
I’m publishing it in November 2008, 14 months after I started writing it, because I’m giving the first-ever public speech about mod-atom at ApacheCon 2008 and I want something to point at should anyone be interested.
What’s a “Publication”? · mod_atom is built around the notion of a “Publication”, which has a one-to-one correspondence with a “Service Document” as described in RFC5023. There is a strong notion of a publication which is expected to have at least one Entry collection (to which Atom Entries can be POSTed) and one Media collection (to which any old bag of bits can be POSTed).
Abbreviations & Nomenclature · By “Apache” I mean the code produced by the Apache Software Foundation’s HTTP Server Project, often called “httpd” but this is misleading as the executable sometimes runs under the name “apache” or “apache2”. Specifically, I’m referring to the version against which mod_atom was developed, 2.2.*.
To keep function names reasonably short, MLE stands for Media Link Entry, and FP stands for Front Page, i.e. the “index.html” thingie that is the face a publication shows to the world. Pub stands for “publication”, which in the mod_atom context means the set of collections described by one AtomPub Service Document.
A “Feed” is an Atom document whose root is <atom:feed>
. A
“Collection” is a Feed as used in RFC5023, it can accept POST requests to
create new resources.
Files and URIs · The default operations of Apache assume file-backed resources. Thus mod_atom’s mapping from URI into filesystem space gets quite a bit of support from the infrastructure.
Mostly Static · mod_atom tries to impose the minimum possible tax on the URI processing path. It has to check every URI going through the server to see if it applies to a pub, and re-write those that do from URI into filesystem space. Then almost all GET requests can be tossed back to Apache’s static-file processor without further intervention.
There’s an exception when you have to fetch
anything but the first twenty entries out of a feed. The feed URI takes
count=
and start=
query parameters and generates the
feed dynamically when it sees them.
Otherwise, in general each URI in a publication is mapped to an actual static
on-disk file.
Use the Source, Luke! ·
The main body of code is reasonably well commented, and there are several
page-fulls of explanation at the top of mod_atom.c
, which are
really essential to understanding the details.
Where Stuff Goes ·
Suppose there’s a publication with its URI space rooted at
/pubs/tim
and
the corresponding directory space rooted at /a4/app/tim
.
The URI space for a pub has two
subtrees, beginning pub/
and atom/
.
In fact, they are backed by a single directory structure (beginning
/pub
).
mod_atom only allows POST, PUT, and DELETE requests (as used in AtomPub) to be
routed to the URI space beginning with atom/
.
The goal is to keep your security setup simple. With
<Location>
directive, you can
apply rigorous TLS+authent security to the atom/
subtree, while
leaving the pub/
subtree wide-open.
The actual subtree layout is described at exhaustive length in
the comments at the top of mod_atom.c
. The software has
knowledge of this directory layout wired in at a deep level. Changing it much
would probably be quite expensive in terms of refactoring/recoding cost.
The “Extras” PUT Playground · To maintain a publication, you need more than just the Entry and Media files that AtomPub makes it easy to do CRUD on. You need ancillary files; CSS and JavaScript and chunks of random XML and logo-ware and so on.
To support this, the top-level pub/
directory
has an
x/
(for “extras”) subdirectory, which starts out populated with
templates/
and css/
and js/
and options/
directories.
mod_atom allows unrestricted PUT
and
DELETE
to anywhere under atom/x/
; the idea is that
you can use it as a sandbox (probably with the same access-control that
applies to the whole atom/
subtree) for your ancillary files.
It allows unrestricted creation and deletion of directories, with automatic recursion as required.
ID Elements ·
Every Atom feed and entry has to have a unique ID. mod_atom uses the
urn:uuid
style. The directory structure is such that there’s
only one feed per directory, which also contains a file named id.
(note the dot) that contains only the feed’s unique identifier.
This exists so that when you re-create a collection you don’t have to read and parse the existing collection to find the ID string.
Element Updates and the Iterator ·
When you create a new resource with POST, it goes into a directory whose
pathname encodes the current date along YYYY/MM/DD lines. Feeds are generated
by a directory iterator (see iterator.c
) that walks backward
through these YYYY/MM/DD directories, returning entries one at a time.
When you update a resource with PUT, it creates a new “Link” file with an uninteresting name in the current (that is, date of the PUT) YYYY/MM/DD directory, which contains only the actual full pathname of the file that was updated. The iterator uses these to generate a feed which is still reverse chronologically ordered in the face of updates.
When an entry is updated multiple times, or updated then deleted, there’s no effort made to remove obsolete link files. These are detected in later iterator scans and removed on a side-effect basis, making the Link/PUT system self-maintaining.
Object-Orientation, or Not ·
The Apache code manages to have a fairly object-oriented feel even though
it’s in C, because the key routines all expect to take a
request_rec
as an argument. mod_atom
has three important structures: pub_t
and entry_t
,
packages of information about the things implied by
their names. These are read and written in a fairly ad-hoc way and mostly
exist to keep function argument lists under control; so despite appearances,
the design of mod_atom really isn’t very O-O at all. There’s a very linear
per-request control flow starting at the beginning and proceeding through to
the ending.
I wonder, if I were re-doing this now, knowing how it’s done, I’d refactor to have seriously Object-like constructs for Entry and Pub and Media. I don’t think it would help that much, but at the same time I’m sure that if I were re-writing it in Java or Ruby, wouldn’t think twice. I wonder if there’s a lesson here?
Generating HTML ·
mod_atom can operate either as a pure “Atom Store”, performing CRUD only on
Atom Entries and Collections; or, it can operate as a blog publishing system,
generating HTML and public-facing Atom Feeds and so on. HTML publishing is
turned on by the existence of a resource named
pub/x/options/html
.
HTML and Tidy · When mod_atom receives a POSTed or PUT entry, and HTML is to be generated, the content, in particular the “text constructs” (see RFC4287) need to be parsed, so that sanitization rules can be applied. For those that are marked as XHTML, this is trivially accomplished using Apache’s built-in parser.
Text that comes in marked with type="html"
is a tougher nut to
crack. I found two plausible candidates for HTML parsing: TidyLib and
libxml2. My research did not find anything online that suggested either was
qualitatively better than the other, so I chose TidyLib on the basis that the
library you link to is an order of magnitude smaller.
Since Apache’s XML object model is different from Tidy’s, this means there is replicated sanitization and persistence code; oh well.
XML In · Apache has expat compiled in, and there’s a call that makes it easy to point it at either the client request or file and get a tree structure, both of which mod_atom uses. Over the years, I’ve done relatively little with trees, preferring stream parsers, for robustness. It turns out that Apache already has settings to set a hard limit on the size of XML object it’ll try to parse, so I figured it would be OK to allow it to parse individual Atom entries into a tree.
Apache’s XML support has a whole bunch of special-purpose connections to DAV, but you can mostly ignore them.
The object model itself is kind of nice; it forces you to walk through
lined lists of adjacent text chunks, but aside from that I found that it
pretty well got out of the way.
I eventually created a separate function bundle in children.c
to capture a few repeated patterns of child-walking to pick up this
or that.
Tidy’s isn’t as slick, but it didn’t cause any pain. I’ve already complained about the interface you have to use for a memory allocator. Whatever.
There’s one instance of fairly horrible code that makes me think I’m doing
something wrong: when I get some media bits POSTed, I have to cook up a Media
Link Entry. So I hand-construct an XML tree for the purpose; see
make_mle_shell
. Blecch.
XML Out ·
mod_atom needs to write quite a bit of XML. Apache does have a an
apr_xml_to_text
call, but there was a pretty severe impedence
mismatch with my needs. To start with, I wanted to pump stuff into the XML
output straight from the program, without having to wire it into a tree.
Second, pumping out the XML as you generate was more natural to code (and also
probably a memory-saver).
So I used my own Genx library for the purpose. It has the virtue of being small-ish and pretty well-tested, in production here & there around the Net. It makes it really difficult to generate output that’s not well-formed.
It has a few problems, too. The API was a little kludgy; enough that
there’s a chunk of code in genx_glue.c
that automates
common cases and makes code more readable.
Also, since it generates not only well-formed but canonical XML, you’ve got
a real problem when you find something like &nbps;
in
incoming HTML-encoded data. As a result, there’s actually a table of all the
XHTML 1.1 entity names in the code, and mod_atom turns all those into the
actual Unicode character values in the output. And, as a side-effect, all
Text Constructs that come in marked type="html"
go out marked
type="xhtml"
. Which I think isn’t actually harmful.
Error Reporting · There are two kinds of errors; client protocol errors (e.g. an Entry without a title) which should be reported back to the client in enough detail that there’s a chance of fixing them, and internal errors (e.g. “Can’t open file”). In each case, the error should probably be recorded in the apache logfile.
Apache doesn’t make it particularly easy to send a body back to the client with an error return code. mod_dav uses a fairly strained-looking techniqe, and at the moment mod_atom does really nothing useful, just returning the best-available HTTP status code with no explanation.
Wiring in mod_dav style error bodies is on the to-do list. This should
probably be done anywhere you see an instance of
return HTTP_<X>
where X
is anything but INTERNAL_SERVER_ERROR
,
NOT_FOUND
, or NO_CONTENT
.
Also, the error-reporting structure inside mod_atom is kind of ad-hoc and
maybe needs refactoring. Some returns return an OK or an HTTP error code;
others return an apr_status_t
, and still others return a
char *
; NULL on success, otherwise an error message.
There may actually be a case for having both these options, but I’m not sure
the optimal choices have been made.
Finally, when you get an error in the HTML-generation phase, which doesn’t start until the Atom Store work is done, this does not cause the client to see an HTTP error (although it is sent to the server logfile). Is this correct?
Startup · When httd starts up, it invokes mod_atom to process each AtomPub directive in the config file. There are two directives; here’s a sample of each:
AtomPub /joe /Users/joe/Public/blog "Joe’s Blog" "Joe Smith"
AtomMetaPub /blogs /var/blogs "Default title" "Default user"
In each case, the first argument is the root in URI-space, the second the
root in filespace. AtomPub
declares a single pub,
AtomMetaPub
a facility for managing pubs (more on that later).
mod_atom checks to see if a blog specified by the
directive exists, and if it doesn’t, creates an empty directory framework, and
empty feed files, from scratch.
Apache startup is in itself a little weird. It runs through the config-processing process twice, once to check whether the file is correct and the directive processing doesn’t blow up, the second time “for real”. At the moment, mod_atom just does all its initialization twice in a row, not trying to avoid doing anything twice. This doesn’t seem to cause any problems.
Pub CRUD ·
A publication declared with AtomMetaPub
(let’s call it a
meta-pub) has an Atompub Service
Doc and talks the Atompub protocol, but new publications (let’s call them
sub-pubs) are created,
updated and deleted as a side-effect.
(Let’s just say “meta” and “sub” like in the comments.)
Thus, the collection feed for a meta constitutes a directory of subs.
Suppose the meta is rooted at /foo
. Then there will be 100
subdirectories, s00
through s99
. When a new sub is
created, it gets some sort of short name via regular Slug processing (say
submarine
), and one
of the subdirectories (say s37
) is chosen at random. Then the
sub’s root is /foo/s37/submarine/
To help discovery, in the meta’s feed, the entry will contain a link like
so:
<link rel="publication" href="/foo/s37/submarine/atom/pub.atomsvc" />
When you DELETE the entry in the meta collection that represents (in some sense) the sub, the whole sub is just deleted lock, stock, and barrel.
Templating · The HTML generation system is described in Autumnal mod_atom. I invented a templating system because I couldn’t find one that seemed a very good match for mod_atom’s needs. It’s straightforward if fairly tedious code; check out run_fp_template and run_fp_template_element. The only trick is that for the front-page template you need to interrupt generating the page in the middle of the template where it says “put the entries here”, so there’s a little bit of recursive shuck-and-jive with a state variable; the first time, as an adult programmer, that I really wished I’d had continuations.
Globals · Apache modules are by default constructed with the use of only one global variable, the “module” structure, which has pointers to structures describing how to handle that module’s directives and which which event hooks the module would like to be invoked for.
Since mod_atom’s code extends across several C source files, this doesn’t work. However, we’d like to be careful not to pollute the global function namespace. So, everything that’s visible outside the scope of one source file has one of the prefixes “atom_”, “genx_”, or “Tidy”. This makes for some long and ugly function names.
In general an Apache module can’t have mutable global variables because Apache can run in threaded mode, requiring you to implement access control. mod_atom does have some global variables, and even writes to them, but only at initialization time, which is single-threaded; accesses are read-only at run-time.
atom_join ·
In the process of mapping between URI space and filesystem space, mod_atom
does a tremendous amount of cutting and splicing of UNIX-style pathnames. The
book-keeping that goes witih this in order to keep the slashes in the right
space is awkward and tedious, so the atom_join()
function does
this; it takes a variable number of arguments, the last of which must be
NULL
, and splices them together, making sure the segments are
separated by exactly one slash.
Concurrency · It’s hard; let’s go shopping.
For individual resource creation, mod_atom pushes the concurrency issues down into the filesystem, which seems pretty bullet-proof.
There is exactly one place where it is necessary mutex:
When you’re doing a PUT to a resource, there’s a critical session while
you compute its ETag and compare that to the If-Match
request
header. I think there’s no way around mutexing this; you have to
lock others
out while you compute the ETag, decide whether or not to take the update, and
then maybe apply it.
This is accomplished with APR’s file-locking code, not on the resource itself but on a stub file of the same name with “.mutex” appended.
Sluggishness · Slug processing is pretty brutal. The text is de-%-escaped, which is supposed to give you UTF-8. Then any characters which aren’t legal UTF-8 or XML are silently crushed out. Then we remove leading and trailing dashes, and all characters which are not XML NameChars, or are ‘.’, are replaced with hyphens.
If there’s no slug, or there’s something horribly wrong with it, and it’s
an Atom entry that’s being posted, we try to use the text from its
atom:title
element.
Failing that, filenames are just random numbers.
If you send multiple posts in successions with the same Slug, mod_atom tries appending random numbers, and if that doesn’t work after a couple of tries, gives up and declares an error.
Security · In a typical publication, there are parts that you want to expose to the world and parts that require access control. There are a variety of Apache Directives that you can use to support access control. mod_atom enforces two run-time policies to aid in managing your security setup:
POST, PUT, and DELETE operations are rejected unless their target URI
has /atom/
immediately after the publication’s root.
GET operations are also rejected unless the /atom/
path-step occurs similarly in the URI.
Let’s assume the same customization discussed above:
AtomPub /joe /Users/joe/Public/blog "Joe’s Blog" "Joe Smith"
AtomMetaPub /blogs /var/blogs "Default title" "Default user"
Requiring basic authentication for access to the /atom/
-based
parts of the URI space could be done with Apache configuration directives as
follows:
<Location /joe/atom>
AuthType Basic
AuthName JoeBlog-admin
AuthUserFile /var/blog-admin/passwd
Require user joe
</Location>
<LocationMatch /blogs/s[0-9][0-9]>
AuthType Basic
AuthName Blog-Publishers
AuthUserFile /var/blog-admin/passwd
Require valid-user
</LocationMatch>
This could be done in a similar fashion using <Directory >
directives if you wanted to work from file paths as opposed to URIs.
Supporting app:draft
·
If an entry is POSTED with app:draft set, an Atom entry is created and the
collection, but neither the public-facing feed nor the HTML version, is
updated to show it. Should it eventually be PUT with app:draft removed, the
public-facing HTML and feeds are created.
Extensions and entry/content@type
·
When the HTML version of an entry is generated, its extension depends on
the value of the type
attribute of its atom:content
element. If it’s absent or text
, a .text
file is
created. If it’s html
or xhtml
, an
.html
file is created.
Lazy Feed Generation ·
Whenever an entry is created, updated, or deleted, a file named
timestamp
is updated in the root directory of its collection.
When mod_atom is handling a GET
request for the collection file,
or the public-facing feed file or front page files which are generated from
it, the timestamp of the collection file is compared against that of the
timestamp file, and all three (collection, feed, and index.html) are
regenerated prior to serving the request.
This removes some tricky race-condition scenarios and also makes heavy update streams more survivable.
To Do ·
DAV-style user_error()
reporting.
Allow PUT with app:draft
to un-publish an entry. (This
might work, but I haven’t tested it.