Dave Walker
over at freeform goodness
catches me with my
XML pants, figuratively speaking, down.
I wrote
a piece
about leaving the W3C
TAG entitled (cleverly I thought) </TAG>.
Unfortunately that <
in the title caused all sorts of grief
and breakage, both here at ongoing and downstream in
the world of syndication and aggregation.
I can fix my own problems, but it’s deeper downstream; long term, the answer
is Atom.
Herewith some thoughts on good programming practices and the larger
problem.
[Update: A couple of notes on the “href problem.”]
Local Repair ·
ongoing is written in XML and processed by an XML
parser and a bunch of Perl code to produce both what you’re looking at and
the RSS (and soon Atom) feeds.
There’s a function called escape()
that turns <
into <
, &
into &
,
and so on.
The problem was, as I was writing the software, I stuck an
escape()
call in whenever it seemed necessary, without thinking
about the dataflow too much. Bad, bad Tim!
So just now, when I read the freeform goodness essay and went to
look for the breakage, the code was kind of ugly.
There was quite a bit of double-escaping going on, so what appeared as
<
in the input ended up as &lt;
in
the output.
This was showing up as “<” here at ongoing, but
(maddeningly) as <
in the RSS aggregator display.
Please, we need Atom.
By the way, at Antarctica we had quite a few similar problems, with things surprising us by turning up either unescaped or doubly-escaped.
Getting the Policy Right · I think that software designers have to look at their application dataflows and get the policy right. Here’s a picture:
The policy ideally should be, I think, that all data in the Your Code block has to be known to be escaped or known to be unescaped. That is to say, you always do escaping on the data at the pointy end of the input arrows, or you never do it.
I think always-unescaped is a little better, since some of those output arrows might not be XML or HTML, but probably they all are; so always-escaped is certainly viable.
Now, in a small, constrained publishing system like here at ongoing, this is achievable. It’s tougher in a big professional multi-user system where there are a lot of input arrows and you don’t control them. Which gives us a third choice; accompany every piece of text in your program with a little boolean metadatum saying whether it’s escaped or not. Quite a bit more work; but maybe the only choice for a big-league system.
For what it’s worth, I’m now reworking ongoing to make the internal data always-unescaped.
Later: Couldn’t quite manage that, since the first paragraph is stashed away in a persistent database, and contains markup, so is a mixture of unescaped real markup and escaped magic characters in content. So it’s never ever simple.
The Output ·
Once your internal text is in a deterministic state, you have a chance of
generating the correct output.
For XML and HTML, you single-escape, and don’t forget to escape quotation
marks as well as <
and &
or you’re going to
get attribute-value breakage.
For Atom, you single-escape and set the mode=
attribute and
you’re good to go.
For RSS it’s tougher; lots of people single-escape their HTML and assume it will get executed; which means that you have to double-escape any markup that you don’t want executed. But even so, implementations vary.
And there’s still one nasty sharp-fanged viper lurking in the bushes...
The “href” Problem · There are lots of URIs out there that look like this:
http://example.com/select?y=1999&m=Jan
Well, that’s an &
, right, and everyone knows that those
have to be escaped in well-formed XML, right? So it should look like this in
your HTML:
http://example.com/select?y=1999&m=Jan
Well, yes, but... some browsers have been known to react poorly to this,
probably depending on whether you serve your stuff as text/html
or application/xhtml+xml
.
Hrumph. When I figure out the right solution to that one, I’ll let you
know.
Update:
Julian Reschke writes to tell me that
whereas he’s heard lots of talk about this “problem,” he’s never heard of
anyone getting bitten. Come to think of it, neither have I. And Nik Clayton
writes to point out that you can usually (but not always) use ;
instead of &
for this kind of URI.
Conclusion · OK, I think it’s now right. NetNewsWire is obstinately refusing to show the last character of that </TAG> article, even though it’s irritatingly double-escaped. Brent is no fool. I rest my case; this really needs fixing in the spec.