This entry is specifically about a particular technical issue in the next-gen syndication-format design exercise, but more generally about the wonderful experience of getting a better understanding of complicated things. The lesson continues... update #1 (skip to end).
Learning · I have been fortunate in that on a few occasions I have been privileged to be in the white-hot center of a group of people smarter than me working together to sort out some hard problems and figure out the right thing to do. The ones that stand out in my mind were the two years from 1987-89 working on the computerization of the Oxford English Dictionary at the University of Waterloo, and the period from mid-1996 to late 1997 when we hashed out the guts of XML.
When you’re in this situation, what happens is that you go in with educated opinions about things that you discover, by working through it with people, are uneducated or incomplete or wrong, and sometimes this happens several times in succession on the same issue. Since you’re having the same effect on other people’s perceptions, the effect is that the aggregate wisdom of the group grows. Sometimes, with the right group, it can grow really fast.
When it’s happening, it feels wonderful. It happened over the last couple of days on one particular issue in syndication markup, which I’ll now discuss; people who don’t care about the syntax of newsfeed syndication can check out at this point.
Escaping · Suppose I want write a very brief weblog entry which says, in toto:
Patricia Barber's album Café Blue is wonderful.
Here’s the source code:
<p><a href="http://www.patriciabarber.com">Patricia
Barber</a>'s album <cite>Café Blue</cite> is
wonderful.</p>
When I was doing the code for ongoing, I observed that other people’s RSS feeds had nicely formatted HTML, so I looked at Jon Udell’s feed, and discovered that I’d have to generate this:
<p><a href="http://www.patriciabarber.com">Patricia
Barber</a>'s album <cite>Caf&#233; Blue</cite> is
wonderful.</p>
Or this, which is exactly equivalent per XML:
<![CDATA[<p><a href="http://www.patriciabarber.com">Patricia
Barber</a>'s album <cite>Café Blue</cite> is
wonderful.</p>]]>
These techniques do what is called, in XML terms, escaping markup;
hiding it so that the XML parser will pass it through without getting
confused by the <
and &
characters and the
tags and attributes they mark.
When I first saw this I couldn’t believe it, so I went and looked
at the RSS 2.0 specification,
which said that the <description>
element could contain
escaped HTML, but not a word about what this meant or what to do with it.
I found this practice kind of horrifying; I know that the ongoing software spits out well-formed XML (XHTML to be precise) and I could see no reason for this kind of obfuscation. I said so over on the very active Wiki page dedicated to this issue.
Getting our Terms Straight · One of the things making this discussion difficult is that people are slinging around the terms “encoding” and “quoting” to refer to escaping, which is confusing not only because “escaping” is the pedantically-correct term but because “encoding” in XML jargon means something else entirely, which however is not entirely unrelated to this issue. But I digress.
Chapter 1: Hard-Liner · I started out as something of a hard-liner: this practice is bogus, let’s stop it and just emit well-formed XML.
But a number of smart people piped up and explained that existing tools often emit “tag soup” HTML, with unbalanced tags and unquoted attributes and so on, and if you escape all that, then you can fit it into your syndication without irritating the XML parser that’s going to read it.
I argued that that was the bad old days, couldn’t we do better in future. That doesn’t work, because people might want to take their past five years of syndicated entries and package them up in the new syndication format.
Chapter 2: Moderate · OK, these seem like good arguments, so the situation is that either your log-entry content is nice modern well-formed XML (probably XHTML, but whatever) and so you can just publish it like that.
Or, maybe it’s not, and you should signal that it should be unescaped
before processing, say with an attribute like unescape="true"
.
Which seemed to me like a reasonable compromise.
Sam Ruby posted on this
subject with what I thought was a fairly compatible approach.
And the discussion on the Wiki raged on.
Chapter 3: Sadder But Wiser · Then Brent Simmons, author of the formidable NetNewsWire, the best RSS aggregator I’ve seen on any operating system, waded in, and made a point that’s obvious once you’ve heard it: when the aggregator gets to the content, it’s going to hand it off to an HTML renderer (or XHTML, or maybe in future an SVG renderer or whatever); the aggregator doesn’t want to run the XML parser through the content, it just wants to see a bucket of bits it can hand off to other software.
So from the aggregator’s point of view, escaping the content is always a win.
So the real situation is this: the interests of those who for one reason or another hand-author and hand-read syndication feeds are directly in conflict with those of the people who write the software that reads them.
Readability is a really important virtue, and one of the main reasons why XML has succeeded where a bunch of other universal-interchange frameworks didn’t. On the other hand, having a wonderfully readable language is not a win if it’s such a pain in the ass to write code for that nobody does.
Right now, I don’t know what the right answer is. Escaping stuff that starts out well-formed still strikes me as a really ugly kludge. But I can see that the answer isn’t a slam-dunk.
Conclusions · First, maybe by this time next week we’ll have some more information or insights that will make one of the alternatives look better. Second, the people taking part in this have a lot deeper understanding of the problem of escaping syndication feed contents than they did a few days ago.
This is good.
Update: Never-Ending Story ·
Since I wrote this,
Joe Gregorio pitched in on the Wiki page
pointing out
that one useful thing an aggregator might want to do is strip out
“dangerous” HTML tags like <script>
and
<object>
, and this is going to be a lot easier if it’s
well-formed and you an run an XML processor over it.
Dare Obasanjo amplifies, talking
about XSLT-generated feed views.
These guys do want access to the XML in the <content>
as XML.
Jeepers, how many more levels deeper are going on this one till we get to the bottom? This is fun! Stay tuned...