When Is it OK To Invent New Tags?

Tantek Çelik, smart Microsoft browser guy, is blogging from the big W3C meeting now going on in Boston. Among other things, he's mad because some W3C specifications are written not in HTML but in a completely different XML language called xmlspec, and that language has some tags that are a lot like HTML tags, so why don't we just use HTML tags? I'll address some of the historical background and specifics, but Tantek is pointing at a real important issue in the world of XML: when do you invent your own language, and when do you re-use someone else's? Warning: long, and loaded with markup design theory and obscure standards history.

The History · To the extent that xmlspec ever had a lead designer, that would be me, although I haven't gone there since 1998 or so, and there were major contributions especially from Michael Sperberg-McQueen, and from Dan Connolly, Eve Maler, Gavin Nicol, Lauren Wood, and James Clark.

In 1996, when we started to work on designing XML, to some extent we were reacting against the then-horrible state of HTML. Netscape, the market leader, was arrogantly ignoring the W3C processes and inventing their own tags (<BLINK>, <FONT>, blecch) and Microsoft was trying to claw their way in and prove they were real browser guys too with contributions such as <MARQUEE> (cue distant laughter).

The popular authoring packages, in particular Adobe's hideous "PageMill", spooned table trash into hideous tag soup in heroic effort to pretend that this was print not the Web and that you could actually micromanage where each letter went on the page. Tricks such as using cascaded sequential unclosed <DD> tags to control indentation were the norm.

Ted Nelson, in a keynote spot at the '98 Web Conference, brought the house down by saying that trying to fix HTML would be like trying to graft arms and legs onto hamburger.

A small number of the crowd who'd had experience with SGML were suggesting that we should introduce extensibility and deterministic parsing and error-handling and clean internationalization and so on to the Web. Usually, we were met with howls of derision and gales of laughter; obviously we were fusty old ISO bigots, the same people who'd tried to stuff OSI networking down everyone's throats and resisted TCP/IP.

The only way forward we could see was to cook up XML, and what do you know, today, everyone seems to think that making your HTML well-formed and maybe even valid might be a good idea, and elements should chosen in part based on what they say, rather than entirely to achieve visible effect, and so on. I.e., Ted Nelson was wrong, and we won.

Anyhow, in August 1996 Michael Sperberg-McQueen and I, as co-editors of XML, had to figure out how to write the spec. Our goals were:

There had to be a nice usable online HTML form.
There had to be a high-quality printed form.
It had to support a lot of machine-readable BNF productions.
It had to support term definition, and clearly-labeled references to defined terms and BNF nonterminals.

We never for a minute considered using HTML, simply because nobody at that point in history would have. So we stole some from TEI, some from HTML, and made up the rest. A key point is that we never worried that much about what the actual tags were, because we always assumed that they could be easily transformed into something else for publication. We ended up writing a lot of code in Perl and Java (me) and DSSSL (Jon Bosak) to do this, but when the original online and printed XML specs hit the street in late '96, they impressed a lot of people because of the fact that they were hyperlinked to the max and also had a beautiful printed version. Now they look old-fashioned, but that's fine.

Another reason for inventing a new language was psychological: the whole point of XML is that you can invent your own languages, and XML's definition was proof that you could not only do that but put such a language to work.

Today, the choice would be tougher. if you wanted to write a lot of BNF, xmlspec would probably look better than HTML. xmlspec's <termdef> element is a bit slicker than the HTML machinery, and the <bibref> stuff is nice too. Still, you could write a perfectly OK W3C specification in just HTML, and some people do.

Re-use or Build? · This kind of choice is faced every day by anyone who decides they're going to use XML to solve their problem. At Antarctica, we use good old HTML for our technical documentation, but cheerfully invented a new XML dialect for client-server interchange, and another one for our configuration files.

The cost of inventing a new language is lower than you might think, because it turns out to be fairly easy to transform XML to meet whatever your business needs are. On the other hand, it's higher than you might think, because language design is hard and easy to get wrong.

Some people imagine a future with lots of smaller languages that are combined in instances, using namespaces, to achieve semantic richness; the only real concerted large-scale attempt I've seen to do this is the native XML formats in Microsoft's upcoming XML-savvy release of Office. But it ought to be possible.

One thing that has become obvious is that while it's not too hard to define an XML language for some application's import or export needs, it's brutally difficult and time-consuming to define cross-application languages of any breadth or depth. The good news is that those application-specific import/export languages are still damn useful, and often provide a better basis for data interchange than anything that's come before.

The Special Case of HTML-Like Languages · The situation gets particularly sticky when you want your stuff to be human-readable, but also carry around semantically-rich information. HTML is immensely successful as a payload format for human-readable information, but has never really worked very well at the semantic level.

My favorite compromise has been to author in a really semantically-rich language tailored to the problem, and then transform to HTML for publication; this way I can generate all sorts of weird HTML idioms (positioning, tables, colors, fonts, whatever) with the sole goal of making the content more readable and usable. If, at some subsequent date, a programmer wants to get at the semantically-rich part, send them the XML source. Everyone wins.

Here in ongoing, I have enriched XHTML 1.1 with a few authoring-oriented elements: <letter> (serves as the root element since this isn't HTML), <finished /> (to allow making corrrections without giving a note most-recent status), and <cat> (obvious).

RSS · Finally, there's RSS, which is meant in part for human consumption, and is explicitly intended to contain HTML. Probably, RSS could in principle have been done by just using XHTML elements and sticking a bunch of special class= values on them, but I think that would have been awkward, and it's perfectly OK that RSS has its own vocabulary.

So... · My bottom line is that you'd be stupid to go ahead and invent your own XML language without looking around for previous work in the area, but that there's nothing inherently wrong with going ahead and doing it.

Comments on this fragment are closed.

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

March 06, 2003
· Technology (90 fragments)
· · XML (136 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!