Last week, I sent an email to one of the XML standardization lists at the W3C; my first presence in that conversation in quite some number of years. This short piece, of interest only to XML obsessives, gives a bit of background.
The Problem · There’s this problem in that XML discriminates slightly against certain ethnic groups. You can use any old Unicode character in an XML document, but the set of characters you can use to name an element or attribute is restricted to the letters that were defined in Unicode ten years ago, when XML 1.0 was published. (Yes, that was a bone-headed error by the designers of XML.) Since Unicode keeps growing, that means that a hypothetical programmer working in Cherokee syllabics or in Amharic would’t be able to use their working script in their tag and attribute names.
This is unfortunate. But not that unfortunate; remember, the users of the hypothetical programmer’s software could absolutely use their scripts in their own XML documents.
Solutions · The XML standardization community (a very small and overworked bunch of people) made a first attempt to solve this problem with XML 1.1. Unfortunately, that spec came with excess baggage, namely changed rules on what constitutes white-space, rammed through by IBM for the convenience of their mainframe customers. In any case, XML 1.1 has been widely ignored.
Now, the standardizers are trying once again with XML 1.0 (Fifth Edition). Basically, they’re re-written the rules governing the set of characters you can use to name elements and attributes. There is lots of discussion by the always-authoritative James Clark in XML 1.0 5th edition, including a rich set of links for those who want more background.
More Problems · Heretofore, I’d shut up. I didn’t think the Fifth Edition was a very cost-effective move, but I wasn’t in the room with the smart, overloaded people who were actually, you know, doing the work. But as James Clark points out, the change introduces an inconsistency between XML 1.0 and XML Namespaces 1.0, which is intolerable. They have to be either revised together or not at all. I understand that there may not be appetite or resources for such an effort. Sigh.
What I’d Like · I threw my hat in the ring years ago, with XML-SW, a proposed spec that includes XML 1.0, XML Namespaces, and the XML Information Set, but discards DTDs. There’s more discussion in XML-SW, a thought experiment, and Drop the <!DOCTYPE>.
If you’re going to go through the immense pain of revising the XML spec, focus on the real problems and do it all at once; which is what XML-SW does. I’d even volunteer cycles to work on such a thing. But I’d be astonished if it happened.
Comment feed for ongoing:
From: William Vambenepe (Oct 20 2008, at 13:12)
Interesting history on XML 1.1 which I wasn't aware of. Another example of IBM loosing by winning in standards.
Last month, while referring to other standards (WSDM, WSRF, WS-RT), I wrote:
"IBM seems to have an ability to loose by winning: because they assign so many people to standards they wear out everybody else and at the end, they get the final document to be the way they want it (through the normal process, just by being relentless). But the specification is by then so over-engineered, so IBM-like in its approach and so late that it’s usually a Pyrrhic victory."
[link]
From: Thijs van der Vossen (Oct 20 2008, at 13:53)
Start work on the specs outside of the W3C in the same way as the WHATWG has done for HTML5. It has worked once, it may work again.
[link]
From: John Cowan (Oct 20 2008, at 15:18)
Allow me to kill off this fungus-like historical error before it spreads. As the prime mover of XML 1.1, I had my attention called to the desire of some IBMers, at least, to allow NEL as an alternative newline character. I accepted this idea with enthusiasm and pushed it through the XML Core WG.
Although I don't hold that excluding mainframe programmers from the full benefits of XML is similar in degree to excluding Ethiopians and Eritreans, I do like to point out that there would have been immense howls of invidious discrimination if Mac Classic line endings (bare CR) had been excluded. It would have created a situation in which well-formed XML was not well-formed plain text, which is intolerable.
As for James's complaints, I don't think it takes much prognosticatory ability to say that they will turn out to be well-founded and will be fixed. The WG is in process of fixing Namespaces 1.0 anyhow.
[link]
From: Arnaud Le Hors (Oct 21 2008, at 11:17)
Tim,
I'm glad John Cowan indicated the claim about IBM is false, unfortunately most people won't see his comment.
I have to say that I'm surprised you are spreading this kind of rumor and, as the primary representative for IBM on the XML Core WG at the time of XML 1.1, I'm quite offended.
The proposal to fix XML with regard to its handling of end of line characters was never highly controversial and certainly didn't get rammed by IBM. I challenge you to prove that it was.
Besides, this alone hardly constitutes the reason XML 1.1 wasn't more widely adopted. If anything, I'd say that the backward incompatibility introduced by the exclusion of some characters allowed in XML 1.0 is higher on that list.
The biggest challenge comes from the impact *any* change to the XML character set has on other XML technologies, such as XML Schema. From that point of view, the proposed XML 1.0 5th edition is no better unfortunately.
[link]
From: Tim (Oct 21 2008, at 12:21)
Arnaud, I stand by my claim that the whitespace-change in XML 1.1 was controversial, and that it was pushed through over significant opposition based on the argument that IBM mainframe programmers needed it. John Cowan reports above that the pressure came from him not IBM. This is not how I remember it, but it was a long time ago. However, I agree with your opinion as to the reason why XML 1.1 wasn't more successful in the market.
[link]