Recently in this space I complained that XML is too hard for programmers. That article got Slashdotted and was subsequently read by over thirty thousand people; I got a lot of feedback, quite a bit of it intelligent and thought-provoking. This note will argue that XML doesn't suck, and discuss some of the issues around the difficulties encountered by programmers.
XML Doesn't Suck · This is going to be fun. XML first saw the light of day in November 1996 and between then and 1999 or so I spent most of my time trying to convince people that XML was a good idea and they should use it. In recent years, though, my XML-related work has been much less, due to my role at Antarctica, and has been focused on corner cases and weird interactions, due to my participation in the W3C TAG. So it's going to be a refreshing change to get on the old familiar pulpit and preach the XML gospel a bit.
I'm doing this because quite a few people reacted to my article by saying "See, a co-inventor of XML admits that it sucks, like I've been saying all along." (Many of them obviously hadn't read the article, but anyhow). Some of the XML-sucks arguments were:
Bah. Sticks and stones, etc. Let's look at some of XML's chief virtues, then I'll address some of the XML-sucks arguments, in the same spirit that Sammy Sosa addresses a fastball.
XML Has Internationalization Pretty Well Nailed · Sometime in the last few years, native speakers of English became a minority of Net users, and I'm quite certain they're a minority among users of computers in general. Up until the late nineties, I suspect that the vast majority of application writers basically didn't understand i18n issues, didn't care, and many didn't think they needed to (i18n is an abbreviation for "internationalization"). Those that did often thought they could get away with hacks like switching Microsoft Code Pages or using the much-loathed ISO 2022.
XML, I think, gets a lot of the credit for changing that. In XML, there's no ambiguity - a document is a sequence of characters, and characters are numbers, and the numbers mean what Unicode/ISO10646 says they mean. There are lots of different ways to store those numbers as bytes in data files, but XML forces you to say which one you're using right up front. Larry Wall said it best: “An XML document knows what encoding it's in.”
Basically, XML doesn't let you get away with ignoring the issues. While there is some ongoing tinkering with XML's i18n facilities, that's mostly because Unicode/10646 itself has been changing.
If I had to pick the biggest contribution XML has made to the world, this would be it - forcing people to learn the issues and start doing the right thing.
XML Can Represent Pretty Well Anything · I don't need to expand on this very much except to note that XML has been used to represent, without loss of information, algebra, bibles, computer programs, database records, email, filings to regulators, GIS data, human-resource data, iTunes collections, journal entries, KR data, logic, manuals, network maps, ontologies, purchase orders, queries against databases, remote procedure calls, schemas, transactions against commerce servers, update logs, vector graphics, winecellar inventories, XXX movie metadata, yearly calendars, and Zen koans. OK, I don't know for sure about the koans.
That's a lot of syntaxes that didn't have to be invented.
XML Forces Syntax-Level Interoperability · “Interoperability” has been a mantra since I've been in this business, and has been really hard to achieve. For a long time, the industry labored under the illusion that if we could all agree on One Big API then we'd have interoperability. Examples have included Posix, X11, OLE/COM/DCOM, CORBA, DCE, OpenDoc, and the list goes on; and it's never, ever, ever worked (with the single exception of the “sockets” library for IP networking). The only way to achieve interoperability at the software interface level is for there to be exactly one implementation - for example Perl or Linux. Which is not what we originally had in mind.
Syntax can be, and has been, interoperable. The definitions of the telephone network, the Internet, email, and the Web are all bits-on-the-wire definitions of what you send back and forth, and they've all worked well enough to change the world. XML provides a nice set of syntax rules that you can stick in the face of a recalcitrant vendor and say “you claim to be interoperable? Well, ship me some XML then.” And these days, they can't say no, and this is good for everyone.
This belief that bits-on-the-wire is more important than data structures or APIs is at the center of my world-view, and there's another long ongoing rant on the subject.
XML Supports Constructive Finger-Pointing · Shit happens. Particularly in networked computer systems. And when it happens, you need to figure out who has to fix things. XML, because of its inflexible, anal, Draconian syntax and error-handling rules, is a big help.
When somebody sends me something that's advertised to be XML, the first
thing I do is run xmlwf
(James Clark's
expat parser) on it and then open it
up in Internet Explorer. (They never disagree, but I do both anyhow).
If they tell me the XML is broken, I call up the data source and say
“Expat and IE both say your XML is broken.”, and every time, they
say “Oops” and fix the problem.
And (blush) at least once I've sent XML off to someone and got the same phone call and had to fix the problem.
This is a Good Thing.
XML Confers Longevity · When I'm doing a standup speech, I often ask: “Everyone in the audience who thinks they're going to be using the same word processor in ten years, raise your hand.“ No hands go up. “Everyone who has data around that's going to have value in ten years?” After a minute's thought, every hand goes up. The lesson is clear: information outlives technology.
And yet, as of today, too much of our intellectual heritage is tied up in fragile, proprietary, binary word processor files. This sucks. XML is the solution.
Enough About XML's Virtues, What About the Complaints? · Now let's address some of the specifics raised by the “XML Sucks” crowd, who by the way have several of their own websites, which I find kind of cool.
XML is Verbose · Given the fact that an increasing proportion of all Internet/Web traffic is multimedia (audio/photos/video/ring-tones), I'm pretty sure that the overhead due to encoding the textual part in XML is going to vanish in the static. On this website alone, the bandwidth I burn is dominated by pictures, even though lots of the entries don't have any.
Anyhow, XML compresses beautifully, and in most cases the payback in terms of interoperability is more than enough to make up for the verbosity and if you have an application where you just can't afford the bandwidth, don't use XML.
XML Does What S-Expressions and CSV Already Could · Except for, none of those ever made a real attempt to get serious about internationalization. And Comma-Separated was aimed totally at database tuples.
As for S-Expressions, I can see the arguments, and can't honestly tell you why the same technologists who ignored decades of S-Expression lore instantly took up XML. It's crystal-clear that you could have used S-Expression syntax for XML and it all would have worked about as well.
Maybe it's because S-Expressions were too closely identified with the tattered dreams of the AI community? Or maybe just because XML's compulsory end-tags make it a little easier to read?
XML Has Both Elements and Attributes, Why? · When I first learned about SGML, XML's predecessor (this would be in 1987) I had the same reaction, and single-handedly coerced the markup of the Oxford English Dictionary online text into an attribute-free style that lasted some years.
Today I observe empirically that people who write markup languages like having elements and attributes, and I feel nervous about telling people what they should and shouldn't like. Also, I have one argument by example that I think is incredibly powerful, a show-stopper:
<a href="http://www.w3.org/">the
W3C</a>
This just seems like an elegantly simple and expressive way to encode an anchored one-way hyperlink, and I would resent any syntax that forced me to write it differently.
Mixed Content Sucks · For those who aren't XML pedants and don't know what mixed content is, skip this section. Once again, I'm going to grant the theoretical force of this argument but argue from practice (people like to use this) and, unanswerably again I think, by example:
<p>Recent news about XML may be found at <a href="http://www.w3.org/">the W3C</a>.</p>
Isn't it great, the way the hyperlink just nestles into the text, even allowing you to exclude the “.” at the end of the sentence?
XML is Both a Tree and a Sequence · Well, I have news for you, data is often both a tree and a sequence. This may not fit neatly into the programming paradigm you're currently practising, but it's the way life is.
There Are Ugly Complex Standards Built on XML · Granted. I seriously dislike W3C XML Schemas and several other specifications that have been layered on XML. But I don't think you can blame XML for the things people build on top of it any more than you can blame English for being used to write Harlequin Romances and Schwarzenegger screenplays.
And you absolutely can, and many people do, build all sorts of useful stuff with XML while ignoring the layered complexities.
On the Issues XML Presents to Programmers · Finally, we get to the issues I was discussing in that previous article. I shall climb down from the pulpit because things are not nearly as clear-cut in this space.
First of all, I think that XML has made things locally more difficult but globally easier for programmers. Globally, because it enabled interop (see above) to a degree that we've not previously seen, and suddenly unlocked a lot of interesting data from a lot of valuable applications that didn't need to be rewritten or screen-scraped.
But let's face it, when you parse XML, you get a data structure that is kind of an ordered sequence and kind of a tree and kind of a hypertext. This maps well onto no known programming paradigm. If you're an object-oriented person, you can pretend that XML elements are serialized objects, and that works sometimes, and if you're a Perl hack, you can pretend that XML is an unusually-well-decorated text stream, and that works sometimes. But the impedence mismatch, I suggest, is just a fact of life, and the benefits we get (i18n, interop, and so on) make it worthwhile.
Having said that, the people who wrote to me directly, in other Weblogs, and to Slashdot made a few specific points that are worth highlighting.
libxml2
and JAXP seem to have lots of happy users.At the End of the Day · When I sat down to figure out how to write ongoing (which I'd like to keep going for potentially a long time) XML was the only format worth thinking about seriously, and my whining was provoked by the fact that it took me a total of oh, three days' programmming to build a one-off weblog publishing system.
And I have no doubt that when I want to change the look and feel I'll be able to, whether it's tomorrow or in 2023.
That's good enough for me.
And let's end this up on a lighter note.