Anne van Kesteren suggests an XML 2.0 mostly defined by less-Draconian error handling, provoking further discussion over chez Sam Ruby.
I was recently asked about this by Xavier Borderie in an interview currently appearing at Journal du Net. Since not all ongoing will be able to read my incredibly-polished French (well actually, Xavier translated my English, but I nit-picked the translation), I thought I should give the English version here:
Micah Dubinko asks “Is HTML on the Web a special case?”, and the answer is obviously “yes”. Note that the HTML language being developed by the WhatWG is not XML at all, and I'm not brave enough to predict whether that is a good idea.
There have always been a few tools that processed XML data but also accepted broken (non-XML) data; for example, every Web browser. It seems unlikely to me that there will ever be an official new release called “XML 2.0” that has different error-handling rules. But I'm sure that the arguments about when to apply real XML error handling and when software should accept non-XML data will go on forever; among other things they are quite entertaining.
There's a spectrum of situations: at one end, if an electronic-trading system receives an XML message for a transaction valued at €2,000,000, and there's a problem with a missing end tag, you do not want the system guessing what the message meant, you want to report an error. At the other end, if someone sends a blog post from their cellphone with a picture of a cute kitten, you don't want to reject it because there's an “&” in the wrong spot. The world is complicated.
Comment feed for ongoing:
From: Mark (Jan 30 2007, at 18:58)
> if an electronic-trading system receives an XML message for a transaction valued at €2,000,000, and there's a problem with a missing end tag, you do not want the system guessing what the message meant
You have used this example, or variations of it, since 1997. I think I can finally express why it irritates me so much: you are conflating "non-draconian error handling" with "non-deterministic error handling". It is true that there are some non-draconian formats which do not define an error handling mechanism, and it is true that this leads to non-interoperable implementations, but it is not true that non-draconian error handling implies "the system has to guess." It is possible to specify a deterministic algorithm for graceful (non-draconian) error handling; this is one of the primary things WHATWG is attempting to do for HTML 5.
If any format (including an as-yet-unspecified format named "XML 2.0") allows the creation of a document that two clients can parse into incompatible representations, and both clients have an equal footing for claiming that their way is correct, then that format has a serious bug. Draconian error handling is one way to solve such a bug, but it is not the only way, and for 10 years you've been using an overly simplistic example that misleadingly claims otherwise.
[link]
From: Tim Bray (Jan 30 2007, at 19:39)
I'm prepared to believe that there are non-Draconian models which are nonetheless deterministic. The Draconian approach has the virtue of extreme simplicity, with a corresponding lack of corner cases and opportunities to screw up. I personally don't know of any document-error-handling procedures that are as reliably deterministic (not saying they don't exist, just that I don't know of any).
As for the example, I like it and think it's a valuable didactic aid. However, in future, when I have the space, I'll expand the argument to include your point that in the general case, what we're after is determinism. In the case of most network protocols, I'd probably still argue for the Draconian approach.
[link]
From: Stephen (Jan 30 2007, at 19:49)
Mark,
Tim's right, though. You have to reject that large financial transaction, because the missing end tag means that information is potentially incomplete.
What if the full stanza is meant to be: transfer €2,000,000 ... only if the client credit rating is AAA+?
The requirement for well-formed XML is precisely *why* it can be trusted as a data source. Think of the close tag as a checksum -- it says "that's it, I wasn't going to say anything else".
[link]
From: Joe (Jan 30 2007, at 20:14)
To be honest I've used that example myself, but have since strayed from the draconian camp.
The example is flawed in several other ways besides the one that Mark has pointed out.
The first is that it falsely implies that well-formedness is 'sufficient' for banking transactions.
"My bank account has been drained of €2,000,000"
"But sir, the request was well-formed"
The second flaw is by counter-example: how many billions of dollars of credit card transactions have been initiated from the tag soup pages of Amazon and eBay?
[link]
From: Aristotle Pagaltzis (Jan 30 2007, at 20:57)
No, Joe, it does not imply that well-formedness is correct, any more than the fact that a program in a language with strict static type checking languages compiles implies that the program is free of bugs. However, I think we can all agree that this doesn’t mean compilers should be able to “recover” from errors by virtue of “defined error handling” and compiling an erroneous program into something, “deterministic” though the error recovery process may be. Likewise, you don’t want to use well-formedness of a financial transaction message to be your only criterion of its validity (hey, maybe the XML people were on about something when they specified well-formedness and validity as different from each other (although their vision of validity has turned to be too limited and simultaneously too limiting)). Well-formedness would be a *necessary*, but not a *sufficient* criterion for the validity of a transaction message.
And no, Mark, I don’t want “defined error handling” for such a message. I want Draconian handling. Halt and catch fire. Screeching brakes. Full stop. There are cases where you really do want that. There are cases when you don’t. There are cases where it’s half this, half that; cases, say, where you only want charset sniffing. There are cases where you want a pony. Not every document has the same grave impact on the world. If there is to ever be an XML 2.0 with some sort of error recovery defined, it will have to acknowledge the fact that sometimes you ride the pony and sometimes you groom it.
[link]
From: Tim Bray (Jan 30 2007, at 21:01)
Joe, just because I assert that rejecting non-well-formed messages is desirable in some class of protocol does not imply that I would also assert that accepting them is as a consequence necessary.
And as for those high-volume B2C purchases, after they get scraped out of the HTML forms and validated all to hell, I bet they get stuffed into something rigid like EDI transactions for further transit. Snicker.
[link]
From: Mark (Jan 30 2007, at 21:28)
> The requirement for well-formed XML is precisely *why* it can be trusted as a data source.
"The idea that well-formedness-or-die will create a “culture of quality” on the Web is totally bogus. People will become extremely anal about their well-formedness and transfer their laziness to some other part of the system." http://lists.w3.org/Archives/Public/w3c-sgml-wg/1997May/0074.html
> I think we can all agree that this doesn’t mean compilers should be able to “recover” from errors by virtue of “defined error handling” and compiling an erroneous program into something, “deterministic” though the error recovery process may be
Why on earth would you think that we could all agree about that?
[link]
From: Joe (Jan 30 2007, at 21:53)
"Joe, just because I assert that rejecting non-well-formed messages is desirable in some class of protocol..."
The problem with draconian error handling in XML 1.0 is that it does not make the distinction of 'some class of protocol', XML 1.0 states that you MUST reject no matter what 'class of protocol' you're running.
"after they get scraped out of the HTML forms and validated all to hell"
I don't care what happens after that, the point, which you blithely jumped around, is that billions of dollars of transactions are initiated with x-www-form-urlencoded data POSTed from tag soup pages.
[link]
From: John Cowan (Jan 30 2007, at 22:09)
There was in fact a programming language developed at Cornell in the 1960s called CORC, which was designed for running student programs in a punched-card line-printer batch environment. Because turnaround times between program runs were up to 24 hours, the compiler went to a lot of trouble to correct, rather than merely detect and reject, *all* syntax errors. For example, it examined all identifiers, and if any were used only once, it looked for sufficiently similar identifiers also appearing in the program and rewrote the offending identifiers.
What counts as sensible depends on the environment, just as Tim says.
[link]
From: Rob Sayre (Jan 30 2007, at 23:23)
Tim, you seem to have mistaken whatever this is for a technical conversation. I believe Sam Ruby would call the name "XML 2.0" an "act of social violence":
http://intertwingly.net/blog/2006/02/23/Version-Numbers/
XML 2.0 is a good name if you're looking to write pithy blog posts and do ESR impressions. Otherwise, people might assume you are working on something boring, like a parser.
[link]
From: Aristotle Pagaltzis (Jan 30 2007, at 23:30)
Wow, I didn’t think I’d see a Pythonista defending TIMTOWTDI to a Perl hacker. :-)
[link]
From: Sam Ruby (Jan 31 2007, at 03:23)
> I think we can all agree that this doesn’t mean compilers should be able to “recover” from errors by virtue of “defined error handling” and compiling an erroneous program into something, “deterministic” though the error recovery process may be
Implicit in that statement is the assumption that single bit transcription errors result in erroneous programs.
Often in languages like Perl, arbitrary "line noise" may be a valid program (less so with use strict, more so with regular expressions).
[link]
From: Ed Davies (Jan 31 2007, at 06:46)
If recovery from an error is deterministic, is it an error? If a tree falls in the wood but nobody hears it...
Seriously, what's the practical difference between adding deterministic error recovery and simply expanding the grammar? Is it just the way it's documented ("ideal" format + recovery methods vs the broader error recovered grammar) or is there something more to it?
[link]
From: Anne van Kesteren (Jan 31 2007, at 08:33)
Ed Davies, defining graceful error handling instead of expanding the grammer allows for extensibility.
[link]
From: Evan DiBiase (Jan 31 2007, at 09:23)
Furthermore, deterministic error handling might help tag soup situations work better, but it doesn't necessarily make them correct. If content is malformed, and that malformed content is being parsed with deterministic error handling, the result will fall into three categories:
1. The author of the content knew that the parser would handle the error in a certain way, and that certain way is what the author meant,
2. The author of the content didn't know that they were sending content with an error, but the result of handling the error resulted in something the author was okay with, or
3. The author of the content didn't know that they were sending content with an error, and the result of handling the error resulted in something the author was no okay with.
In the first case, the error isn't, exactly -- the author could have created well-formed content, but chose not to, for some reason.
In the second case, the deterministic error handling is more of a heuristic than anything else, and the content author got lucky. This is presumably the situation that web browsers handle with tag soup: they can do a good enough job enough of the time that content authors and users, in general, end up happy.
In the third case, the deterministic handling is still a heuristic, but the content author got screwed.
My point, after all of that, is that deterministic error handling is potentially helpful in ways that draconian error isn't (as indicated by the second case), but also opens itself up to problems that draconian error handling avoids (the third case).
[link]
From: Aristotle Pagaltzis (Jan 31 2007, at 10:33)
Sam:
Given high enough density of hashmarks almost anything is valid Perl, that is true… That uninteresting special case aside though, given strictures, most transcription errors will result in syntax errors. Without strictures, you are right, perl will swallow a lot of bitflips without complaint. So what does it say about the value or not of lax error handling for Perl that strictures are enabled for any piece of Perl I place in a file of its own? That any good Perl programmer considers their absence on any non-trivial code a tentative red flag?
Your point is well taken, but after some consideration it doesn’t seem to affect the issue at hand much.
John:
Interesting anecdote. Of course, that’s an environment with very slow edit-compile-debug cycles, and likely one where program complexity was limited by programming being done as an educational excercise, not to mention the dearth of machine resources at the time. Under such circumstances, accepting rare subtle bugs (when the spelling corrector guessed wrong) in exchange for salvaging many edit cycles that would otherwise have gone to waste was apparently worthwhile.
Mark:
In all seriousness, though, how would you sensibly recover form errors in program text? Say the compiler ran into a random syntax error, ultimately caused by a missing string-closing quote somewhere in the preceding text. How do you make something sensible of that? I can’t think of any way where tripping it and making it fall flat on its face wouldn’t be trivial.
I thinking about this I realised the actual crucial difference between program text and markup: there is no publisher/consumer divide during development, which makes it practical to run source code through a validator continuously and incidentally to the actual work being done. John’s anecdote corroborates this: in an environment with fast edit-compile-debug turnaround, the right place for such as identifier spelling correctors is within the editor/IDE as a user-driven tool, not within the compiler as an automatic process.
This inevitably leads to the conclusion that we are writing markup today roughly the way we used to write programs back when punchcards where the only input channel.
I was initially surprised by this realisation, but in retrospect it’s not a particularly profound insight; not to mention it seems uncomfortably close to some camps’ “the tools will save us” stance that I consider misguided. Hmm. I’ve led myself to a strange place…
[link]
From: jgraham (Jan 31 2007, at 15:01)
So, from my point of view the problem with the current error handling in XML is that, it is either undesirable, unnecessary, or insufficient. The case of a financial transaction is a prime example of where it is insufficient; merely having a well-formed message does not ensure that the information received is correct so, whatever the XML error handling behavior, the application will have to layer all sorts of custom checks on the content of the message to ensure it is valid (in an application-specific sense).
Conversely, for a HTML document it is almost always very unimportant if the content is well-formed as long as the page is displayed as intended (which may be easily checked). Indeed the extra effort needed to consistently produce well-formed markup, compared to the added risk of an error creeping in and lack of hard advantages, has effectively prevented the adoption of XHTML on the web.
Since the optimum error handling is application dependent and XML is supposed to be a multi-domain metaformat, it seems odd that the XML specification should try to impose strict error handling. Much better to have a well-defined error recovery strategy for those applications where such a thing would be a benefit and, for those applications where well-formnedness of the message is considered an essential part of the validity checking, allow the "bozo-bit" to be read so the application can fail in an XML 1.0 style.
[link]
From: Sam Ruby (Jan 31 2007, at 15:39)
re: "I thinking about this I realised the actual crucial difference between program text and markup: there is no publisher/consumer divide during development, which makes it practical to run source code through a validator continuously and incidentally to the actual work being done."
That depends on the usage scenario. Documentation produced using DOCBOOK tend to be high quality XML, and often make use of a number of the advanced features of XML, such as Internal DTDs.
I personally would rather that DOCBOOK fails fast. But as Tim pointed out, I don't necessary want that same behavior when posting a cat picture from a camera phone.
As I see it, we can go down two paths: let everybody define their own error recovery strategy (including rejection), or try to work together to define a common best practices for handling common errors, like these:
http://googlereader.blogspot.com/2005/12/xml-errors-in-feeds.html
In many (actually most) of these cases the WHATWG has considered similar scenarios and documented error recovery procedures for handling each. Whether these recovery scenarios apply equally in other contexts is something open for debate, but they are proven in at least one widely deployed context.
But even nobody else is interested, this is something I plan to pursue, implement, and deploy.
Joe Gregorio's new comment system is based on the same implementation that I plan on continuing to evolve. Conceptually, there is no reason why his inevitable implementation of APP couldn't take advantage of this same logic.
[link]
From: Jim (Jan 31 2007, at 19:19)
> "The idea that well-formedness-or-die will create a “culture of quality” on the Web is totally bogus. People will become extremely anal about their well-formedness and transfer their laziness to some other part of the system."
Be realistic - they are *already* being lazy in other parts of the system. Well-formedness-or-die merely means that there's one less place where complicated, buggy error handling is necessary. It doesn't shift the laziness elsewhere because there is no part of the system that laziness is not commonplace.
Why do you assume a Law of Laziness Conservation? Eliminating it from one area doesn't compel people to be lazy elsewhere.
[link]
From: Ed Davies (Feb 01 2007, at 05:01)
Anne van Kesteren: "defining graceful error handling instead of expanding the grammer allows for extensibility."
So, a document which is, this week, invalid but is given an interpretation within the current semantics via silent error recovery will, next week, be valid in a new grammar with semantics defined by the extension? Ouch!
I'd suggest that graceful error handling actually works against extensibility because any string which was previously invalid but had determined error handling cannot now be used for an extension - unless the "error handling" was very simple - something like "must ignore" or similar.
(P.S., yes, my old, nearly-defunct web site serves XHTML as text/html - fixing that or switching to "pure" HTML is something I need to do before resurrecting the site.)
[link]
From: Morten Frederiksen (Feb 02 2007, at 01:04)
I fail to see, why it would be better to have a specification plus specified deterministic error handling as an alternative to a smaller specification?
While e.g. the ultra liberal feed parser is a useful tool, it is likely impossible to spec en reimplement from scratch. Compare with MS Office and Word95 special cases - it seems to me there is no difference.
Specifications are needed for interoperability, and smaller are better than large for multiple correct implementations.
[link]
From: Asbjørn Ulsberg (Feb 02 2007, at 06:24)
What I think is missing with XML today is choice. Choice to recover from errors in a well-described and deterministic way. What would help was if we had two profiles for XML which can be defined in this so-called "XML 2.0". One profile is closer to today's XML 1.0 parsing rules and the other is closer to HTML 5.0's. Which profile a consumer is supposed to use can be defined in an 'xml:profile' attribute on the root node of the document or in a processing instruction. Software should default to the XML 1.0 parsing mode, but if the loose profile flag is turned on, they can choose to recover from errors if it increases the user experience, for instance, in a web browser.
I'm pretty sure we need both parsing profiles. Both are valuable and neither of them can be discarded. So if we want to define an "XML 2.0" both of them needs to be defined and used in the format in a clear and concise way.
[link]
From: Mark (Feb 02 2007, at 06:35)
> the ultra liberal feed parser is a useful tool, it is likely impossible to spec en reimplement from scratch
Tell that to the people who ported it to Java. And Ruby. And probably some other languages I don't know about. They ported the test cases, then coded to them.
> Specifications are needed for interoperability, and smaller are better than large for multiple correct implementations.
"Make things as simple as possible... but no simpler." (attrib. Einstein) XML's draconian error handling is too simple. We need something just a little bit more complex. And lots and lots of test cases.
[link]
From: Morten Frederiksen (Feb 02 2007, at 09:43)
Mark,
Fair enough, I didn't include test cases in my equation -- and obviously I didn't know about the other feedparser implementations.
However, I think my point regarding magnitude still stands. Even when you include the test cases -- as many as you want -- there will still be cases that aren't covered with a complex specification with exceptions etc. A simpler specification would require a smaller and finite amount of test cases.
And then of course there's the errors in the error handling...
[link]
From: Mark (Feb 02 2007, at 13:02)
> Fair enough, I didn't include test cases in my equation
I would posit that, for the vast majority of feed producers, feedvalidator.org *is* RSS (and Atom). People only read the relevant specs when they want to argue that the validator has a false positive (which has happened, and results in a new test) or a false negative (which has also happened, and also results in a new test). Around the time that RFC 4287 was published, Sam rearranged the tests by spec section: http://feedvalidator.org/testcases/atom/ This is why specs matter. The validator service lets morons be efficient morons, and the tests behind it let the assholes be efficient assholes. More on this in a minute.
> A simpler specification would require a smaller and finite amount of test cases.
The only thing with a "finite amount of test cases" is a dead fish wrapped in yesterday's newspaper.
On October 2, 2002, the service that is now hosted at feedvalidator.org came bundled with 262 tests. Today it has 1707. That ain't all Atom. To a large extent, the increase in tests parallels an increase in understanding of feed formats and feed delivery mechanisms. The world understands more about feeds in 2007 than it did in 2002, and much of that knowledge is embodied in the validator service.
If a group of people want to define an XML-ish format with robust, deterministic error handling, then they will charge ahead and do so. Some in that group will charge ahead to write tests and a validator, which (one would hope) will be available when the spec finally ships. And then they will spend the next 5-10 years refining the validator, and its tests, based on the world's collective understanding. It will take this long to refine the tests into something bordering on comprehensive *regardless of how simple the spec is* in the first place.
In short, you're asking the wrong question: "How can we reduce the number of tests that would we need to ship with the spec in order to feel like we had complete coverage?" That's a pernicious form of premature optimization. The tests you will actually need (and, hopefully, will actually *have*, 5 years from now) bears no relationship to the tests you can dream up now. True "simplicity" emerges over time, as the world's understanding grows and the format proves that it won't drown you in "gotchas" and unexpected interactions. XML is over 10 years old now. How many XML parsers still don't support RFC 3023? How many do support it if you only count the parts where XML is served as "application/xml"?
I was *really proud* of those 262 validator tests in 2002. But if you'd forked the validator on October 3rd, 2002, and never synced it, you'd have something less than worthless today. Did the tests rot? No; the world just got smarter.
[link]
From: Sam Ruby (Feb 02 2007, at 20:14)
> But if you'd forked the validator on October 3rd, 2002, and never synced it, you'd have something less than worthless today
Or if you forked it on February 2nd, 2004 (yes, exactly three years to the day), the same thing would be true.
To those who don't get the inside reference, don't worry about it.
[link]
From: Brett Zamir (Feb 09 2007, at 14:37)
Howdy,
Though I have a couple of questions not _directly_ related to the thread, since I think they are important questions, I'd see if I could post it here to garner your respected opinion.
Given your part in the huge achievement of the interoperability XML brought, I wonder what would you think about what to my mind would surely be an even greater greater achievement of interoperability (and that is, needless to say, saying a lot given the gigantic importance of XML)--a world auxiliary language (for human language that is).
I would be interested to hear your opinions in the context of such a language being given the sufficient opportunity to take hold via a representative global convocation (e.g., through the Inter-Parliamentary Union?) to decide the issue (after thorough consultation and with the stated willingness by governments of implementing the eventual decision) and not merely through the hope that the de facto English will quickly win over all governments to start teaching English at an early age (or the blindness that mistakenly asserts that it already is sufficiently doing so to have English work now as a fully "world" language) nor through the hope that a constructed language such as Esperanto will sufficiently take root (at least solely) through grass-roots teaching and study, but rather in a forum which left the ultimate decision on the type of language (and language itself) open for debate and which committed to fully support the decision. Surely, the tired and tiring response of the idea being too idealistic is not a sufficient answer for you, as many standards--including in the realm of human governance too--have in fact been accomplished progressively over time (the world is no longer restricted to tribal government, for example, so we are not inherently incapable of greater unities).
The other question I have (and which may be a naive one) is about the practicality of representing XHTML in other languages such as Chinese. Given that XSL could today already be made to transform such a language into the English-coded XHTML, I am very curious whether work is being done on this and what it is, and if not, why not. It would seem that such an ability would have large implications in allowing many coders in the developing world to more easily "get started" and contribute back their talents rather than being relegated to being seen as mere recipients of technology developed elsewhere. Of course many in the developing world are already contributing to their own communities and back to the global community, but language is no doubt a very large barrier to entry, at least in getting more people started on the path to web design and programming.
And in a similar vein, how about representing accessible scripting languages like PHP in XML format (which could be transformed to their native textual format) for the sake of easier localization not to mention possible conversion into other computer languages. I'm sorry if these latter questions may be naive ones, and if they are, beg your indulgence, but I hope you can at least respond to the ideas if my referencing of examples is inadequate.
Lastly, I'd like to ask if you have any thoughts about getting native XML databases (like Berkeley DB XML) become more of a staple of client-side as well as server-side programming. I absolutely love the abilities the latter gives along with its use of XQuery, but wish that even more languages (like Javascript) could be made to tap into these to allow people to more easily develop and share XML applications that worked with larger and more complex storage of data on the client-side (with the ability to interact with a central server) without expecting users to go out of their way to download and setup their own database server.
thank you kindly,
Brett Zamir
[link]