XML is a bouncing thriving five-year-old now, and yet I've been feeling unsatisfied with it, particularly in recent times. In particular in my capacity as a programmer.
During the process of setting up ongoing, for the first time in a year or more I wrote a bunch of code to process arbitrary incoming XML, and I found it irritating, time-consuming, and error-prone.
Some other recent data points:
Programming Baskets · Some more background. Serious programming these days more or less all falls into three baskets:
I think all of these communities are having more trouble than they really ought to with XML. Oddly enough, the problem isn't in writing the XML processor, which isn't that hard, look at the number that are out there. The difficulty is in using one.
An XML-Oriented Programing Language? · One response has been a suggestion that we need a language whose semantics and native data model are optimized for XML. That premise is silly on the face of it: here are two reasons why:
struct
-centered worldview to O-O code+data encapsulation is
really a move away from the tabular paradigm.
You can embed SQL in most languages now, but normally you don't implement any
serious business logic in it.
If this hasn't happened after decades in the relational world, why would we
expect it to happen in the XML world?Life in the Scripting Basket ·
As regards XML,
I've been living in the land of scripting generally and Perl specifically
in recent times; the internals of the Antarctica runtime codebase are all C,
the back end has Java and C++, but these all build and manage internal
data structures that look nothing like XML, and the XML we generate is via
the venerable printf()
-plus-markup-escaping approach.
That leaves input data munging, which I do a lot of, and a lot of input data these days is XML. Now here's the dirty secret; most of it is machine-generated XML, and in most cases, I use the perl regexp engine to read and process it. I've even gone to the length of writing a prefilter to glue together tags that got split across multiple lines, just so I could do the regexp trick.
The reasons are not complicated: If I use any of the perl+XML machinery, it wants me either to let it read the whole thing and build a structure in memory, or go to a callback interface.
Since we're typically reading very large datasets, and typically looking at the vast majority of it, preloading it into a data structure would be impractical not to say stupid. Thus we'd be forced to use parser callbacks of one kind or another, which is sufficiently non-idiomatic and awkward that I'd rather just live in regexp-land.
When I came to do ongoing, I decided as a matter of principle that the input had to be XML and had to be read with a real XML processor. Since, once again, I was going to be using every byte of every file, I decided that loading it all into an in-memory data structure so I could run through it inorder was egregiously stupid, and went with callbacks. Which are irritating.
The program that writes ongoing sets up for processing an entry by initializing a bunch of global state variables, unleashes the XML parser, and stands back. I've been writing Perl since 1993 or so and this just feels awkward and unnecessary. The canonical Perl program, in my idiom anyhow, looks something like:
my ($state_var1, $state_var2) = (0, '');
my (%collector1, $collector2);
while (<STDIN>) {
next if (/rexexp-for-something-I-ignore/);
if (/something-I'm interested-in/)
{ $state_var1 = &foo($1, $4, \%collector1); }
elsif (/something-else/)
{ $state_var2 = &bar($_, $state_var1); }
elsif (/yet another/)
{
$state_var_1 = $state_var2 + $collector1{baz};
}
else { print; }
}
This may feel primitive to the O-O heavies out there, but it's the way a lot of the Net is stitched together.
I'm not sure what the right solution to the XML awkwardness is in O-O land or close-to-the-metal-ville, but I'm pretty damn sure what I'd like to see in Scripting Village. By example:
while (<STDIN>) {
next if (X<meta>X);
if (X<h1>|<h2>|<h3>|<h4>X)
{ $divert = 'head'; }
elsif (X<img src="/^(.*\.jpg)$/i>X)
{ &proc_jpeg($1); }
# and so on...
}
The idea is that the element-ish and attribute-y syntax in regexps abstracts away all the XML syntax weirdness, igoring line-breaks, attribute orders, choice of quotemakrs and so on. I've invented some Perl syntax off the top of my head which is a highly dangerous thing to do, particularly in the fraught land of regexps, particularly since the Perloids are re-inventing all that right now in the Perl6 project; so let's be clear that the above is not a serious syntax proposal. But essentially, I want to have my idiomatic regexp cake and eat my well-formed XML goodness too. Too much to ask?
Out of the Scripting Basket · I suspect there are parallel proposals to be made for the people who live in the O-O and close-to-the-metal worlds, but they don't leap to the front of my mind. I will make one slightly-brave prediction though: I think that the stream-processing mode of reading and using XML is going to occupy a substantial part of the landscape no matter which basket you're living in; the costs of the alternatives are frequently going to be just too high.
So I think the key first step is to make XML stream processing idiomatic in as many programming languages as possible. Rumor has it that the .NET CLR is going the right way on this one, but I haven't been there.
I guess I ought to say in closing that even given the irritation which programmers encounter in dealing with XML, the benefits are sufficient that the current trend toward using it as the interchange format for more or less everything still seems sound. But we can make people's lives easier I think.