RX and 1.9 and Pain

This fragment is mostly a note to myself and placeholder and might prove useful to someone slashing through the XML undergrowth with bleeding-edge Ruby. Briefly: I revived my “RX” Ruby tokenizer (see here, here, and here) to contribute to Antonio Cangiano’s proposed Ruby benchmark suite, which I think is a Really Good Idea. I had a bit of pain getting the code to run on both Ruby 1.8 and 1.9, and then when I tried sanity-checking the output by comparing it to REXML on 1.9, REXML blew chunks. There are, apparently, issues about REXML and 1.9. Read on for details in the unlikely event that you care about any of this.

Benchmarking · There’s this problem in that there are a lot of plausible-looking Ruby implementations now (MRI, YARV, JRuby, Rubinius, IronRuby, MagLev) and it would be nice to compare performance. I was talking to some of the implementers about this and someone (Charles Nutter I think) said “Problem is, there’s this huge gap between running fib() and running Rails.” So, for example, how do we find out how fast MagLev will run Rails, without going through all the pain of making MagLev run Rails?

Antonio Cangiano sensibly proposed Let’s create a Ruby Benchmark Suite; when Avi Bryant told me he’d tried my RX code on MagLev, it occurred to me that it might be an interesting benchmark.

RX refresher: It’s a pure automaton-based XML tokenizer whose performance is totally dependent on the efficiency of dereferencing integer arrays, and it turns out that mainstream Ruby really sucks at this.

To make it a little more competitive with REXML, the de-facto standard Ruby parser, I had kludged it all over the place with regex preprocessing to cut down on the array traffic.

So I asked Antonio whether, if I de-optimized RX to make it a pure array benchmark, would it be interesting for his suite, and he said yes, so I did.

1.8.6 vs. 1.9 · Perhaps the single most visible difference between today’s Ruby and tomorrow’s is in the low-level string-handling API. Well, an XML parser lives entirely right there, so boy did I ever learn all about it. I had previously converted RX to run on 2006-vintage YARV, but I wanted one version of the code that would run in both 1.8.6 and 1.9. Sigh.

Here’s one of the detail issues, to give a feeling for the problems. Suppose you know that your input stream is in UTF-8 and you’ve read a buffer-full of data and you want to turn it into Unicode integer characters for the parser. The problem is that the buffer might end in the middle of a multi-byte UTF-8 character. Easy enough, a glance at the last byte will diagnose that. The problem is, how do you pull out the unsigned-integer value of the last byte of a buffer, without processing through the whole (potentially large) buffer, with code that runs in both Rubies?

I poked around on IRC and Eric Hodel managed to improve on my original suggestion. Read it and weep:

  def byte_at(s, i)
    s[i, 1].unpack('C')[0]
  end

REXML Ouch · RX has a primitive unit-test suite; what I do to sanity-check it at a high level is feed a nontrivial XML doc to it and REXML and check that they find the same number of elements, PIs, paragraphs, img elements with a src= attribute whose value ends in .jpg, and occurrences of the word “the” in running text.

Well, when I finally got it running in Ruby 1.9, and started the sanity check, REXML blew up on my document, 2.8 Meg of the input to this blog.

With a bit of poking around, I ascertained that:

REXML blew up differently depending whether I used its stream or DOM mode.
In the stream mode, it apparently fell over while trying to handle an instance of ½, but I couldn’t replicate the failure with a small file containing that.
In the DOM mode, it incorrectly reported an error on a CDATA section containing an XML declaration and DOCTYPE, but I couldn’t replicate the failure with a small file containing that.
I asked Sam Ruby whether I should be surprised, and a glance at his continuous-integration setup for 1.9 reveals that I shouldn’t, there are lots of tests failing.
There is quite a bit of disgruntlement about XML and Ruby right at this point in time; see The Status of Ruby’s libxml and My Frustrations with REXML: Ruby’s Standard Library for Reading/Writing XML; or, Ruby’s Problem Is Its Type System, and Don’t Try to Tell Me Otherwise and Does Ruby's support for XML suck? I need your input.
The issue is under discussion on the ruby-core mailing list.

Well, there you go. By the way, in Ruby 1.9’s favor, it runs the (simplified de-optimized) RX about three times as fast as 1.8.6. Any other implementors want a whack at it?

Contributions

Comment feed for ongoing:

From: MenTaLguY (Jun 11 2008, at 07:19)

I'm not sure what's worse: that REXML has issues, or that lots of people seem to think that Hpricot (a permissive HTML parser lacking namespace support and implementing a small subset of XPath) is a better replacement.

[link]

From: automatthew (Jun 11 2008, at 10:12)

The author of Hpricot commented recently that he does not "think Rubyists and XMLists share much of a Venn diagram. "

http://www.rubyflow.com/items/388

Many rubyists can get away with using Hpricot for XML parsing, the way I can get away with using Google's translated pages to make DNS changes on a foreign registrar's website. Unpleasant, but effective so long as it's infrequent.

[link]

From: Scott Johnson (Jun 12 2008, at 08:28)

All of this make me happy to stick with Python. I haven't personally had any issues with XML support there.

[link]

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

June 10, 2008
· Technology (90 fragments)
· · Ruby (93 more)
· · XML (136 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!