A few days ago I wrote a little report on regular-expression performance; it drew a surprising amount of feedback, including one piece that throws an interesting sidelight on the trade-offs around Java and Open Source.
Geek Feedback ·
Simon Cozens, among others,
suspects that perl was
slow because the regex is built around Unicode patterns.
Well yeah, but, that’s the future; Larry Wall has pointed out that any time
you find yourself writing [a-zA-Z]
into a regex you’ve probably
just uttered a bug.
Also, check out the interesting dialogue at Perlmonks.org; they suggest that I could have done the job better and they may be right. But I did need to group the subpatterns, whatever they say, to do the tokenization.
Patching and Forking · Here’s where it gets interesting. Kevin Burton, who has written about this before, emailed me: This should have been a dedicated OSS project outside of the JRE. It has no requirements on internals. There are some more patches (according to the maintainer) in 1.5 but it will be years before I can even begin to consider that. If the SUN license allowed it I would just fork this code and create a new branch with my performance patches... but no... development has to stop and we have to wait until JDK 1.5 is deployed everywhere. Innovation has to stop...
Well... yeah, but here’s a story. I was kind of surprised at the results on my OS X box, so I took the code and data over to a nearby Windows XP box and got similar results, so I took ’em to a Debian-stable box I had handy and once again, about the same story.
Only, not quite the same. Because it was Perl 5.8.1 on one box and 5.8.3 on another and 5.6.1 on the third, and the i18n/regex code was slightly different in each version, and no two of them gave quite the same results.
Now, I’ll grant that Unicode/i18n is one of the wobbliest areas in recent Perl versions. But you know, Java on all three boxes gave me the exact same output, bit for bit. Because you can’t do what Kevin suggested and fork it.
So it all comes down to what matters to you. Yep, if I’m at Google running a billion-hits-a-day web site, screw uniformity, I’ll rip it apart and patch furiously and fork like crazy for that last 10% performance bump if I have to.
On the other hand, if I’m deploying enterprise applications in a heterogeneous networked environment, I care about predictability and I really want to be able to rely on things running the same way here, there, and everywhere and I don’t want to be hearing about any damn forks. I’ve been on both sides of this trade-off and probably will be again.
Note that I am not taking a position on Java and OSS. I’ve been learning about this stuff for months, and I still haven’t figured out what open-sourcing Java really means, what problems it would solve, and whether it would be a good or bad thing for Sun. I’ve learned one thing though: anyone who says the issue is simple, that’s someone who doesn’t understand it.