I needed a quick and dirty tokenizer for a big chunk of XML-ish text to feed into some Java code so I was going to fire up Perl, then I remembered that modern Java comes with its own regular-expression library. Hey, it’s good! I put it together in quick-n-dirty hacker style, and it ran over a 100M file, finding fifteen million tokens, in about three minutes of CPU time on my 1.25GHz PowerBook. Quite respectable, but, I thought with a snicker, I bet Perl can beat that. (Perl’s regex engine is generally regarded as the state of the art.) So I whacked together a Perl version and, just to make sure I was getting the right answers, I had both the Java and Perl versions print out all the tokens they found. They both burned something over ten minutes, and Perl was maybe 10% faster; might have been the I/O or other static. I was impressed to find Java within 10% of the best. So then I ran it again without the output, just counting the tokens, and yowie zowie, Perl was at 8 minutes 47 seconds, Java back at 3 minutes 4 seconds. So I re-ran on a nearby Debian box, on the theory that the OS X versions of Java and Perl might not be representative of their kind. There are all sorts of variations around I/O and so on, but my finding is that for this problem, the Java 1.4.2 regex processing is somewhere around twice as fast as Perl 5.8.1. Frankly, I’m astounded. Read on for acknowledgements, some gory details, and a tasteful selection of Google ads for regular expression software. [Update: There is a good reason things are the way they are, and Perl’s trade-off may well be better.]
Acknowledgements · Thanks to Phillip Pearson, Sriram Srinivasan, Chris Lightfoot, Robert Sayre, and Chris Nokleberg for pointing out my thinko on the first cut of the perl regexp. Thanks to Anne van Kesteren and Russell Beattie for CSS help.
Regexes · Here’s the java version.
(<[^/]([^>]*[^/>])?>)|(</[^>]*>)|(<[^>]*/>)|((\p{Lu}|\p{Ll}|\p{Lt}|\p{Nd}|\p{Nl}|\p{No}|[\u4e00-\u9fa5]|\u3007|[\u3021-\u3029])((\p{Lu}|\p{Ll}|\p{Lt}|\p{Nd}|\p{Nl}|\p{No}|[-._:']|[\u4e00-\u9fa5]|\u3007|[\u3021-\u3029])*(\p{Lu}|\p{Ll}|\p{Lt}|\p{Nd}|\p{Nl}|\p{No}|[\u4e00-\u9fa5]|\u3007|[\u3021-\u3029]))?)
Here’s the perl version, which differs only in using \x
with
parentheses
rather than \u
to reference Unicode codepoints.
(<[^/]([^>]*[^/>])?>)|(</[^>]*>)|(<[^>]*/>)|((\p{Lu}|\p{Ll}|\p{Lt}|\p{Nd}|\p{Nl}|\p{No}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])((\p{Lu}|\p{Ll}|\p{Lt}|\p{Nd}|\p{Nl}|\p{No}|[-._:']|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])*(\p{Lu}|\p{Ll}|\p{Lt}|\p{Nd}|\p{Nl}|\p{No}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}]))?)
Unfortunately, they don’t produce quite the same results, with occasional variation around international characters. But close enough for the purpose of this experiment.