Welcome to another installment in ongoing's ongoing tour through text-processing issues. This one is about programming-language support, and while it makes specific reference to Java, tries to be generally applicable to modern software environments. The conclusion is that Java is OK for some kinds of text processing, but has real problems when the lifting gets heavy.
Last time out I said this was going to be a three-part essay, but now I realize I'd already written two other text-processing-centric pieces before that, one an intro to Unicode, and the other entitled On Character Strings. The present essay will recapitulate some of the material in that second note, but no matter how you cut it, we're already (to quote Douglas Adams) on volume four of the trilogy. To make it worse, I'm gestating some essays on full-text-search, so we'll just call it a continuing series.
I had also said that I was going to discuss C# to a lesser degree, and I thought I could probably get away with that simply because C# is so amazingly like Java, but I found the language and library documentation so hard to find and navigate on the Microsoft site that I abandoned that project. I suspect that some of the remarks will apply, and if someone wants to follow-up on the differences (if any), I'll point to or reproduce such material.
I think that most text processing falls into one of two buckets, which I'll illustrate with a story.
You Look Inside? · I'm remembering a meeting that took place sometime in '95 or '96. My company, Open Text, had just acquired a company called Odesta. We were mostly a full-text-search company, Odesta was mostly a pure document-management outfit.
Shortly after the acquisition, we had a technical summit of the senior engineers from the two companies to explain to each other what our stuff did and figure out how to move forward. The Odesta guys talked first. Their Doc Man stuff was pretty cool, it had a RDBMS of interesting metadata about each object checked into and out of the system, and the software could slice and dice this many different ways, had a very sophisticated security model, and so on. We were up next, and our head search-algorithm guy was talking about tokenizing and inflexion processing and index update and so on, when one of the Odesta guys put up his hand with a question and said in a slow Southern-accented voice: “Hold on... you mean you guys look inside the documents?!?” At which point a silence fell in the room as we all had an “Aha!” moment.
Classes of Text Processing · I think that story illustrates the two main classes of text processing software, which for the purposes of this essay I'll call “light” and “heavy”.
Light Text Processing · This is what most people do most of the time. You get some text, from a file or more likely via a database for example:
SELECT uri, description FROM Essays WHERE age < 24 * 60 *
60;
You keep the text around in String
object and maybe you
intern
it for efficiency and use it as a hash key and eventually
jam it into an HTML template or send it off in a SOAP message or update
another table.
What you don't do is invest many lines of code in reaching into that chunk
of text and fiddling around with characters, rather you take what you're
given and route it around.
An example is the code that generates ongoing, which takes the chunks of text emitted by the XML parser, sticks them into an RDBMS, wraps XHTML tags around them and writes them into the output files.
Heavy Text Processing · This is the province of search engines and parsers and semantic analyzers and browsers and rendering engines and word processors.
Such software lives by “looking inside the document”, disassembling, searching, sorting, transforming, and reconstructing strings of characters. Its needs are qualitatively different from the light processing world.
An example is the Expat XML parser and the Perl regular expression engine that the ongoing software uses to do all the heavy lifting.
Java String
is Lightweight ·
Before I go any further, I should note that my remarks on Java are in the
context of the “Java 2 Platform Std. Ed. v1.3.1”, which is what
Apple packages up with the Macintosh.
Anyone who's tried to do heavy lifting with Java's String
class is rolling their eyes at this section's headline, but what I mean to
say is that String
is well-suited to the needs of light text
processing:
intern
it and do very
lightweight comparisonI'm less friendly to the mutable StringBuffer
class.
It has
all the methods you could want for twiddling characters, but they seem to
think that successive append
method dispatches are a good way to
build up character strings, and if your lifting gets heavy at all this
collection of blunt-edged tools is apt to feel like trying to play a
Stradivarius with welding gloves on.
Still, most light text processing is going to be in applications where the costs are dominated by database, UI, and network overhead, so this is probably not too bad.
But if you're going to be doing any heavy text processing, you may
find that Java's String
and char
primitives are part of the problem, not part of the solution.
Character Support ·
In Java, characters are represented by the char
data type,
which is claimed to be a “16-bit Unicode character”.
Unfortunately, as I pointed out
recently, there really is no
such thing.
To be precise, a Java char
represents a UTF-16 code point, which
may represent a character or may, via the surrogate mechanism, represent only
half a character.
The consequence of this is that the following methods of the
String
class can produce results that are incorrect:
charAt
, getChars
, indexOf
,
lastIndexOf
, length
, and substring
.
Of course, if you are really sure that you will never have to deal with an
“astral-plane” character, to the point of being willing to accept
that your software will break messily if one shows up, you can pretend that
these errors can't happen.
To me, this feels just like deciding that you'll hever have to deal with more than 64K of memory, or a database bigger than 32 bits in size, or a date after December 31, 1999. What Hunter S. Thompson would call “bad craziness.” I'll settle for “shortsighted.”
But still probably OK for light processing, where you're not counting on
looking inside those String
objects.
I note that Java sorta kinda recognizes the problem; the
indexOf
and
lastIndexOf
methods of String
(correctly) use an
int
to hold the character argument, but the replace
method still uses char
.
One minor problem with Java should be noted; DataInputStream
and DataOutputStream
have readUTF
and
writeUTF
methods which do not in fact read or write any of the
recognized Unicode encoding formats, but that's OK, because
InputStreamReader
and OutputStreamWriter
handle
them just fine, just don't get confused by the names.
Fortunately, Those Who Control Java have noticed the UTF-16 problem,
and there is work in
progress to address it.
I am a bit puzzled, though, by the remark in the above-noted document that
most of the changes will take place in java.lang.Character
.
Now admittedly this is a class which I've never found occasion to use, but it
seems to me the real breakage is on the String
front.
In any case, it's good that they're addressing the issue.
Characters Considered Harmful · One of the key principles of doing heavy text processing in an internationalized environment is that it's almost always wrong to work directly with individual characters, especially at the API level. I'll cover this in brief, but for those working in this area, it would be good to spend some quality time with a W3C Working Draft entitled Character Model for the World Wide Web. This is not light reading, but it's important stuff. It highlights the following problems with characters as units of work:
I'm not going to lay all this out in detail, but the conclusion is sound: if there's a place in your API that takes a “character” as input or provides one as output, you're probably looking at a design error.
Strings or Character Arrays or Byte Arrays? ·
In a previous essay
inspired by some fine writing by Paul Graham, I considered the question he
raised: do we need String
data objects at all?
After all, a string is just a sequence of characters, and every programming language under the sun has arrays, so why do you need a separate object to package one of these up? Some of the most sophisticated text-processing software in the world, er make that most of the most sophisticated text-processing software in the world, is written in C, which gets along just fine with null-terminated byte arrays.
So for Java, should we abandon String
and do all our work
with char[]
constructs?
I don't think so, simply because I think the char
primitive is
just too deeply broken.
Also, I want to use tricks like strcmp()
and
strncat()
, beloved of grey-bearded Unix veterans.
Gasp! Surely he's not suggesting that we retreat to putting everything in
a byte[]
construct, and presumably revert to living in caves and
courting women with clubs?!?
Well no, because I am an object-oriented kinda guy, when I can get away with
it.
So, how do we get the heavy industrial machinery for doing superior text
processing in modern languages without compromising their virtues?
Stay tuned.
Acknowledgements · Grateful thanks to Rick Jelliffe, Miles Sabin, and Elliote Rusty Harold over on xml-dev who helped hash through some of these issues.