Articles in this space have
introduced Unicode,
discussed how it
is processed by computers, and
argued that Java's
primitives are less than ideal for heavy text processing.
To explore this further,
I've been writing a Java class called Ustr
for “Unicode
String,” pronounced “Yooster.”
The design goals are correct Unicode semantics, support
for as much of the Java String
API as reasonable, and support for
the familiar, efficient null-terminated byte array machinery from C.
Here goes:
package com.textuality;
public class Hello
{
public static void main (String[] args)
{
// room for 13 UTF-8 bytes
Ustr message = new Ustr(13);
// construct Ustr from String
Ustr hello = new Ustr("Hello");
// blast it into message
message.strcpy(hello);
// append a character given as an integer
message.prepareAppend();
message.appendChar((int) ' ');
// construct Ustr from some integers
int [] wints = { 'w', 'o', 'r', 'l', 'd' };
Ustr world = new Ustr(wints);
// stick it on the end, using byte ops
Ustr.strcat(message.s, world.s);
// there's no room in the buffer for all these bangs
Ustr bangs = new Ustr("!!!!!!!!!!!!!!!!!!!!!");
// damn the torpedos, we have safe methods (note extra 's')
message.sstrcat(bangs);
// it would be more stylish to do this in Hebrew or Korean
System.out.println(message);
}
}
I've got enough of it working to start sharing it with the world. I can't post the code till I sort out the copyright—anything I produce belongs to Antarctica, but since there's no money in this, the best thing for the world and the company, in the event there's any interest, is to publish it under some sane OSS license.
Also, I'm a little reluctant to publish code until there's been a bit of feedback, because I haven't programmed Java professionally for a few years and it's quite possible that the interface reveals my profound ignorance of some new trick or approach that I should fix. But I promise to get it out there in the next seven days.
So
here's the
Javadoc, as a basis for discussion.
It provides enough info for anyone competent to implement
Ustr
or (hint hint) the C# equivalent, it only took me a
couple of weeks working an hour here and
an hour there, mostly late at night.
There are just under 1500 lines of Java (at least half Javadoc bloat, that's
OK, Javadoc bloat is a good thing) and the class file
is 11K.
How It Works ·
A Ustr
is a thin wrapper around a null-terminated UTF-8 byte
sequence.
Nothing is private, the byte array s
and the start of the
sequence base
are both public fields. You could allocate a
really big byte array and have lots of different Ustrs
in it.
There's one more field called offset
, which
is used for stepping through the characters embedded in the UTF-8.
I went to some effort to get the Unicode right; there are methods called
appendChar()
and nextChar()
that store and retrieve
Unicode characters (as integers) from the UTF8, and use that
offset
field to work through the text in a natural way.
Also there are constructors and generators for integer arrays and Java's
kind-of-UTF16 String
thingies.
I'm thinking about making a strong claim: with some more polishing, this package, in the hands of someone who knows what they're doing, is going to be both more correct and more efficient for doing heavy-lifting text processing than what comes with Java.
strcpy()
and Friends ·
Once you have null-terminated byte arrays, why shouldn't you be able to
make like a classic C programmer?
So Ustr
has strcat()
, strcmp()
,
strcpy()
, strlen()
, strstr()
,
strchr()
, and
strrchr()
.
They all operate on bytes, not Unicode characters.
There are lots more where that came from (strspn()
,
strtok()
, etc etc ad nauseum); I just implemented the ones I've
actually used regularly over the years.
Except for, I implemented strchr()
and strrchr()
instead of the slightly-more-idiomatic index()
and
rindex()
because I wanted the term “index” to always mean
counts of Unicode characters, not bytes.
Each comes in at least two flavors: first, a nice modern object-oriented
version, so you if you have a Ustr
named ustr
, you
can say things like ustr.strcmp(other_ustr)
.
Second, there are down-to-the-metal static functions that just pump
the contents of byte arrays back and forth: Ustr.strcpy(to, from)
.
One of the reasons the class is Ustr
rather than
ExcellentPoMoUnicodeString
is that Ustr.strcpy()
is
less typing (snicker).
I haven't done the “n” variants (strncpy()
etc) yet,
because for anything that copies data I made a safe version, e.g. for
strcpy()
there's sstrcpy()
(note the extra
“s”) which efficiently makes sure you don't overrun the target buffer
by catching ArrayIndexOutOfBounds
exceptions.
Which is fine, but I think there's a good case for strncpy()
and friends anyhow.
The java.lang.String
Family ·
Pretty well all of String
that's not actually pernicious is
in Ustr
.
The constructors are a bit different, but the other methods are there, except
for anything that involves case-folding (arguably wrong and empirically
horribly expensive in Unicode) and all the valueOf()
stuff.
Also I started implementing trim()
by calling out to the
String()
version, decided the that what that version
does is horribly, unsalvageably, wrong, and postponed it, because doing
this properly with respect to the Unicode tables will take a bit of work.
The other difference is that methods such as charAt()
and
substring()
operate correctly in terms of Unicode characters.
I even did intern()
which is kind of questionable since a
Implementation Notes ·
The implementation (while fairly well-tested) is entirely unoptimized;
it's all done with the bare minimum amount of code.
This, I would argue, is entirely correct in a first cut.
I can imagine all sorts of optimizations that my intuition tells me would
make it run a lot faster (I note that in GNU libc
the
str*()
routines are largely in assembler), but I say
“Get thee behind me, intuition!”
It should be emphasized that this is a power tool; pumping bytes around
like this is efficient, but it relies on you to do null-termination and
allocate enough space and all that good stuff.
So if you're not doing heavy lifting, Java's built-in
String
is probably a much better choice.
TDD Again · This is my first outing with JUnit in hand and an aggressive TDD approach. Junit rocks, and TDD is programmers' crack of the highest purity. There is no going back, nosirree. (Mind you, on the Mac, JUnit or its Swing wrapper has a busy loop of some sort and burns CPU steadily. But still.)
So there's a TestUstr
harness with 1500 or so lines of code
and 26 test functions with 134 assertions last time I counted.
At that, I don't think the harness is as complete as Kent Beck would like, and
could well be expanded.
As long as this remains my baby the policy will be that nothing goes into
Ustr.java
without something corresponding in
TestUstr.java
.
Plans? ·
I don't know.
There's a kind of low-level geek thrill in implementing
strcpy()
and friends, and there are subtleties to some of
these calls that I hadn't appreciated.
It would appear that I wasn't quite as
comprehensively on top of text-processing issues as I maybe thought I was, and
this is worth doing if only to have learned that.
In the unlikely event that other people want to take this seriously and use it or work with it, I'd be perfectly happy to park it on sourceforge and run it for a while. I can't see it being much work.
For the first time, I regret not having a comments feature for ongoing, but I'm just not going to have time to write that code, is this a job for Yahoo groups or some such? Advice on hosting discussions would be received with gratitude, Google and five seconds will get you my email address.
And if I don't get any mail, this has all been a pleasant experiment, and so far, I have failed to falsify my hypotheseses about string processing in high-level languages.