Quite a few software professionals have learned that they need to worry about internationalizing software, and some of those have learned how to go about doing it. For those getting started, herewith a brief introduction to Unicode, the one technology that you have to get comfortable with if you're going to do a good job as a software citizen of the world.
This essay:
Right: U+0024
DOLLAR SIGN
Why Should You Care? · Whether you're doing business or academic research or public service, you have to deal with people, and these days, it's quite likely that some of the people you want to deal with come from somewhere else, and you'll sometimes want to deal with them in their own language. And if your software is unable to collect, store, and display a name, an address, or a part description in Chinese, Bengali, or Greek, there's a good chance that this could become very painful very quickly.
There are a few organizations that as a matter of principle operate in one language only (The US Department of Defense, the Académie française) but as a proportion of the world, they shrink every year.
Right: U+05D4
HEBREW LETTER HE
If you're in the business of specifying, paying for, or building software, and you're not paying attention to this stuff you're probably not doing your job. The good news is that doing the right thing isn't that difficult or that expensive.
Writing Systems · The number of human languages is much larger than the number of systems for writing them down, but the definitive reference on the subject, The World's Writing Systems (Peter T. Daniels and William Bright, eds.), still has 74 big sections, most of which discuss not one but a family of related writing systems. The Unicode system, which we'll discuss in depth, covers some three dozen different language-oriented character sets.
Right: U+0E12
THAI CHARACTER THO PHUTHAO
Many languages, of course, aren't written in our A-to-Z alphabet, in fact many aren't written with alphabets at all. Many scripts don't fill the page left-to-right top-to-botton, don't have spaces between words, don't have alphabetical order, and don't conform to Western expectations in lots of other different ways.
And once you get past languages you have to deal with symbols for currency, mathematics, and science.
If you're feeling intimidated, don't; there is good technology in place to help deal with this, and the really hard problems have mostly been solved for you by other people.
We'll start with the basics: how do you get the languages of the world into and out of computers?
Right: U+AE7D
HANGUL SYLLABIC
Input Methods · How do people get text in all the world's languages into the computer? This is one of the many problems that you don't have to solve; anyone who sells a computer, or a PDA, or a cellphone, equips it with technology to do this.
For languages which have a reasonably-small number of characters (Hebrew, Arabic, Greek, the languages of India) you just use a keyboard with those characters painted on the keys.
For Chinese, Japanese, and Korean, there are a variety of tricks people use to enter thousands of characters using only a few dozen keys. I won't go into detail, but if you haven't seen it before, it's pretty impressive to watch a Japanese person pounding text into their PDA at high speed using just their thumbs.
Right: U+00D8
LATIN CAPITAL LETTER O WITH STROKE
Fonts and Rendering · How do computers display text in all those writing systems? The bad news is that this is a horribly hard problem; the good news is that once again, the people who make computer systems have done most of the work. If you're going to have to support complicated text-editing operations including select/cut/paste, you're going to have to bite the bullet and learn a lot more about this than you probably want to, but most software only really needs to accept short chunks of text in the fields of a form, and then to hand off other chunks of text to a browser or equivalent for display on the screen.
Right: U+4F5B
HAN IDEOGRAPH
There are two pieces of technology necessary to make this work. The first is fonts; if you have a Russian customer and send them some Cyrillic text (for example their name), they probably have the appropriate fonts installed on their computer and everything will just work out; on the other hand, if you're a Canadian anglophone like me and try to open an Indian website that's written in Gujarati, there's a good chance the fonts won't be there. Having said that, Macintosh OS X comes with an astoundingly wide selection of fonts that covers pretty well the whole world, and I believe modern Windows boxes are reasonably well-supplied as well.
Fonts don't solve the whole problem. Many languages just can't be rendered without some built-in knowledge of how characters, words, and lines fit together; for example, many versions of Windows can't display Thai text without downloading some special Thai rendering software (which Microsoft supplies). Once again, the good news is that nobody expects you to write this into your software.
Right: U+0634
ARABIC LETTER SHEEN
Unicode, ISO, Politics · You can do the right thing at a reasonable cost, mostly because an excellent standard normally referred to as “Unicode”. There's a lot of history behind this simple label; Unicode proper is a consortium of technology vendors that, many years ago in a flash of intelligence and public-spiritedness, decided to unify their work with that going on at the ISO. Thus, while there are officially two standards you should care about, Unicode and ISO 10646, through some political/organizational magic they are exactly the same, and if you're using one you're also using the other.
Right: U+0F03
TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA
The reason we usually talk about Unicode rather than ISO10646 is that Unicode has a helpful web site and also publishes their product in a nice beautifully printed book, which you should think seriously about buying; more on that later.
What's a “Character” Anyhow? · All human languages are written using characters; and while philologists can enjoy decades-long arguments about what characters are, as far as Unicode (and computers) care, a character can usefully be defined as the smallest atomic unit of text with semantic value.
Computers usually store characters as small numbers; back in the days of A-to-Z ASCII, you could fit a character into an eight-bit byte, but those days are long gone.
Right: U+221E
INFINITY
Historically, there have been hundreds of different systems for assigning characters to numbers and then stuffing those numbers into bytes of computer storage. Given that every computer manufacturer in the world tended to cook up their own scheme for every language in the world, this was clearly an interoperability disaster in the making, and led to the ISO and Unicode work.
How Unicode Works ·
The basics of Unicode are actually pretty simple.
It defines a large (and steadily growing) number of characters - just under
100,000 last time I checked.
Each character gets a name and a number, for example LATIN CAPITAL
LETTER A
is 65 and TIBETAN SYLLABLE OM
is 3840.
Unicode includes a table of useful character properties such as "this is lower
case" or "this is a number" or "this is a punctuation mark".
Right: U+0A8A
GUJARATI LETTER UU
Also, for each of these characters, the standard provides a helpful picture of a reasonably-typical rendition.
For reasons we need not explore here, Unicode numbers are given in four
hex digits preceded by U+
, so “A”; is
U+0041
and “Tibetan Om” is U+0F00
.
Now the labels for the pictures in the right margin should make sense.
The Unicode standard also includes a large volume of helpful rules and explanations about how to display these characters properly, do line-breaking and hyphenation and sorting and all sorts of other stuff that you probably don't have to worry about, but if you do, it's all right here and easy to find.
Right: U+091D
DEVANAGARI LETTER JHA
Encodings · From Unicode's point a view, text is stored on a computer as a series of numbers, one per character. There are many different ways to arrange these numbers in memory (or in a network transmission), some straightforward and efficient, some less so. These are called “encodings”. Unicode itself defines several different encoding schemes, the two best known of which are UTF-8 and UTF-16.
However, there's a good chance that your software will have to input and output characters in some other pre-Unicode encoding scheme such as ASCII, ISO-8859, or a Microsoft Code Page. Fortunately, converting back and forth is a fairly well-defined process, if a little bit less efficient than we would like.
Right: U+0178
LATIN SMALL LETTER Y WITH DIAERESIS
Internally, it would be a really good idea, in your software, to start storing all your data internally as either UTF-8 or UTF-16, starting now. I'll discuss the trade-offs between these two in another essay, which will be quite a bit more technical than this one.
Special Problems in Asian Scripts · The Asian scripts (Chinese, Japanese, and Korean, often abbreviated “CJK“) present special problems, both political and technical. The process by which all these related character sets were organized into the Unicode tables was somewhat controversial and left bruised egos in various places around Asia, in particular Japan. For quite some time, whether or not you were using Unicode, you had to be really careful what you said about it in Japan or you could end up catching some real grief.
Right: U+306C
HIRAGANA LETTER NU
However, today there seems to be fairly widespread acceptance of the fact that while Unicode may not be perfect, it's probably an acceptable compromise and substantially better than the chaos that came before.
Another problem is that in these parts of the world, it is not unheard-of to invent new characters. The Japanese word for such charaacters is gaiji; historically they were invented for personal or company names. Just last year, I found out that NTT DoCoMo has been inventing new characters for teenagers to include in their cellphone text messages. This made my blood run a little bit cold, and I think the jury's still out on what the impact is going to be from a business point of view.
Right: U+0A14
GURMUKHI LETTER AU
Search · One of the most common things you have to do with text is search it. In an internationalized environment, this is tricky with Unicode and essentially impossible without it. It's tricky because a decent search capability knows about things like singular/plural, verb conjugations, and maybe something about synonyms. This is obviously different from language to language.
Another problem is that in some languages (for example Japanese and Chinese) there are no spaces between the words. This is a problem for software that needs to search such text. Not all search-engine vendors have done a good job of this, and if you’re doing your own search capability you're going to have to think about it
Does This All Work? · It's important to realize that all this is here today and really works. The following is a bit of an experiment and depends how many fonts you have in your browser. Suppose you wanted to send an invoice to me, Tim Bray; it's a good idea to spell someone's name correctly particularly when you're asking for money. So, if I were living in Cairo you'd probably want to send it to تم براي, and if in Osaka, to チムブレー.
What You Have to Do · So what, practically speaking, should you do, as a software practitioner? Here are a bunch of recommendations: