The “On Search” series resumes with this look at the issues that arise in search when (as you must) you deal with words from around the worlds written in the characters that the people around the world use. I18n stands for “internationalization.”
Let’s start with that abbreviation: for those who haven’t seen “i18n” before, it means “‘i’, then 18 letters, then ‘n’.” You see it all the time, because “internationalization” is just way too long for man or beast. There are a few other patterns along this line, I’ve seen l10n for “localization” and so on. And you can call me “T1m.”
All the discussion so far on the psychology and mechanics of search holds pretty well across languages and character sets. This essay isn’t going to tell you everything you need to know about dealing with i18n issues in the search arena—that would fill a substantial book. Rather, I’ll try to outline the issues that people who deploy and are serious users of search systems should probably be aware of; also I’ll recommend a substantial book that covers much of this territory.
Unicode, D’oh · Anyone who’s creating search technology these days totally has to base the guts on Unicode. I’ve written before both on why this technology is the only sane choice and why it’s not that hard for programmers. I personally would recommend using UTF-8 for the low-level machinery but, particularly if you’re in Java- or C#-land, you might find yourself more comfy in UTF-16. In any case, to do this kind of thing, you do need to educate yourself on the issues.
Know Your Language · If you are going to be indexing text for full-text search, it’s really important that you know what language that text is in. There are a bunch of different ways to ascertain this. Most computer operating systems provide a way for a user to establish a language environment. Some data formats, including XML and HTML, include special markup that can be inserted to say what language the content is in.
Inevitably, in any ambitious internationalized search application, you will encounter texts whose language the system doesn’t know, and that’s just too bad.
Ignore Collation · Suppose you want to sort a bunch of words alphabetically, or as geeks say, “lexically.” It turns out that different cultures around the world, even the ones that use our alphabet, have different ways of doing this; for example, in Spanish they sort “ch” after all the other “c” words, and “ll” after “l”. Most operating systems allow you to set your “locale,” which affects these sorting rules and so on. For the purposes of sorting postings and structuring your index, you can ignore this and just sort word-lists and so on according to the underlying Unicode rules.
On Case and Diacritics ·
One of the trickiest issues in internationalized search is upper and lower
case.
This notion of case is limited to languages written in the Latin, Greek,
and Cyrillic character sets.
English-speakers naturally expect search to be case-insensitive if only
because they’re lazy: if Nadia Jones wants to
look herself up on Google she’ll probably just type in nadia
jones
and expect the system to take care of it.
So it’s fairly common for search systems to “normalize” words by converting them all to lower- or upper-case, both for indexing and queries.
The trouble is that the mapping between cases is not always as
straightforward as it is in English.
For example, the German lower-case character “ß” becomes “SS” when
upper-cased, and good old capital “I” when down-cased in Turkish becomes
the dotless “ı” (yes, they have “i”, its upper-case version is
“İ”).
I have read (but not verified first-hand) that the rules for upcasing
accented characters such “é” are different in France and Québec.
One of the results of all this is that software such as
java.String.toLowerCase()
tends to run astonishingly slow as it
tries to work around all these corner-cases.
That discussion of French highlights the problem of diacritics. To a Francophone, the letters “e” and “é” are quite different, but an English-speaker like me might well type in “quebec” and expect to find “Québec”.
Another member of this family of problems is the German umlaut; in general, umlauted German vowels can be replaced by the same letter with a following “e”; this rule would regard Müller and Mueller as equivalent. In German-speaking countries they like to write official rules codifying spelling, and in Switzerland this equivalence is officially blessed. But I have been told of at least one Swiss family named Mueller who are indignant when mis-identified as Müller.
Unless you’re actually implementing a search system, you don’t have to track down and implement all these rules; but even users of search systems can benefit from knowing what’s going on.
People who build search systems may choose to address some of these problems by treating variant internationalized spellings, for example “Québec” and “quebec” as synonyms.
What’s a Word? · Probably the trickiest problem in internationalized search is figuring out what words ought to be indexed. In English, it’s terribly easy; words have spaces between them and that’s how you spot them. There are some judgement calls—for example, whether “mis-spoke” is treated as one word or two—but it’s not rocket science.
Some Western languages make heavy use of compound words, which present interesting indexing problems. For example, you might hear a German refer to the “Bahnhofkiosk”, the newsstand at the train station. It wouldn’t be surprising to a German to have this turn up in a search for either “Bahnhof” or “Kiosk”. But they might not see that as a basic requirement, either. And of course “Bahnhof” is itself a compound word, but a different kind of compound word, you see.
CJK · The problem gets really severe in Asian languages such as Chinese, Japanese, and Korean, which don’t put spaces between words at all. The problems of processing text in these three languages are similar enough that the acronym “CJK” is commonly used to describe them as a whole. Some but not all issues arising in processing Vietnamese are similar so you’ll occasionally see discussions of “CJKV”. I’ll repeat my recommendation from an earlier essay that if you need to do serious text processing in the CJK domain and you’re not already a native-speaker with computer programming experience in the space, you should purchase Ken Lunde’s excellent book on the subject from O’Reilly.
The big problem here is how to break the stream of characters into words without spaces to help. I have the most experience with Japanese, so here are a couple of introductory notes on how it is written: in a combination of four different character sets. The first and largest is the characters originally adapted from Chinese, called Kanji. Then there are two different syllabic alphabets called “Hiragana” and “Katakana”, collectively known as “kana”. For example, my name in Katakana is チムブレー, and in Hiragana (but normally you’d use Katakana for a foreigner’s name) is ちむぶれー. Finally, Latin characters make regular appearances in Japanese texts. For example, “Sony” is not the English version of a Japanese word: that company’s name is written with four Latin letters.
In Japanese, some words are composed entirely of kanji, some entirely of kana, and quite a few of some kanji followed by some kana. You can use these patterns to help spot words; for example, the transition from Latin to Kanji is a reliable word boundary, and (apart from some well-known exceptions) so is the transition from kana to kanji. But that leaves a lot of places where there is no obvious word boundary.
So in the CJK space, there are a couple of strategies that have been adopted. You can go along through the text, looking at each character and checking if it and the characters following it constitute a word by looking them up in a dictionary, and if so, indexing that word. You can worry about apparently-overlapping words or you can just go ahead and index everything that looks like a word. This can make indexing seriously expensive if the dictionary is big, which it should be
Alternatively, you can just treat each character as a separate word and index them all; when someone types in a five-character search you treat it as a five-word phrase search. Since the average word (in Japanese anyhow) has a length of between two and three characters, this seems like a weird strategy, likely to produce all sorts of “false-positive” hits.
I don’t have personal experience with either Chinese or Korean, but in my experience in Japanese, a combination of heuristics for dealing with strings of kana, and this trick of indexing every character, works remarkably well. That is to say, when ordinary Japanese people type in ordinary queries, this method seems to produce reasonable results that don’t surprise them, with few enough false positives that it isn’t a problem. I suspect there’s a deep lesson in the structure of the Japanese language lurking here, but it’s beyond my talents.
Looking for Help · All this is tricky stuff. Fortunately, you don’t have to go to the mat with it unless you’re actually writing a search engine. And if you are writing a search engine, you’re not entirely on your own. I’ve run across lots of library-ware over the years, Perl modules and IBM Alphaworks packages and so on, for helping you process text in Chinese and Hungarian and Hebrew and lots of others.
For the person who’s deploying or making serious use of a search engine, the real lesson is that this stuff is more complicated than you think it is, and that you can’t ignore it.