In this, the sixth instalment of my search saga, a survey of the fuzzy edges of words and their meanings and the (surprisingly moderate) consequences for search systems.
Synonyms and Collisions and Fuzzy Edges · We search in order to find things, after which we’ve found them. The search software at one time ran on mainframes, but now it runs on Unix boxes. Some of those servers get electricity through cables, but others get power through wires. Which is different of course from the power you get with the root password.
Which is to say, sometimes different words mean the same thing, and sometimes the same word means different things. One of my favorite examples: “Time flies like an arrow, but fruit flies like bananas.” How is a search engine to deal with this?
The International Dimension · Of course, the way that words twist and turn around is highly language-dependent. English is what’s called an “inflected” language, which is to say words change their form depending on their grammatical role: verb conjugation, singular/plural, and so on. (Interestingly, “inflection” has a common variant spelling: “inflexion”.) Other languages (for example Turkish and Finnish) are “agglutinative”, where words are formed by combining “morphemes.” The third most common category of languages is “analytic” or “isolating”, where words do not change and grammatical roles are established by sequences of words. The best-known example is written Chinese.
But words in all of these systems remain slippery and the principle holds true that different words can mean the same thing, and the same word can mean different things.
There are lots of language-specific problems, too. Germans readily combine words: consider Bahnhofkiosk, the newsstand in the railway station; a German would not be surprised to have this turn up in a search result list for for either Bahnhof or Kiosk.
And finally there is the issue of cross-language synonyms: the most common word in English, “the,” is easily confused with the French word for tea.
Pinker · The study of language is as old as the study of anything, and many fine authors have written many good if self-referential books in this area. And of course the world is full of dictionaries and encyclopedias and other reference works on language.
I wanted to mention a living author who has recently caught my eye: Steven Pinker. I’ve only read Words and Rules, which manages the astounding feat of talking almost entirely about irregular verb forms for several hundred pages and still being amusing, readable and very educational. On that basis, I plan to track down and read some more Pinker.
It’s Not As Bad As It Looks · Clearly, any search engine that wants to be taken seriously had better equip itself with some heavy word-variation machinery, and a sophisticated language-sensitive thesaurus. Also a basic requirement is a linguistically-sophisticated part-of-speech analyzer so that the verb “to fly” is distinguished from the similar-sounding insect.
Right? Wrong. Let’s look at Google. Search for swan then swans; or fly then flew; or wire then cable. On the evidence, the world’s most successful search engine is not doing very much at all in this area.
And in my experience, the effort of wrestling with inflexions and synonyms and antonyms and homonyms and so on, in a search engine, is usually not particularly cost-effective. I’m sure that someone must have done good research on this, but I’m afraid that I don’t have it at my fingertips. But I think that some of the reasons are pretty obvious on the face of it.
Usefulness vs. Variability · First of all, the words that have the most variation in meaning and the most collisions with other words are the common ones. In the Oxford English Dictionary, the three words with the longest entries (i.e. largest number of meanings) are “set,” “run,” and “get.”
And, for obvious reasons, these are also the words that are among the least useful for search. You’re very unlikely to find anything useful by searching for any of these alone. On the other hand, tea set, chicken run, and GET transaction are all apt to produce useful results, and the variability isn’t going to get in your way.
People are Smarter · Most people will quickly figure out that they need to search for “tea set” rather than just “set”, because each human carries around a powerful language analysis and processing engine between his or her ears. In practice, people have who need to know have little trouble figuring out that they need to, for example, search for both “cattle ranching” and “beef production” in order to coax a good answer out of Google or equivalent. I’ve never seen language-processing software that I’d expect to do as good a job as a reasonably clever high-school student, particularly one who was really motivated.
So, barring some major AI breakthroughs, I’m really not sure how cost-effective cleverness in this area is.
Information Loss · Anyone who hides information should be worried. Suppose, for example, that you decide that since “swans” is the plural of “swan,” you’ll conflate the two in your full-text index. You’ve just made life considerably harder for someone trying to find out about the Sydney Swans Aussie-rules football team. Type “swans” into Google, though, and they’re in the first page of results.
Where Thesauri Are Useful · This isn’t to say that word-mapping and thesaurus techniques are never useful in search applications. In my experience, they’re more useful the more specific they are, and increasingly less useful as they try to be general. For example, if I’m doing legal research in a business-oriented database, I really want the database to know that “Philip Morris” and “Altria” are really the same company; same for “GTE” and “Verizon”. It’s not hard to think of other very-specific examples in pharmaceuticals, financial reporting, and lots of other specialized domains.
At one time (perhaps under Verity’s influence) many people felt that
structured thesauri with IsA
and homonym/antonym/synonym mapping
and so on were the way to go, but I’m far from convinced that they do
enough better than a simple synonym-list to justify the extra effort.
Statistical Techniques · There are other ways than thesauri to improve the recall of search systems. Perhaps the best known is “Latent Semantic Indexing.” I hesitate to dive deep on that because while I’ve read about it, I’ve never actually tried implementing it myself. The idea is that you statistically analyze the word patterns in documents and develop measures of which words cluster together and which words are particularly strongly associated with a document. Then, when someone searches for a particular word, you return documents that are strongly associated with words that are strongly associated with your word, whether they actually contain your word or not. LSI has been getting good reviews from smart people, but, as with many search techniques, how well it works depends a lot on the specifics of the kind of database and the kind of query. For really general-purpose search, I’d still be inclined to bet on the inventiveness of a motivated human searcher.
Then there’s the technique that Google uses. If you search Google for tim bray, ongoing comes up first on the list, although (last time I checked) my name doesn’t actually appear on the front page. Google, however, is cleverly noticing that many pages containing the string “tim bray” also point to ongoing. This is a very useful technique; of course it depends on having many millions of volunteer hypertext authors building your linkage network to drive the computation. For this reason, it may not work very well on, for example, Intranet search applications.
Implementation Techniques · Let’s suppose we’ve decided that when someone searches for “Verizon”, we really want them to find “GTE” too. There are basically two ways we can arrange for this to happen.
When we analyze a query before we start searching, we can look up
the words in it to see if any of them are synonyms that we care about; in this
case, we notice the Verizon/GTE link.
Then we do two searches, merge the results, and make it look like they searched
for Verizon OR
GTE.
We can put the linkage right in the index, at the postings level. So when we write a posting like this:
<posting doc="http://example.com/verizon.html" word="Verizon" />
We add another one to the index like so:
<posting doc="http://example.com/verizon.html" word="gte" />
Then, when someone searches for either “Verizon” or “GTE,”
this posting will come up and the verizon.html
document will be
in the result list.
It’s not obvious which approach is better. The postings-list approach is going to run a little slower at indexing time and quite a lot faster at run-time. But it’s less flexible; you don’t get to decide at run-time whether or not to do the synonym search based for example on user preferences.
Conclusion · Languages are tricky, slippery, and subtle. Fortunately, your users are better-adapted to using them than any software you’re ever apt to write. So it’s not at all clear that this area of technology is a key to success in your next search project.