This is the first of a series on search, by which I mean full-text search.
Anyone who uses computers now uses search pretty well every day, so this is
an important chunk of our technology spectrum.
This piece covers the business and history angles; future instalments will
explain how search engines work and the interfaces to them.
I plan to conclude with a description of the next search engine, which
doesn’t exist yet but someone ought
to start building.
(Updated: Microsoft Indexing found.
Slashdot search explained)
The Number of the Beast · When I went to my first-ever computer trade show (this would be in the early Eighties) I ran across a booth for a search system; they were running on an IBM PC, which at that time would have been a 4.77MHz 8088 processor if I recall correctly. They had an electronic Bible and were happy to demo. At that time I knew more or less nothing, but I wanted to show them up, so I asked them to look up “The Number of the Beast,” which appears of course in Revelation, near the end of the book. It turned up the answer in a second or so, and at that point I was hooked.
I worked on the New Oxford English Dictionary project in the late Eighties at the University of Waterloo, where we were refining a search engine named “Pat”, developed by Gaston Gonnet; we eventually spun the OED Project technology out into a company called Open Text, and one of our core products was Pat; we were a search company.
I worked there until 1996, and built one of the first commercial Web search engines, the Open Text Index; at one point, there was us and Lycos and Infoseek and that’s about all. Antarctica’s Visual Net contains a pretty nice built-in full-text search engine (if I say so myself), the first one I’ve built that really does Unicode right.
All of which is a long-winded way of saying that while anything I say about full-text search may be wrong, it’s not wrong on the basis of inexperience.
The “Search Business” · A usage note: When someone says they’re in the Search business, they usually mean they’re a head-hunter, i.e. recruiter. Knowing this can help avoid confusion.
STAIRS and More · As I said, at Open Text we got into the commercial search business around 1990, only to discover that it was a crowded place. I can’t find an authoritative history of search, but as far as I know the first player really to get mainstream market share, back in the days when “computer” meant “IBM Mainframe” was naturally a mainframe product from IBM, namely STAIRS (for Storage and Information Retrieval System). When I went poking around to research this article I discovered, much to my delight, that SearchManager/370, a STAIRS successor, still exists. I bet it’s damn well debugged.
Back in those days there was a thundering herd of full-text search products; here are the ones I can dredge out of my memory with some assistance from the ODP: AskSam, Information Dimensions’ Basis (sold to Open Text), BRS (Open Text too), Context (from Oracle), Excalibur (now Convera) Fulcrum (now part of Hummingbird), ISYS, PLS (bought by AOL and closed), Verity, Thunderstone, and ZyIndex.
There was another interesting now-vanished sub-species of search vendors: people who shipped special-purpose hardware for searching high text volumes without pre-indexing. The only one I can remember is “Fast Data Finder.”
Open Text, by the way, is now a general content-management vendor with several lines of good search software.
One other historical player deserves separate mention: WAIS, spun out of
Thinking Machines by Brewster Kahle, which for a few brief moments in 1993 or
thereabouts had 100% mindshare in the nascent Web-space.
There was even a wais:
URI
scheme.
These days http://www.wais.com
points at Hummingbird, I
don’t know how that happened.
Today · Many of the old players are still out there, and the new faces who got famous in the Internet Search business are also trying to make money selling search software: Google, Atomz, Inktomi, and so on. I have neither the time nor the space to conduct a tour of current commercial full-text search offerings. But there are some things that are obvious to the educated observer.
It’s Expensive · If you want to go out and buy a search subsystem that’s going to do a good job on a large set of data, you’re going to end up spending serious money, well into the six digits.
It’s Commoditized · This is the part of the essay that’s going to get the vendors mad, because it’s the number-one dirty secret: All search engines work more or less the same, and offer more or less the same APIs, and provide more or less the same quality of result.
I’ll go into detail in a later essay in this series, but the fact of the matter is that there really hasn’t been much progress in the basic science of how to search since the seventies.
There are some important differences, at the margins: some engines are much better than others at in-place updates, some are better internationalized, some can handle more file formats. But these aside, they’re much of a muchness.
Basic Website Searching · Suppose you’re putting up a website and you want to offer full-text search of your own content. In fact, I’m eventually going to have to do this for ongoing, I’m already spending too much time looking for some previous article to refer back to. Here are your good choices:
Use Microsoft IIS? · Pre-XP, Microsoft’s IIS used to come with a pretty good little full-text search engine called Index Server or Indexing Services, depending on which version you had. It did more or less exactly what you’d want for a web-site; you pointed it at a directory in web-space, and it arranged for all the HTML to be searchable. It came with a nice templating facility, so you could make your search screen and results list look like part of your site. Best of all, it was zero-maintenance; just turn it on and it kept track of what was changed and added and deleted, and updated the index automatically.
Of course, this only works for static content, but that’s an important part of the problem.
I couldn’t find a pointer the contemporary version of this on the always-execrable Microsoft webfarm, but Deepak Shetty did (thanks).
Use Google · Anyone who has a reasonably popular site knows how often Google’s robots come to call, so why not use them and get something back? The form below will search for anything here on ongoing, give it a try.
So, what’s not to like? Google is, after all, pretty good, and it’s free (well, not entirely, I’m paying for the bandwidth their crawlers use up). And indeed, this is a lot better than no search.
But many people who run websites do have a bit of a control-freak streak, and that’s a problem. Because the result page is a Google result page, not an ongoing result page. And the ranking is a Google ranking, which works pretty well on the mega-scale of the Web, but I may have my own opinions about what’s important here on ongoing.
The branding issue you can work around, by wrapping your own UI around Google’s search API. Of course, to do that, there are only a certain number of free searches and beyond that, you have to pay. Which isn’t unreasonable on the face of it, Google is in this case providing a service and not getting any branding or advertising sugar out of the deal.
Use an Open-Source Tool · There are any number of open-source tools out there for searching. Rather than pick one or two from the pack and give pointers to them, I’ll give a short list of a few of the better-known names; if you cut and paste this into Google, you’ll reliably get some survey sites that will mention all the ones worth looking at. Here goes: Lucene, ht://dig, SWISH-E.
Check ’em out yourself; there are a few webloggers out there who are using these things. So far, I’m unsatisfied. Each of the ones I’ve looked at has a problem (lightly/poorly maintained, scalability problems, lack of internationalization, awkward API).
But there is some good stuff out there; for example Slashdot’s search engine seems to run smooth, clean, and fast.
JY Stervinou, Ray G from Yahoo, Damien Bonvillain, and Robin Berjon wrote to let explain that it uses Mysql’s search builtins, but refuses to search for words of less than four characters; hmm.
What We Need · What we need is for Apache to come out-of-the-box with a built-in search capability that you just push a button and it works, and it’s fast, and doesn’t need much care and feeding, and it’s internationalized, and it has the right API for when you want to get fancy.
Building one would be nontrivial, but not really backbreaking either, since you probably wouldn’t have to start from scratch. I’ll have lots more to say on that.