This is the last in my series of On Search essays. I’ve written these pieces because I care about search and because the lessons of experience are worth writing down; but also because I’d like to change this part of the world. In short, I’d like to arrange for basically every serious computer in the world to come with fast, efficient, easy-to-manage search software that Just Works. This essay is about what that software should look like. Early next year I’ll write something on how it might get built.
Naming the Baby · An important piece of software needs to have a name, but that takes time and creativity and can wait; for now I’ll just call this thing the Basic Resource Finder (BRF).
Requirements · Then a couple of non-requirements and a conclusion.
BRF is Open-Source · My heartfelt apologies to anyone still trying to make a go of it in server-side search; but that business is just so over. It always was a lousy business, nobody has ever made real money there on a sustained basis, and yet it’s something that every Web deployment needs. For a substantial site you can easily drop six figures for a search engine, and all the bells and whistles that buys you are mostly not cost-effective.
So BRF is going to be open-source. That doesn’t mean that you can’t make money with search software; it just means you have to do it in services. There are always going to be search deployments loaded with tricky implementation and deployment work: figuring out where the data is, aggregating it, cleaning it up, building the workflows so these things keep happening, maintaining some application-specific synonyms, the list goes on and on, and none of these things are free. And they are much better things to spend money on than software licenses.
… is Web-Centric · The Web technology is a natural basis for a search system, most obviously because that’s where people are used to encountering them. But as I’ve argued previously, the Web also naturally provides the kind of interfaces a search engine needs. So even when you’re deploying a search app in a non-Web context, if that software needs to talk to a Web server behind the scenes to get the searching done, I don’t see that as a problem.
Specifically, what I mean by this requirement is that BRF’s interfaces to the world are more or less as described in my previous essay on the subject.
… Interfaces to Everything · Although stated as a requirement, this comes for free as a consequence of being Web-centric. You can update BRF’s index and run searches against it from anything that can interchange XML messages, i.e. from pretty well anywhere. For the same reason, BRF doesn’t care in the slightest what programming language you want to use, what application server you like, or what operating system you’re running.
… is Part of Apache · This may be a little controversial but I think it’s a no-brainer. Apache has a mature, sophisticated, well-thought-through programming infrastructure that includes the low-level functions a developer needs, delivered in just the way they’re most needed in a Web context. Also, it’s already running a large proportion of the world’s Web servers.
For the rest of the world, this means that if you have a Web site, you can ask your tech admin “Please Turn On Search” and the admin will say “Okay” and your web site will have search running that time next day.
This requirement may have some consequences. I’m not sure that the Apache Java subsystems are well-enough integrated in the basic install that they meet BRF’s low-barrier-to-entry requirement. Put another way, I think it may not be acceptable, in a lot of implementations, to have to deploy and configure a JVM to “Turn On Search.”
… is Internationalized · Given all my ranting here at ongoing on the subject, this can’t come as a surprise. Clearly, BRF has to be Unicode-based, internally.
Even if this weren’t a good idea, it would be forced by the use of the XML-over-HTTP interface I’ve already required; once you’ve committed to accept XML, the sender can send you text in any weird assortment of Unicode characters.
… Comes with a Bunch of Document Readers · These days, you could be forgiven for thinking that all you need to index and search is HTML; but you’d be wrong. A huge proportion of our species’ intellectual capital is locked up in Word and PowerPoint and PDF and lots of non-HTML XML vocabularies. For BRF to be useful, it’s going to have come on day one with a suite of software that can read most of these formats at least enough to be useful.
Because of the strongly-decoupled nature of the BRF interfaces, of course, any such document reader is plug-replaceable. There might be a line of business for someone in selling a line of software that does a better job of tokenizing Chinese word-processor documents or whatever.
… Comes with a Robot ·
A lot of the text that you’d like to be able to search is scattered around
the Web or around your own Intranet. So BRF is going to have to include a
good basic spider; I’ve
covered the basics here
previously.
The spider will be rule-driven, where the rules are patterns that match the
set of URIs that you want crawled (or excluded). For example, if you’re the
CIO at Example
Corp, you might tell BRF that you want everything in
http://corp.example.com
and http://www.example.com
and http://hr.example.com/public/
indexed.
And if your number-one competitor is Ejemplo Sistemas of Guadalajara you’d
also have the BRF robot get everything it can from
http://*.ejemplo.com
.
… Comes with a Filesystem Walker · In a lot of cases, the material you’d like indexed is sitting in a filesystem. Even if it’s exposed through a Web server, it’s going to be way more efficient to read it out of the filesystem for indexing. So BRF is going to have a smart filesystem walker.
On modern versions of Windows, the walker can be super-smart and subscribe to changes in interesting parts of the filesystem to make sure that part of the index is really up-to-date.
Of course, the filesystem walker would need to use the file readers for tokenizing what it finds, and of course it would use the standard message-passing API to talk to BRF just like any client software.
… is Self-Managing · This is crucial: BRF needs to be fire-and-forget. Business users and their web admin people don’t want to think about indexing runs and space compression or thesaurus administration: they just want to say “Turn On Search.”
BRF, once you start it, updates its index while it’s running; it never needs to be initialized or re-organized or optimized or tuned.
… Keeps Running · BRF has to be exceptionally, remarkably robust. This means that when the system crashes, you restart it and it just restarts. Even if it was in the middle of updating its index, it recovers without any damage and without any loss of information; if you were in the middle of transmitting an update and you hadn’t seen your HTTP acknowledgement back yet, that update might get lost; but once BRF acknowledges an update, that update will get done and won’t be lost.
… is XML-Capable · Frankly, I’m far from convinced that there are going to be lots of big repositories full of XML, or that querying XML is going to be a high-value business problem. But I’m also not convinced that there won’t and it won’t. In any case, BRF has to have enough hooks in its core machinery that it can be used to build an efficient implementation of XQuery (whether or not I like it).
Fortunately, this is a problem that’s been solved in the past, and the index structures you need to do this are not child’s play but they’re not out of reach either.
… is Fast · BRF isn’t worth doing unless it’s fast enough to be pleasant to use, even for lots of users on large amounts of text. Interesting numbers to measure would be how many search results per second it can serve, and how many millions of postings updates per hour it can soak up, and what the aggregate search-and-index performance is,
I’m not ready to provide numeric targets yet, and it’s crazy to try to engineer around bottlenecks before you know where they are, and in a complex system you never know where they are until you build the system. So what you do is try to avoid anything egregiously stupid in the design, and assume that there will be bottlenecks, and instrument the system from day one so that when it’s not fast enough, you can find the bottlenecks and then fix them.
… is Ready for PHP, JSP, and Friends · Let’s imagine someone’s “Turned On Search” for a website; they want a way to feed queries in and get results out. Most people build web sites with one kind of templating facility or another; well-known examples are PHP, JSP, then there’s all the blogging tools and any number of different portal-ware offerings.
I don’t have a lot of experience with this kind of software since I’ve always cooked my own website engines, but I suspect that probably BRF on day one would need support for PHP, JSP, and basic Apache server-side includes.
Hopefully, good open-source implementations of these things would provide enough encapsulated experience to make it easy to build BRF into the next couple of dozen templating software modules.
… Has Programmmable Ranking · BRF will have built in most of the lore on result ranking I wrote up earlier in this series, with the possible exception of Latent Semantic Indexing. Crucially, it will have some facilities to make it easy to feed back popularity and usage counts into the ranking heuristics.
Also, there are lots of deployments where the data owner knows perfectly well what should be used to rank results and doesn’t want the engine’s opinion getting in the way; BRF needs to have hooks for this too.
… Does Booleans, Phrases, And So On · Of course, BRF will search for phrases as well as words. Also, it’ll do what most people think of of “Boolean Querying” but in fact is set arithmetic on postings lists. The query language will be what today’s Web Search engines seem to have found consensus on.
Non-Requirements · A sure way to arrange for BRF to fail would be try to make it do more than it really needs to; we’re just trying to build something to bring pretty decent search to a Web site near you without you having to spend much time or money to make it happen.
… isn’t Web-Scale · BRF is not going to be built to try to replace Google or anything close to that; which means that there will be no machinery for index partitioning nor massive-scale parallelism. The reason is that I think that the brute-force application of an ordinary server with a lot of RAM ought to be able to provide all the search muscle just about any imaginable enterprise-scale search problem needs. The Web, that’s a different class of problem, and one that only needs to be solved once or twice, and already has been, and the solution isn’t cheap.
… isn’t Intelligent · My suspicion is that search software won’t really be intelligent in my lifetime—see the earlier discussion—and in any case, I wouldn’t have the slightest idea how to begin building real human intelligence into the system, so let’s leave that in the non-goal category.
It Can Be Done · None of the requirements in this essay are out of reach. More important, the combination of all of them isn’t either. We’ve been doing this search thing for a few decades now and the Web thing for a solid decade, and they were made for other. I’m not the only person out there who could build most of the pieces of this single-handed. Furthermore, quite a few of the pieces have already been built and are out there and are good. V1.0 could be done inside of a year; no biggie.