Searching is a branch of computer programming, which is supposed to be a quantitative discipline and a member of the engineering family. That means we should have metrics: measures of how good our search techniques are. Otherwise, how can we ever measure improvements in one system or the differences between two systems? “Precision” and “recall” are the most common measures of search performance. But they’re not as helpful as we’d like.
Defining Terms · Recall measures how well a search system finds what you want, and precision measures how well it weeds out what you don’t want. It’s easier to illustrate than explain. Suppose that:
In this case, you’ve found 10/20 of the relevant documents, so your recall is 50%. Ten of the sixteen you found were relevant, so your precision was 62.5%.
Problems · While precision and recall are very helpful in talking about how good search systems are, they are nightmarishly difficult to actually use, quantitatively. First of all, the notion of “relevance” is definitely in the eye of the beholder, and not, in the real world, a mechanical yes/no decision. Secondly, any information base big enough to make search engines interesting is going to be too big to actually compute recall numbers (to compute recall, you have to know how many matches there are, and if you did, you wouldn’t need a search engine).
Third, precision and recall aren’t, in the real world, standalone numbers; they are strongly related. Consider the a Google result list; as you add each successive page of results, you expect your recall to improve and your precision to worsen.
TREC · Just because precision and recall are tough to measure doesn’t mean that people aren’t trying. The US National Institute of Standards and Technology has since 1992 been running TREC, a series of conferences in which researchers test their technology against a controlled pre-cooked set of documents and “topics” (by which they mean queries).
The most recent TREC proceedings are here; the work is organized into tracks, for example “filtering,” “video,” and “Web.”
As I said, the job is difficult; they have to formalize what a document is, what a query is, what relevance is, how relevance is measured, and so no; each of these definitions rests on a fabric of assumptions and approximations. Having said that, this is admirable work and the few times I’ve looked at TREC numbers, they’ve been interesting. Furthermore, TREC claims that since they got started, the “effectiveness” of retrieval systems has more or less doubled.
One phrase in that last paragraph should have raised your eyebrows: “The few times I’ve looked...” Huh? I am an acknowledged long-time search hack and have earned my living in this space for some number of years. Wouldn’t you expect me to be hounding TREC faithfully every year?
Industrial Search and Academic Research · Well, I’m not the only person in the search business who is a somewhat less than ardent follower of TREC. Check out the list of participants in the TREC 2002 Conference Overview. Here are some names that I don’t see: Google, Inktomi, Overture, Teoma, and Yahoo. It turns out that to attend the conference, you either have to have participated in the TREC tests or be a member of a sponsoring government agency, which in most cases would mean a spook.
TREC isn’t the only hotbed of academic research on search; there is also the ACM’s Special Interest Group on Information Retrieval; a glance at their next conference program gives a flavor for what the hot issues are. Once again, the big names in the search technology that we see out here in the Web trenches are mostly distinguished by their absence.
Except for, both in TREC and SIGIR, I see Microsoft Research popping up again and again. And while Microsoft is not, at the moment, a real force in the search community, that may change.
How Good Does it Get? · Back when they started trying to measure precision and recall, the numbers were appalling; few systems could achieve 50% along either axis, given sufficiently general problems. I don’t think it’s meaningful to ask how good precision and recall are in any kind of a general way; in the real world, the results are going to depend a lot on the specifics of the data, the specifics of the query, the specifics of the metadata (i.e. what do you know about your documents other than the text they contain), and the specifics of who’s doing the action. That doesn’t mean that if you’re in the search business, you shouldn’t try to measure and improve your precision and recall.
Because when you talk to a human expert who knows a field and is on top of work in it, you normally expect precision and recall of around 100%. That’s what we’d really like: for our search systems to be smart, in the way that people are. I’ll take up that issue, of intelligence in search, in the next instalment of this series.