I was talking to someone building a search engine and he was moaning
about sorting result lists in real time, only you don’t have to.
Anyone who’s built a big search engine eventually works this out, but posting
it here might save a few minutes for some future developer.
The idea is, you should never have to do an O(N·log(N))
sort on a
result list.
[Update: Experimental verification.]
Consider a Google search with a huge number of results that’s unlikely to have been pre-computed. Suppose you’re the engineer writing the code for that two-word search. You’re going to have to find the list of documents that matches the first, then another that matches the second, then merge them (pretty easy since they’ll be sorted in document order). Then you have to pick out the best twenty to show the user.
Quite a few computer programmers will say “oh, I’ll just call a library
routine to sort()
the result list by relevance and send the top
20.”
The trouble is, your relevance rating might be a little bit expensive to
compute (for example it might involve things like how close the words are to
each other),
so you don’t want to recompute it O(N·log(N))
times.
You also don’t want to have to temporarily store millions of values while you
do the sort.
Here’s the trick: nobody will ever look at more than the first hundred or so results. So you don’t have to sort at all, you just have to find the highest relevance values. Here’s the algorithm:
Maintain a list of twenty most relevant entries, and their relevancy numbers. Initially it’s empty.
Fill the list with the first twenty results in whatever order they’re in, and sort them in descending order of relevance.
For each of the remaining items in the list:
Compute its relevance value.
If it’s less relevant than #20 in the most-relevant list, you’re done with it.
If it’s more relevant than #20, add it to the most-relevant list in the proper place, shuffling down and losing #20.
That’s all there is to it. If ongoing were the Journal of the ACM, at this point I’d work out, assuming randomly distributed data, how often you end up working a new value into the most-relevant list, and then prove a couple of theorems with Greek letters in them Except for, I’m sure someone else already has, and the answer is: “not very often.”
Suppose your user reads the first 20 and then clicks the “next 20” button. That’s OK, just repeat the process, but build a top-40 list and pick the second 20 of the top 40. Repeat as often as necessary; nobody ever asks to see more than the first few hundred.
Experimental Validation · Michael Bazely writes:
You’re right on. When you take your “jungle sex” query and jump ahead in the results (by modifying the query string in the URL), you get this: “Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 1000.)”