Anyone who regards themselves as a serious programmer has internalized a lot of different ways of searching: hash tables, binary, and many different kinds of trees. I've used pretty well all of these seriously at some point, but for a decade or so, as far as can I recall I've used almost exclusively binary search, and I see no reason to change that. Herewith an essay for programmers, with fully-worked out examples in Java, on why. [Updated 39 months after publishing when I read with horror Josh Bloch’s exposé of a long-lurking bug. If we can’t get binary search right, what chance do we have with real software?]
I'll start out with a discussion of the performance of binary search, look at some of the reasons why people don't use it and try to poke some holes in them, and conclude with a small tutorial on how to code binary search properly, with working Java examples.
Motivation · If you've already fully internalized binary search and make heavy regular use of it, you can skip what, after all, is basically a rehash of something you took in second-year CS. This note is provoked by a recent conversation where someone said “There are five million records here, and we need to do quick lookup; why don't we stick them in an Oracle table?” I had a better idea.
Also, I think binary search is aesthetically pleasing.
Logarithmic Time... mmm, Tasty ·
Everybody knows that in Computer Science jargon, binary search is
O(log2(N))
.
Let's just look at what that means. If it takes you 1 millisecond
to search a million (say 220) items, then it'll take you about
1.05 to search 2 million, 1.10 to search 4 million, and a whole 1.301 when
you get up to 64 million.
Those are awfully nice numbers.
The Alternatives ·
Now, everyone knows that a hashtable has O(1)
search time,
and trees can be updated in place, so why would you use something as
primitive as binary search when you have these excellent alternatives?
Also, with trees and so on you can potentially search really large data sets,
too large to fit into memory, with persistent on-disk structures.
All that granted, here are some reasons you might want to use binary search anyhow:
Simplicity · Simplicity is good, and binary search, coded properly, (see below) is awfully damn simple. Hashing requires you to deal with collisions and chaining and all that kind of stuff, and while tree search is pretty straightforward, tree update is hellishly complex. Less code means less room for bugs, and better use of your CPU's I-cache, and is just generally good.
Memory Overhead · Binary search requires that you store the search items you're going to compare in an array of some sort, with exactly zero indexing overhead. Zero, that's a nice number. If you need the kind of performance you get from in-memory searching, you're going to be able to cram more in with binary than with any other approach.
Code Your Own · Of course, any good programming system library will come with a decent precooked hashtable function and maybe a persistent on-disk updateable tree too, so why not use them?
Well, since they're coded portably they'll require you to implement
hashkey()
and/or compare()
methods for your
objects, so you're going to be doing a whole lot of method dispatching or
procedure calling, which is a reasonable price for using pre-written code.
On the other hand, binary search is so simple and robust that you can code up
five subtly different binary searches for different places in your code, all
using optimized direct references into your internal structures, and not
suffer any software engineering angst.
Performance ·
In fact, in practical terms, O(1)
is in many cases not
noticeably faster than O(log2(N))
. The difference, if
it isn't eaten up in all that collision handling and chaining code, tending
to vanish in the static.
Run some measurements; I have.
Size · Suppose you have way too much data to fit that array into memory. To start with, are you really sure? Check out the cost of stacking up the necessary memory versus the cost of writing and maintaining the code to manage on-disk data structures; memory prices have probably gone down since the last time you looked.
Suppose it's still too big.
Consider trying to stuff it in anyhow.
Modern operating systems have these wonderful system calls that tell the
system to consider a file to be mapped into memory; on Unix and Linux
systems, it's called mmap()
.
You'll be surprised the first time you call this on a file that's eight times
the size of your memory and the system comes back a couple of nanoseconds
later and says “OK, what next?”.
What's happening is, the job's been turned over to the operating sytem's
Virtual Memory subsystem.
This code tends to have been written by people like Dennis Ritchie and Bill
Joy and Linus Torvalds, i.e. better programmers than you or I, and be
really terrifically smart about making the best use of memory.
Look at it this way: you can write a bunch of complicated code to manage the on-disk vs. in-memory parts of your data, or you can pretend it's all in memory and use that Dennis/Bill/Linus code to take care of the details.
Although I am given to frequent (ab)use of mmap()
, I continue
to be astonished how well it holds up under ridiculous stress.
Finally, in a couple of years, when the cost of RAM has dropped to even more ridiculously small levels, and the current practice of putting half-a-gig of RAM on a laptop seems quaint, your data will have all smoothly migrated into RAM without you having to change anything.
Update ·
Yes (gasp), it is perfectly possible to update large contiguous
arrays in RAM.
GNU Emacs, one of the fastest and most robust editing systems on the planet,
just blasts your file into a great big buffer and addresses it as an array of
bytes using a low-level C primitive it calls CharAt
, which
addresses the character at any given integer offset.
The trick is, it keeps the buffer in two extents with a “gap” in the
middle, and moves the gap around to let you edit; CharAt
checks
where the gap is, does a little bit of ridiculously-simple arithmetic and lets
you pretend it's all contiguous.
Last time I checked (er maybe ten years ago) CharAt
was a macro
and you'd see things like:
CharAt(curPos++) = whateverTheUserJustTyped;
There are also tricks like page tables which let you monkey with arrays at run-time; for example, as a programmer you are allowed to believe that your process memory is a big contiguous array, but that's all a pleasant illusion brought about by that same virtual-memory code mentioned above.
Of course, you have to worry about concurrency, but if your data structure is, well, an array, or an array with a gap, or an array with a page table, the locking and mutexing you have to do is way simpler than with a tree.
How To Code a Binary Search · There's a story I heard from a prof, that the basic Binary Search idea was around when Alan Turing was still in diapers, but that the first 18 published implementations were buggy. And I know for a fact that the first few times I coded it I got it wrong.
At this point I should say that I wasn't smart enough to devise a non-buggy implementation, nor smart enough to see that a lot of places where people use other things they should really be using binary search. I learned both those things in the late Eighties from Gaston Gonnet, my mentor back in the days of New Oxford English Dictionary Project at the University of Waterloo, who is both a frighteningly good mathematician and a world-class programmer, responsible for much of the implementation of the Maple algebra language.
All that said, the core idea of binary search is mind-bogglingly simple; you have a sorted array, you look at the middle element which tells you what you're looking for is in the top or bottom half, then you take that half, rinse and repeat until done.
Binary Search is a Two-Step Process ·
The easiest way to go off the rails with binary search is checking whether
you've found your target while you're doing the cutting-down-by-half.
This adds all kinds of complexity for the benefit of cutting down the number
of steps from O(log2(N))
to one or two less, which is
stupid.
So the recipe is:
That said, here's the basic code.
I've been mostly in Perl-and-C land since about 1999, but decided to outline
this in Java because of all the computer languages I know, Java is the most
readable.
People whom I respect say Python is good too, but I've
never got around to learning it.
The following routine is part of a fully-worked-out .java
file
with a test driver and so on that you can download, we'll get to that.
If you don't like my coding style, too bad.
1 static public int search1(int [] array, int target)
2 {
3 int high = array.length, low = -1, probe;
4 while (high - low > 1)
5 {
6 probe = (low + high) >>> 1;
7 if (array[probe] > target)
8 high = probe;
9 else
10 low = probe;
11 }
12 if (low == -1 || array[low] != target)
13 return -1;
14 else
15 return low;
16 }
Some commentary:
high
and
low
bounds are next to
each other; there is no loop breakout.probe
can
never fail to be a valid array index, nor can it end up out of the closed
high
-low
interval.high
is one
off the end of the array, or it points at something greater than
target
; low
is either -1 or points to something
<= target
.low
will end up
pointing at the target.
We can't test for that directly, because if we started with something lower
than the lowest value in the array, low
will be left at
-1
.Now there's a little problem with this; suppose the array contains
[1,2,2,3]
and you search for 2
. At the end of the
search loop, high
will be pointing at the 3
and
low
at the second 2
; that is to say, when
there are stretches of equal values in the array, this will always find the
last one.
Suppose this isn't what you want; if you're comfortable with the idiom, it's
easy enough to turn around so it finds the first of a repeated sequence:
static public int search(int [] array, int target)
{
int high = array.length, low = -1, probe;
while (high - low > 1)
{
probe = (low + high) >>> 1;
if (array[probe] < target)
low = probe;
else
high = probe;
}
if (high == array.length || array[high] != target)
return -1;
else
return high;
}
With any kind of a decent compiler, this is going to generate hellaciously tight code, particularly the search loop. And, it can't break.
And, once you get used to it, you can code it up without having to think
too much.
I got the body of the search1
method up there right, character
for character, the first time I typed it in; it compiled and the tests came
out right.
Getting Fancy ·
Once you get comfortable with this idiom, you realize you can fiddle the
algorithm depending on what you're trying to accomplish.
Suppose, for example, you want to search for a range in that array;
take two input numbers and find the index of the entry that is smaller than
the lesser of them, and of the entry that is larger than the greater.
For example, if the array is [2,4,6,8]
and I pass in
5
and 6
as arguments, I want the indices
1
and 3
back.
Return the answer in a 2-integer array; the values -1
and
array.length
are perfectly reasonable.
static public int [] range(int [] array, int floor, int ceiling)
{
int [] answer = new int[2];
int high, low, probe;
// work on floor
high = array.length; low = -1;
while (high - low > 1)
{
probe = (low + high) >>> 1;
if (array[probe] < floor)
low = probe;
else
high = probe;
}
answer[0] = low;
// work on ceiling
high = array.length; low = -1;
while (high - low > 1)
{
probe = (low + high) >>> 1;
if (array[probe] > ceiling)
high = probe;
else
low = probe;
}
answer[1] = high;
return answer;
}
I'll leave out the detailed commentary; I just wanted to make the point that once you get comfy with it, you can do all sorts of useful things with this efficient, compact, simple, robust programming idiom
Measurements · I ran some performance tests, which reminded me how aggravating Java can be, there's basically no good way to measure CPU consumption without linking in OS-specific code and going the JNI route. However, you can measure wall-clock time quite easily. I ran ten passes each of three searches, looking up ten thousand targets in arrays of size one, five, and ten million integers. I lack the patience to aggregate the numbers and run proper stats, but here's a summary of what I typically saw:
1M | 5M | 10M | |
---|---|---|---|
min, max (sec) | .021, .034 | .033, .035 | .037, .039 |
In other words, the performance difference in going from a million to ten million basically vanishes in the static. This is a a good thing.
You can grab the Java source if you want to take a closer look.