Here’s a Google search for a famous phrase, to be or not to be; give it a try and see what happens. When you look at word frequencies, it appears that there are a few words that appear unreasonably often and carry unreasonably little information. They are called “stopwords,” and this (brief) eighth bead in the On Search necklace considers them.
What many search engines have done with stopwords is, well, nothing. That is to say, they don’t get indexed, and you can’t search for them. The theory is that they’re expensive to index because there are so many of them, and they carry little useful information.
The Numbers ·
Let’s start with some numbers.
Just before I started writing this, a collection of ad-hoc Perl and
sed
and sort
alleged that ongoing as of then contained
169,908 words of text, comprising 15,370 unique words.
Here are the most common twenty-six in decreasing order of frequency.
The column labels are a bit short to keep the table from spreading, so to
amplify: “Count” means the number of times this word appears,
“Running” means the total occurrence of this word and all those
above it, and “%” is “Running” as a percentage of
all 169,908 words in ongoing.
Word | Count | Running | % |
---|---|---|---|
the | 8886 | 8886 | 5.2 |
and | 5499 | 14385 | 8.5 |
a | 4576 | 18961 | 11.2 |
to | 4466 | 23427 | 13.8 |
of | 4406 | 27833 | 16.4 |
in | 2821 | 30654 | 18.0 |
i | 2500 | 33154 | 19.5 |
is | 2423 | 35577 | 20.9 |
that | 2354 | 37931 | 22.3 |
it | 1943 | 39874 | 23.5 |
on | 1577 | 41451 | 24.4 |
you | 1505 | 42956 | 25.3 |
this | 1499 | 44455 | 26.2 |
for | 1469 | 45924 | 27.0 |
but | 1126 | 47050 | 27.7 |
with | 1111 | 48161 | 28.3 |
are | 1077 | 49238 | 29.0 |
have | 921 | 50159 | 29.5 |
be | 909 | 51068 | 30.1 |
at | 836 | 51904 | 30.5 |
or | 833 | 52737 | 31.0 |
as | 793 | 53530 | 31.5 |
was | 789 | 54319 | 32.0 |
so | 763 | 55082 | 32.4 |
if | 699 | 55781 | 32.8 |
out | 686 | 56467 | 33.2 |
not | 679 | 57146 | 33.6 |
Why Stop? · The numbers tell the story. By leaving out the 26 most common words, we account for a third of all word occurrences. (If you haven’t read the write-up on how search indices work, you might want to take a side-trip there now.) Each word occurrence requires that you create, store, and search one posting. Most of the space-cost of search is in postings, and most of the compute time is reading and merging postings lists. Consider that occurrences of “the” comprise almost 5% of the total. If you’re running something like Google, with billions of documents and hundreds of billions of words, you’re looking at many billions of postings you can get rid of by discarding stopwords. Consider the task of doing a set intersection on the billions of matches to “to,” “be,” “or,” and so on. It’s no surprise, really, that you get that polite little note from Google about all of Hamlet’s words except “not” being too common to be useful.
Why Not Stop? · Of course, skipping the stopwords comes at a cost; for example, “to be or not to be.” Another amusing example is the well-known retail chain “The Limited,” which is going to be pretty hard to find in a database that doesn't index “the.”
And as we’ve come to expect from Google, they’re not stupid. They do in fact index the stopwords, and you can search for "to be or not to be" just fine. See the quotes around the string? This is a phrase search. I won’t go into the details, but the cost of combining huge lists of common-word postings is immensely cheaper for a phrase search than doing a simple AND or OR. You can find The Limited just fine, too.
In fact, by putting +
characters in front of each of the
words, you can make Google claim to do a search for each of the words
separately, although when I do the arithmetic in my head, I find it hard to
believe they’re actually processing that many billions of postings.
Interestingly, I note that in the first Google search in the article, to be or not to be with no quotes or pluses, the one word it’s willing to search on is “not,” the least common of the most common (in ongoing anyhow).
The bottom line: refusing, by default, to search for common words is good usability practice; when I search for “Lord of the Rings,” nobody misses the two words in the middle. But simply leaving words out of your index because they’re common is a bug, not a feature.