Small Tables

[This fragment is available in an audio version.]

Computer programs organize bits and bytes into “data structures”. In software of any import, the data structures are usually more interesting than the code around them. This part of the Quamina Diary takes a close look at a very simple data structure that I have greatly enjoyed using to build finite automata, and which I think has lessons to teach; it’s called smallTable.

The problem · As described in a (pretty short) previous episode of this Diary, Quamina matches Patterns to Events, which are flattened into field-name/value pairs for the purpose. We try to match field names and values from Events to those offered in a Pattern. Matching names is easy, they’re immutable strings both in Events and Patterns, and thus the following suffices. ¶

transitions map[string]*valueMatcher

Now, inside that valueMatcher is a harder problem: How to match the value of a field, because patterns can include incantations like *.jpg and "anything-but": ["foo"] and so on. Clearly a map keyed by a string isn’t going to do the job. In fact, we need another flavor of finite automaton that works at the level of characters within fields.

Characters or bytes? · If you want to run an automaton over Unicode text, you have to decide whether you’re going to transition on code points, of which there are potentially 1,114,112 (although only 144,697 are in use as I write this), or on UTF-8 bytes, for which there are only 245 distinct values. Yes, a byte has eight bits, but the values 0XF5-0XFF can’t appear in correctly constructed UTF-8. Which is better, code points or bytes? ¶

On the one hand, here are generally fewer code points than UTF-8 bytes in a string, which means fewer steps to get through your automaton. On the other, the Events coming in over the wire are almost certainly already in UTF-8, and anyhow, a lot of things you’re going to match will be ASCII (for example, numbers), where the number of bytes and characters is the same. For Quamina’s predecessor, which I wrote while at AWS, I picked bytes and never regretted it. And given how easy Go makes it to work with []byte, I wasn’t tempted to do anything else.

Goals and constraints · Let’s say we have a step type that represents a transition between automaton states in response to a UTF-8 byte. Here are two simple Go idioms that represent that kind of state table: map[byte]step and [245]step. Neither of these made me happy, based on prior experience. ¶

First of all, I know that size will be a problem; adding a million Patterns to a Quamina instance is something that people will want to do, then they’ll complain about memory burn.

The second lesson came from gathering statistics on state machines in the wild. Many of the UTF-8 states have only one or two branches, and even in big complicated automata, the average number of out-transitions from a state remains well below 10.

Third, when you’re building automata, it’s common to want to express “transition on a range of byte values.” For example, to match a wildcard construct like *a you transition to a new state when you see a, and loop back on any other value. Similarly, you might be interested in transitioning on anything that’s a decimal digit while matching numbers, or on hex digits while matching IPv6 addresses.

Both the Go map and array fragments above look like they’re going to use way more memory than needed, and in particular Go’s map construct is not simple and not free to initialize. It’s complicated enough that, even after considerable searching, I couldn’t figure out how it actually implements map[byte]. And neither simple option is very appealing for storing byte ranges. So…

Meet smallTable · You can look on GitHub, but here it is: ¶

type smallTable[S comparable] struct {
    ceilings []byte
    steps    []S
}

It’s generic and the S stands for “step”; the steps are different for deterministic and nondeterministic automata, but the logic for processing them is the same.

First, here’s the code for “find the step that corresponds to a UTF-8 byte value”.

// step finds the member of "steps" in the smallTable that corresponds to a UTF-8 byte value. It may return nil.
func (t *smallTable[S]) step(utf8Byte byte) S {
    for index, ceiling := range t.ceilings {
        if utf8Byte < ceiling {
            return t.steps[index]
        }
    }
    panic("Malformed smallTable")
}

“Wait,” you say, “that’s going to run off the end of the array!” Here’s why it doesn’t.

const byteCeiling int = 0xf6

// newSmallTable mostly exists to enforce the constraint that every smallTable has a byteCeiling entry at
// the end, which smallTable.step totally depends on.
func newSmallTable[S comparable]() *smallTable[S] {
    var sNil S // declared but not assigned, thus serves as nil
    return &smallTable[S]{
        ceilings: []byte{byte(byteCeiling)},
        steps:    []S{sNil},
    }
}

Here’s an explanation-by-example of how smallTable works. Suppose we want to model a table where byte values 3 and 4 map to ss1 and 0x34 maps to ss2. Then the smallTable would look like:

ceilings	steps
3	`nil`
5	`&ss1`
0x34	`nil`
0x35	`&ss2`
`byteCeiling`	`nil`

Suppose we wanted the same thing, but to map all values other than 3, 4, and 0x34 to ss3. Then we’d have:

ceilings	steps
3	`&ss3`
5	`&ss1`
0x34	`&ss3`
0x35	`&ss2`
`byteCeiling`	`&ss3`

Invariant: The last element of ceilings is always byteCeiling. [The attentive reader may be wondering why byteCeiling is 0xf6 rather than 0xf5. We’ll get to that in a subsequent article.]

Why? · Everybody knows that dereferencing a hash table (as in map) is O(1), so why would I want to search sequentially? Bear in mind that I’m pulling in a byte array, so I get eight or more ceiling values for each memory fetch, and we know that the size of ceilings is usually less then ten. So I figure that I can run (on average) halfway through the byte array by the time the hashtable code has done a couple of memory fetches and computed a hash. (All computers wait for memory at the same speed.) That’s why ceilings and steps are separate arrays, as opposed to an array of ceiling/step structures. It also helps that the code itself is very compact and should be cache-friendly. ¶

Speculation aside, it used to be a hash table and it didn’t slow down when I switched in smallTable. Plus, it used way less memory. Plus, it represents matches to ranges of bytes very nicely.

It’s not perfect · There’s a lot to like about smallTable. And, one big problem: It’s hard to update. A couple of times I’ve tried to write the code to change an arbitrary byte mapping in an existing smallTable and pretty soon my head started to hurt. So what the code’s doing now is unpacking the smallTable into a unpackedTable: ¶

// unpackedTable replicates the data in the smallTable ceilings and steps arrays.  It's quite hard to
// update the list structure in a smallTable, but trivial in an unpackedTable.  The idea is that to update
// a smallDfaTable you unpack it, update, then re-pack it.  Not gonna be the most efficient thing so at some future point…
// TODO: Figure out how to update a smallDfaTable in place
type unpackedTable[S comparable] [byteCeiling]S

The support routines look like this; the body of code isn’t that interesting:

func unpackTable[S comparable](t *smallTable[S]) *unpackedTable[S]
func (t *smallTable[S]) pack(u *unpackedTable[S])

The profiler shows that in situations where you’re adding a lot of patterns to a Quamina instance, this approach is expensive, especially that pack call. I’m pretty sure that with more head-scratching and a bit of inspiration, I’ll find a faster way to update these things.

smallTable makes me happy. I think that if you’re building an automaton driven by a range of small numbers with modest state fanout, you probably need something much like it.

Contributions

Comment feed for ongoing:

From: Ed Davies (Jun 27 2022, at 03:54)

Questions I'd ask if I was doing a code review:

Why in `if utf8Byte < ceiling` do you use `<` rather than '<=`? `<=` would allow operation on arbitrary byte sequences as it doesn't require a sentinel value outside the range of acceptable values. Also, a test for a single value would use that particular value directly which feels more natural to me.

Couldn't `panic("Malformed smallTable")` also be triggered by malformed UTF-8 so be a misleading error message?

Why does smallTable need S to be comparable?

[link]

From: Rob Sayre (Jun 28 2022, at 13:38)

The rule of thumb that I know is that less than 12 entries can be sequentially compared faster than a hash table can be searched.

After that, you get into cache-aware or cache-oblivious data structures, or at least different kinds of hash table addressing.

running: "sudo lshw -C memory"

shows me the size of the L2 cache on my Linux machine, you're probably staying on the right side of that.

[link]

From: Tim (Jun 29 2022, at 13:44)

There are some useful comments in https://news.ycombinator.com/item?id=31890672

[link]

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

June 25, 2022
· Technology (90 fragments)
· · Quamina Diary (17 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!