[This fragment is available in an audio version.]
Computer programs organize bits and bytes into “data structures”. In software of any import, the data structures are usually more interesting than the code around them. This part of the Quamina Diary takes a close look at a very simple data structure that I have greatly enjoyed using to build finite automata, and which I think has lessons to teach; it’s called smallTable.
The problem · As described in a (pretty short) previous episode of this Diary, Quamina matches Patterns to Events, which are flattened into field-name/value pairs for the purpose. We try to match field names and values from Events to those offered in a Pattern. Matching names is easy, they’re immutable strings both in Events and Patterns, and thus the following suffices.
transitions map[string]*valueMatcher
Now, inside that valueMatcher
is a harder problem: How to match the value of a field, because patterns
can include incantations like *.jpg
and "anything-but": ["foo"]
and so on. Clearly a map keyed by a string isn’t
going to do the job. In fact, we need another flavor of finite automaton that works at the level of characters within
fields.
Characters or bytes? · If you want to run an automaton over Unicode text, you have to decide whether you’re going to transition on code points, of which there are potentially 1,114,112 (although only 144,697 are in use as I write this), or on UTF-8 bytes, for which there are only 245 distinct values. Yes, a byte has eight bits, but the values 0XF5-0XFF can’t appear in correctly constructed UTF-8. Which is better, code points or bytes?
On the one hand, here are generally fewer code points than UTF-8 bytes in a string, which means fewer steps to get through
your automaton. On the other, the Events coming in over the wire are almost certainly already in UTF-8, and anyhow, a lot of
things you’re going to match will be ASCII (for example, numbers), where the number of bytes and characters is the same.
For Quamina’s predecessor, which I wrote while at AWS, I picked bytes and never regretted it. And given how easy Go makes
it to work with []byte
, I wasn’t tempted to do anything else.
Goals and constraints ·
Let’s say we have a step
type that represents a transition between automaton states in response to a UTF-8 byte.
Here are two simple Go idioms that represent that kind of state table: map[byte]step
and [245]step
.
Neither of these made me happy, based on prior experience.
First of all, I know that size will be a problem; adding a million Patterns to a Quamina instance is something that people will want to do, then they’ll complain about memory burn.
The second lesson came from gathering statistics on state machines in the wild. Many of the UTF-8 states have only one or two branches, and even in big complicated automata, the average number of out-transitions from a state remains well below 10.
Third, when you’re building automata, it’s common to want to express “transition on a range of byte values.” For
example, to match a wildcard construct like *a
you transition to a new state when you see a
, and loop
back on any other value. Similarly, you might be interested in transitioning on anything that’s a decimal digit while matching
numbers, or on hex digits while matching IPv6 addresses.
Both the Go map and array fragments above look like they’re going to use way more memory than needed, and in particular Go’s
map construct is
not simple and not free to initialize. It’s
complicated enough that, even after considerable searching, I couldn’t figure out how it actually implements
map[byte]
. And neither simple option is very appealing for storing byte ranges. So…
Meet smallTable
·
You can
look on GitHub, but here it is:
type smallTable[S comparable] struct { ceilings []byte steps []S }
It’s generic and the S
stands for “step”; the steps are different for deterministic and nondeterministic
automata, but the logic for processing them is the same.
First, here’s the code for “find the step that corresponds to a UTF-8 byte value”.
// step finds the member of "steps" in the smallTable that corresponds to a UTF-8 byte value. It may return nil. func (t *smallTable[S]) step(utf8Byte byte) S { for index, ceiling := range t.ceilings { if utf8Byte < ceiling { return t.steps[index] } } panic("Malformed smallTable") }
“Wait,” you say, “that’s going to run off the end of the array!” Here’s why it doesn’t.
const byteCeiling int = 0xf6 // newSmallTable mostly exists to enforce the constraint that every smallTable has a byteCeiling entry at // the end, which smallTable.step totally depends on. func newSmallTable[S comparable]() *smallTable[S] { var sNil S // declared but not assigned, thus serves as nil return &smallTable[S]{ ceilings: []byte{byte(byteCeiling)}, steps: []S{sNil}, } }
Here’s an explanation-by-example of how smallTable
works.
Suppose we want to model a table where byte values 3 and 4 map to ss1
and 0x34 maps to ss2
.
Then the smallTable would look like:
ceilings | steps |
---|---|
3 | nil |
5 | &ss1 |
0x34 | nil |
0x35 | &ss2 |
byteCeiling | nil |
Suppose we wanted the same thing, but to map all values other than 3, 4, and 0x34 to ss3
. Then we’d have:
ceilings | steps |
---|---|
3 | &ss3 |
5 | &ss1 |
0x34 | &ss3 |
0x35 | &ss2 |
byteCeiling | &ss3 |
Invariant: The last element of ceilings
is always byteCeiling
.
[The attentive reader may be wondering why byteCeiling
is 0xf6 rather than 0xf5. We’ll get to that in a
subsequent article.]
Why? ·
Everybody knows that dereferencing a hash table (as in map
) is O(1)
, so why would I want to search
sequentially?
Bear in mind that I’m pulling in a byte array, so I get eight or more ceiling
values for each memory fetch, and we
know that the size of ceilings
is usually less then ten. So I figure that I can run (on average) halfway
through the byte array by the time the hashtable code has done a couple of memory fetches and computed a hash. (All computers
wait for memory at the same speed.) That’s why ceilings
and steps
are separate arrays, as
opposed to an array of ceiling
/step
structures. It also helps that the code itself is very compact
and should be cache-friendly.
Speculation aside, it used to be a hash table and it didn’t slow down when I switched in smallTable
. Plus, it used way less
memory. Plus, it represents matches to ranges of bytes very nicely.
It’s not perfect ·
There’s a lot to like about smallTable
. And, one big problem: It’s hard to update. A couple of times I’ve tried
to write the code to change an arbitrary byte mapping in an existing smallTable
and pretty soon my head started to
hurt.
So what the code’s doing now is unpacking the smallTable
into a unpackedTable
:
// unpackedTable replicates the data in the smallTable ceilings and steps arrays. It's quite hard to // update the list structure in a smallTable, but trivial in an unpackedTable. The idea is that to update // a smallDfaTable you unpack it, update, then re-pack it. Not gonna be the most efficient thing so at some future point… // TODO: Figure out how to update a smallDfaTable in place type unpackedTable[S comparable] [byteCeiling]S
The support routines look like this; the body of code isn’t that interesting:
func unpackTable[S comparable](t *smallTable[S]) *unpackedTable[S] func (t *smallTable[S]) pack(u *unpackedTable[S])
The profiler shows that in situations where you’re adding a lot of patterns to a Quamina instance, this approach is
expensive, especially that pack
call. I’m pretty sure that with more head-scratching and a bit of inspiration,
I’ll find a faster way to update these things.
smallTable
makes me happy. I think that if you’re building an automaton driven by a range of small numbers with
modest state fanout, you
probably need something much like it.
Comment feed for ongoing:
From: Ed Davies (Jun 27 2022, at 03:54)
Questions I'd ask if I was doing a code review:
Why in `if utf8Byte < ceiling` do you use `<` rather than '<=`? `<=` would allow operation on arbitrary byte sequences as it doesn't require a sentinel value outside the range of acceptable values. Also, a test for a single value would use that particular value directly which feels more natural to me.
Couldn't `panic("Malformed smallTable")` also be triggered by malformed UTF-8 so be a misleading error message?
Why does smallTable need S to be comparable?
[link]
From: Rob Sayre (Jun 28 2022, at 13:38)
The rule of thumb that I know is that less than 12 entries can be sequentially compared faster than a hash table can be searched.
After that, you get into cache-aware or cache-oblivious data structures, or at least different kinds of hash table addressing.
running: "sudo lshw -C memory"
shows me the size of the L2 cache on my Linux machine, you're probably staying on the right side of that.
[link]
From: Tim (Jun 29 2022, at 13:44)
There are some useful comments in https://news.ycombinator.com/item?id=31890672
[link]