[This fragment is available in an audio version.]
When I introduced Quamina, I described the core trick: You prepare an arbitrary JSON blob for automaton-based matching by “Flattening” it into a list of name/value pairs, then sorting them in name order. Today, a closer look at how you work with the flattened data.
This is one of a series of essays on programming topics motivated by my work on eventing services at AWS and, since then, on Quamina, a content-filtering library implemented in Go (GitHub).
By example · Consider the following Pattern:
{"x": ["a"], "y": [1, 2]}
It should match any JSON blob containing a top-level x
field whose value is the string "a"
and a
top-level y
field whose value is 1
or 2
.
Since we sort the fields of incoming events in order of name, we know how to build the automaton. There are two things that aren’t obvious:
Matching each field is a two-step process: First, see if the field name matches, then the value.
This pattern doesn’t care about any fields except for the top-level x
and y
. So the automaton
has to bypass all the others.
Given that, here’s the little automaton that could:
Notes ·
There are two kinds of things in the picture: field matchers and value matchers, labeled in the diagram with names beginning
fm
and vm
. In Go code, they’re implemented by types named
fieldMatcher and
valueMatcher.
The matchers transition straightforwardly, looking for an x
field
with value "a"
and then a y
field with value 1
or 2
. But, each of these
matchers have a *
representing “anything else” that loops back to the field-matcher state. This is how fields
that don’t appear in the pattern are ignored. The automaton loops in fm0
until it sees an x
field, then
moves along to fm1
if the field’s value is "a"
, otherwise looping back to fm0
. If it does get
to fm1
, it’ll loop there forever until it sees a y
field with value 1
or
2
, transitioning to fm2 or fm3 respectively and bypassing any other fields we don’t care about.
The big red !T
on fm2
and fm3
says reaching either state means you’ve matched
the pattern T
that was added in the code sample above.
Problem ·
I’d used finite automata more than once before this, but on every other occasion I was parsing a programming language or
Internet data format, which have none of this “skip any fields you’re not looking for” crap. Which meant that the state
machine I drew on the whiteboard in my AWS office looked like the one just above, but without those loopbacks labeled
*
. Yeah, even though they’re implicit in the design. What can I say, I was in a hurry.
So when I sat down to write the code to traverse the automaton, I just couldn’t figure out how to make it work. I felt, as I often feel, that I’d gotten in over my head.
Since I wasn’t smart enough to write down the correct automaton, I had to figure out how to code around the one I had. I came up with this. Let’s look at the code.
// tryToMatch tries to match the field at fields[index] to the provided state. If it does match and generate // 1 or more transitions to other states, it calls itself recursively to see if any of the remaining fields // can continue the process by matching that state. func tryToMatch(fields []Field, index int, state *fieldMatcher, matches *matchSet) { // try to transition through the machine nextStates := state.transitionOn(fields[index]) // for each state in the possibly-empty list of transitions from this state on fields[index] for _, nextState := range nextStates { nextStateFields := nextState.fields() matches = matches.addXSingleThreaded(nextStateFields.matches...) // for each state we've transitioned to, give each subsequent field a chance to // transition on it, assuming it's not in an object that's in a different element // of the same array for nextIndex := index + 1; nextIndex < len(fields); nextIndex++ { if noArrayTrailConflict(fields[index].ArrayTrail, fields[nextIndex].ArrayTrail) { tryToMatch(fields, nextIndex, nextState, matches) } }
In English ·
You have an automaton that looks like the picture above, only without the *
back-links. It has a start state. You
don’t know which fields in the event (if any) are going to match the pattern. You do know that any match has to begin
with the start state.
The fact that the fields are sorted by name really matters here; as you work your way through the automaton, you never have to worry about transferring back to a previous state.
If you’re looking at that code, please ignore the static about “ArrayTrailConflict”, although that’s maybe interesting enough to get a write-up later in this series.
I don’t know if this approach to traversing automata (a) has been investigated, (b) has a name, or (c) is any good. I do know that it worked really well in practice, handling many millions of events per second in multiple AWS services.
I’ve done a bit of pen-and-paper analysis and don’t think the amount of work is meaningfully different from a conventional traversal. But it did occur to me that in principle this approach could be made multithreaded; you could process multiple proposals in parallel on different cores. But anyhow, the profiler says this part of traversing the automaton is hardly visible as part of the total compute. So I left it this way in Quamina for sentimental reasons.
Tables? ·
In the first cut, the field matcher was just map[string]*valueMatcher
and the value-matcher was
map[string]*fieldMatcher
. It worked OK and the fieldMatcher
is still like that but, for reasons
I’ll write about later, that’s a bad choice for the valueMatcher
.
Smarter than me · At some level, Quamina is. One symptom is places like this in the code, distinguished by extended verbose comments. These are where I got stuck and bashed my head across the wall until I got something that worked, and knew I’d have no chance of understanding it later (nor would any subsequent visitor) unless I could squeeze out a coherent English explanation. As Prof. Feynman said, if you can’t explain something in simple language, you don’t really understand it. The observation that computer programmers can build executable abstractions that work but they then have trouble understanding is not new and not surprising. Lots of our code is smarter than we are.
The code is also smart because at AWS I had extremely talented collaborators who added things that I didn’t think were possible until they worked. The proportion of the useful ideas in here that are actually mine is probably less than 50% now.
Finally, we had the insane luxury of running this in production against millions-per-second event flows and watching what broke. And of hearing from other teams using it about what they had managed to break. I guarantee: Nobody is smart enough to predict the behavior of software under this kind of stress without experiencing it.
So if you are face-to-face with a piece of production software that’s new to you, and find yourself feeling stupid, don’t worry about it. That software represents a collection of flashes of peak-energy inspiration from people more or less as smart as you, and there were probably quite a few of those people, and it’s been enhanced with the wisdom that comes only from being used at length for real work.
If course it’s smarter than you. That’s OK, you can still improve it.
News · As of late May, Quamina has picked up a couple of collaborators with way more GitHub expertise than me, and its repo is growing all sorts of bells and whistles, mostly on the CI/CD front. Which means that I hope to do a release next month and see if anyone actually wants to use this. Also, more stuff to write about!
Comment feed for ongoing:
From: Michael Schürig (May 29 2022, at 05:49)
Would it be possible to speed up filtering by sorting matchers and fields not alphabetically, but in order of selectivity? This could be done statically or adaptively. The additional processing might eat up any gains.
Inspiration: database indexes and Boyer-Moore string search.
[link]
From: Tim (May 29 2022, at 10:14)
Michael: It might be, but you couldn't do it dynamically, because you have to sort the pattern fields at automaton build time and then the incoming event fields in the same way. So you'd need a good advance metric of which event fields would in practice be the most selective.
But TBH I bet it wouldn't make any performance difference. As we'll see later in this series, the actual matching compute is not that important.
[link]
From: Steve Coffman (Jun 03 2022, at 11:56)
This is pretty great! I have several times implemented a protobuf (GCP PubSub) to workload dispatch (kubernetes pod) that I would like to make generic, and this seems like it would drop in nicely (as long as I get the protobufs in json).
[link]