[This fragment is available in an audio version.]
No parent will name a favorite among their children. But I do have one among my brainchildren, my software contributions over the decades: The event-streaming code I helped build at AWS. After rage-quitting I missed it so much that over the last few months, I wrote a library (in Go) called Quamina (GitHub) that does some of the same things. This is about that.
Quamina offers an API to a construct called a “Matcher”. You add one or a hundred or a million “Patterns” to a Matcher then feed “Events” (data objects with possibly-nested fields and values) to it, and it will return you an array (possibly empty) of the Patterns each Event matches. Both Patterns and Events are represented by JSON objects (but it should be easy to support other Event encodings).
Quamina (and here I beg pardon for a bit of
chest-pounding) is really freaking fast. But what’s more interesting is that its speed doesn’t depend much
on the number of Patterns that have been added. (Not strictly speaking O(1)
, but pretty close.) Concretely:
On my 2019 Intel MacBook Pro, it can pattern-match several hundred thousand Events per second without much regard for how many
Patterns are loaded into the Matcher. The benchmark I use most for optimization has maliciously-constructed rules that force the
Matcher to
look at basically all the fields in 1KB-ish Events, where the values are arrays of long floating-point numbers. I try to keep this
worst-case running at 150K/second or better.
History · Back in 2014-15 I wrote v1.0 of a conceptually similar library operating inside AWS, which now implements pattern-matching capabilities for EventBridge and quite a few other teams around Amazon, filtering millions and millions of events per second so they can be accepted, rejected, or routed. Lots of super-talented collaborators emerged over the years and I bet that these days, my code isn’t even a majority of the thing.
I’ve been lobbying AWS to open-source it and that conversation continues, it might happen. But I missed working on it so much that I built Quamina. Also, the AWS version isn’t in Go and I thought a Go library might be nice.
I plan to write a few pieces about Quamina. Whether or not its features interest you, it illustrates lessons in dealing with data in motion at a scale that not very many people are lucky enough to be given a chance to code for.
Automata · Quamina is based on finite state machines a.k.a. finite automata. There is a rich body of academic investigation into the discrete-math theory of automata, almost none of which I’ve studied. But once I got the basic idea, state machines have been the first tool I’ve reached for across the decades when faced with hard programming problems. I’m not going to introduce the concept here because anyone who’s either got a CS degree or written much software knows the basics.
I like using automata because they have very, very predictable performance. If you’re processing a row of data items you only have to look at each item once, which is a pretty strong constraint on the time you’re going to spend. Also, the typical unit of work in stepping through an automaton is small. Thus, fast and predictable. Assuming I write more on Quamina, there’ll be lots of digging into details of how to make automata useful.
Quamina by example · This is going to be easier if we have an example to work with. Here’s a sample Event:
{ "Image": { "Width": 800, "Height": 600, "Title": "View from 15th Floor", "Thumbnail": { "Url": "http://www.example.com/image/481989943", "Height": 125, "Width": 100 }, "Animated" : false, "IDs": [116, 943, 234, 38793] } }
(This is actually the canonical example of a JSON object from RFC8259.)
Here are two Patterns that would match the Event:
{"Image": {"Width": [800]}}
{ "Image": { "Animated": [ false ], "Thumbnail": { "Height": [ 125 ] }, "IDs": [ 943, 811 ] } }
For more examples and explanation, see the README.
This doesn’t look much like a typical query or pattern-matching language; the Pattern echoes the “shape” of the Event you’re trying to match. This turns out to have pleased a few developers, one or two of whom have said nice things to me.
Unfortunately I have no recollection of the thought processes that got us there in the white heat of the early design sprints for what became EventBridge. I remember a meeting in Seattle where I got up and sketched a few of these on the whiteboard and people said “It’s not SQL should it be?” and then eventually “Yeah, looks OK”, then we moved on. There must have been some discussion before that?
The trick · The idea of compiling together a bunch of automata and using them to match data is not exactly new; I think the “|” operator was introduced to grep while I was still in high school. The idea of using it on data that takes the form of JSON objects is a little less obvious. Here’s how it’s done:
First, you flatten the Event fields into name/value pairs. Here are a few of the fields in flattened form:
Image.Width: 800 Image.Thumbnail.Width: 100 Image.IDs: [116, 943]
Then you sort them by pathname. This is perfectly OK, because the order of fields in a JSON object doesn’t matter.
Image.IDs: [116, 943] Image.Thumbnail.Width: 100 Image.Width: 800
Then you do the same trick with the Pattern. For example, let’s use the second one above:
Image.Animated: false Image.IDs: 943 Image.Thumbnail.Height: 125
Then you make a state machine based on the flattened sorted Pattern and run it against the flattened sorted Event, and away you go.
What next? · Quamina has enough features to be useful and the unit test coverage isn’t terrible. I’d like people to look at it and see if solves any problems they have, or alternately if they see a reason why it’s All Wrong. Bear in mind that there’ll probably be another (more feature-rich) version in another popular programming language coming from AWS later this summer.
It isn’t at 1.0 yet and has no CI/CD, which means I reserve the right to change the APIs. In particular, should the MatchesForJsonEvent() be made a Go generic? (Quamina already has generics internally). I couldn’t stop anyone putting it into production, but I probably wouldn’t if I were you. But if that idea even crosses your mind, please get in touch.
Comment feed for ongoing:
From: Joshua F (May 12 2022, at 21:08)
This is great. I feel that I'll be using this in the near future. Thank you for making this available.
[link]
From: André Eriksson (May 13 2022, at 06:23)
This is great, thanks for sharing. We've been thinking about adding better event streaming/event sourcing support to Encore and I think something like this could play a central part in it.
Would love to understand more about the sort of use cases where it was successful at AWS.
[link]
From: Andrew Reilly (May 17 2022, at 20:16)
Nice work! Feels like "awk" for the post-regexp/JSON era. I have no idea whether anyone's awk implementation ever jointly optimized the pattern application the way you have though.
Do you have sub-patterns (regexes) for string fields, or ranges for numerics? Returned captures?
[link]
From: David A (May 26 2022, at 06:36)
Tim, Quamina this is so fantastic. I love the library and your write up and thank you immensely for releasing this. I've built a horrible narrow version of something similar for a few projects, but now that Quamina is here I will absolutely swap to your implementation which is very elegant, more robust, and faster. Please, please write more about this. Your long form explanatory articles are amazing. More on Quamina please. Thank you.
[link]