[This fragment is available in an audio version.]
Last week, AWS announced (docs, blog) event filtering for Lambdas reading from SQS, DynamoDB, and Kinesis. Lambda’s new filtering is consistent with EventBridge’s. Also announced was faster, slicker, cheaper S3/EventBridge integration. These features rely on software I helped write. The problem turns out to be more interesting than you might expect, by a factor of 10¹² or so. I think it’d be really great if the code were open-sourced.
The core idea is that in modern applications, streams of data flow around between services, and very often you want to select just some of the data out of the stream. For example, if you’ve written a few Lambdas, quite likely some of your code was about looking at the input events and deciding whether to process or ignore them. Or deciding which other software to dispatch them to. If you can achieve the same effect by writing pattern-matching rules, you can subtract code and save money: a win-win.
Interesting why? · It’s not as though these filtering features are doing anything new or innovative or surprising; this sort of thing goes back as far as 1973, with grep. There are lots more landmarks along that road; for example a lot of smart people seem to like jq these days. So, why should anyone care about this software?
First of all, the stream it’s filtering might be carrying millions of events per second. Second, people will want to post a million rules against the stream. (Neither of those numbers are hypothetical.) So it’s a million times a million times trickier than you might think.
Bearing that in mind, we created a library which I’ll call the Matcher. Its query expressions are a little idiosyncratic, but people seem to like them; that blog I linked to earlier is probably the best introduction I’ve seen.
The performance envelope is, um, unusual. But pleasing. “Good enough for EventBridge” is decent testimony. Even stronger is “good enough for S3”. Check out this short thread from Usman Khalid, GM for EventBridge and some other stuff, a guy I really enjoyed working with at AWS. He explains that the event-matching logic has been pushed down into S3, which doesn’t even emit the events unless they match a rule. Now, “inside of S3” is a very special place. An almost unimaginable amount of customer data lives in there, subject to an almost unimaginable transaction rate, and everyone is very very latency-sensitive.
How it works · Well, that’d be a secret. At least I thought so until I sat down to write this, and discovered U.S. Patent 11,068,487: Event-stream searching using compiled rule patterns, filed in 2015 but just issued last July. Patent theory says that the disclosure should be sufficient to enable a competent practitioner to implement the technology. Heh, good luck with that. I know a better way than patents to share engineering progress.
Open-source why? · The Matcher would be useful for a lot of people, if the AWS experience is any guide. It’s easy to integrate. It’d be a significant and visible give-back from AWS to the software commons from which it’s profited so richly. I also think it’d be a useful teaching tool (see below).
Having said that, and to be brutally honest, my reasons are selfish. I enjoyed working on this so much, and would like to do it some more. So from that point of view, AWS would get some free time out of an engineer with highly relevant experience.
I might also like to make versions of it in other programming languages.
And courseware · I applied plenty of lessons hard-won in decades of programming to this software, and learned useful new ones. So there are deep software-tech blog pieces screaming to be written. In fact, I think I could probably put together a 400-level CompSci course on “Topics in high-performance software design” using the Matcher as a lab substrate.
I’ll support that claim with one interesting and counter-intuitive lesson. But first…
History · When I started at AWS on December 1, 2014, my hiring manager said “Our idea is we’re gonna have an event bus carrying notifications of everything that’s happening in everybody’s AWS account and you can subscribe to it, select the events you care about, and process them with Lambdas.” So, it wasn’t my idea; and that manager is now a VP who owns a huge swathe of AWS Networking.
But I helped write the PR/FAQ and was senior engineer on the dev team.
I couldn’t stop thinking about matching the rules that defined the subscriptions. My mind hadn’t stretched to millions times millions yet, but it seemed obvious that the biggest customers, who would have the most events, would also have the most rules. So if you coded up a simple-minded loop like “for each event, retrieve all the rules posted by the event’s account and check one by one to see if they match” that’d probably be bad.
So I coded up most of release 1.0 of the Matcher by January, with a couple of breakthroughs happening on my Mom’s couch when we visited her in Saskatchewan for Christmas.
EventBridge (launched as “CloudWatch Events”) worked out pretty well and soon other groups started using the Matcher. It grew plenty of features over the years, some only exposed internally. Naturally, like any other widely-used piece of software that started out lean and mean, it became fatter and more complex. While there’s a ton of my code there, I’m sure it’s a minority by now.
I met some wonderful people from other AWS and Amazon teams who said “Oh we need Matcher but it also needs to do X, mind if I code that up?” Working at AWS was a fun job, and Matcher was one of the most-fun parts.
Now, I promised a non-obvious finding that emerged during the Matcher work…
Lesson: JSON is faster than binary · Here’s a surprising statement: If you want to filter messages very, very quickly, you can do it much faster if they’re in JSON as opposed to a strongly-typed “binary” format like Protobufs or Avro.
No, really. Because for those formats, you have to parse the whole binary blob into a native data structure, which means looking at all the fields and allocating memory for them. You really want to minimize memory allocation for anything that happens millions of times per second.
JSON, however, allows stream parsing. Consider for example, the Jackson library’s nextToken() or Go’s json.Tokenizer.
Which means you can skip through each event, only keeping around the fields that you’re interested in filtering on, which in most event-processing scenarios is very few of them. So yeah, you do have to parse strings of digits into numbers, but only for the fields you care about.
And there are better structures for storing a small selection of data fields than JSON’s messy, weakly-typed tree. I’ve measured; the performance delta between parsing incoming events into an object and just skipping through their fields is freaking huge.
That’s just one take-away; there are plenty more. I’d sure like to share them.
[Update: I’ve seen feedback that there’s no reason in principle that you couldn’t build streaming parsers for the strongly-typed binary data formats and that, for Avro in particular, it ought to be really fast. As far as I can tell, there isn’t much in the way of such APIs out there at the moment; sounds like fun little programming project.]
Comment feed for ongoing:
From: Zellyn Hunter (Dec 10 2021, at 19:41)
fwiw, it would be very simple to process protobufs in a streaming fashion too. They're just nested blobs, with numbered fields, but nested messages also have length prepended, so it's easy to skip ahead.
[link]
From: Janne (Dec 11 2021, at 02:39)
Have you seriously considered writing a book? With all your experience and facility with words perhaps you should.
"Computing at Scale", with an O'Reilly animal cover featuring either a blue whale or a large school of fish swimming in synchrony.
[link]
From: Douglas Creager (Dec 11 2021, at 06:13)
I _think_ you’re saying that if you’re starting with JSON data, then it’s more efficient to process that directly, than to first translate it into something more typed like Protobuf or Avro, which sounds reasonable.
But if you’re starting from something like Avro, my hunch is that processing that directly would be even faster than processing original JSON directly. You’d be able to use the same streaming tricks, and Avro in particular puts all of the typing information up front in the file header, instead of inline within each message. That means that the streaming parsing (of only the fields you need) and skipping (of the fields you don’t) can be VERY fast.
[link]
From: James (Dec 12 2021, at 14:48)
For the record, most protobuf language support libraries that I've come across comes with some form of streaming protobuf parser, I've done streamed protobuf parsing in both Java and JavaScript, the only language support that I've come across that doesn't support it is Go. Here's the Java streamed parser, note the ability to skip fields and messages:
https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/CodedInputStream
Sure, these are fairly low level APIs, and they work at the wire protocol level, they don't map the field numbers to the high level field names and/or types, but it isn't that hard to do that, the hardest thing is that unlike JSON, which is human readable therefore trivial to understand how language types map down to the wire protocol, Protobuf is opaque to most people, they never bother to understand the wire format, so there is a fair bit more to learn when you go from just working with the generated types to working with streaming the raw underlying datatypes.
[link]
From: Jens Alfke (Dec 13 2021, at 18:29)
There are several binary structured-data formats that can be read without parsing or unpacking, and should be even faster to filter than JSON. I can name 3 off the top of my head:
* FlatBuffers/ FlexBuffers (Google)
* Cap'n Proto (by the original creator of ProtoBufs)
* Fleece (Couchbase; I wrote this one.)
[link]
From: Rob Sayre (Dec 24 2021, at 20:35)
"Lesson: JSON is faster than binary"
I don't think this is true at all, even accounting for existing APIs.
Jackson seems ruled out right away, because I think it will be doing UTF-8 -> Java UCS-2 conversion.
Protobuf, by default, does do something like an XML DOMParser. But this neglects to mention the common gRPC message framing of "unary" vs "streaming", something like XML stanzas in XMPP. If you're working in a language that makes arena allocation easy, it can win big.
As mentioned above, Flatbuffers and Cap'n Proto are even more efficient. If you combine these allocation-efficient formats and an ownership system as in Rust, you can often avoid allocation altogether.
There's just no way a JSON parser hunting for a closing quote in a big string will beat one of these binary formats. If all of the fields are short, it might be close.
The big win is to use a language with UTF-8 strings, avoiding conversion, and potentially allocation.
[link]