When you fire an event into the cloud, can you be sure it’ll only come out again once? It turns out that sometimes they come out more often than they go in. This may or may not be a problem in your application. If it is, there are techniques to help work around it.
“At-least once” · This is a phrase you’ll hear a lot when you hang around with eventing/messaging people, (and cloud people generally). Builders work so hard at making sure everything gets delivered that they can end up doing it more than once.
How can that happen? · Here’s a scenario: Suppose your software wants to retrieve and process events from a bus or topic or stream or whatever the service is called. And suppose you retrieve one, then something goes wrong before you can acknowledge it. For example, the host your code’s running on might have failed. Or your code just crashed (mine never has bugs, but they tell me it happens). Or your software is written in Java and unfortunately went into a 45-second stop-the-world garbage-collection stall.
When any of these happen, the eventing software will get the idea that you didn’t successfully receive the data, and since its primary purpose in life is to deliver reliably, it will try again. Which means you end up processing it twice.
On the way in…
…on the way out.
Sometimes duplicates aren’t your fault. Suppose one data center gets cut off from all the others in an AWS region. Yes, we try really hard to arrange for multiple redundant connections, but shit happens. A little birdie told that one day a few years ago, a badly-built bridge in Beijing fell down and it had been carrying three different network providers’ fibres.
We call this situation a network partition. When it happens, both sides of the partition will try really hard not to lose any data. The details can get complicated, but duplicate messages are a common result.
Obviously this could be a problem for your app.
What to do? · There are situations where you can ignore the problem. Since duplicates are rare, in an analytics app or anything else that’s doing stats, they’re probably not a big deal. But there are lots of situations where they are.
If your app has a database that does transactions, you’re probably OK, because you can safely remember when you’ve seen each event’s unique ID, and just discard duplicates.
Another useful technique is idempotency. That is to say, structure your application such that API calls can be repeated without changing the result. An example is anything that can be expressed as a pure HTTP PUT request. You can set a field to a given value as many times as you like without doing any damage. Designing an app to work this way is tricky. But it’s an option that’s worth investigating, because having idempotent operations tends to produce apps that are robust in the face of all sorts of common failure scenarios.
Here’s one thing to note: Duplicates are rare, but when they do happen, they tend to come in clusters (think about that bridge in Beijing). I don’t know if that fact is useful in the context of your app, but just in case.
But there are some apps that just can’t live with dupes.
“Exactly once” · That‘s the terminology used for software that comes with built-in de-duping. One example would be SQS FIFO. Upon encountering this capability, you might ask yourself “why don’t I just use this for everything?” It turns out, just as with FIFO, de-duping isn’t free and in fact isn’t particularly cheap.
It can also get kind of complicated and there are more details than you might think. Consider this blog: How the Amazon SQS FIFO API Works. It dives deep on all the details, which ends up taking over two thousand words.
My advice would be to teach your software to live with duplicates, if at all possible. “At least once” systems are just part of the cloud landscape.