[This fragment is available in an audio version.]
Suppose you’re running your organization’s crucial apps in the cloud. Specifically, suppose you’re running them them on AWS, and in particular in the “us-east-1” region? Could us-east-1 go away? What might you do about it? Let’s catastrophize!
Acks & disclaimers · First, thanks to Corey Quinn for this Twitter thread, which got me thinking.
Second, while I worked for AWS for 5½ years, I’ve never been near a data center, nor do I have any inside information about the buildings, servers, or networking. On the other hand, I do have a decent understanding of AWS culture and capabilities in software engineering and operations. Bear those facts in mind as you read this.
Finally, since this blog fragment concerns itself entirely with catastrophic scenarios, I’ll try to be cheerful about it.
[Those of you who know what us-east-1 is can skip over the next section to the first entertaining disaster.]
“us-east-1”? · AWS means “Amazon Web Services”, Amazon’s insanely huge ($60B/year revenue) and profitable (~30% margin) collection of cloud-computing services. Basically, AWS will rent you computers and databases and the use of many other software services. So more or less everything your IT department owns can be rented by the hour (or second) rather than installed in your own data center.
If you’re using AWS, you have to pick one (or more) of its (24, as I write) “regions” to host your systems. They have boring names like “us-west-2” (Portland) and “ap-northeast-1” (Tokyo).
“us-east-1” (N. Virginia) is generally thought to be the biggest region, by a huge margin. There have been estimates that 30% of all Internet traffic flows through it. Here’s AWS’s official write-up and here’s a nice Atlantic story by a person who drove around Northern Virginia looking for the actual buildings.
Before we leave the subject, I should say that each AWS region is divided into multiple “availability zones” (AZ’s), data centers that are independently operated and geographically separated, so to really lose a whole region, you”d have to take all of them out.
If us-east-1 went off the air, it would be Really Bad. How could that happen?
Terrestrial disaster · This is the first one anybody thinks of.
Suppose a big late-summer hurricane somehow misses Florida and Texas, cruises north offshore picking up energy from an anomalously-warm western Atlantic, turns left just south of DC, and savages anywhere that’s easy driving distance from Dulles airport. We’re talking about inches of rain in a few hours so every waterway floods; also, high winds and lightning are playing hell with the electrical and network infrastructure.
The other obvious candidate would be an earthquake, which can ravage infrastructure to a degree unequaled by any other flavor of natural catastrophe. Among other thing, the Potomac bridges and lots of freeway overpasses would be rubble, so your ability to bring help in would be severely reduced.
If you’re the unlucky proprietor of systems hosted at us-east-1, they’d be off the air, and while AWS would probably arrange to answer your distress call, there’s really not much that could be done. How would your business do if it were off the air for, uh, nobody really knows how long?
How much should you worry? · This one worries me less than a lot of the other scenarios here. First off, the hurricane scenario is so utterly predictable that I bet anyone with a significant data-center presence in the region has been planning and wargaming around this one for at least a decade.
Modern data centers all come with self-contained backup generators and some sort of power-bridging gear, so assuming the water doesn’t actually get in and flood the equipment rooms, things should be fine. You’d expect Internet-provider outages as well, but once again, modern data centers strive for redundant connections and are built in places where there are multiple providers, so they’d all have to go down to go completely off the air.
Having said that, the climate is changing and possibly, everything we know about that storm system will turn out to have been wrong.
The earthquake scenario is tougher, but fortunately that’s not a seismically active zone.
Also bear in mind that the availability-zone architecture is going to help you. You can imagine one data centre’s backup power failing to operate, but it’d be really unlikely for that to happen in all the AZ’s.
I’m not sure this is much consolation, but: If an event of this scale occurs, you’re not going to be the only operation who’s off the air. Probably, quite a lot of the United States government would be in the same boat. So while your customers and employees are going to be mad, they’re also going to be distracted from worrying about your downtime.
Extraterrestrial disaster · What about devastation raining down from space?
Sophie Schmieg is a high-level cryptography/security Googler, and Knows What She’s Talking About. She refers to the Carrington Event, a major solar storm (“Coronal Mass Ejection” they say) that happened in 1859, and severely disrupted the world’s telegraph system for about eight hours. This is an example of a Solar proton event. If/when one happens, it’s going to seriously suck for astronauts and for anyone who depends on aerial radio-frequency communications. How hard will it hit modern data-center and Internet infrastructure? The deepest dive on the subject seems to be Solar Superstorms: Planning for an Internet Apocalypse (PDF) by Sangeetha Abdu Jyothi.
Physicists I’ve talked to say “Yeah, that’s gonna happen someday.” Bear in mind that since the duration is measured in hours, we might get lucky and find us-east-1 facing away from the sun.
How much should you worry? · I figure that this is actually a more likely disaster scenario for us-east-1 then either the hurricane or the earthquake. But I’ve got no special insights into how much it will hurt. In Abdu Jyothi’s paper, she offers lots of specific recommendations about how to solar-storm-proof the infrastructure. How much have the operators of us-east-1 tried, and how well will their efforts work? We don’t know.
However, as with the terrestrial disasters, your personal pain may not matter that much. After all, as Abdu Jyothi points out, “A recent study … which analyzed the risks posed by a Carrington-scale event to the US power grid today found that 20 - 40 million people could be without power for up to 2 years, and the total economic cost will be 0.6 - 2.6 trillion USD.” So… there’s not going to be much leftover attention for your little outage.
Labor unrest · It’s increasing around the globe as multiple decades of increasing inequality in wealth and power bite down harder and harder. Also, it may turn out that Covid has disturbed the balance of power between the working and owning classes. A wave of Big Tech unionization would be surprising, but not that surprising.
So here’s the scenario: Some group of employees whose services are essential for the operation of us-east-1 wins a unionization vote and starts trying to negotiate a contract with AWS, because they’re looking at that 30% margin on the tens of billions in revenue.
Unsurprisingly, Amazon goes all hard-ass, explains that unionization is incompatible with Day One thinking and Amazon Leadership Principles, and refuses to talk. So they take a strike vote, and on one fine spring day, don’t come to work. Nobody’s watching the graphs, whether those are graphs of electrical-supply stability, fiber-repeater failures, or data-storage latencies. How long does us-east-1 stay operative? I have no idea. But it’s a terrifying scenario.
It’s going to be difficult to explain to your customers that you can’t service them because of a labor dispute between a company they’re not dealing with and a union that doesn’t contain any of your employees.
How much should you worry? · Not at all. This will never happen.
Let’s ignore the passion and fury with which Amazon will resist unionization, and suppose hypothetically that things proceed as described, the strike vote passes, and it’s becoming apparent that several thousand essential workers are absolutely not going to show up on a near-future morning. What happens? Amazon caves instantly and does whatever it takes to come to a settlement with the workers.
The company is always talking customer obsession and that’s no BS, they really mean it. Failing to provide services that customers pay for and rely on because of internal management failure (and this is one of those) is violently antipathetic to Amazon culture. So they just won’t let it happen.
AWS software or operational failure · I’m talking about something like what happened to Facebook this month: For reasons that nobody who’s not a serious software geek can understand, us-east-1 suddenly vanishes from the network. Or is still on the network but is refusing all requests. Or is accepting requests but timing them out. Or is accepting requests but returning empty answers.
Once again, you’re in a bad spot when you have to explain to your customers that you’re off the air because you made a bet on a provider who couldn’t deliver the goods.
How much should you worry? · I’m not going to say this could never happen. But I’d be shocked. AWS has been doing cloud at scale for longer than anyone, they have the most experience, and they’ve seen everything imaginable that could go wrong, most things multiple times, and are really good at learning from errors.
Also, AWS has a powerful and consciously-constructed culture of operational excellence based on extreme paranoia. To be honest, I’m just the tiniest bit concerned over the recent departure of Charlie Bell, because he, more than anyone else, deserves credit for building and maintaining that culture. But it runs very deep.
War · It doesn’t seem likely that foreign attackers are going to swarm ashore on the Virginia beaches and send tank battalions through the industrial parks to blow up us-east-1. So maybe you don’t need to worry?
But wait; how about civil war? Let’s see; suppose Trump wins the Republican nomination for 2024, and runs on a rabble-rousing campaign of Revenge For The Steal, and explicitly rallies the Proud Boys, Oath Keepers, Sovereign Citizens, Three Percenters, Groypers, and police unions, telling them, “We can’t lose in a fair election, so if we do, let’s not let them steal it again.”
His election rallies are stuffed with Second-Amendment fanatics brandishing assault weapons. Every debate and campaign interview features questions along the lines of “If you lose, will there be an insurrection?” The majority of voters are out of patience with Trump and vote in Kamala Harris by a decent popular margin, but once again it’s a squeaker in the Electoral College.
The Trump supporters scream “Steal!” and launch a march on Washington; it turns out they have support from significant factions in the police forces and the US armed forces. Northern Virginia becomes a key strategic battleground, and both sides deploy heavy artillery…
OK, that’s a little far-fetched (I hope). Here’s another scenario: Beijing launches an invasion of Taiwan and the US comes to its defence. China’s cyberwar apparatus turns out to have discovered multiple zero-day attacks against Internet exchanges, poison pills that knock BGP off the air and keep it from coming back up. In this scenario, us-east-1 may be up and running, but nobody can reach it.
How much should you worry? · Probably not very much. Like the hurricane or solar storm, your problems are going to vanish in the static.
Enemy action · In this scenario, the Bad Guys (who knows, maybe those Chinese cyberwarfighters I just mentioned) figure out some combination of poison pills and DDOS and Linux kernel zero-days to knock over us-east-1 and keep it that way.
Once again, there you are explaining to your customers why AWS’s incompetence is screwing up their lives.
How much should you worry? · Not at all; I just can’t see this happening. I remember an AWS meeting with a customer looking at moving to the cloud, who asked “What about DDOS attacks?” The Amazon executive in the room said “Yeah, there’s probably three or four of those going on right now, they’re a cost of doing business for us.” There’s nobody in the world with more experience than AWS in dealing with this kind of crap.
But there’s a bigger reason. The vast majority of hackers are in it for the money, and they know perfectly well that AWS has one of the best-defended attack surfaces on the planet. So it’s in their interest to go after softer targets; big companies with juicy customer lists and password files and so on who aren’t minding their perimeters.
Note: You might be one of those big companies; while AWS is generally secure, it’s possible to run insecurely on it. So while the Bad Guys might come after you, they’re almost certainly not going to go after us-east-1 as a whole.
Public legal risk · It seems quite unlikely that any force of nature or criminal action could wipe out us-east-1. How about the US Government? Bear in mind that Republicans hate Amazon because of Bezos’s Washington Post and because the whole tech industry is (somewhat correctly) perceived as progressive.
Suppose Trump or some guttersnipe like Cruz or DeSantis wins the Presidency in 2024, and the Republicans control congress. Could AWS survive a US Federal legal move that forced a us-east-1 shutdown? Could it even survive a continuous credible threat of such a thing happening? The temptation might be too much for the GOP goons.
How much should you worry? · I would. But in a more general way; the existential peril to the USA following on the exercise of power by the Trumpist faction seems to me very severe, not something that can be ignored. So I would be watching which PACs I donated money to, and encouraging grassroots political activism to stave off the wreckage before it happens.
But then, I’m on the respectable left of the Canadian political spectrum, which makes me a raving Commie by US standards.
Surviving · Let’s assume you’re not going to wait for us-east-1 to come back, you want to resume operations elsewhere. So, you need to pick another region. Depending on which scenario worries you the most, you might want to be (as Sophie Schmieg suggested) in a different hemisphere, or if you’re worried about political/legal risks, at least a different jurisdiction.
The best thing you could possibly do is, don’t wait: Run “active-active”, which is to say have your application live in both regions all the time. Netflix kind of wrote the book on this, for example consider this 2013 write-up. I’ll be honest: I don’t know if Netflix has ever actually failed over in the face of an actual region outage. But their thinking is correct: The only way you can be sure that your backup region will run in production is by running it in production.
But let’s suppose you’re less ambitious; you’re not going to try to keep operations running continuously in the case of a failed region, you just need to be able to get back on the air in a reasonable amount of time, probably accepting that some transactions happening just as disaster struck might get lost.
Your app inventory, if it’s typical, probably includes virts running your code, along with load-balancing and fire-walling gear, and your code accesses a variety of services such as messaging systems and databases and serverless stuff. Let’s assume you’ve got your configurations all stored as code with Terraform or CloudFormation or whatever, so that if you needed to rebuild the system from scratch, you could. You do, right? Seriously, given that, if us-east-1 got blown to hell and you have a copy of the config code, revivifying your app is plausible.
Then there’s your data, which lives in some combination of databases, filesystems, and S3.
S3 has had region-to-region replication built in for a long time, and clearly people at AWS have been thinking about this; consider Introducing Multi-Region Asynchronous Object Replication Solution.
If it were me in my ideal world, I’d have copies of everything stored in S3 because of its exceptional durability; I sincerely believe there is no safer place on the planet to save data. Then I’d have a series of scripts that would rehydrate all my databases and config from S3, reconfigure all my code, and fire up my applications. I’d test this script regularly; any more than a few weeks untested and I’d lose confidence that it’d work.
Anyhow… · We probably won’t lose us-east-1. I’m not absolutely 100% sure that these scenarios are even worth thinking about, in a strictly economic sense. But if I were running a big important app, I wouldn’t be able to not think about it.
Comment feed for ongoing:
From: Jon Stewart (Oct 10 2021, at 18:05)
Virginia had a 5.8 earthquake in 2011, with the epicenter right about between Richmond and Charlottesville. “Meh, 5.8,” you scoff. But because the geology of the east coast is so different than the west, the quake traveled far, far longer than a 5.8 on the west coast would. It caused pretty severe damage to the Washington Monument and the National Cathedral.
https://en.m.wikipedia.org/wiki/2011_Virginia_earthquake
While exceedingly rare, an earthquake in northern Virginia is likely to have a much greater area of effect than on the west coast, nor are building codes designed for them. A data center might have more robust building standards but it still depends on other infrastructure to be effective. So, it’s not inconceivable that a ~7 Richter scale earthquake in NoVA could put a real crimp in us-east-1.
[link]
From: Beth Macknik (Oct 10 2021, at 20:57)
Yes, that earthquake was rated severe VIII on the shake index (more applicable for this application than the Richter scale,) and was felt by more Americans than any previous earthquake. It took years to repair major buildings in the DC area.
A stronger earthquake would be devastating.
[link]
From: Tim (but not THE Tim) (Oct 10 2021, at 22:20)
Nice article. My only comment is about "they’ve seen everything imaginable that could go wrong" - it's the things you don't manage to imagine that bite you hard.
[link]
From: Dave Pawson (Oct 11 2021, at 00:13)
Well thought out piece Tim. Easy on the acronyms (or explain on first use) please? BGF?
[link]
From: anonymous (Oct 11 2021, at 05:11)
The most likely care of an outage is AWS Networking breaking and update and taking the region off the air.
Like they did with Brazil last year.
I would point you to the COE but you don't have access anymore. AWS Networking are substantially worse than other parts of AWS.
[link]
From: Nathan (Oct 11 2021, at 06:10)
In the political insurrection hypothetical, you have Ms. Harris winning the presidency. I'm curious what happened to President Biden in this scenario!
Love the worst-case brainstorming though. If these conversations are NOT happening in high-level meetings at companies across the planet, they definitely should be.
[link]
From: Rob (Oct 11 2021, at 09:01)
For me, your last bit gets at the most interesting question-- what happens after the Event. The strength/impermeability of a thing often appears to be less important than (and usually antagonistic to) its resiliency-- How much damage can it take and keep on functioning? How quickly can it spring back after being damaged or taken down? Wasn't resilience the raison d'etre for the design of the internet in the first place?
After all practically the defining feature of a catastrophe is precisely its non-predictability...
[link]
From: folderk (Oct 11 2021, at 09:35)
Is this post mostly an excuse to post a few raving-lunatic-level pro-Democrat doom scenarios?
[link]
From: tom (Oct 12 2021, at 01:13)
a point I didn't see mentioned in this article is edge stuff, specifically how cloudfront interacts with other components.
I can deploy a cloudfront in any region i want, it is region specific unlike for eg iam, but if I want to reference an edge lambda (common) or an ACM certificate (very common) then both the edge lambda and the ACM cert must be shipped to us-east-1 regardless of where the cloudfront goes.
cloudfront will only use certs and edge lambdas from us-east-1
[link]
From: Aidan Kehoe (Oct 13 2021, at 11:46)
Nathan, President Joe Biden is 80. The actuarial tables surprise me a little, in that the chances of an 80 year old making it to 81 are nineteen in twenty, but a one in twenty chance of death is completely within the bounds of possibility. And the chances of an eighty year old making it to 84 are four in five, not nineteen in twenty. It would be completely unremarkable if Vice-President Harris were the next US Democratic candidate for president.
[link]
From: JW (Oct 13 2021, at 15:10)
One scenario I'm missing is infiltration by a terrorist or spy. What is the chance that there are already, at this moment, Chinese spies working in Amazon data centres? If I were the head of the "Chinese NSA", I'd have sent some of my spies there already. (If I were the head of the NSA, same in fact.) And if I was trying to be the next Al Qaeda, maybe too.
So imagine a conflict between the US and China over Taiwan, or a war with Iran, or a new terrorist group targeting the US. What are the chances they can get people inside and what damage can they cause?
[link]
From: Anurag (Oct 18 2021, at 07:08)
The only time I remember where there was a meaningful risk of a region outage was the blizzard in N. Virginia. The buildings weren't built to handle lots of snow on the room. Thankfully, it passed...
[link]
From: Rick Wagner (Oct 18 2021, at 07:12)
I agree with the contributor who asked if the post is an excuse to make loony left-leaning statements.
Seriously, Trump supporters and cops rioting, but no possibility of leftist riots? Statistics would indicate lefties riot much more, and cause much more damage.
Outages because of unionization?
Needs balance. Badly needs balance.
[link]
From: Brendan (Oct 19 2021, at 22:01)
No mention in this post or comments so far of cascading or compounding physical effects and other “unforeseen” multipliers,for example the Fukushima nuclear disaster from the adjacent 2011 Tōhoku earthquake *and* tsunami, or for another example the accumulated risk from a severe cold wave impacting failing infrastructure as we saw in Texas this year (Texas power crisis), or a meteorite or asteroid.
[link]
From: JJ (Oct 21 2021, at 08:33)
wondering why you didn't mention last November's Kinesis outage in us-east-1; that was a big deal.
[link]
From: Geordie (Oct 21 2021, at 08:59)
One thing that is missing from this discussion is the reality of what many customers workloads are and the difference between data unavailability and data loss. I work in the backup industry (#IWork4Dell but opinions are my own to satisfy the lawyers) with most of my customers running "traditional workloads" in the cloud.
When customers approach me about multi-region or multi-cloud backups, my response is how long do you think it will take you to fully recover your DNS and directory services infrastructure in a brand new data center? Then how long will it take to get the dozens or hundreds of interlocking application services up after that? In any conceivable scenario is that less time than it will take AWS to get us-east-1 operating normally? Almost always the answer is no.
Geographically separated datacenters ARE worthwhile but doing it in a way the increases reliability requires a wholistic strategy, that takes CAP theorem into account at all levels of the design.
[link]
From: Stephen (Oct 21 2021, at 09:58)
Thanks for thinking through all these scenarios. Great article.
My only complaint is that "us-west-2" isn't Portland, it's The Dalles, about 2 hours east of Portland. You could simplify that to "Oregon" and it'd be fine though.
[link]
From: Al Sargent (Oct 22 2021, at 14:57)
Great article, thanks. One takeaway is the importance of having backups of data and applications in the Southern Hemisphere, to address the risk of a major solar flare (Carrington Event). That makes AWS South Africa worth considering even if you don't have customers in Africa. And of course, AWS Sydney, Sao Paolo, and their upcoming New Zealand region, even if you don't have customers in those locations. If such a flare would occur during the Northern Hemisphere summer, having backups in the southern hemisphere, and distributed across time zones, could save your business.
[link]
From: Fantomking (Oct 25 2021, at 01:07)
The approach of Netflix is more likely right 'The only way you can be sure that your backup region will run in production is by running it in production'. But if us-east-1 fails at 100%, all people will running away to other regions, with the sudden high load to other regions, it probably just crashes down region by region.
[link]
From: Gavin B (Dec 08 2021, at 04:23)
BUMP:
Amazon Web Services outage hits sites and apps such as IMDb and Tinder.
Users in North America and Europe report patchy service after cloud computing goes down
https://www.theguardian.com/technology/2021/dec/07/amazon-web-services-outage-hits-sites-and-apps-such-as-imdb-and-tinder
[link]