This is eventually about the public cloud and Open Source, but — apologies in advance — takes an indirect and long-ish path.
In AWS engineering, we develop stuff and we operate stuff. I think the second is more important.
We have good hardware and software engineers, and infrastructure that feels pretty magic to me (faves: the racks and networking gear, the consensus manager underlying QLDB, and the voodoo that makes S3 go). But, like Bill Joy said, “Wherever you work, most of the smart people are somewhere else”, so I’m not gonna kid myself that we’re magically unique at programming.
But on the operations side, the picture is really unique. First of all, there are very few places in the world where you can get operational experience at this scale. Second, AWS doesn’t run on SRE culture; the same engineers who write the code live by the dashboards and alarms and metrics that try to reflect and protect the customers’ experience (not perfectly, but we make progress).
The obsessive focus on operational excellence isn’t subtle and it’s not a secret. There’s been a re:Invent presentation about how we run our ops meetings and we even open-sourced the AWS Ops Wheel.
But it’s not all meetings. We build and deploy a lot of technology with no direct connection to any feature or function or API that a customer will ever see. These are all about having the right dashboards, and being able to extract the key ratio from petabytes of logs, and predicting what might melt down before it even gets warm.
The asshole ratio · I’ve already written that at AWS, it’s lower than I experienced at other BigTech outfits. Here’s why this is relevant: There is plenty of evidence that you can be a white-hot flaming asswipe and still ship great software. But (going out on a limb) I don’t think you can be an asshole and be good at operations.
Because ops requires being humble in the face of the evidence, acknowledging fallibility, assuming that the problem is your problem even when quite likely it’s not, and always eager to investigate theories B, C, and D even when you’re pretty sure your current theory A is right-on. Since problems in complex services are almost never solved by a single individual’s efforts, you have to be good at working with people under pressure.
Those LPs · I have a hypothesis about that good ratio and it involves the Amazon Leadership Principles (we just say LPs). I’ve gotten flack from friends who think having such things is lame and corny. But in practice they turn out to be useful, and to explain how I’m going to take side-trip into modern clinical medicine.
There’s this guy Atul Gawande, a surgeon and writer whose work I’ve admired, mostly in The New Yorker, for years. I recommend pretty well anything he writes but in particular I recommend The Checklist Manifesto. Do me (and yourself) a favor, follow that link and read the Malcolm Gladwell review excerpt. From which:
“…the routine tasks of surgeons have now become so incredibly complicated that mistakes of one kind or another are virtually inevitable: it’s just too easy for an otherwise competent doctor to miss a step, or forget to ask a key question or, in the stress and pressure of the moment, to fail to plan properly for every eventuality.“ [Sounds just like updating a million-TPS Web Service. -Tim] “Gawande then visits with pilots and the people who build skyscrapers and comes back with a solution. Experts need checklists–literally–written guides that walk them through the key steps in any complex procedure.”
Well, one insanely-complex routine task that we do all the time is hiring. You know what the LPs are at hiring time? A checklist. Now even the typical all-day interview marathon isn’t gonna reliably dig into every LP, but we do an acceptable job of taking a close look at enough of them. I believe that’s very helpful in bringing down the asshole ratio.
Open Source · Which brings me to the touchy subject of the relationship between Cloud Providers and Open Source. We and our competitors have made a good business of infrastructure operations, keeping service-oriented software servicing; reliably, durably, 24/7/365. The core EC2 business is about operating Linux boxes and IP networking at extreme scale, efficiently enough that we can rent them out at an attractive price and still make a buck.
In recent Open-Source years, some very gifted people have created wonderful pieces of software — Kafka, ElasticSearch, Mongo — and taken a new course, launching VC-financed companies to monetize with service and support. Then sometimes they find themselves competing with multiple public-cloud providers.
I have a load of sympathy for the virtuoso engineers who created these wonderful pieces of work. But here’s the thing: I have at least as much for the customers who (let’s take Kafka for an example) just need reliable high-performance streaming. A direct quote: “I’ll cheerfully pay monthly to never worry about Zookeeper again.”
On the other hand, I have little sympathy with modern VC-driven business models.
It’s like this: The qualities that make people great at carving high-value software out of nothingness aren’t necessarily the ones that make them good at operations. This has two unfortunate effects: They don’t necessarily have the right skills to build and run a crack operations team, and they might not manage to get a job at an operations-obsessed company.
I have recent personal experience with failing to hire a senior committer to a well-known OSS project, and also with paying an “open-source company” for tech support when we were spinning up a service around a package we didn’t know very well. Both of these left me unhappy.
Jack and Jonathan · Let me tell you a story. Sometime around 2008, I and Jonathan Schwartz, then the CEO of Sun Microsystems, made a sales call on Jack Dorsey at Twitter. Sun had acquired MySQL and Twitter was using the hell out of it. We wanted them to start paying us for support; after all, they were existentially dependent on this technology and everyone knew that serious Enterprises would never use unsupported software.
Jack was nice, and listened to our pitch, but we didn’t get the business.
And while, as a career software guy, I entirely love open-source culture and technologies and methods, the hypothesis that Open Source in and of itself constitutes a business model is not well supported by the evidence.
Which way forward? · Google Cloud’s recent Open Source partnerships are interesting. I look at that list of companies and it’s not obvious to me that they’re going to offer better operational excellence than Google’s, but maybe I’m wrong. It’s an interesting and probably useful experiment.
At the end of the day I’m not that worried. Most of us who’ve open-sourced stuff love the creative process for its own sake; touching and improving other engineers’ lives. The skillset evidenced by having done so will probably help you get really good jobs. Yeah, you might not get to be a Bay Area Unicorn. But you probably weren’t going to anyhow.
Comment feed for ongoing:
From: Paul Ramsey (Sep 22 2019, at 11:51)
I too have little to no sympathy for the open core VC companies, who frankly only do open source as an adoption to channel, not out of any sense of reciprocal obligation and value creation. But glossing the cloud/OSS issue as being exclusively about the open core companies misses a lot of territory. I work in the PostgreSQL ecosystem, and find it incredibly odd that the development effort is provided almost exclusively by a handful of relatively small pure-play OSS support companies, while the cloud companies, who almost certainly derive the lions share of the dollars made in the PgSQL world, contribute very little. It seems like those kinds of imbalances are not long term sustainable.
[link]
From: Geoff Arnold (Sep 22 2019, at 13:37)
Hey there, Tim. Long time, etcetera.
So.... I agree with pretty much everything you write here. Open source isn't a business model, operational excellence is what people will pay for, and so forth. I'm applying most of these ideas in my current gig.
But there's a big elephant in the room, and I think you really need to acknowledge its presence. The fact is that even if a small software company follows your advice, and does a stellar job in operations and support, Amazon or Google is often (usually) going to be able to win the business because of (effective) monopoly power over pricing and multi-service integration.
It doesn't matter whether AWS/Azure/GCP actually exerts this power, of course. Lack of transparency, and dependency on underlying services will do the trick.If I'm using Cloudwatch for some of my services, I'm likely to prefer a new service offering which is Cloudwatch-enabled. And of course these underlying services are rarely open source...
So keep on touting the excellence of AWS operations and reminding people that stellar "ops" is much harder than brilliant "dev". But please recognize that there are other economic forces at work.
--
PS One more point: I'm curious about your criticism of VC business models. In most cases, the objective is to pump up the visibility and influence of a startup to the point that a larger company will buy them. IPO unicorns are nice, but rare. And the way to make an innovation influential is to open source it, either stand-alone, or as part of a broader community (hello, Kubernetes!). Trying to compete with a closed-source product against an open source alternative is really tough in today's market.
[link]
From: John Cowan (Sep 22 2019, at 14:26)
You know what I like about Amazon? I'm a <i>customer<i>. I pay my money (not much) and you give me what I want (enough for me). I deal with Google and lots of other Big Tech companies too, but I never forget that I'm not the customer there. "The TV business is like the chicken business: the chickens think that because they get fed, they're the customers."
[link]
From: PeterL (Sep 23 2019, at 19:37)
"... AWS doesn’t run on SRE culture; the same engineers who write the code live by the dashboards and alarms and metrics"
Been drinking the KoolAid? ;)
I've lived this culture at Yahoo and I can assure you that the Google SWE/SRE system works better, at least in groups where SWEs and SREs have figured out a way to work together (e.g., the parts of Google that bring in the $$$s). The SREs work to keep the data centre and all its lower level parts working (Borg, Colossus, Bigtable, networks, power fail-over, etc.); SWEs deal with the higher level stuff and work with the SREs to make sure things don't break on deployment. (I could talk more about this but don't want to break my NDA.)
There are parts of Google where SWE+SRE has become SWE vs SRE and that's not good (I experienced that for a bit, when I switched to another group). But overall, I think Google gets it right and SWE-as-SRE is not the way to go; the characteristics of a good SRE aren't always the characteristics of a good SWE.
(BTW, it's common for a Google SWE to take a 6 month temporary assignment as an SRE -- SRE is a very different world from SWE.)
If you want to go into this deeper, you can buy me coffee next time I'm in YVR or next time you're in Silicon Gulch.
[link]
From: Alex (Oct 30 2019, at 17:15)
I'm curious to know how much money Amazon has donated to the OSS distributions and OSS software developers, to encourage them to continue working on "free software" as opposed to resorting to an open core model with VC funding. You seem to imply these software companies got VC funding before becoming popular, but in fact they got VC funding because they were popular and couldn't afford to work for "free" anymore. Amazon could easily solve that problem, but instead Amazon has always taken advantage of popular OSS exclusively for their own good, to the detriment of the original developers.
[link]
From: Miles (Oct 30 2019, at 21:34)
Point taken -- AWS is better at ops than the developers who create open source tools.
Then why not pay a fraction of the money you're earning on offering these tools back to the open source community? Using something like GitHub's new funding tools?
[link]