Early this year, Amazon’s S3 service experienced a major service outage. For about 4-5 hours on February 28th, many high profile software-as-a-service (SaaS) providers which depend on S3’s East Coast region experienced outages as well, leading to the claim that “Amazon broke the internet.” In the time since Amazon recovered from their service outage, I […]
Early this year, Amazon’s S3 service experienced a major service outage. For about 4-5 hours on February 28th, many high profile software-as-a-service (SaaS) providers which depend on S3’s East Coast region experienced outages as well, leading to the claim that “Amazon broke the internet.”
In the time since Amazon recovered from their service outage, I have seen a few different types of reactions around the internet community, which I’ll attempt to summarize below:
After witnessing these groups of reactions, one thing became very clear to me – for a development team working with the cloud, their choices (as with so many choices in development) are all about trade-offs.
On the flip-side of reducing risk by making use of multiple regions or providers, there is a cost to mitigate that risk. Beyond the explicit cost of keeping a copy of a company’s data in another data center, there also exists the risk and associated cost of synchronizing that data, securing that data, testing that data, and ensuring appropriate failover mechanisms.
Given the low likelihood of a regional failure on a service like Amazon’s S3, businesses felt confident enough that their trade-off was worth the risk. Now that an outage on this scale has actually occurred, I imagine businesses will be re-evaluating the impact of that risk.
Personally, my bet is that enough customers understand sale and upstream dependencies that when something on the scale of this outage happens, if they’re informed about the outage via the media and the provider itself (in this case, Amazon) they’re willing to be forgiving. Therefore, the customer relationship doesn’t suffer as much. I’ve also noticed customers being much more forgiving when good DevOps practices – such as status pages, and clear communication – are employed by a business.
But what if an outage scenario involved data loss? The calculus would change altogether.
So, “expect things to fail” really means “make your trade-offs based on a small likelihood of downtime, not 0% chance of downtime, and plan accordingly.” For a service that can survive an hours-long outage, the cost saving trade-offs are a no-brainer. Services that are mission critical or have volatile customer bases do not have this luxury, and they may need to build fault-tolerance across regions and cloud providers, at a cost to themselves and possibly their end users.
In IT, we are always in the business of value and trade-offs. As SaaS providers are now finding out, the key is knowing your customers’ expectations for the value your business provides, and making the correct trade-offs to deliver that value.
I had the opportunity to attend the virtual DevOps Enterprise Summit recently. The timing was...
Last fall, Excella participated in the Department of Defense’s (DoD) Eye in the Sky Challenge....