Early this year, Amazon’s S3 service experienced a major service outage. For about 4-5 hours on February 28th, many high profile software-as-a-service (SaaS) providers which depend on S3’s East Coast region experienced outages as well, leading to the claim that “Amazon broke the internet.”
In the time since Amazon recovered from their service outage, I have seen a few different types of reactions around the internet community, which I’ll attempt to summarize below:
- Angry customers, who were displeased that their service of choice, which they depend on, had an outage.
- Understanding customers, who patronize these services and understand that when operating at this level of scale and value, sometimes things happen.
- Unforgiving Developers stating that developers of SaaS apps should expect things to fail, (as Jeff Bezos has previously stated) and therefore should have spread their application across different regions, clouds, or services to mitigate this.
- Forgiving Developers, who believe that SaaS apps have made the best decision for their customers’ costs and satisfaction by taking on a little risk for a single-region setup.
- Cautious adopters, who are concerned that moving into the cloud means yielding some control over their own destiny.
After witnessing these groups of reactions, one thing became very clear to me – for a development team working with the cloud, their choices (as with so many choices in development) are all about trade-offs.
On the flip-side of reducing risk by making use of multiple regions or providers, there is a cost to mitigate that risk. Beyond the explicit cost of keeping a copy of a company’s data in another data center, there also exists the risk and associated cost of synchronizing that data, securing that data, testing that data, and ensuring appropriate failover mechanisms.
Given the low likelihood of a regional failure on a service like Amazon’s S3, businesses felt confident enough that their trade-off was worth the risk. Now that an outage on this scale has actually occurred, I imagine businesses will be re-evaluating the impact of that risk.
Personally, my bet is that enough customers understand sale and upstream dependencies that when something on the scale of this outage happens, if they’re informed about the outage via the media and the provider itself (in this case, Amazon) they’re willing to be forgiving. Therefore, the customer relationship doesn’t suffer as much. I’ve also noticed customers being much more forgiving when good DevOps practices – such as status pages, and clear communication – are employed by a business.
But what if an outage scenario involved data loss? The calculus would change altogether.
So, “expect things to fail” really means “make your trade-offs based on a small likelihood of downtime, not 0% chance of downtime, and plan accordingly.” For a service that can survive an hours-long outage, the cost saving trade-offs are a no-brainer. Services that are mission critical or have volatile customer bases do not have this luxury, and they may need to build fault-tolerance across regions and cloud providers, at a cost to themselves and possibly their end users.
In IT, we are always in the business of value and trade-offs. As SaaS providers are now finding out, the key is knowing your customers’ expectations for the value your business provides, and making the correct trade-offs to deliver that value.