Wide Awake Developers

S3 Outage Report and Perspective

| Comments

Amazon has issued a more detailed statement explaining the S3 outage from June 20, 2008.  In my company, we’d call this a "Post Incident Report" or PIR. It has all the necessary sections:

  • Observed behavior
  • Root cause analysis
  • Followup actions: corrective and operational

This is exactly what I’d expect from any mature service provider.

There are a few interesting bits from the report. First, the condition seems to have arisen from an unexpected failure mode in the platform’s self-management protocol. This shouldn’t surprise anyone. It’s a new way of doing business, and some of the most innovative software development, applied at the largest scales. Bugs will creep in.

In fact, I’d expect to find more cases of odd emergent behavior at large scale.

Second, the back of my envelope still shows S3 at 99.94% availability for the year. That’s better than most data center providers. It’s certainly better than most corporate IT departments do.

Third, Amazon rightly recognizes that transparency is a necessary condition for trust. Many service providers would fall into the "bunker mentality" of the embattled organization. That’s a deteriorating spiral of distrust, coverups, and spin control. Transparency is most vital after an incident. If you cannot maintain transparency then, it won’t matter at any other time.