Wide Awake Developers

« Article on Building Robust Messaging Applications | Main | Beyond the Village »

S3 Outage Report and Perspective

Amazon has issued a more detailed statement explaining the S3 outage from June 20, 2008.  In my company, we'd call this a "Post Incident Report" or PIR. It has all the necessary sections:

  • Observed behavior
  • Root cause analysis
  • Followup actions: corrective and operational

This is exactly what I'd expect from any mature service provider.

There are a few interesting bits from the report. First, the condition seems to have arisen from an unexpected failure mode in the platform's self-management protocol. This shouldn't surprise anyone. It's a new way of doing business, and some of the most innovative software development, applied at the largest scales. Bugs will creep in.

In fact, I'd expect to find more cases of odd emergent behavior at large scale.

Second, the back of my envelope still shows S3 at 99.94% availability for the year. That's better than most data center providers. It's certainly better than most corporate IT departments do.

Third, Amazon rightly recognizes that transparency is a necessary condition for trust. Many service providers would fall into the "bunker mentality" of the embattled organization. That's a deteriorating spiral of distrust, coverups, and spin control. Transparency is most vital after an incident. If you cannot maintain transparency then, it won't matter at any other time.

 

Comments

I dunno, the problem report indicates to me that Amazon simply messed up by insufficiently protecting the integrity of the content of its gossip messages.

Assuming they were using TCP/IP, it's long been known that TCP checksums are insufficiently robust. If you send enough messages, you *will* run into these problems;

http://portal.acm.org/citation.cfm?doid=347059.347561

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)