Wide Awake Developers

Software Failure Takes Down Blackberry Services

| Comments

Anyone who’s addicted to a Blackberry already knows about Monday’s four-hour outage. For some of us, the Blackberry isn’t just an electronic leash, it’s part of our business operations.

Like cell phones, Blackberries have a huge, hidden infrastructure behind them. Corporate Blackberry Event Servers (BES) relay email, calendar, and contact information through RIM’s infrastructure, out through the wireless carriers. It was RIM’s own infrastructure that suffered from intermittent failures during the outage.

Data Center Knowledge reports that the outage was caused by a failed software upgrade

Releases are risky. We use testing and QA to reduce the risk, but every line of new or modified code represents an unknown.

How can we reduce the risk of an upgrade? One way is to roll it out slowly. Companies with widely distributed point-of-sale (POS) systems know this. They never push a release out to every store at once. They start with one or two. If that works, they go up to a larger handful, maybe four to eight. After a couple of days, they’ll roll it out to an entire district. It can take a week or more to roll the release out everywhere.

In the interim, there are plenty of checkpoints where the release can be rolled back.

I strongly recommend approaching Web site releases the same way. Roll the new release out to one or two servers in your farm. Let a fraction of your customers into the new release. Watch for performance regressions, capacity problems, and functional errors. Absolutely ensure that you can roll it back if you need to. Once it’s "baked" for a while in production, then roll it to the remaining app servers.

This approach demands a few corollaries. First, your database updates have to be structured in a forward-compatible way, and they must always allow for rollback. There can be no irrevocable updates. Second, two versions of your software will be operating simultaneously. That means your integration protocols and static assets have to be able to accommodate both versions. I discuss specific strategies for each of these aspects in Release It.

Finally, an aside: RIM’s statement about the outage isn’t reflected anywhere on their site. Once again, if what you want is the latest true information about a company, the very last place to find it is the company’s own web site.