Wide Awake Developers

« Life's Little Frustrations | Main | The Future of Software Development »

Failover: Messy Realities

People who don't live in operations can carry some funny misconceptions in their heads. Some of my personal faves:

  • Just add some servers!
  • I want a report of every configuration setting that's different between production and QA!
  • We're going to make sure this (outage) never happens again!

I've recently been reminded of this during some discussions about disaster recovery. This topic seems to breed misconceptions. Somewhere, I think most people carry around a mental model of failover that looks like this:

Normal operations transitions directly and cleanly to failed over

That is, failover is essentially automatic and magical.

Sadly, there are many intermediate states that aren't found in this mental model. For example, there can be quite some time between failure and it's detection. Depending on the detection and notification, there can be quite a delay before failover is initiated at all. (I once spoke with a retailer whose primary notification mechanism seemed to be the Marketing VP's wife.)

Once you account for delays, you also have to account for faulty mechanisms. Failover itself often fails, usually due to configuration drift. Regular drills and failover exercises are the only way to ensure that failover works when you need it. When the failover mechanisms themselves fail, your system gets thrown into one of these terminal states that require manual recovery.

Just off the cuff, I think the full model looks a lot more like this:

Many more states exist in the real world, including failure of the failover mechanism itself.

It's worth considering each of these states and asking yourself the following questions:

  • Is the state transition triggered automatically or manually?
  • Is the transition step executed by hand or through automation?
  • How long will the state transition take?
  • How can I tell whether it worked or not?
  • How can I recover if it didn't work?

Comments

Another question that you might ask about each transition is: "What race conditions can occur during this transition?"

This is the best explanation I've seen yet about the messy realities of recovery. For completeness, the only suggestion I can offer is adding two more transitions:

1) between "failover detected, failover not initiated" and "waiting for passive node" there's a state where one or more people have to interpret the notification and decide if failover needs to be initiated. This state is fraught with error and delay, and can often involve things like initiating conference calls, tracking down dependent upstream and downstream owners, deciding if the failure is really bad enough to failover, or if we can stick it out, etc. Maybe call this state "interpreting and deciding on course of action". I've seen this take as long as 30-60 minutes in less than mature organizations.

2) Right after "failover succeeded" there's a state where you're waiting for adjustments to your failover to ripple through the rest of the organization, both upstream and downstream. For enterprises that aren't holistically designed and/or don't have good abstraction between systems, this can be messy and take a long time. As Paul said in the comment above me, race condition vulnerabilities are especially gnarly during this time.

It's very interesting to see that among normal operation state ,failure state , and normal state again there is a whole universe. By the way in my experience I had a case where I had to explain of my American colleagues similar problem where during surgical operation they had implemented restart of surgical operation hardware due to firmware failure! The result of this restart was memory leak and unloaded dynamic library. Hardware restart was necessary approximately on 10 to 15 minutes. I can't explain you how hard was for me to explain them that we have a memory leak and instantiation of dynamic libraries throwing data from the hardware to the software application. The application responsibility is to decide to signal to surgeon if the surgeon is close to a spinal cord nerve !!!!!!!!! As I know they had done an operation already in parallel with the old system. But I am not sure how safe was that combination between the old system and the new one. Probably the advice is to know the dependency of the system and it's behavior during failure.

Anton Jorov Antonov

Another possible route in this is that, after Failover Succeeded, you might have another failure - if the problem was induced by some external cause that repeats, like freakishly high demand or a malformed query. Not sure if that merits inclusion or not, but it felt worth mentioning to me.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)