Wide Awake Developers

« 2009 Calendar as OmniGraffle Stencil | Main | Getting Real About Reliability »

Reliability Math

Suppose you build a web site out of a single stack of one web, app, and database server. What sort of availability SLA should you be willing to support for this site?

We'll approach this in a few steps. For the first cut, you'd say that the appropriate SLA is just the expected availability of the site. Availability is defined in different ways depending on when and how you expect to measure it, but for the time being, we'll say that availability is the probability of getting an HTTP response when you submit a request. This is the instantaneous availability.

What is the probability of getting a response from the web server? Assuming that every request goes through all three layers, then the probability of a response is the probability that all three components are working. That is:

This follows our intuition pretty closely. Since any of the three servers can go down, and any one server down takes down the site, we'd expect to just multiply the probabilities together. But what should we use for the reliability of the individual boxes? We haven't done a test to failure or life cycle test on our vendor's hardware. In fact, if our vendor has any MTBF data, they're keeping it pretty quiet.

We can spend some time hunting down server reliability data later. For now, let's just try to estimate it. In fact, let's estimate widely enough that we can be 90% confident that the true value is within our range. This will give us some pretty wide ranges, but that's OK... we haven't dug up much data yet, so there should be a lot of uncertainty. Uncertainty isn't a show stopper, and it isn't an excuse for inaction. It just means there are things we don't yet know. If we can quantify our uncertainty, then we can still make meaningful decisions. (And some of those decisions may be to go study something to reduce the uncertainty!)

Even cheap hardware is getting pretty reliable. Would you expect every server to fail once a year? Probably not. It's less frequent than that. One out of the three servers fail every two years? Seems to be a little pessimistic, but not impossible. Let's start there. If every server fails once every two years, at a constant rate [1], then we can say that the lower bound on server reliability is 60.6%. Would we expect all of these servers to run for five years straight without a failure? Possible, but unlikely. Let's use one failure over five years as our upper bound. One failure out of fifteen server-years would give an annual availability of 93.5% for each server.

So, each server's availability is somewhere between 60.6% and 93.5%. That's a pretty wide range, and won't be satisfactory to many people. That's OK, because it reflects our current degree of uncertainty.

To find the overall reliability, I could just take the worst case and plug it in for all three probabilities, then plug in the best case. That slightly overstates the edge cases, though. I'm better off getting Excel to help me run a Monte Carlo analysis to give me an average across a bunch of scenarios. I'll construct a row that randomly samples a scenario from within these ranges. It will pick three values between 60.6% and 93.5% and compute their product. Then, I'll copy that row 10,000 times by dragging it down the sheet. Finally, I'll average out the computed products to get a range for the overall reliability. When I do that, I get a weighted range of 28.9% to 62.6%. [2] [3]

Yep, this single stack web site will be available somewhere between 28.9% of the time and 62.6%. [4]

Actually, it's likely to be worse than that. There are two big problems in the analysis so far. First, we've only accounted for hardware failures, but software failures are a much bigger contributor to downtime. Second, more seriously, the equation for overall reliability assumes that all failures are disjoint. That is, we implicitly assumed that nothing could cause more than one of these servers to fail simultaneously. Talk about Pollyanna! We've got common mode failures all over the place, especially in the network, power, and data center arenas.

Next time, we'll start working toward a more realistic calculation.


1. I'm using a lot of simplifying assumptions right now. Over time, I'll strip these away and replace them with more realistic calculations. For example, a constant failure rate implies an exponential distribution function. It is mathematically convenient, but doesn't represent the effects of aging on moving components like hard drives and fans.

2. You can download the spreadsheet here.

3. These estimation and analysis techniques are from "How to Measure Anything" by Doug Hubbard.

4. Clearly, for a single-threaded stack like this, you can achieve much higher reliability by running all three layers on a single physical host.

Comments

Hi Michael. I think it's not quite right formula used in this post.

You mentioned that probability of successful request is:

Pr (Web U App U DB) - but this is probality that any component will be working and it is calculated as Max (Pr(Web),Pr(App),Pr(DB ))

But in your example you are using events intersection and it's has following formula:
Pr (Web ^ App ^ DB )

So instead of Union sign, Intersection sign should be used.

Pavel,

You are quite correct, of course. I've updated the formula in the graphic.

Fortunately, it was just a typo in the left hand side. That formula was incorrect, but the remainder of the calculation is the same.

Regards,
-Michael

Great post. Enjoyed reading it and enjoyed thinking about it afterwards even more. But I have many questions :)

Firstly, your equation corresponds to a model where impact of how many queries your system receives (1 query per hour or 1MM queries a second) is assumed to be nil. There is nothing wrong with that, you define a model any way you want - I am just saying.

However, think about it. In a model where software failures are assumed to be non-existent (per last paragraph) and impact of incoming flow of queries is negligible (per equation) and all hardware failures are fatal (non-transient), instantaneous availability as you defined it in paragraph 1 is exactly 1 if all servers are running and exactly 0 if at least one server crashed.

Note that it's 1 from start of observation till at least one server goes down. Then it's 0 forever.

If you allow transient (sporadic, recoverable) hardware failures, your graph might be smoothed, but still - once at least one machine crashes, your probability falls to 0 and remains there.

I also think Monte Carlo simulation is not applicable here, because in it measurements are independent of each other - this is not the case in our model, because if a server crashes, it can't leave that state by itself. Hence, prior observations may have an impact on future observations.

I believe you incorrectly applied a notion of time and averaging over time to calculate general probability, as opposed to a probability of an event occurring within a certain timeframe.

What do you think?

Please note I am not a very sophisticated statistician :), even though I am a math major. So my thinking can be totally off. Looking forward to hearing your thoughts.

Cheers,
Dmitriy

Dmitriy,

Thank you for your excellent comments. I'll try to answer them all here.

For the most part, I think you're just getting ahead of me. I've envisioned this as a series of posts where I progressively remove the (inaccurate) simplifying assumptions. For example, hardware failures do not follow an exponential distribution, but are much better represented as Weibull distributions. Software failures, on the other hand, are more related to load than age. (Running software doesn't wear out.) So, in future posts, I will decouple hardware from software. Second, this post doesn't really account for recycle time (operational availability), it only accounts for instantaneous availability. In future posts, I will account for repair time as well.

Regarding use of Monte Carlo simulation, I made a transition from speaking about probability of events over time, in the first half of my post, to talking about uncertainty in measurements.

With measurements uncertain in 3 dimensions, I could simply multiply out the best and worst cases. This would correspond to a constant distribution across a cubic region of event space. That's usually unrealistic, in that it puts too much probability density in the corners. I used the Monte Carlo spreadsheet to represent a 3-dimensions fuzzy sphere with Gaussian distribution in each dimension. By assuming the original reliability range was a 90% CI, I can derive the mean and standard deviation of the distribution in each axis, then use the Monte Carlo technique to find the sigma-radius of the resulting probability cloud. This was the transition from time basis to uncertainty basis.

On a broader scale, you'll have noticed that I'm making a lot of unvalidated assumptions. There is room for much more precision and accuracy. This is part of the larger point I'm making in my series on time and uncertainty.

In the early days of a project, it is _always_ the case that we lack hard data. It's not uncommon to have an order of magnitude in estimates. (If we even have estimates!) "This query will take somewhere between a tenth of a second and a second." That can't stop us from making progress, though. We just have to be clear about our uncertainties and approximations. Every project will be a series of successively more accurate approximations, up until the system goes into production. Then, and only then, will we have the absolute concrete data to plug into our models. Until that time, we just need to understand how uncertain we should be, and what we can do to reduce the uncertainty.

Cheers,
-Michael

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)