Wide Awake Developers

« February 2009 | Main | April 2009 »

Quantum Backups

Backups are the only macroscopic system we commonly deal with that exhibits quantum mechanical effects. This is odd enough that I've spent some time getting tangled up in these observations.

Until you attempt a restore, a backup set is neither good nor bad, but a superposition of both. This is the superposition principle.

The peculiarity of the superposition principle is dramatically illustrated with the experiment of Schrödinger's backup. This is when you attempt to restore Schrödinger's pictures of his cat, and discover that the cat is not there.

In a startling corollary, if you use offsite vaulting, a second quantum variable is introduced, in that the backup set exists and does not exist simultaneously. A curious effect emerges upon applying the Hamiltonian operator. The operator shows that certain eigenvalues are always zero, revealing that prime numbered tapes greater than 5 in a set never exist.

Finally, the Heisenbackup principle says that the user of a system is entangled with the system itself. As a result, within 30 days of consciously deciding that you do not need to run a backup, you will experience a complete disk crash. Because you've just read this, your 30 days start now.

Sorry about that.

Update: Sun Cloud API Not the Same as Amazon

It looks like the early reports that Sun's cloud API would be compatible with AWS resulted from the reporters' exuberance (or mere confusion.)

It's actually nicer than Amazon's.

It is based on the REST architectural style, with representations in JSON. In fact, I might start using it as the best embodiment of REST principles. You start with an HTTP GET of "/". In this repsonse to this and every other request, it is the hyperlinks in the response that indicate what actions are allowed.

Sun has a wiki to describe the API, with a very nicely illustrated "Hello, Cloud" example.

Can you make that meeting?

I'm convinced that the next great productivity revolution will be de-matrixing the organizations we've just spent ten years slicing and dicing.

Yesterday, I ran into a case in point: What are the odds that three people can schedule a meeting this week versus having to push it into next week?

Turns out that if they're each 75% utilized, then there's only a 15% chance they can schedule a one hour meeting this week. (If you always schedule 30 minute meetings instead of one hour, then the odds go up to about 25%.)

Here's the probability curve that the meeting can happen. This assumes, by the way, that there are no lunches or vacation days, and that all parties are in the same time zone. It only gets worse from here.

So, overall, there's about an 85% chance that 3 random people in a meeting-driven company will have to defer until next week.

Bring it up to 10 people, in a consensus-driven, meeting-oriented company, and the odds drop to 0.00095%.

No wonder "time to first meeting" seems to dominate "time to do stuff."

Amazon as the new Intel

Update: Please read this update. The information underlying this post was based on early, somewhat garbled, reports.

A brief digression from the unpleasantness of reliability.

This morning, Sun announced their re-entry into the cloud computing market. After withdrawing Network.com from the marketplace a few months ago, we were all wondering what Sun's approach would be. No hardware vendor can afford to ignore the cloud computing trend... it's going to change how customers view their own data centers and hardware purchases.

One thing that really caught my interest was the description of Sun's cloud offering. It sounded really, really similar to AWS. Then I heard the E-word and it made perfect sense. Sun announced that they will use EUCALYPTUS as the control interface to their solution. EUCALYPTUS is an open-source implementation of the AWS APIs.

Last week at QCon London, we heard Simon Wardley give a brilliant talk, in which he described Canonical's plan to create a de facto open standard for cloud computing by seeding the market with open source implementations. Canonical's plan? Ubuntu and private clouds running EUCALYPTUS.

It looks like Amazon may be setting the standard for cloud computing, in the same way that Intel set the standard for desktop and server computing, by defining the programming interface.

I don't worry about this, for two reasons. One, it forestalls any premature efforts to force a de jure standard. This space is still young enough that an early standard can't help but be a drag on exploration of different business and technical models. Two, Amazon has done an excellent job as a technical leader. If their APIs "win" and become de facto standards, well, we could do a lot worse.

Getting Real About Reliability

In my last post, I user some back-of-the-envelope reliability calculations, with just one interesting bit, to estimate the availability of a single-stacked web application, shown again here. I cautioned that there were a lot of unfounded assumptions baked in. Now it's time to start removing those assumptions, though I reserve the right to introduce a few new ones.

Is it there when I want it?

First, lets talk about the hardware itself.  It's very likely that these machines are what some vendors are calling "industry-standard servers." That's a polite euphemism for "x86" or "ia64" that just doesn't happen to mention Intel. ISS servers are expected to exhibit 99.9% availability.

There's something a little bit fishy about that number, though. It's one thing to say that a box is up and running ("available") 99.9% of the times you look at it.If I check it every hour for a year, and find it alive at least 8,756 out of 8,765 times, then it's 99.9% available. It might have broken just once for 9 hours, or it might have broken 9 times for an hour each, or it might have broken 36 times for half an hour each.

This is the difference between availability and reliability. Availability measures the likelihood that a system can perform its function at a specific point in time. Reliability, on the other hand, measures the likelihood that a system will have failed before a point in time. Availability and reliability both matter to your users. In fact, a large number of small outages can be just as frustrating as a single large event. (I do wonder... since both ends of the spectrum seem to stick out in users' memories, perhaps there's an optimum value for the duration and frequency of outages, where they are seldom enough to seem infrequent, but short enough to seem forgivable?)

We need a bit more math at this point.

It must be science... it's got integrals.

Let's suppose that hardware failures can be described as function of time, and that they are essentially random. It's not like the story of the "priceless" server room, where failure can be expected based on actions or inaction. We'll also carry over the previous assumption that hardware failures among these three boxes are independent. That is, failure of any one boxes does not make other boxes more likely to fail.

We want to determine the likelihood that the box is available, but the random event we're concerned with is a fault. Thus, we first need to find the probability that a fault has occurred by time t. Checking for a failure is sampling for an event X between times 0 and t.

The function f(t) is the probability distribution function that describes failures of this system. We'll come back to that shortly, because a great deal hinges on what function we use here. The reliability of the system, then is the probability that the event X didn't happen by time t.

One other equation that will help in a bit is the failure rate, the number of failures to expect per unit time. Like reliability, the failure rate can vary over time. The failure rate is:

Failure distributions

So now we've got integrals to infinity of unknown functions. This is progress?

It is progress, but there are some missing pieces. Next time, I'll talk about different probability distributions, which ones make sense for different purposes, and how to calibrate them with observations.