Wide Awake Developers

Update: Sun Cloud API Not the Same as Amazon

| Comments

It looks like the early reports that Sun’s cloud API would be compatible with AWS resulted from the reporters’ exuberance (or mere confusion.)

It’s actually nicer than Amazon’s.

It is based on the REST architectural style, with representations in JSON. In fact, I might start using it as the best embodiment of REST principles. You start with an HTTP GET of “/”. In this repsonse to this and every other request, it is the hyperlinks in the response that indicate what actions are allowed.

Sun has a wiki to describe the API, with a very nicely illustrated “Hello, Cloud” example.

Can You Make That Meeting?

| Comments

I’m convinced that the next great productivity revolution will be de-matrixing the organizations we’ve just spent ten years slicing and dicing.

Yesterday, I ran into a case in point: What are the odds that three people can schedule a meeting this week versus having to push it into next week?

Turns out that if they’re each 75% utilized, then there’s only a 15% chance they can schedule a one hour meeting this week. (If you always schedule 30 minute meetings instead of one hour, then the odds go up to about 25%.)

Here’s the probability curve that the meeting can happen. This assumes, by the way, that there are no lunches or vacation days, and that all parties are in the same time zone. It only gets worse from here.

So, overall, there’s about an 85% chance that 3 random people in a meeting-driven company will have to defer until next week.

Bring it up to 10 people, in a consensus-driven, meeting-oriented company, and the odds drop to 0.00095%.

No wonder “time to first meeting” seems to dominate “time to do stuff.”

Amazon as the New Intel

| Comments

Update: Please read this update. The information underlying this post was based on early, somewhat garbled, reports.

A brief digression from the unpleasantness of reliability.

This morning, Sun announced their re-entry into the cloud computing market. After withdrawing Network.com from the marketplace a few months ago, we were all wondering what Sun’s approach would be. No hardware vendor can afford to ignore the cloud computing trend… it’s going to change how customers view their own data centers and hardware purchases.

One thing that really caught my interest was the description of Sun’s cloud offering. It sounded really, really similar to AWS. Then I heard the E-word and it made perfect sense. Sun announced that they will use EUCALYPTUS as the control interface to their solution. EUCALYPTUS is an open-source implementation of the AWS APIs.

Last week at QCon London, we heard Simon Wardley give a brilliant talk, in which he described Canonical’s plan to create a de facto open standard for cloud computing by seeding the market with open source implementations. Canonical’s plan? Ubuntu and private clouds running EUCALYPTUS.

It looks like Amazon may be setting the standard for cloud computing, in the same way that Intel set the standard for desktop and server computing, by defining the programming interface.

I don’t worry about this, for two reasons. One, it forestalls any premature efforts to force a de jure standard. This space is still young enough that an early standard can’t help but be a drag on exploration of different business and technical models. Two, Amazon has done an excellent job as a technical leader. If their APIs “win” and become de facto standards, well, we could do a lot worse.

Getting Real About Reliability

| Comments

In my last post, I user some back-of-the-envelope reliability calculations, with just one interesting bit, to estimate the availability of a single-stacked web application, shown again here. I cautioned that there were a lot of unfounded assumptions baked in. Now it’s time to start removing those assumptions, though I reserve the right to introduce a few new ones.

Is it there when I want it?

First, lets talk about the hardware itself.  It’s very likely that these machines are what some vendors are calling "industry-standard servers." That’s a polite euphemism for "x86" or "ia64" that just doesn’t happen to mention Intel. ISS servers are expected to exhibit 99.9% availability.

There’s something a little bit fishy about that number, though. It’s one thing to say that a box is up and running ("available") 99.9% of the times you look at it.If I check it every hour for a year, and find it alive at least 8,756 out of 8,765 times, then it’s 99.9% available. It might have broken just once for 9 hours, or it might have broken 9 times for an hour each, or it might have broken 36 times for half an hour each.

This is the difference between availability and reliability. Availability measures the likelihood that a system can perform its function at a specific point in time. Reliability, on the other hand, measures the likelihood that a system will have failed before a point in time. Availability and reliability both matter to your users. In fact, a large number of small outages can be just as frustrating as a single large event. (I do wonder… since both ends of the spectrum seem to stick out in users’ memories, perhaps there’s an optimum value for the duration and frequency of outages, where they are seldom enough to seem infrequent, but short enough to seem forgivable?)

We need a bit more math at this point.

It must be science… it’s got integrals.

Let’s suppose that hardware failures can be described as function of time, and that they are essentially random. It’s not like the story of the "priceless" server room, where failure can be expected based on actions or inaction. We’ll also carry over the previous assumption that hardware failures among these three boxes are independent. That is, failure of any one boxes does not make other boxes more likely to fail.

We want to determine the likelihood that the box is available, but the random event we’re concerned with is a fault. Thus, we first need to find the probability that a fault has occurred by time t. Checking for a failure is sampling for an event X between times 0 and t.

The function f(t) is the probability distribution function that describes failures of this system. We’ll come back to that shortly, because a great deal hinges on what function we use here. The reliability of the system, then is the probability that the event X didn’t happen by time t.

One other equation that will help in a bit is the failure rate, the number of failures to expect per unit time. Like reliability, the failure rate can vary over time. The failure rate is:

Failure distributions

So now we’ve got integrals to infinity of unknown functions. This is progress?

It is progress, but there are some missing pieces. Next time, I’ll talk about different probability distributions, which ones make sense for different purposes, and how to calibrate them with observations.

Reliability Math

| Comments

Suppose you build a web site out of a single stack of one web, app, and database server. What sort of availability SLA should you be willing to support for this site?

We’ll approach this in a few steps. For the first cut, you’d say that the appropriate SLA is just the expected availability of the site. Availability is defined in different ways depending on when and how you expect to measure it, but for the time being, we’ll say that availability is the probability of getting an HTTP response when you submit a request. This is the instantaneous availability.

What is the probability of getting a response from the web server? Assuming that every request goes through all three layers, then the probability of a response is the probability that all three components are working. That is:

This follows our intuition pretty closely. Since any of the three servers can go down, and any one server down takes down the site, we’d expect to just multiply the probabilities together. But what should we use for the reliability of the individual boxes? We haven’t done a test to failure or life cycle test on our vendor’s hardware. In fact, if our vendor has any MTBF data, they’re keeping it pretty quiet.

We can spend some time hunting down server reliability data later. For now, let’s just try to estimate it. In fact, let’s estimate widely enough that we can be 90% confident that the true value is within our range. This will give us some pretty wide ranges, but that’s OK… we haven’t dug up much data yet, so there should be a lot of uncertainty. Uncertainty isn’t a show stopper, and it isn’t an excuse for inaction. It just means there are things we don’t yet know. If we can quantify our uncertainty, then we can still make meaningful decisions. (And some of those decisions may be to go study something to reduce the uncertainty!)

Even cheap hardware is getting pretty reliable. Would you expect every server to fail once a year? Probably not. It’s less frequent than that. One out of the three servers fail every two years? Seems to be a little pessimistic, but not impossible. Let’s start there. If every server fails once every two years, at a constant rate [1], then we can say that the lower bound on server reliability is 60.6%. Would we expect all of these servers to run for five years straight without a failure? Possible, but unlikely. Let’s use one failure over five years as our upper bound. One failure out of fifteen server-years would give an annual availability of 93.5% for each server.

So, each server’s availability is somewhere between 60.6% and 93.5%. That’s a pretty wide range, and won’t be satisfactory to many people. That’s OK, because it reflects our current degree of uncertainty.

To find the overall reliability, I could just take the worst case and plug it in for all three probabilities, then plug in the best case. That slightly overstates the edge cases, though. I’m better off getting Excel to help me run a Monte Carlo analysis to give me an average across a bunch of scenarios. I’ll construct a row that randomly samples a scenario from within these ranges. It will pick three values between 60.6% and 93.5% and compute their product. Then, I’ll copy that row 10,000 times by dragging it down the sheet. Finally, I’ll average out the computed products to get a range for the overall reliability. When I do that, I get a weighted range of 28.9% to 62.6%. [2] [3]

Yep, this single stack web site will be available somewhere between 28.9% of the time and 62.6%. [4]

Actually, it’s likely to be worse than that. There are two big problems in the analysis so far. First, we’ve only accounted for hardware failures, but software failures are a much bigger contributor to downtime. Second, more seriously, the equation for overall reliability assumes that all failures are disjoint. That is, we implicitly assumed that nothing could cause more than one of these servers to fail simultaneously. Talk about Pollyanna! We’ve got common mode failures all over the place, especially in the network, power, and data center arenas.

Next time, we’ll start working toward a more realistic calculation.


1. I’m using a lot of simplifying assumptions right now. Over time, I’ll strip these away and replace them with more realistic calculations. For example, a constant failure rate implies an exponential distribution function. It is mathematically convenient, but doesn’t represent the effects of aging on moving components like hard drives and fans.

2. You can download the spreadsheet here.

3. These estimation and analysis techniques are from "How to Measure Anything" by Doug Hubbard.

4. Clearly, for a single-threaded stack like this, you can achieve much higher reliability by running all three layers on a single physical host.

Fast Iteration Versus Elegant Design

| Comments

I love the way that proggit bubbles stuff around. Today, for a while at least, the top link is to a story from Salon in May of 2000 about Bill and Lynne Jolitz, the creators of 386BSD.

[An aside: I’m not sure exactly when I became enough of a graybeard to remember as current events things which are now discussed as history. It’s really disturbing that an article from almost a decade ago talks about events seven years earlier than that, and I remember them happening! To me, the real graybeards are the guys that created UNIX and C to begin with. Me? I’m part of the second or third UNIX generation, at best. Sigh…]

Anyway, Bill and Lynne Jolitz created the first free, open-source UNIX that ran on x86 chips.  Coherent was around before that, and I think SCO UNIX was available for x86 at the same time. SCO wasn’t evil then, just expensive. In those days, you had to lay down some serious jing to get UNIX on your PC. Minix was available for free, but Tannenbaum held firm that Minix should teach principles rather than be a production OS, so he favor pedagogical value over functionality. Consequently, Minix wasn’t a full UNIX implementation. (At least at that time. It might be now.)

Just contemplate the hubris of two programmers deciding that they would create their own operating system, to be UNIX, but fixing the flaws, hacks, and workarounds that had built up over more than a decade. Not only that, but they would choose to give it away for the cost of floppies! And not only that, but they would build it for a processor that serious UNIX people sneered at. Most impressive of all, they succeeded. 386BSD was a technically superior, well-architected version of UNIX for commodity hardware. The Jolitzes extrapolated Intel’s growth curve and rapid product cycles and saw that x86 processors would advance far faster than the technically superior RISC chips.

At various times, I ran Minix, 386BSD, and SCO UNIX on my PC well before I even heard of Linux. Each of them had the field before Linus even made his 0.1 release.

So why is Linux everywhere, and we only hear about 386BSD in historical contexts? There is exactly one answer, and it’s what Eric Raymond was really talking about in The Cathedral and the Bazaar. TCatB has been seen mostly as an argument for open-source versus commercial software, but what Raymond saw was that the real competition comes down to an open contribution model versus closed contributions. Linus’ promiscuous contribution policy simply let Linux out-evolve 386BSD. More contributors meant more drivers, more bug fixes, more enhancements… more ideas, ultimately. Two people, no matter how talented, cannot outcode thousands of Linux contributors. The best programmers are 10 times more productive than the average, and I would rate Bill and Lynne among the very best. But, as of last April, the Linux Foundation reported that more than 3,600 people had contributed to the kernel alone.

Iteration is one of the fundamental dynamics. Iteration facilitates adaptation, and adaptation wins competition. History is littered with the carcasses of "superior" contenders that simply didn’t adapt as fast as their victorious challengers.

Why Do Enterprise Applications Suck?

| Comments

What is it about enterprise applications that makes them suck?

I mean, have you ever seen someone write 1,500 words about how much they love their corporate expense reporting system? Or spend their free time mashing up the job posting system together with Google maps? Of course not. But why not?

There’s a quality about some software that inspires love in their users, and it’s totally devoid in enterprise software. The best you can ever say about enterprise software is when it doesn’t get in the way of the business. At it’s worst, enterprise software creates more work than it automates.

For example, in my company, we’ve got a personnel management tool that’s so unpredictable that every admin in the company keeps his or her own spreadsheet of requests that have been entered. They have to do that because the system itself randomly eats input, drops requests, or silently trashes forms. It’s not a training problem, it’s just lousy software.

We’ve got a time-tracking system that has a feature where an employee can enter in a vacation request. There’s a little workflow triggered to have the supervisor approve the vacation request. I’ve seen it used inside two groups. In both cases, the employee negotiates the leave request via email then enters it into the time tracking system. I know several people who use Travelocity to find their flights before they log in to our corporate travel system. And you wouldn’t even believe how hard our sales force automation system it compared to Salesforce.com.

Way back in 1937, Ronald Coase elaborated his theory about why corporations exist. He said that a firm’s boundaries should be drawn so as to minimize transaction costs… search and information costs, bargaining costs, and cost of policing behavior. By almost every measure, then, external systems offer lower transaction costs than internal ones. No wonder some people think IT doesn’t matter.

If the best you can do is not mess up a nondifferentiating function like personnel management, it’s tough to claim that IT can be a competitive advantage. So, again I’ll ask, why?

I think there are exactly four reasons that internal corporate systems are so unloved and unlovable.

1. The serve their corporate overlords, not their users.

This is simple. Corporate systems are built according to what analysts believe will make the company more efficient. Unfortunately, this too often falls prey to penny-wise-pound-foolish decisions that micro-optimize costs while suboptimizing the overall value stream. Optimizing one person’s job with a system that creates more work for a number of other people doesn’t do any good for the company as a whole.

2. They only do gray-suited, stolidly conservative things.

Corporate IM now looks like an obvious idea, but messaging started frivolously. It was blocked, prohibited, and firewalled. In 1990, who would have spent precious capital on something to let cubicle-dwellers ask each other what they were doing for lunch? As it turns out, a few companies were on the leading edge of that wave, but their illicit communications were done in spite of IT.  How many companies would build something to "Create Breakthrough Products Through Collaborative Play?"

3. They have captive audiences.

If your company has six purchasing systems, that’s a problem. If you have a choice of six online stores, that’s competition.

4. They lack "give-a-shitness".

I think this one matters most of all. Commerce sites, Web 2.0 startups, IM networks… the software that people love was created by people who love it, too. It’s either their ticket to F-U money, it’s their brainchild, or it’s their livelihood. The people who build those systems live with them for a long time, often years. They have reason to care about the design and about keeping the thing alive.

This is also why, once acquired, startups often lose their luster. The founders get their big check and cash out. The barnstormers that poured their passion into it discover they don’t like being assimilated and drift away.

Architects, designers, and developers of corporate systems usually have little or no voice in what gets built, or how, or why. (Imagine the average IT department meeting where one developer says this system really ought to be built using Scala and Lift.) The don’t sign on, they get assigned. I know that individual developers do care passionately about their work, but usually have no way to really make a difference.

The net result is that corporate software is software that nobody gives a shit about: not it’s creators, not it’s investors, and not it’s users.

 

Tracking and Trouble

| Comments

Pick something in your world and start measuring it.  Your measurements will surely change a little from day to day. Track those changes over a few months, and you might have a chart something like this.

First 100 samples

Now that you’ve got some data assembled, you can start analyzing it. The average over this sample is 59.5. It’s got a variance of 17, which is about 28% of the mean. You can look for trends. For example, we seem to see an upswing for the first few months, then a pullback starting around 90 days into the cycle. In addition, it looks like there is a pretty regular oscillation superimposed on the main trend, so you might be looking at some kind of weekly pattern as well.

The next few months of data should make the patterns clearer.

First 200 samples.

Indeed, from this chart, it looks pretty clear that the pullback around 100 days was the early indicator of a flattening in the overall growth trend from the first few months. Now, the weekly oscillations are pretty much the only movement, with just minor wobbles around a ceiling.

I’ll fast forward and show the full chart, spanning 1000 samples (over three years’ worth of daily measurements.)

Full chart of 100 samples

Now we can see that the ceiling established at 65 held against upward pressure until about 250 days in, when it finally gave way and we reached a new support at about 80. That support lasted for another year, when we started to see some gradual downward pressure resulting in a pullback to the mid-70s.

You’ve probably realized by now that I’m playing a bit of a game with you. These charts aren’t from any stock market or weather data. In fact, they’re completely random. I started with a base value of 55 and added a little random value each "day".

When you see the final chart, it’s easy to see it as the result of a random number generator.  If you were to live this chart, day by day, however, it’s exceedingly hard not to impose some kind of meaning or interpretation on it. The tough part is that you actually can see some patterns in the data.  I didn’t force the weekly oscillations into the random number function, they just appeared in the graph. We are all exceptional good at pattern detection and matching. We’re so good, in fact, that we find patterns all over the place. When we are confronted with obvious patterns, we tend to believe that they’re real or that they emerge from some underlying, meaningful structure. But sometimes, they’re really just nothing more than randomness.

Nassim Nicholas Taleb is today’s guru of randomness, but Benoit Mandelbrot wrote about it earlier in the decade, and Benjamin Graham wrote about this problem back in the 1920’s. I suspect someone has sounded this warning every decade since statistics were invented. Graham, Mandelbrot, and Taleb all tell us that, if we set out to find patterns in historical data, we will always find them. Whether those patterns have any intrinsic meaning is another question entirely. Unless we discover that there are real forces and dynamics that underlie the data, we risk fooling ourselves again and again.

We can’t abandon the idea of prediction, though. Randomness is real, and we have a tendency to be fooled by it. Still, even in the face of those facts, we really do have to make predictions and forecasts. Fortunately, there are about a dozen really effective ways to deal with the fundamental uncertainty of the future. I’ll spend a few posts exploring these different ways to deal with the uncertainty of the future.

Booklist

| Comments

I made a LibraryThing list of books relevant to the stuff that’s banging around in my head now. These are in no particular order or organization. In fact, this is a live widget, so it might change as I think of other things that should be on the list.

The key themes here are time, complexity, uncertainty, and constraints. If you’ve got recommendations along these lines, please send them my way.