Wide Awake Developers

« January 2009 | Main | March 2009 »

Reliability Math

Suppose you build a web site out of a single stack of one web, app, and database server. What sort of availability SLA should you be willing to support for this site?

We'll approach this in a few steps. For the first cut, you'd say that the appropriate SLA is just the expected availability of the site. Availability is defined in different ways depending on when and how you expect to measure it, but for the time being, we'll say that availability is the probability of getting an HTTP response when you submit a request. This is the instantaneous availability.

What is the probability of getting a response from the web server? Assuming that every request goes through all three layers, then the probability of a response is the probability that all three components are working. That is:

This follows our intuition pretty closely. Since any of the three servers can go down, and any one server down takes down the site, we'd expect to just multiply the probabilities together. But what should we use for the reliability of the individual boxes? We haven't done a test to failure or life cycle test on our vendor's hardware. In fact, if our vendor has any MTBF data, they're keeping it pretty quiet.

We can spend some time hunting down server reliability data later. For now, let's just try to estimate it. In fact, let's estimate widely enough that we can be 90% confident that the true value is within our range. This will give us some pretty wide ranges, but that's OK... we haven't dug up much data yet, so there should be a lot of uncertainty. Uncertainty isn't a show stopper, and it isn't an excuse for inaction. It just means there are things we don't yet know. If we can quantify our uncertainty, then we can still make meaningful decisions. (And some of those decisions may be to go study something to reduce the uncertainty!)

Even cheap hardware is getting pretty reliable. Would you expect every server to fail once a year? Probably not. It's less frequent than that. One out of the three servers fail every two years? Seems to be a little pessimistic, but not impossible. Let's start there. If every server fails once every two years, at a constant rate [1], then we can say that the lower bound on server reliability is 60.6%. Would we expect all of these servers to run for five years straight without a failure? Possible, but unlikely. Let's use one failure over five years as our upper bound. One failure out of fifteen server-years would give an annual availability of 93.5% for each server.

So, each server's availability is somewhere between 60.6% and 93.5%. That's a pretty wide range, and won't be satisfactory to many people. That's OK, because it reflects our current degree of uncertainty.

To find the overall reliability, I could just take the worst case and plug it in for all three probabilities, then plug in the best case. That slightly overstates the edge cases, though. I'm better off getting Excel to help me run a Monte Carlo analysis to give me an average across a bunch of scenarios. I'll construct a row that randomly samples a scenario from within these ranges. It will pick three values between 60.6% and 93.5% and compute their product. Then, I'll copy that row 10,000 times by dragging it down the sheet. Finally, I'll average out the computed products to get a range for the overall reliability. When I do that, I get a weighted range of 28.9% to 62.6%. [2] [3]

Yep, this single stack web site will be available somewhere between 28.9% of the time and 62.6%. [4]

Actually, it's likely to be worse than that. There are two big problems in the analysis so far. First, we've only accounted for hardware failures, but software failures are a much bigger contributor to downtime. Second, more seriously, the equation for overall reliability assumes that all failures are disjoint. That is, we implicitly assumed that nothing could cause more than one of these servers to fail simultaneously. Talk about Pollyanna! We've got common mode failures all over the place, especially in the network, power, and data center arenas.

Next time, we'll start working toward a more realistic calculation.


1. I'm using a lot of simplifying assumptions right now. Over time, I'll strip these away and replace them with more realistic calculations. For example, a constant failure rate implies an exponential distribution function. It is mathematically convenient, but doesn't represent the effects of aging on moving components like hard drives and fans.

2. You can download the spreadsheet here.

3. These estimation and analysis techniques are from "How to Measure Anything" by Doug Hubbard.

4. Clearly, for a single-threaded stack like this, you can achieve much higher reliability by running all three layers on a single physical host.

2009 Calendar as OmniGraffle Stencil

I had need of a stencil that would let me drop monthly calendars on a number of pages. I found it useful, and someone else might, too.

Download the stencil.

Fast Iteration versus Elegant Design

I love the way that proggit bubbles stuff around. Today, for a while at least, the top link is to a story from Salon in May of 2000 about Bill and Lynne Jolitz, the creators of 386BSD.

[An aside: I'm not sure exactly when I became enough of a graybeard to remember as current events things which are now discussed as history. It's really disturbing that an article from almost a decade ago talks about events seven years earlier than that, and I remember them happening! To me, the real graybeards are the guys that created UNIX and C to begin with. Me? I'm part of the second or third UNIX generation, at best. Sigh...]

Anyway, Bill and Lynne Jolitz created the first free, open-source UNIX that ran on x86 chips.  Coherent was around before that, and I think SCO UNIX was available for x86 at the same time. SCO wasn't evil then, just expensive. In those days, you had to lay down some serious jing to get UNIX on your PC. Minix was available for free, but Tannenbaum held firm that Minix should teach principles rather than be a production OS, so he favor pedagogical value over functionality. Consequently, Minix wasn't a full UNIX implementation. (At least at that time. It might be now.)

Just contemplate the hubris of two programmers deciding that they would create their own operating system, to be UNIX, but fixing the flaws, hacks, and workarounds that had built up over more than a decade. Not only that, but they would choose to give it away for the cost of floppies! And not only that, but they would build it for a processor that serious UNIX people sneered at. Most impressive of all, they succeeded. 386BSD was a technically superior, well-architected version of UNIX for commodity hardware. The Jolitzes extrapolated Intel's growth curve and rapid product cycles and saw that x86 processors would advance far faster than the technically superior RISC chips.

At various times, I ran Minix, 386BSD, and SCO UNIX on my PC well before I even heard of Linux. Each of them had the field before Linus even made his 0.1 release.

So why is Linux everywhere, and we only hear about 386BSD in historical contexts? There is exactly one answer, and it's what Eric Raymond was really talking about in The Cathedral and the Bazaar. TCatB has been seen mostly as an argument for open-source versus commercial software, but what Raymond saw was that the real competition comes down to an open contribution model versus closed contributions. Linus' promiscuous contribution policy simply let Linux out-evolve 386BSD. More contributors meant more drivers, more bug fixes, more enhancements... more ideas, ultimately. Two people, no matter how talented, cannot outcode thousands of Linux contributors. The best programmers are 10 times more productive than the average, and I would rate Bill and Lynne among the very best. But, as of last April, the Linux Foundation reported that more than 3,600 people had contributed to the kernel alone.

Iteration is one of the fundamental dynamics. Iteration facilitates adaptation, and adaptation wins competition. History is littered with the carcasses of "superior" contenders that simply didn't adapt as fast as their victorious challengers.

Why Do Enterprise Applications Suck?

What is it about enterprise applications that makes them suck?

I mean, have you ever seen someone write 1,500 words about how much they love their corporate expense reporting system? Or spend their free time mashing up the job posting system together with Google maps? Of course not. But why not?

There's a quality about some software that inspires love in their users, and it's totally devoid in enterprise software. The best you can ever say about enterprise software is when it doesn't get in the way of the business. At it's worst, enterprise software creates more work than it automates.

For example, in my company, we've got a personnel management tool that's so unpredictable that every admin in the company keeps his or her own spreadsheet of requests that have been entered. They have to do that because the system itself randomly eats input, drops requests, or silently trashes forms. It's not a training problem, it's just lousy software.

We've got a time-tracking system that has a feature where an employee can enter in a vacation request. There's a little workflow triggered to have the supervisor approve the vacation request. I've seen it used inside two groups. In both cases, the employee negotiates the leave request via email then enters it into the time tracking system. I know several people who use Travelocity to find their flights before they log in to our corporate travel system. And you wouldn't even believe how hard our sales force automation system it compared to Salesforce.com.

Way back in 1937, Ronald Coase elaborated his theory about why corporations exist. He said that a firm's boundaries should be drawn so as to minimize transaction costs... search and information costs, bargaining costs, and cost of policing behavior. By almost every measure, then, external systems offer lower transaction costs than internal ones. No wonder some people think IT doesn't matter.

If the best you can do is not mess up a nondifferentiating function like personnel management, it's tough to claim that IT can be a competitive advantage. So, again I'll ask, why?

I think there are exactly four reasons that internal corporate systems are so unloved and unlovable.

1. The serve their corporate overlords, not their users.

This is simple. Corporate systems are built according to what analysts believe will make the company more efficient. Unfortunately, this too often falls prey to penny-wise-pound-foolish decisions that micro-optimize costs while suboptimizing the overall value stream. Optimizing one person's job with a system that creates more work for a number of other people doesn't do any good for the company as a whole.

2. They only do gray-suited, stolidly conservative things.

Corporate IM now looks like an obvious idea, but messaging started frivolously. It was blocked, prohibited, and firewalled. In 1990, who would have spent precious capital on something to let cubicle-dwellers ask each other what they were doing for lunch? As it turns out, a few companies were on the leading edge of that wave, but their illicit communications were done in spite of IT.  How many companies would build something to "Create Breakthrough Products Through Collaborative Play?"

3. They have captive audiences.

If your company has six purchasing systems, that's a problem. If you have a choice of six online stores, that's competition.

4. They lack "give-a-shitness".

I think this one matters most of all. Commerce sites, Web 2.0 startups, IM networks... the software that people love was created by people who love it, too. It's either their ticket to F-U money, it's their brainchild, or it's their livelihood. The people who build those systems live with them for a long time, often years. They have reason to care about the design and about keeping the thing alive.

This is also why, once acquired, startups often lose their luster. The founders get their big check and cash out. The barnstormers that poured their passion into it discover they don't like being assimilated and drift away.

Architects, designers, and developers of corporate systems usually have little or no voice in what gets built, or how, or why. (Imagine the average IT department meeting where one developer says this system really ought to be built using Scala and Lift.) The don't sign on, they get assigned. I know that individual developers do care passionately about their work, but usually have no way to really make a difference.

The net result is that corporate software is software that nobody gives a shit about: not it's creators, not it's investors, and not it's users.

 

Tracking and Trouble

Pick something in your world and start measuring it.  Your measurements will surely change a little from day to day. Track those changes over a few months, and you might have a chart something like this.

First 100 samples

Now that you've got some data assembled, you can start analyzing it. The average over this sample is 59.5. It's got a variance of 17, which is about 28% of the mean. You can look for trends. For example, we seem to see an upswing for the first few months, then a pullback starting around 90 days into the cycle. In addition, it looks like there is a pretty regular oscillation superimposed on the main trend, so you might be looking at some kind of weekly pattern as well.

The next few months of data should make the patterns clearer.

First 200 samples.

Indeed, from this chart, it looks pretty clear that the pullback around 100 days was the early indicator of a flattening in the overall growth trend from the first few months. Now, the weekly oscillations are pretty much the only movement, with just minor wobbles around a ceiling.

I'll fast forward and show the full chart, spanning 1000 samples (over three years' worth of daily measurements.)

Full chart of 100 samples

Now we can see that the ceiling established at 65 held against upward pressure until about 250 days in, when it finally gave way and we reached a new support at about 80. That support lasted for another year, when we started to see some gradual downward pressure resulting in a pullback to the mid-70s.

You've probably realized by now that I'm playing a bit of a game with you. These charts aren't from any stock market or weather data. In fact, they're completely random. I started with a base value of 55 and added a little random value each "day".

When you see the final chart, it's easy to see it as the result of a random number generator.  If you were to live this chart, day by day, however, it's exceedingly hard not to impose some kind of meaning or interpretation on it. The tough part is that you actually can see some patterns in the data.  I didn't force the weekly oscillations into the random number function, they just appeared in the graph. We are all exceptional good at pattern detection and matching. We're so good, in fact, that we find patterns all over the place. When we are confronted with obvious patterns, we tend to believe that they're real or that they emerge from some underlying, meaningful structure. But sometimes, they're really just nothing more than randomness.

Nassim Nicholas Taleb is today's guru of randomness, but Benoit Mandelbrot wrote about it earlier in the decade, and Benjamin Graham wrote about this problem back in the 1920's. I suspect someone has sounded this warning every decade since statistics were invented. Graham, Mandelbrot, and Taleb all tell us that, if we set out to find patterns in historical data, we will always find them. Whether those patterns have any intrinsic meaning is another question entirely. Unless we discover that there are real forces and dynamics that underlie the data, we risk fooling ourselves again and again.

We can't abandon the idea of prediction, though. Randomness is real, and we have a tendency to be fooled by it. Still, even in the face of those facts, we really do have to make predictions and forecasts. Fortunately, there are about a dozen really effective ways to deal with the fundamental uncertainty of the future. I'll spend a few posts exploring these different ways to deal with the uncertainty of the future.

Booklist

I made a LibraryThing list of books relevant to the stuff that's banging around in my head now. These are in no particular order or organization. In fact, this is a live widget, so it might change as I think of other things that should be on the list.

The key themes here are time, complexity, uncertainty, and constraints. If you've got recommendations along these lines, please send them my way.

Cold Turkey

Last night, I did something pretty drastic.  It wasn't on impulse... I had been thinking about this for quite a while. Finally, I decided to take the band-aid approach and just do it all at once.

I deleted all my games.

New and old alike, they all went.  Bioshock, System Shock, System Shock II.  GTA IV. GTA: Vice City. (I skipped San Adreas.) Venerable Diablo I and II, not to mention their leering cousin Overlord. Age of Empires. Several versions of Peggle and Bejeweled. Warcraft III. Every incarnation of Half-Life and Half-Life 2. Uplink, Darwinia, Wingnuts, Weird Worlds and SPORE.

Well, OK, it  wasn't that hard to give up SPORE, but seriously, deleting Darwinia hurt.

Why chuck hundreds of dollars of software into the bin? It's all about time. My own time and time with a capital 'T'. I need time to understand Time. Too much recombinant thought has taken up residence. It's time to marshal these unruly ideas and get them out. So, the games served during gestation, but now it's time and Time and past time for me to put them aside and get scholastic. Put pen to paper, or fingers to keyboard. Time to run some numbers, see the scenarios, and try to synthesize a cohesive whole. Time to abstract and distill and methodologize.

I know I'm being obscure. How can I not? Take the number of people exposed to a given process theory (OODA). Multiply it by the fraction who also know the second through the seventh (ToC, Lean, Six Sigma, TQM, Agile, Strategic Navigation). Mix in dynamical systems thinking (Senge, Liker, Hock.) Intersect that group with people who know something about uncertainty, complexity, and time. Now intersect it with people who view all the world as material and economic flux.  (If you are a member of the resulting set, I want to talk to you!)  I know all these things are deeply connected, but if I could articulate how, and why, then I'd already be done.

One thing I am already sure about, though, is this: It is all about Time. Time is far more fundamental and far less understood than you'd think.  I'm now just talking about inappropriately scaled-up quantum mechanics metaphors.  I mean that people fundamentally trip up on Time all the time.  "The Black Swan" is just the tip of the iceberg.

If it works, I'll sound like an utter crackpot, raving and waving my very own personal ToE.

If it doesn't work, well, Steam knows which games I bought. I can always reinstall them.

Subtle Interactions, Non-local Problems

Alex Miller has a really interesting blog post up today. In LBQ + GC = slow, he shows how LinkedBlockingQueue can leave a chain of references from tenured dead objects to live young objects.  That sounds really dirty, but it actually means something to Java programmers. Something bad.

The effect here is a subtle interaction between the code and the mostly hidden, yet omnipresent, garbage collector. This interaction just happens to hit a known sore spot for the generational garbage collector. I won't spoil the ending, because I want you to read Alex's piece.

In effect, a one-line change to LinkedBlockingQueue has a dramatic effect on the garbage collector's performance. In fact, because the problem causes more full GC's, you'd be likely to observe this problem in an area completely unconnected with the queue itself.  By leaving these refchains worming through multiple generations in the heap, the queue damages a resource needed by every other part of the application.

This is a classic common-mode dependency, and it's very hard to diagnose because it results from hidden and asynchronous coupling.

Combining here docs and blocks in Ruby

Like a geocache, this is another post meant to help somebody who stumbles across it in a future Google search. (Or as an external reminder for me, when I forget how I did this six months from now.)

I've liked here-documents since the days of shell programming. Ruby has good support for here docs with variable interpolation. For example, if I want to construct a SQL query, I can do this:

def build_query(customer_id)
  <<-STMT
    select * 
     from customer
   where id = #{customer_id}
  STMT
}

Disclaimer: Don't do this if customer_id comes from user input!

Recently, I wanted a way to build inserts using a matching number of column names and placeholders.

def build_query
  <<-STMT
    insert into #{table} ( #{columns()} ) values ( #{column_placeholders()} )
  STMT
end

In this case, columns and column_placeholders were both functions.

One oddity I ran into is the combination of here documents and block syntax. RubyDBI lets you pass a block when executing a query, the same way you would pass a block to File::open(). The block gets a "statement handle", which gets cleaned up when the block completes.

  dbh.execute(query) { |sth| 
    sth.fetch() { |row|
      # do something with the row
    }
  }

Combining these two lets you write something that looks like SQL invading Ruby:

  dbh.execute(<<-STMT) { |sth|
      select distinct customer, business_unit_id, business_unit_key_name
       from problem_ticket_lz
       order by customer
    STMT
    sth.fetch { |row|
      print "#{row[1]}\t#{row[0]}\t#{row[2]}\n"
    }
  }

This looks pretty good overall, but take a look at how the block opening interacts with the here doc. The here doc appears to be line-oriented, so it always begins on the line after the <<-STMT token. On the other hand, the block open follows the function, so the here doc gets lexically interpolated in the middle of the block, even though it has no syntactic relation to the block. No real gripe, just an oddity.

Beautiful Architecture

O'Reilly has released "Beautiful Architecture," a compilation of essays by software and system architects. I'm happy to announce that I have a chapter in this book. The finished book is shipping now, and available through Safari. I think the whole thing has turned out amazingly well, both instructive and interesting.

One of the editors, Diomidas Spinellis, has posted an excellent description and summary.






Another Cause of TNS-12541

There are about a jillion forum posts and official pages on the web that talk about ORA-12541, the infamous "TNS:No Listener" error. Somewhere around 70% of them appear to be link-farmers who just scrape all the Oracle forums and mailing lists.  Virtually all of the pages just refer back to the official definition from Oracle, which says "there's no listener running on the server" and tells you to log in to the server as admin and start up the listener.

Not all that useful, especially if you're not the DBA.

I found a different way that you can get the same error code, even when the listener is running. Special thanks to blogger John Jacob, whose post didn't quite solve my problem, but did set me on the right track.

Here's my situation. My client is a laptop connecting to the destination network through a VPN client. I'm connecting to an Oracle 10g service with 2 nodes. Tnsping reported success, the connection assistant could connect successfully, but sqlplus always reported TNS-12541 TNS:No listener.  The listener was fine.

Turning on client side tracing, I saw that the initial connection attempt to the service VIP was successful, but that the server than sends back a packet with the hostname of a specific node to use. Here's where the problem begins.

Thanks to some quirk in the VPN configuration, I can only resolve DNS names on the VPN if they're fully qualified. The default search domain just flat doesn't work.  So, I can resolve proddb02.example.com but not proddb02. That's the catch, because the database sends back just the host portion of the node, not the FQDN. DNS resolution fails, but sqlplus reports it as "No listener", rather than saying "Host not found" or something useful like that.

Again, there are a jillion post and articles telling network admins how to fix the default domain search on a VPN concentrator. And, again, I'm not the network admin, either.

The best I can do as a user is work around this issue by adding the IPs of the physical DB nodes to the hosts file on my own client machine.  Sure, some day it'll break when we re-address the DB nodes, and I will have long forgotten that I even put those addresses in C:\Windows\System32\Drivers\etc\hosts. Still, at least it works for now.