Wide Awake Developers

Units of Measure in Scala

| Comments

Failure to understand or represent units has caused several major disasters, including the costly Ariane 5 disaster in 1996. This is one of those things that DSLs often get right, but mainstream programming languages just ignore. Or, worse, they implement a clunky unit of measure library that ensures you can never again write a sensible arithmetic expression.

While I was at JAOO Australia this week, Amanda Laucher showed some F# code for a recipe that caught my attention. It used numeric literals with that directly attached units to quantities. What’s more, it was intelligent about combining units.

I went looking for something similar in Scala. I googled my fingertips off, but without much luck, until Miles Sabin pointed out that there’s already a compiler plugin sitting right next to the core Scala code itself.

Installing Units

Scala has it’s own package manager, called sbaz. It can directly install the units extension:

sbaz install units

This will install it under your default managed installation. If you haven’t done anything else, that will be your Scala install directory. If you have done something else, you probably already know what you’re doing, so I won’t try to give you instructions.

Using Units

To use units, you first have to import the library’s “Preamble”. It’s also helpful to go ahead and import the “StandardUnits” object. That brings in a whole set of useful SI units.

I’m going to do all this from the Scala interactive interpreter.

scala> import units.Preamble._
import units.Preamble._

scala> import units.StandardUnits._
import units.StandardUnits._

After that, you can multiply any number by a unit to create a dimensional quantity:

scala> 20*m
res0: units.Measure = 20.0*m

scala> res0*res0
res1: units.Measure = 400.0*m*m

scala> Math.Pi*res0*res0
res2: units.Measure = 1256.6370614359173*m*m

Notice that when I multiplied a length (in meters) times itself, I got an area (square meters). To me, this is a really exciting thing about the units library. It can combine dimensions sensibly when you do math on them. In fact, it can help prevent you from incorrectly combining units.

scala> val length = 5*mm
length: units.Measure = 5.0*mm

scala> val weight = 12*g
weight: units.Measure = 12.0*g

scala> length + weight
units.IncompatibleUnits: Incompatible units: g and mm

I can’t add grams and millimeters, but I can multiply them.

Creating Units

The StandardUnits package includes a lot of common units relating to basic physics. It doesn’t have any relating to system capacity metrics, so I’d like to create some units for that.

scala> import units._
import units._

scala> val requests = SimpleDimension("requests")
requests: units.SimpleDimension = requests

scala> val req = SimpleUnit("req", requests, 1.0)
req: units.SimpleUnit = req

scala> val Kreq = SimpleUnit("Kreq", requests, 1000.0)
Kreq: units.SimpleUnit = Kreq

Now I can combine that simple dimension with others. If I want to express requests per second, I can just write it directly.

scala> 565*req/s
res4: units.Measure = 565.0*req/s

Conclusion

This extension will be the first thing I add to new projects from now on. The convenience of literals, with the extensibility of adding my own dimensions and units means I can easily keep units with all of my numbers.

There’s no longer any excuse to neglect your units in a mainstream programming language.

Kudos to Relevance and Clojure

| Comments

It’s been a while since I blogged anything, mainly because most of my work lately has either been mind-numbing corporate stuff, or so highly contextualized that it wouldn’t be productive to write about.

Something came up last week, though, that just blew me away.

For various reasons, I’ve engaged Relevance to do a project for me. (Actually, the first results were so good that I’ve now got at least three more projects lined up.) They decided—and by “they”, I mean Stuart Halloway—to write the engine at the heart of this application in Clojure. That makes it sound like I was reluctant to go along, but actually, I was interested to see if the result would be as expressive and compact as everyone says.

Let me make a brief aside here and comment that I’m finding it much harder to be the customer on an agile project than to be a developer. I think there are two main reasons. First, it’s hard for me to keep these guys supplied with enough cards to fill an iteration. They’re outrunning me all the time. Big organizations like my employer just take a long time to decide anything. Second, there’s nobody else I can defer to when the team needs a decision. It often takes two weeks just for me to get a meeting scheduled with all of the stakeholders inside my company. That’s an entire iteration gone, just waiting to get to the meeting to make a decision! So, I’m often in the position of making decisions that I’m not 100% sure will be agreeable to all parties. So far, they have mostly worked out, but it’s a definite source of anxiety.

Anyway, back to the main point I wanted to make.

My personal theme is making software production-ready. That means handling all the messy things that happen in the real world. In a lab, for example, only one batch file ever needs to be processed at once. You never have multiple files waiting for processing, and files are always fully present before you start working on them. In production, that only happens if you guarantee it.

Another example, from my system. We have a set of rules (which are themselves written in Clojure code) that can be changed by privileged users. After changing the configuration, you can tell the daemonized Clojure engine to “(reload-rules!)”. The “!” at the end of that function means it’s an imperative with major side effects, so the rules get reloaded right now.

I thought I was going to catch them up when I asked, oh so innocently, “So what happens when you say (reload-rules!) while there’s a file being processed on the other thread?” I just love catching people when they haven’t dealt with all that nasty production stuff.

After a brief sidebar, Stu and Glenn Vanderburg decided that, in fact, nothing bad would happen at all, despite reloading rules in one thread while another thread was in the middle of using the rules.

Clojure uses a flavor of transactional memory, along with persistent data structures. No, that doesn’t mean they go in a database. It means that changes to a data structure can only be made inside of a transaction. The new version of the data structure and the old version exist simultaneously, for as long as there are outstanding references to them. So, in my case, that meant that the daemon thread would “see” the old version of the rules, because it had dereferenced the collection prior to the “reload-rules!” Meanwhile, the reload-rules! function would modify the collection in its own transaction. The next time the daemon thread comes back around and uses the reference to the rules, it’ll just see the new version of the rules.

In other words, two threads can both use the same reference, with complete consistency, because they each see a point-in-time snapshot of the collection’s state. The team didn’t have to do anything special to make this happen… it’s just the way that Clojure’s references, persistent data structures, and transactional memory work.

Even though I didn’t get to catch Stu and Glenn out on a production readiness issue, I still had to admit that was pretty frickin’ cool.

JAOO Australia in 1 Month

| Comments

Speaking at JAOO The Australian JAOO conferences are now just one month away. I’ve wanted to get to Australia for at least ten years now, so I am thrilled to finally get there.

I’ll be delivering a tutorial on production ready software in both the Brisbane and Sydney conferences. This tutorial was a hit at QCon London, where I first delivered it. The Australian version will be further improved.

During the main conference, I’ll be delivering a two-part talk on common failure modes) of distributed systems break and how to recover) from such breakage. These talks apply whether you’re building web facing systems or internal shared services/SOA projects.

Quantum Backups

| Comments

Backups are the only macroscopic system we commonly deal with that exhibits quantum mechanical effects. This is odd enough that I’ve spent some time getting tangled up in these observations.

Until you attempt a restore, a backup set is neither good nor bad, but a superposition of both. This is the superposition principle.

The peculiarity of the superposition principle is dramatically illustrated with the experiment of Schrödinger’s backup. This is when you attempt to restore Schrödinger’s pictures of his cat, and discover that the cat is not there.

In a startling corollary, if you use offsite vaulting, a second quantum variable is introduced, in that the backup set exists and does not exist simultaneously. A curious effect emerges upon applying the Hamiltonian operator. The operator shows that certain eigenvalues are always zero, revealing that prime numbered tapes greater than 5 in a set never exist.

Finally, the Heisenbackup principle says that the user of a system is entangled with the system itself. As a result, within 30 days of consciously deciding that you do not need to run a backup, you will experience a complete disk crash. Because you’ve just read this, your 30 days start now.

Sorry about that.

Update: Sun Cloud API Not the Same as Amazon

| Comments

It looks like the early reports that Sun’s cloud API would be compatible with AWS resulted from the reporters’ exuberance (or mere confusion.)

It’s actually nicer than Amazon’s.

It is based on the REST architectural style, with representations in JSON. In fact, I might start using it as the best embodiment of REST principles. You start with an HTTP GET of “/”. In this repsonse to this and every other request, it is the hyperlinks in the response that indicate what actions are allowed.

Sun has a wiki to describe the API, with a very nicely illustrated “Hello, Cloud” example.

Can You Make That Meeting?

| Comments

I’m convinced that the next great productivity revolution will be de-matrixing the organizations we’ve just spent ten years slicing and dicing.

Yesterday, I ran into a case in point: What are the odds that three people can schedule a meeting this week versus having to push it into next week?

Turns out that if they’re each 75% utilized, then there’s only a 15% chance they can schedule a one hour meeting this week. (If you always schedule 30 minute meetings instead of one hour, then the odds go up to about 25%.)

Here’s the probability curve that the meeting can happen. This assumes, by the way, that there are no lunches or vacation days, and that all parties are in the same time zone. It only gets worse from here.

So, overall, there’s about an 85% chance that 3 random people in a meeting-driven company will have to defer until next week.

Bring it up to 10 people, in a consensus-driven, meeting-oriented company, and the odds drop to 0.00095%.

No wonder “time to first meeting” seems to dominate “time to do stuff.”

Amazon as the New Intel

| Comments

Update: Please read this update. The information underlying this post was based on early, somewhat garbled, reports.

A brief digression from the unpleasantness of reliability.

This morning, Sun announced their re-entry into the cloud computing market. After withdrawing Network.com from the marketplace a few months ago, we were all wondering what Sun’s approach would be. No hardware vendor can afford to ignore the cloud computing trend… it’s going to change how customers view their own data centers and hardware purchases.

One thing that really caught my interest was the description of Sun’s cloud offering. It sounded really, really similar to AWS. Then I heard the E-word and it made perfect sense. Sun announced that they will use EUCALYPTUS as the control interface to their solution. EUCALYPTUS is an open-source implementation of the AWS APIs.

Last week at QCon London, we heard Simon Wardley give a brilliant talk, in which he described Canonical’s plan to create a de facto open standard for cloud computing by seeding the market with open source implementations. Canonical’s plan? Ubuntu and private clouds running EUCALYPTUS.

It looks like Amazon may be setting the standard for cloud computing, in the same way that Intel set the standard for desktop and server computing, by defining the programming interface.

I don’t worry about this, for two reasons. One, it forestalls any premature efforts to force a de jure standard. This space is still young enough that an early standard can’t help but be a drag on exploration of different business and technical models. Two, Amazon has done an excellent job as a technical leader. If their APIs “win” and become de facto standards, well, we could do a lot worse.

Getting Real About Reliability

| Comments

In my last post, I user some back-of-the-envelope reliability calculations, with just one interesting bit, to estimate the availability of a single-stacked web application, shown again here. I cautioned that there were a lot of unfounded assumptions baked in. Now it’s time to start removing those assumptions, though I reserve the right to introduce a few new ones.

Is it there when I want it?

First, lets talk about the hardware itself.  It’s very likely that these machines are what some vendors are calling "industry-standard servers." That’s a polite euphemism for "x86" or "ia64" that just doesn’t happen to mention Intel. ISS servers are expected to exhibit 99.9% availability.

There’s something a little bit fishy about that number, though. It’s one thing to say that a box is up and running ("available") 99.9% of the times you look at it.If I check it every hour for a year, and find it alive at least 8,756 out of 8,765 times, then it’s 99.9% available. It might have broken just once for 9 hours, or it might have broken 9 times for an hour each, or it might have broken 36 times for half an hour each.

This is the difference between availability and reliability. Availability measures the likelihood that a system can perform its function at a specific point in time. Reliability, on the other hand, measures the likelihood that a system will have failed before a point in time. Availability and reliability both matter to your users. In fact, a large number of small outages can be just as frustrating as a single large event. (I do wonder… since both ends of the spectrum seem to stick out in users’ memories, perhaps there’s an optimum value for the duration and frequency of outages, where they are seldom enough to seem infrequent, but short enough to seem forgivable?)

We need a bit more math at this point.

It must be science… it’s got integrals.

Let’s suppose that hardware failures can be described as function of time, and that they are essentially random. It’s not like the story of the "priceless" server room, where failure can be expected based on actions or inaction. We’ll also carry over the previous assumption that hardware failures among these three boxes are independent. That is, failure of any one boxes does not make other boxes more likely to fail.

We want to determine the likelihood that the box is available, but the random event we’re concerned with is a fault. Thus, we first need to find the probability that a fault has occurred by time t. Checking for a failure is sampling for an event X between times 0 and t.

The function f(t) is the probability distribution function that describes failures of this system. We’ll come back to that shortly, because a great deal hinges on what function we use here. The reliability of the system, then is the probability that the event X didn’t happen by time t.

One other equation that will help in a bit is the failure rate, the number of failures to expect per unit time. Like reliability, the failure rate can vary over time. The failure rate is:

Failure distributions

So now we’ve got integrals to infinity of unknown functions. This is progress?

It is progress, but there are some missing pieces. Next time, I’ll talk about different probability distributions, which ones make sense for different purposes, and how to calibrate them with observations.

Reliability Math

| Comments

Suppose you build a web site out of a single stack of one web, app, and database server. What sort of availability SLA should you be willing to support for this site?

We’ll approach this in a few steps. For the first cut, you’d say that the appropriate SLA is just the expected availability of the site. Availability is defined in different ways depending on when and how you expect to measure it, but for the time being, we’ll say that availability is the probability of getting an HTTP response when you submit a request. This is the instantaneous availability.

What is the probability of getting a response from the web server? Assuming that every request goes through all three layers, then the probability of a response is the probability that all three components are working. That is:

This follows our intuition pretty closely. Since any of the three servers can go down, and any one server down takes down the site, we’d expect to just multiply the probabilities together. But what should we use for the reliability of the individual boxes? We haven’t done a test to failure or life cycle test on our vendor’s hardware. In fact, if our vendor has any MTBF data, they’re keeping it pretty quiet.

We can spend some time hunting down server reliability data later. For now, let’s just try to estimate it. In fact, let’s estimate widely enough that we can be 90% confident that the true value is within our range. This will give us some pretty wide ranges, but that’s OK… we haven’t dug up much data yet, so there should be a lot of uncertainty. Uncertainty isn’t a show stopper, and it isn’t an excuse for inaction. It just means there are things we don’t yet know. If we can quantify our uncertainty, then we can still make meaningful decisions. (And some of those decisions may be to go study something to reduce the uncertainty!)

Even cheap hardware is getting pretty reliable. Would you expect every server to fail once a year? Probably not. It’s less frequent than that. One out of the three servers fail every two years? Seems to be a little pessimistic, but not impossible. Let’s start there. If every server fails once every two years, at a constant rate [1], then we can say that the lower bound on server reliability is 60.6%. Would we expect all of these servers to run for five years straight without a failure? Possible, but unlikely. Let’s use one failure over five years as our upper bound. One failure out of fifteen server-years would give an annual availability of 93.5% for each server.

So, each server’s availability is somewhere between 60.6% and 93.5%. That’s a pretty wide range, and won’t be satisfactory to many people. That’s OK, because it reflects our current degree of uncertainty.

To find the overall reliability, I could just take the worst case and plug it in for all three probabilities, then plug in the best case. That slightly overstates the edge cases, though. I’m better off getting Excel to help me run a Monte Carlo analysis to give me an average across a bunch of scenarios. I’ll construct a row that randomly samples a scenario from within these ranges. It will pick three values between 60.6% and 93.5% and compute their product. Then, I’ll copy that row 10,000 times by dragging it down the sheet. Finally, I’ll average out the computed products to get a range for the overall reliability. When I do that, I get a weighted range of 28.9% to 62.6%. [2] [3]

Yep, this single stack web site will be available somewhere between 28.9% of the time and 62.6%. [4]

Actually, it’s likely to be worse than that. There are two big problems in the analysis so far. First, we’ve only accounted for hardware failures, but software failures are a much bigger contributor to downtime. Second, more seriously, the equation for overall reliability assumes that all failures are disjoint. That is, we implicitly assumed that nothing could cause more than one of these servers to fail simultaneously. Talk about Pollyanna! We’ve got common mode failures all over the place, especially in the network, power, and data center arenas.

Next time, we’ll start working toward a more realistic calculation.


1. I’m using a lot of simplifying assumptions right now. Over time, I’ll strip these away and replace them with more realistic calculations. For example, a constant failure rate implies an exponential distribution function. It is mathematically convenient, but doesn’t represent the effects of aging on moving components like hard drives and fans.

2. You can download the spreadsheet here.

3. These estimation and analysis techniques are from "How to Measure Anything" by Doug Hubbard.

4. Clearly, for a single-threaded stack like this, you can achieve much higher reliability by running all three layers on a single physical host.