Wide Awake Developers


Metaphoric Problems in REST Systems

I used to think that metaphor was just a literary technique, that it was something you could use to dress up some piece of creative writing. Reading George Lakoff's Metaphors We Live By, though has changed my mind about that.

I now see that metaphor is not just something we use in writing; it's actually a powerful technique for structuring thought. We use metaphor when we are creating designs. We say that a class is like a factory, that an object is a kind of a thing. The thing may be an animal, it may be a part of a whole, or it may be representative of some real world thing.

All those are uses of metaphor, but there is a deeper structure of metaphors that we use every day, without even realizing it. We don't think of them as metaphors because in a sense these are actually the ways that we think. Lakoff uses the example of "The tree is in front of the mountain." Perfectly ordinary sentence. We wouldn't think twice about saying it.

But the mountain doesn't actually have a front, neither does the tree. Or if the mountain has a front, how do we know it's facing us? What we actually mean, if we unpack that metaphor is something like, "The distance from me to the tree is less than the distance from me to the mountain." Or, "The tree is closer to me than the mountain is." That we assign that to being in front is actually a metaphoric construct.

When we say, "I am filled with joy." We are actually using a double metaphor, two different metaphors related structurally. One, is "A Person Is A Container," the other is, "An Emotion Is A Physical Quantity." Together it makes sense to say, if a person is a container and emotion is a physical thing then the person can be full of that emotion. In reality of course, the person is no such thing. The person is full of all the usual things a person is full of, tissues, blood, bones, other fluids that are best kept on the inside.

But we are embodied beings, we have an inside and an outside and so we think of ourselves as a container with something on the inside.

This notion of containers is actually really important.

Because we are embodied beings, we tend to view other things as containers as well. It would make perfect sense to you if I said, "I am in the room." The room is a container, the building is a container. The building contains the room. The room contains me. No problem.

It would also make perfect sense to you, if I said, "That program is in my computer." Or we might even say, "that video is on the Internet." As though the Internet itself were a container rather than a vast collection of wires and specialized computers.

None of these things are containers, but it's useful for us to think of them as such. Metaphorically, we can treat them as containers. This isn't just an abstraction about the choice of pronouns. Rather the use of the pronouns I think reflects the way that we think about these things.

We also tend to think about our applications as containers. The contents that they hold are the features they provide. This has provided a powerful way of thinking about and structuring our programs for a long time. In reality, no such thing is happening. The program source text doesn't contain features. It contains instructions to the computer. The features are actually sort of emergent properties of the source text.

Increasingly the features aren't even fully specified within the source text. We went through a period for a while where we could pretend that everything was inside of an application. Take web systems for example. We would pretend that the source text specified the program completely. We even talked about application containers. There was always a little bit of fuzziness around the edges. Sure, most of the behavior was inside the container. But there were always those extra bits. There was the web server, which would have some variety of rules in it about access control, rewrite rules, ways to present friendly URLs. There were load balancers and firewalls. These active components meant that it was really necessary to understand more than the program text, in order to fully understand what the program was doing.

The more the network devices edged into Layer 7, previously the domain of the application, the more false the metaphor of program as container became. Look at something like a web application firewall. Or the miniature programs you can write inside of an F5 load balancer. These are functional behavior. They are part of the program. However, you will never find them in the source text. And most of the time, you don't find them inside the source control systems either.

Consequently, systems today are enormously complex. It's very hard to tell what a system is going to do once you put into production. Especially in those edge cases within hard to reach sections of the state space. We are just bad at thinking about emergent properties. It's hard to design properties to emerge from simple rules.

I think we'll find this most truly in RESTful architectures. In a fully mature REST architecture, the state of the system doesn't really exist in either the client or the server, but rather in the communication between the two of them. We say, HATEOAS "Hypertext As The Engine Of Application State," (which is a sort of shibboleth use to identify true RESTafarian's from the rest of the world) but the truth is: what the client is allowed to do is to hold to it by the server at any point in time, and the next state transition is whatever the client chooses to invoke. Once we have that then the true behavior of the system can't actually be known just by the service provider.

In a REST architecture we follow an open world assumption. When we're designing the service provider, we don't actually know who all the consumers are going to be or what their individual and particular work flows maybe. Therefore we have to design for a visible system, an open system that communicates what it can do, and what it has done at any point in time. Once we do that then the behavior is no longer just in the server. And in a sense it's not really in the client either. It's in the interaction between the two of them, in the collaborations.

That means the features of our system are emergent properties of the communication between these several parts. They're externalized. They're no longer in anything. There is no container. One could almost say there's no application. The features exists somewhere in the white space between those boxes on the architecture diagram.

I think we lack some of the conceptual tools for that as well. We certainly don't have a good metaphorical structure for thinking about behavior as a hive-like property emerging from the collaboration of these relatively, independent and self-directed pieces of software.

I don't know where the next set of metaphors will come from. I do know that the attempt to force web-shaped systems in to the application is container metaphor, simply won't work anymore. In truth, they never worked all that well. But now it's broken down completely.

Metaphoric Problems in REST Systems (audio)

Metaphoric problems in rest systems by mtnygard

Circuit Breaker in Scala

FaKod (I think that translates as "The Fatalistic Coder"?) has written a nice Scala implementation of the Circuit Breaker pattern, and even better, has made it available on GitHub.

Check out http://github.com/FaKod/Circuit-Breaker-for-Scala for the code.

The Circuit Breaker can be mixed in to any type. See http://wiki.github.com/FaKod/Circuit-Breaker-for-Scala/ for an example of usage.

Failover: Messy Realities

People who don't live in operations can carry some funny misconceptions in their heads. Some of my personal faves:

  • Just add some servers!
  • I want a report of every configuration setting that's different between production and QA!
  • We're going to make sure this (outage) never happens again!

I've recently been reminded of this during some discussions about disaster recovery. This topic seems to breed misconceptions. Somewhere, I think most people carry around a mental model of failover that looks like this:

Normal operations transitions directly and cleanly to failed over

That is, failover is essentially automatic and magical.

Sadly, there are many intermediate states that aren't found in this mental model. For example, there can be quite some time between failure and it's detection. Depending on the detection and notification, there can be quite a delay before failover is initiated at all. (I once spoke with a retailer whose primary notification mechanism seemed to be the Marketing VP's wife.)

Once you account for delays, you also have to account for faulty mechanisms. Failover itself often fails, usually due to configuration drift. Regular drills and failover exercises are the only way to ensure that failover works when you need it. When the failover mechanisms themselves fail, your system gets thrown into one of these terminal states that require manual recovery.

Just off the cuff, I think the full model looks a lot more like this:

Many more states exist in the real world, including failure of the failover mechanism itself.

It's worth considering each of these states and asking yourself the following questions:

  • Is the state transition triggered automatically or manually?
  • Is the transition step executed by hand or through automation?
  • How long will the state transition take?
  • How can I tell whether it worked or not?
  • How can I recover if it didn't work?

Life's Little Frustrations

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. -Leslie Lamport

On my way to QCon Tokyo and QCon China, I had some time to kill so I headed over to Delta's Skyclub lounge. I've been a member for a few years now. And why not? I mean, who could pass up tepid coffee, stale party snacks, and a TV permanently locked to CNN? Wait... that actually doesn't sound like such a hot deal.

Oh! I remember, it's for the wifi access. (Well, that plus reliably clean bathrooms, but we need not discuss that.) Being able to count on wifi access without paying for yet another data plan has been pretty helpful for me. (As an aside, I might change my tune once I try a mifi box. Carrying my own hotspot sounds even better.)

Like most wifi providers, the Skyclub has a captive portal. Before you can get a TCP/IP connection to anything, you have to submit a form with a checkbox to agree to 89 pages of terms and conditions. I'm well aware that Delta's lawyers are trying to make sure the company isn't liable if I go downloading bootlegs of every Ally McBeal episode. But I really don't know if these agreements are enforceable. For all I know, page 83 has me agreeing to 7 years indentured servitude cleaning Delta's toilets.

Anyway, Delta has outsourced operations of their wifi network to Concourse Communications. And apparently, they've had an outage all morning that has blocked anyone from using wifi in the Minneapolis Skyclubs. When I submit the form with the checkbox, I get the following error page:

Including this bit of stacktrace:

There's a lot to dislike here.

  1. Why is this yelling at me, the user? To anyone who isn't a web site developer, this makes it sound like the user did something wrong. There's a ton of scary language here: "instance-specific error", "allow remote connections", "Named Pipes Provider"... heck, this sounds like it's accusing the user of hacking servers. "Stack trace" sure sounds like the Feds are hot on somebody's trail, doesn't it?
  2. Isn't it fabulous to know that Ken keeps his projects on his D: drive? If I had to lay bets, I'd say that Ken screwed up his configuration string. In fact, the whole problem smells like a failed deployment or poorly executed change. Ken probably pushed some code out late on a Friday afternoon, then boogied out of town. My prediction (totally unverifiable, of course) is that this problem will take less than 5 minutes to resolve, once Ken gets his ass back from the beach.
  3. We mere users get to see quite a bit of internal information here. Nothing really damaging, unless of course Wilson ORMapper has some security defects or something like that.
  4. Stepping back from this specific error message, we have the larger question: is it sensible to couple availability of the network to the availability of this check-the-box application? Accessing the network is the primary purpose of this whole system. It is the most critical feature. Is collecting a compulsory boolean "true" from every user really as important as the reason the whole damn thing was built in the first place? Of course not! (As an aside, this is an example of Le Chatelier's Principle: "Complex systems tend to oppose their own proper function.")

We see this kind of operational coupling all the time. Non-critical features are allowed to damage or destroy critical features. Maybe there's a single thread pool that services all kinds of requests, rather than reserving a separate pool for the important things. Maybe a process is overly linearized and doesn't allow for secondary, after-the-fact processing. Or, maybe a critical and a non-critical system both share an enterprise service---producing a common-mode dependency.

Whatever the proximate cause, the underlying problem is lack of diligence in operational decoupling.

Topics in Architecture

I'm working on a syllabus for an extensive course on web architecture. This will be for experienced programmers looking to become architects.

Like all of my work about architecture, this covers technology, business, and strategic aspects, so there's an emphasis on creating high-velocity, competitive organizations.

In general, I'm aiming for a mark that's just behind the bleeding edge. So, I'm including several of the NoSQL persistence technologies, for example, but not including Erjang because it's too early. (Or is that "erl-y"? )

(What I'd really love to do is make a screencast series out of all of these. I'm daunted, though. There's a lot of ground to cover here!)

EDIT: Added function and OO styles of programming. (Thanks @deanwampler.) Added JRuby/Java under languages. (Thanks @glv.)

I'm interested in hearing your feedback. What would you add? Remove?

  • Methods and Processes

    • Systems Thinking/Learning Organization
    • High Velocity Organizations
    • Safety Culture
    • Error-Inducing Systems ("Normal Accidents")
    • Points of Leverage
    • Fundamental Dynamics: Iteration, Variation, Selection, Feedback, Constraint
    • 5D architecture
    • Failures of Intuition
    • ToC
    • Critical Chain
    • Lean Software Development
    • Real Options
    • Strategic Navigation
    • OODA
    • Tempo, Adaptation
    • XP
    • Scrum
    • Lean
    • Kanban
    • TDD
  • Architecture Styles

    • REST / ROA
    • SOA
    • Pipes & Filters
    • Actors
    • App-server centric
    • Event-Driven Architecture
  • Web Foundations

    • The "architecture" of the web
    • HTTP 1.0 & 1.1
    • Browser fetch behaviors
    • HTTP Intermediaries
  • The Nature of the Web

    • Crowdsourcing
    • Folksonomy
    • Mashups/APIs/Linked Open Data
  • Testing

    • TDD
    • Unit testing
    • BDD/Spec testing
    • ScalaCheck
    • Selenium
  • Persistence

    • Redis
    • CouchDB
    • Neo4J
    • eXist
    • "Web-shaped" persistence
  • Technical architecture

    • 8 Fallacies of Distributed Computing
    • CAP Theorem
    • Scalability
    • Reliability
    • Performance
    • Latency
    • Capacity
    • Decoupling
    • Safety
  • Languages and Frameworks

    • Spring
    • Groovy/Grails
    • Scala
    • Lift
    • Clojure
    • Compojure
    • JRuby
    • Rails
    • OSGi
  • Design

    • Code Smells
    • Object Thinking
    • Object Design
    • Functional Thinking
    • API Design
    • Design for Operations
    • Information Hiding
    • Recognizing Coupling
  • Deployment

    • Physical
    • Virtual
    • Multisite
    • Cloud (AWS)
    • Chef
    • Puppet
    • Capistrano
  • Build and Version Control

    • Git
    • Ant
    • Maven
    • Leiningen
    • Private repos
    • Collaboration across projects

GMail Outage Was a Chain Reaction

Google has published an explanation of the widespread GMail outage from September 1st. In this explanation, they trace the root cause to a layer of "request routers":

...a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!". This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded.

This perfectly describes the "Chain Reaction" stability antipattern from Release It!

An AspectJ Circuit Breaker

Spiros Tzavellas pointed me to his implementation of Circuit Breaker. His approach uses AspectJ and can be applied using a bytecode weaver or AspectJ compiler. He's also got unit tests with 85% coverage.

Spiros' project page is here, and the code is (where else?) on GitHub. He appears to be quite actively developing the project.

Two New Circuit Breaker Implementations

The excellent Will Sargent has created a Circuit Breaker gem that's quite nice. You can read the docs at rdoc.info. He's released the code (under LGPL) on GitHub.

The other one has actually been out for a couple of months now, but I forgot to blog about it. Scott Vlamnick created a Grails plugin that uses AOP to weave Circuit Breaker functionality as "around" advice. This one can also report its state via JMX. In a particularly nice feature, this plugin supports different configurations in different environments.

Kudos to Relevance and Clojure

It's been a while since I blogged anything, mainly because most of my work lately has either been mind-numbing corporate stuff, or so highly contextualized that it wouldn't be productive to write about.

Something came up last week, though, that just blew me away.

For various reasons, I've engaged Relevance to do a project for me. (Actually, the first results were so good that I've now got at least three more projects lined up.) They decided---and by "they", I mean Stuart Halloway---to write the engine at the heart of this application in Clojure. That makes it sound like I was reluctant to go along, but actually, I was interested to see if the result would be as expressive and compact as everyone says.

Let me make a brief aside here and comment that I'm finding it much harder to be the customer on an agile project than to be a developer. I think there are two main reasons. First, it's hard for me to keep these guys supplied with enough cards to fill an iteration. They're outrunning me all the time. Big organizations like my employer just take a long time to decide anything. Second, there's nobody else I can defer to when the team needs a decision. It often takes two weeks just for me to get a meeting scheduled with all of the stakeholders inside my company. That's an entire iteration gone, just waiting to get to the meeting to make a decision! So, I'm often in the position of making decisions that I'm not 100% sure will be agreeable to all parties. So far, they have mostly worked out, but it's a definite source of anxiety.

Anyway, back to the main point I wanted to make.

My personal theme is making software production-ready. That means handling all the messy things that happen in the real world. In a lab, for example, only one batch file ever needs to be processed at once. You never have multiple files waiting for processing, and files are always fully present before you start working on them. In production, that only happens if you guarantee it.

Another example, from my system. We have a set of rules (which are themselves written in Clojure code) that can be changed by privileged users. After changing the configuration, you can tell the daemonized Clojure engine to "(reload-rules!)". The "!" at the end of that function means it's an imperative with major side effects, so the rules get reloaded right now.

I thought I was going to catch them up when I asked, oh so innocently, "So what happens when you say (reload-rules!) while there's a file being processed on the other thread?" I just love catching people when they haven't dealt with all that nasty production stuff.

After a brief sidebar, Stu and Glenn Vanderburg decided that, in fact, nothing bad would happen at all, despite reloading rules in one thread while another thread was in the middle of using the rules.

Clojure uses a flavor of transactional memory, along with persistent data structures. No, that doesn't mean they go in a database. It means that changes to a data structure can only be made inside of a transaction. The new version of the data structure and the old version exist simultaneously, for as long as there are outstanding references to them. So, in my case, that meant that the daemon thread would "see" the old version of the rules, because it had dereferenced the collection prior to the "reload-rules!" Meanwhile, the reload-rules! function would modify the collection in its own transaction. The next time the daemon thread comes back around and uses the reference to the rules, it'll just see the new version of the rules.

In other words, two threads can both use the same reference, with complete consistency, because they each see a point-in-time snapshot of the collection's state. The team didn't have to do anything special to make this happen... it's just the way that Clojure's references, persistent data structures, and transactional memory work.

Even though I didn't get to catch Stu and Glenn out on a production readiness issue, I still had to admit that was pretty frickin' cool.

Getting Real About Reliability

In my last post, I user some back-of-the-envelope reliability calculations, with just one interesting bit, to estimate the availability of a single-stacked web application, shown again here. I cautioned that there were a lot of unfounded assumptions baked in. Now it's time to start removing those assumptions, though I reserve the right to introduce a few new ones.

Is it there when I want it?

First, lets talk about the hardware itself.  It's very likely that these machines are what some vendors are calling "industry-standard servers." That's a polite euphemism for "x86" or "ia64" that just doesn't happen to mention Intel. ISS servers are expected to exhibit 99.9% availability.

There's something a little bit fishy about that number, though. It's one thing to say that a box is up and running ("available") 99.9% of the times you look at it.If I check it every hour for a year, and find it alive at least 8,756 out of 8,765 times, then it's 99.9% available. It might have broken just once for 9 hours, or it might have broken 9 times for an hour each, or it might have broken 36 times for half an hour each.

This is the difference between availability and reliability. Availability measures the likelihood that a system can perform its function at a specific point in time. Reliability, on the other hand, measures the likelihood that a system will have failed before a point in time. Availability and reliability both matter to your users. In fact, a large number of small outages can be just as frustrating as a single large event. (I do wonder... since both ends of the spectrum seem to stick out in users' memories, perhaps there's an optimum value for the duration and frequency of outages, where they are seldom enough to seem infrequent, but short enough to seem forgivable?)

We need a bit more math at this point.

It must be science... it's got integrals.

Let's suppose that hardware failures can be described as function of time, and that they are essentially random. It's not like the story of the "priceless" server room, where failure can be expected based on actions or inaction. We'll also carry over the previous assumption that hardware failures among these three boxes are independent. That is, failure of any one boxes does not make other boxes more likely to fail.

We want to determine the likelihood that the box is available, but the random event we're concerned with is a fault. Thus, we first need to find the probability that a fault has occurred by time t. Checking for a failure is sampling for an event X between times 0 and t.

The function f(t) is the probability distribution function that describes failures of this system. We'll come back to that shortly, because a great deal hinges on what function we use here. The reliability of the system, then is the probability that the event X didn't happen by time t.

One other equation that will help in a bit is the failure rate, the number of failures to expect per unit time. Like reliability, the failure rate can vary over time. The failure rate is:

Failure distributions

So now we've got integrals to infinity of unknown functions. This is progress?

It is progress, but there are some missing pieces. Next time, I'll talk about different probability distributions, which ones make sense for different purposes, and how to calibrate them with observations.

Reliability Math

Suppose you build a web site out of a single stack of one web, app, and database server. What sort of availability SLA should you be willing to support for this site?

We'll approach this in a few steps. For the first cut, you'd say that the appropriate SLA is just the expected availability of the site. Availability is defined in different ways depending on when and how you expect to measure it, but for the time being, we'll say that availability is the probability of getting an HTTP response when you submit a request. This is the instantaneous availability.

What is the probability of getting a response from the web server? Assuming that every request goes through all three layers, then the probability of a response is the probability that all three components are working. That is:

This follows our intuition pretty closely. Since any of the three servers can go down, and any one server down takes down the site, we'd expect to just multiply the probabilities together. But what should we use for the reliability of the individual boxes? We haven't done a test to failure or life cycle test on our vendor's hardware. In fact, if our vendor has any MTBF data, they're keeping it pretty quiet.

We can spend some time hunting down server reliability data later. For now, let's just try to estimate it. In fact, let's estimate widely enough that we can be 90% confident that the true value is within our range. This will give us some pretty wide ranges, but that's OK... we haven't dug up much data yet, so there should be a lot of uncertainty. Uncertainty isn't a show stopper, and it isn't an excuse for inaction. It just means there are things we don't yet know. If we can quantify our uncertainty, then we can still make meaningful decisions. (And some of those decisions may be to go study something to reduce the uncertainty!)

Even cheap hardware is getting pretty reliable. Would you expect every server to fail once a year? Probably not. It's less frequent than that. One out of the three servers fail every two years? Seems to be a little pessimistic, but not impossible. Let's start there. If every server fails once every two years, at a constant rate [1], then we can say that the lower bound on server reliability is 60.6%. Would we expect all of these servers to run for five years straight without a failure? Possible, but unlikely. Let's use one failure over five years as our upper bound. One failure out of fifteen server-years would give an annual availability of 93.5% for each server.

So, each server's availability is somewhere between 60.6% and 93.5%. That's a pretty wide range, and won't be satisfactory to many people. That's OK, because it reflects our current degree of uncertainty.

To find the overall reliability, I could just take the worst case and plug it in for all three probabilities, then plug in the best case. That slightly overstates the edge cases, though. I'm better off getting Excel to help me run a Monte Carlo analysis to give me an average across a bunch of scenarios. I'll construct a row that randomly samples a scenario from within these ranges. It will pick three values between 60.6% and 93.5% and compute their product. Then, I'll copy that row 10,000 times by dragging it down the sheet. Finally, I'll average out the computed products to get a range for the overall reliability. When I do that, I get a weighted range of 28.9% to 62.6%. [2] [3]

Yep, this single stack web site will be available somewhere between 28.9% of the time and 62.6%. [4]

Actually, it's likely to be worse than that. There are two big problems in the analysis so far. First, we've only accounted for hardware failures, but software failures are a much bigger contributor to downtime. Second, more seriously, the equation for overall reliability assumes that all failures are disjoint. That is, we implicitly assumed that nothing could cause more than one of these servers to fail simultaneously. Talk about Pollyanna! We've got common mode failures all over the place, especially in the network, power, and data center arenas.

Next time, we'll start working toward a more realistic calculation.

1. I'm using a lot of simplifying assumptions right now. Over time, I'll strip these away and replace them with more realistic calculations. For example, a constant failure rate implies an exponential distribution function. It is mathematically convenient, but doesn't represent the effects of aging on moving components like hard drives and fans.

2. You can download the spreadsheet here.

3. These estimation and analysis techniques are from "How to Measure Anything" by Doug Hubbard.

4. Clearly, for a single-threaded stack like this, you can achieve much higher reliability by running all three layers on a single physical host.

Subtle Interactions, Non-local Problems

Alex Miller has a really interesting blog post up today. In LBQ + GC = slow, he shows how LinkedBlockingQueue can leave a chain of references from tenured dead objects to live young objects.  That sounds really dirty, but it actually means something to Java programmers. Something bad.

The effect here is a subtle interaction between the code and the mostly hidden, yet omnipresent, garbage collector. This interaction just happens to hit a known sore spot for the generational garbage collector. I won't spoil the ending, because I want you to read Alex's piece.

In effect, a one-line change to LinkedBlockingQueue has a dramatic effect on the garbage collector's performance. In fact, because the problem causes more full GC's, you'd be likely to observe this problem in an area completely unconnected with the queue itself.  By leaving these refchains worming through multiple generations in the heap, the queue damages a resource needed by every other part of the application.

This is a classic common-mode dependency, and it's very hard to diagnose because it results from hidden and asynchronous coupling.

Attack of Self-Denial, 2008 Style

"Good marketing can kill your site at any time."

--Paul Lord, 2006

I just learned of another attack of self-denial from this past week.

Many retailers are suffering this year, particularly in the brick-and-mortar space. I have heard from several, though, who say that their online performance is not suffering as much as the physical stores are. In some cases, where the brand is strong and the products are not fungible, the online channel is showing year-over-year growth.

One retailer I know was running strong, with the site near it's capacity. They fit the bill for an online success in 2008. They have a great name recognition, a very strong, global brand, and their customers love their products. This past week, their marketing group decided to "take it to the next level."

They blasted an email campaign to four million customers.  It had a good offer, no qualifier, and a very short expiration time---one day only.  A short expiration like that creates a sense of urgency.  Good marketing reaches people and induces them to act, and in that respect, the email worked. Unfortunately, when that means millions of users hitting your site, you may run into trouble.

Traffic flooded the site and knocked it offline. It took more than 6 hours to get everything functioning again.

Instead of getting an extra bump in sales, they lost six hours of holiday-season revenue. As a rule of thumb, you should assume that a peak hour of holiday sales counts for six hours of off-season sales.

There are other technological solutions to help with this kind of traffic flood. For instance, the UX group can create a static landing page for the offer. Then marketing links to that static page in their email blast. Ops can push that static page out into their cache servers, or even into their CDN's edge network. This requires some preparation for each offer, and it takes some extra preparation before the first such offer, but it's very effective. The static page absorbs the bulk of the traffic, so only customers who really want to buy get passed into the dynamic site.

Marketing can also send the email out in waves, so people receive it at different times. That spreads the traffic spike out over a few hours. (Though this doesn't work so well when you send the waves throughout the night, because customers will all see it in a couple of hours in the morning.)

In really extreme cases, a portion of capacity can be carved out and devoted to handling promotional traffic. That way, if the promotion goes nuclear, at least the rest of the site is still online. Obviously, this would be more appropriate for a long-running promotion than a one-day event.

Of course, it should be obvious that all of these technological solutions depend on good communication.

At a surface level, it's easy to say that this happened because marketing had no idea how close to the edge the site was already running. That's true. It's also true, however, that operations previously had no idea what the capacity was. If marketing called and asked, "Can we support 4 million extra visits?" the current operations group could have answered "no". Previously, the answer would have been "I don't know."

So, operations got better, but marketing never got re-educated. Lines of communication were never opened, or re-opened. Better communication would have helped.

In any online business, you must have close communications between marketing, UX, development, and operations. They need to regard themselves as part of one integrated team, rather than four separate teams. I've often seen development groups that view operations as a barrier to getting their stuff released. UX and marketing view development as the barrier to getting their ideas implemented, and so on. This dynamic evolves from the "throw it over the wall" approach, and it can only result in finger-pointing and recriminations.

I'd bet there's a lot of finger-pointing going on in that retailer's hallways this weekend.

Connection Pools and Engset

In my last post, I talked about using Erlang models to size the front end of a system. By using some fundamental capacity models that are almost a century old, you can estimate the number of request handling threads you need for a given traffic load and request duration.

Inside the Box

It gets tricky, though, when you start to consider what happens inside the server itself. Processing the request usually involves some kind of database interaction with a connection pool. (There are many ways to avoid database calls, or at least minimize the damage they cause. I'll address some of these in a future post, but you can also check out Two Ways to Boost Your Flagging Web Site for starters.) Database calls act like a kind of "interior" request that can be considered to have its own probability of queuing.

Exterior call to server becomes an "interior" call to a database.

Because this interior call can block, we have to consider what effects it will have on the duration of the exterior call. In particular, the exterior call must take at least the sum of the blocking time plus the processing time for the interior call.

At this point, we need to make a few assumptions about the connection pool. First, the connection pool is finite. Every connection pool should have a ceiling. If nothing else, the database server can only handle a finite number of connections. Second, I'm going to assume that the pool blocks when exhausted. That is, calling threads that can't get a connection right away will happily wait forever rather than abandoning the request. This is a simplifying assumption that I need for the math to work out. It's not a good configuration in practice!

With these assumption in place, I can predict the probability of blocking within the interior call. It's a formula closely related to the Erlang model from my last post, but with a twist. The Erlang models assume an essentially infinite pool of requestors. For this interior call, though, the pool of requestors is quite finite: it's the number of request handling threads for the exterior calls. Once all of those threads are busy, there aren't any left to generate more traffic on the interior call!

The formula to compute the blocking probability with a finite number of sources is the Engset formula. Like the Erlang models, Engset originated in the world of telephony. It's useful for predicting the outbound capacity needed on a private branch exchange (PBX), because the number of possible callers is known. In our case, the request handling threads are the callers and the connection pool is the PBX.

Practical Example

Using our 1,000,000 page views per hour from last time, Table 1 shows the Engset table for various numbers of connections in the pool. This assumes that the application server has a maximum of 40 request handling threads. This also supposes that the database processing time uses 200 milliseconds of the 250 milliseconds we measured for the exterior call.


Notice that when we get to 18 connections in the pool, the probability of blocking drops below 50%.  Also, notice how sharply the probability of blocking drops off around 23 to 31 connections in the pool. This is a decidedly nonlinear effect!

From this table, it's clear that even though there are 40 request handling threads that could call into this pool, there's not much point in having more than 30 connections in the pool. At 30 connections, the probability of blocking is already less than 1%, meaning that the queuing time is only going to add a few milliseconds to the average request.

Why do we care? Why not just crank up the connection pool size to 40? After all, if we did, then no request could ever block waiting for a connection. That would minimize latency, wouldn't it?

Yes, it would, but at a cost. Increasing the number of connections to the database by a third means more memory and CPU time on the database just managing those connections, even if they're idle. If you've got two app servers, then the database probably won't notice an extra 10 connections. Suppose you scale out at the app tier, though, and you now have 50 or 60 app servers. You'd better believe that the DB will notice an extra 500 to 600 connections. They'll affect memory needs, CPU utilization, and your ability to fail over correctly when a database node goes down.

Feedback and Coupling

There's a strong coupling between the total request duration in the interior call and the request duration for the exterior call. If we assume that every request must go through the database call, then the exterior response time must be strictly greater than the interior blocking time plus the interior processing time.

In practice, it actually gets a little worse than that, as this causal loop diagram illustrates.

 Time dependencies between the interior call and the exterior call.

It reads like this: "As the interior call blocking time increases, the exterior call duration increase. As the interior call blocking increases, the exterior call duration time increases." This type of representation helps clarify relations between the different layers. It's very often the case that you'll find feedback loops this way. Any time you do find a feedback loop, it means that slowdowns will produce increasing slowdowns. Blocking begets blocking, quickly resulting in a site hang.


Queues are like timing dots. Once you start seeing them, you'll never be able to stop. You might even start to think that your entire server farm looks like one vast, interconnected set of queues.

That's because it is.

People use database connection pools because creating new connections is very slow. Tuning your database connection pool size, however, is all about optimizing the cost of queueing against the cost of extra connections. Each connection consumes resources on the database server and in the application server. Striking the right balance starts by identifying the required exterior response time, then sizing the connection pool---or changing the architecture---so the interior blocking time doesn't break the SLA.

For much, much more on the topic of capacity modeling and analysis, I definitely recommend Neil Gunther's website, Performance Agora. His books are also a great---and very practical---way to start applying performance and capacity management.

Thread Pools and Erlang Models

Sizing, Danish Style

Folks in telecommunications and operations research have used Erlang models for almost a century. A. K. Erlang, a Danish telephone engineer, developed these models to help plan the capacity of the phone network and predict the grade of service that could be guaranteed, given some basic metrics about call volume and duration. Telephone networks are expensive to deploy, particularly when upgrading your trunk lines involves digging up large portions of rocky Danish ground or running cables under the North Sea.

The Erlang-B formula predicts the probability that an incoming call cannot be serviced, based on the call arrival rate, average call time, and number of lines available.  Erlang-C is similar, but allows for calls to be queued while waiting for service. It predicts the probability that a call will be queued. It can also show when calls will never be serviced, because the rate of arriving calls exceeds the system's total capacity to serve them.

Erlang models are widely used in telecomm, including GPRS network sizing, trunk line sizing, call center staffing models, and other capacity planning arenas where request arrival is apparently random. In fact, you can use it to predict the capacity and wait time at a restaurant, bank branch, or theme park, too.

It should be pretty obvious that Erlang models are widely applicable in computer performance analysis, too. There's a rich body of literature on this subject that goes back to the dawn of the mainframe. Erlang models are the foundation of most capacity management groups. I'm not even going to scratch the surface here, except to show how some back-of-the-envelope calculations can help you save millions of dollars.

One Million Page Views

In my case, I wanted to look at thread pool sizing. Suppose you have an even 1,000,000 requests per hour to handle. This implies an arrival rate (or lambda) of 0.27777... requests per millisecond. (Erlang units are dimensionless, but you need to start with the same units of time, whether it's hours, days, or milliseconds.) I'm going to assume for the moment that the system is pretty fast, so it handles a request in 250 milliseconds, on average.

(Please note that there are many assumptions underneath simply statements like "on average". For the moment, I'll pretend that request processing time follows a normal distribution, even though any modern system is more likely to be bimodal.)

Table 1 shows a portion of the Erlang-C table for these parameters. Feel free to double-check my work with this spreadsheet or this short C program to compute the Erlang-B and Erlang-C values for various numbers of threads. (Thanks to Kenneth J. Christensen for the original program. I can only claim credit for the extra "for" loop.)

Table 1. Erlang-C values at 250 ms / request

NPr_Queue (Erlang-C)

From Table 1, I can immediately see that anything less than 70 threads will never keep up. With less than 70 threads, the queue of unprocessed requests will grow without bound. I need at least 91 threads to get below a 1% chance that a request will be delayed by queueing.

Performance and Capacity

Now, what happens if the average request processing time goes up by 100 milliseconds on those same million requests? Adjusting the parameters, I get Table 2.

Table 2. Erlang-C values at 350 ms / request

NPr_Queue (Erlang-C)

Now we need a minimum of 99 threads before we can even expect to keep up and we need 122 threads to get down under that 1% queuing threshold.

On the other hand, what about increasing performance by 100 millseconds per request? I'll let you run the calculator for that, but it looks to me like we need between 42 and 59 threads to meet the same thresholds.

That swing, from 150 to 350 milliseconds per request makes a huge difference in the number of concurrent threads your system must support to handle a million requests per hour---almost a factor of 3 times. Would you be willing to triple your hardware for the same request volume? Next time anyone says that "CPU is cheap", fold your arms and tell them "Erlang would not approve." On the flip side, it might be worth spending some administrator time on performance tuning to bring down your average page latency. Or maybe some programmer time to integrate memcached so every single page doesn't have to trudge all the way to the database.

Summary and Extension

Obviously, there's a lot more to performance analysis for web servers than this. Over time, I'll be mixing more analytic pieces with the pragmatic, hands-on posts that I usually make. It'll take some time. For one thing, I have to go back and learn about stochastic process and Markov chains. Pattern recognition and signal processing I've got. Advanced probability and statistics I don't got.

In fact, I'll offer a free copy of Release It to the first commenter who can show me how to derive an Erlang-like model that accounts for a) garbage collection times (bimodal processing time distribution), b) multiple coupled wait states during processing, c) non-equilibrium system states, and d) processing time that varies as a function of system utilization.

Perfection is Not Always Required

In my series on dirty data, I made the argument that sometimes incomplete, inaccurate, or inconsistent data was OK. In fact, not only is it OK, but it can be an advantage.

There's a really slick Ruby library called WhatLanguage that illustrates this beautifully. The author also wrote a nice article introducing the library. WhatLanguage automatically determines the language that a piece of text is written in.

For example (from the article)

require 'whatlanguage'

"Je suis un homme".language # => :french

Very nice.

WhatLanguage works by comparing words in the input text to a data structure that can tell you whether a word exists in the corpus. There's the catch, though. It can return a false positive! That would mean you get an incorrect "yes" sometimes for words that aren't in the language in question. On the other hand, it's guaranteed against false negatives.

You might imagine that there are pretty limited circumstances when you'd use a data structure that sometimes returns incorrect answers. (There is a calculable probability of a false positive. It never reaches zero.) It works for WhatLanguage, though.

You see, each word contributes to a histogram binned by possible language. Ultimately, one language "wins", based on whichever has the most entries in the histogram. False positives may contribute an extra point to incorrect languages, but the correct language will pretty much always emerge from the noise, provided there's enough source text to work from.

So, there's another example of information emerging from noisy inputs, just as long as there's enough of it.



ReadWriteWeb on Dirty Data

A short while back, I did a brief series on the value of "dirty data"---copious amounts of unstructured, non-relational data created by the many interactions user have with your site and each other.

ReadWriteWeb has a post up about Four Ad-Free Ways that Mined Data Can Make Money, along very similar lines.  Well worth a read.

How Buildings Learn

Stewart Brand's famous book How Buildings Learn has been on my reading queue for a while, possibly a few years. Now that I've begun reading it, I wish I had gotten it sooner. Listen to this:

The finished-looking model and visually obsessive renderings dominate the let's-do-it meeting, so that shallow guesses are frozen as deep decisions. All the design intelligence gets forced to the earliest part of the building process, when everyone knows the least about what is really needed.

Wow. It's hard to tell what industry he's talking about there. It could easily apply to software development. No wonder Brand is so well-regarded in the Agile community!

Another wonderful parallel is between what Brand calls "Low Road" and "High Road" buildings. A Low Road building is one that is flexible, cheap, and easy to modify. It's hackable. Lofts, garages, old factory floors, warehouses, and so on. Each new owner can gut and modify it without qualms. A building where you can drill holes through the walls, run your own cabling, and rip out every interior wall is a Low Road building.

High Road buildings evolve gradually over time, through persistent care and love. There doesn't necessarily have to be a consistent--or even coherent--vision, but each own does need to feel a strong sense of preservation. High Road buildings become monuments, but they aren't made that way. They just evolve in that direction as each generation adds their own character.

Then there are the buildings that aren't High or Low Road. Too static to be Low Road, but not valued enough to be High Road. Resistant to change, bureaucratic in management. Diffuse responsibility produces static (i.e., dead) buildings. Deliberately setting out to design a work of art, paradoxically, prevents you from creating a living, livable building.

Again, I see some clear parallels to software architecture here. On the one hand, we've got Low Road architecture. Easy to glue together, easy to rip apart. Nobody gets bent out of shape if you blow up a hodge-podge of shoestring batch jobs and quick-and-dirty web apps. CGI scripts written in perl are classic Low Road architecture. It doesn't mean they're bad, but they're probably not going to go a long time without being changed in some massive ways.

High Road architecture would express a conservativism that we don't often see. High Road is not "big" architecture. Rather, High Road means cohesive systems lovingly tended. Emacs strikes me as a good example of High Road architecture. Yes, it's accumulated a lot of bits and oddments over the years, but it's quite conservative in its architecture.

Enterprise SOA projects, to me, seem like dead buildings. They're overspecified and too focused on the moment of rollout. They're the grand facades with leaky roofs. They're the corporate office buildings that get gerrymandered into paralysis. They preach change, but produce stasis.

Dan Pritchett on Availability

Dan Pritchett is a man after my own heart. His latest post talks about the path to availability enlightenment. The obvious path--reliable components and vendor-supported commercial software--leads only to tears.

You can begin on the path to enlightenment when you set aside dreams of perfect software running on perfect hardware, talking over perfect networks. Instead, embrace the reality of fallible components. Don't design around them, design for them.

How do you design for failure-prone components? That's what most of Release It! is all about.

Hard Problems in Architecture

Many of the really hard problems in web architecture today exist outside the application server.  Here are three problems that haven't been solved. Partial solutions exist today, but nothing comprehensive.

Uncontrolled demand

Users tend to arrive at web sites in huge gobs, all at once. As the population of the Net continues to grow, and the need for content aggregators/filters grows, the "front page" effect will get worse.

One flavor of this is the "Attack of Self-Denial", an email, radio, or TV campaign that drives enough traffic to crash the site.  Marketing can slow you down. Really good marketing can kill you at any time.

Versioning, deployment and rollback

With large scale applications, as with large enterprise integrations, versioning rapidly becomes a problem. Versioning of schemas, static assets, protocols, and interfaces. Add in a dash of SOA, and you have a real nightmare. You can count on having at least one interface broken at any given time. Or, you introduce such powerful governors that nothing ever changes.

As the number of nodes increases, you eventually find that there's always at least one deployment going on. A "deployment" becomes less of a point-in-time activity than it is a rolling wave. A new service version will take hours or days to be deployed to every node. In the meantime, both the old and new service version must coexist peacefully. Since both service versions will need to support multiple protocol versions (see above) you have a combinatorial problem.

And, of course, some of these deployments will have problems of their own. Today, many application deployments are "one way" events. The deployment process itself has irreversably destructive effects. This will have to change, so every deployment can be done both forward and back. Oh, and every deployment will also be deploying assets to multiple targets---web, application, and database---while also triggering cache flushes and, possibly, metadata changes to external partners like Akamai.

Applications will need to participate in their own versioning, deployment, and management. 

Blurring the lines

There used to be a distinction between the application and the infrastructure it ran on. That meant you could move applications around and they would behave pretty much the same in a development environment as in the real production environment. These days, firewalls, load balancers, and caching appliances blur the lines between "infrastructure" and "application". It's going to get worse as the lines between "operational support system" and "production applications" get blurred, too. Automated provisioning, SLA management, and performance management tools will all have interactions with the applications they manage. These will inevitably introduce unexpected interactions... in a word, bugs.

Webber and Fowler on SOA Man-Boobs

InfoQ posted a video of Jim Webber and Martin Fowler doing a keynote speech at QCon London this Spring. It's a brilliant deconstruction of the concept of the Enterprise Service Bus. I can attest that they're both funny and articulate (whether on the stage or off.)

Along the way, they talk about building services incrementally, delivering value at every step along the way. They advocate decentralized control and direct alignment between services and the business units that own them. 

I agree with every word, though I'm vaguely uncomfortable with how often they say "enterprise man boobs".

Coincidence or Back-end Problem?

An odd thing happened to me today. Actually, an odd thing happened yesterday, but it's having the same odd thing happen today that really makes it odd. With me so far?

Yesterday, while I was shopping at Amazon, Amazon told me that my American Express card had expired. While it is set for a May expiration, it's several years in the future. I didn't think too much of it, because when I re-entered the same information, Amazon accepted it.

Today, I got the same thing with the same card on iTunes!

Online stores don't do a whole lot with your credit cards. For the most part, they just make a call out to a credit card processor. Small stores have to go through a second-tier CCVS system that charges a few pennies per transaction. Large ones---and do they get larger than Amazon?---generally connect directly to a payment processor. The payment processor may charge a fraction of a cent per transaction, but they definitely make it up in volume.

(There are other business factors, too, like the committed transaction volume, response time SLAs, and the like.)

Asynchronously, the payment processor collects from the issuing bank. It's the issuing bank that actually bills you, and sets your interest rate and payment terms.

Whereas VISA and MasterCard work with thousands of issuers, American Express doesn't. When you get an AmEx card, they are the issuing bank as well as the payment processor.

Which makes it highly suspect that the same card gave me the same error through two different sites. It makes me think that American Express has introduced a bug in their validation system, causing spurious declines for expiration. 

SOA: Time For a Rethink

The notion of a service-oriented architecture is real, and it can deliver. The term "SOA", however, has been entirely hijacked by a band of dangerous Taj Mahal architects. They seem innocuous, it part because they'll lull you to sleep with endless protocol diagrams. Behind the soporific technology discussion lies a grave threat to your business.

"SOA" has come to mean top-down, up-front, strong-governance, all-or-nothing process (moving at glacial speed) implemented by an ill-conceived stack of technologies. SOAP is not the problem. WSDL is not the problem. Even BPEL is not the problem. The problem begins with the entire world view.

We need to abandon the term "SOA" and invent a new one. "SOA" is chasing a false goal. The idea that services will be so strongly defined that no integration point will ever break is unachievable. Moreover, it's optimizing for the wrong thing. Most business today are not safety-critical. Instead, they are highly competitive.

We need loosely-coupled services, not orchestration.

We need services that emerge from the business units they serve, not an IT governance panel.

We need services to change as rapidly as the business itself changes, not after a chartering, funding, and governance cycle.

Instead of trying to build an antiseptic, clockwork enterprise, we need to embrace the messy, chaotic, Darwinian nature of business. We should be enabling rapid experimentation, quick rollout of "barely sufficient" systems, and fast feedback. We need to enable emergence, not stifle it.

Anything that slows down that cycle of experimentation and adjustment puts your business on the same evolutionary path as the Great Auk. I never thought I'd find myself quoting Tom Peters in a tech blog, but the key really is to "Test fast, fail fast, adjust fast."

The JVM is Great, But...

Much of the interest in dynamic languages like Groovy, JRuby, and Scala comes from running on the JVM. That lets them leverage the tremendous R&D that's gone into JVM performance and stability. It also opens up the universe of Java libraries and frameworks.

And yet, much of my work deals with the 80% of cost that comes after the development project is done. I deal with the rest of the software's lifetime. The end of development is the beginning of the software's life. Throughout that life, many of the biggest, toughest problems exist around and between the JVM's: Scalability, stability, interoperability, and adaptability.

As I previously showed in this graphic, the easiest thing for a Java developer to create is a slow, unscalable, and unstable web application. Making high-performance, robust, scalable applications still requires serious expertise. This is a big problem, and I don't see it getting better. Scala might help here in terms of stability, but I'm not yet convinced it's suitable for the largest masses of Java developers. Normal attrition means that the largest population of developers will always be the youngest and least experienced. This is not a training problem: in the post-commoditization world, the majority of code will always be written by undertrained, least-cost coders. That means we need platforms where the easiest thing to do is also the right thing to do.

Scaling distributed systems has gotten better over the last few years. Distributed memory caching has reached the mainstream. Terracotta and Coherence are both mature products, and they both let you try them out for free. In the open source crowd, as usual, you lose some manageability and some time-to-implement, but the projects work when you use them right. All of these do the job of connecting individual JVMs to a caching layer. On the other hand, I can't help but feel that the need for these products points to a gap in the platform itself.

OSGi is finally reaching the mainstream. It's been needed for a long time, for a couple of reasons. First, it's still too common to see gigantic classpaths containing multiple versions of JAR files, leading to the perennial joy of finding obscure, it-works-fine-in-QA bugs. So, keeping individual projects in their own bundles, with no classpath pollution will be a big help. Versioning application bundles is also important for application management and deployment. OSGi is what we should have had since the beginning, instead of having the classpath inflicted on us.

I predict that we'll see more production operations moving to hot deployment on OSGi containers. For enterprise services that require 100% uptime, it's just no longer acceptable to bring down the whole cluster in order to do deployments. Even taking an entire server down to deploy a new revision may become a thing of the past. In the Erlang world, it's common to see containers running continuously for months or years. In Programming Erlang, Joe Armstrong talks about sending an Erlang process a message to "become" a new package. It works without disrupting any current requests and it happens atomically between one service request and the next. (In fact, Joe says that one of the first things he does on a new system is deploy the container processes, at the very beginning of the project. Later, once he knows what the system is supposed to do, he deploys new packages into those containers.) Hot deployment can be safe, if the code being deployed is sufficiently decoupled from the container itself. OSGi does that.

OSGi also enables strong versioning of the bundles and their dependencies. This is an all-around good thing, since it will let developers and operations agree on exactly versions of which components belong in production at a given time.


SAP has been talking up their suite of SOA tools. The names all run together a bit, since they each contain some permutation of "enterprise" and "builder", but it's a very impressive set of tools.

Everything SAP does comes off of an enterprise service repository (ESR). This includes a UDDI registry, and it supports service discovery and lookup. Development tools allow developers to search and discover services through their "ES Workspace". Interestingly, this workspace is open to partners as well as internal developers.

From the ESR, a developer can import enough of a service defition to build a composite application. Composite applications include business process definitions, new services of their own, local UI components, and remote service references.

Once a developer creates a composite application, it can be deployed to a local container or a test server. Presumably, there's a similar tool available for administrators to deploy services, composite applications, and other enterprise components onto servers.

Through it all, the complete definition of every component goes into the ESR.

In order to make the entire service lifecycle work, SAP has defined a strong meta-model and a very strong governance process.

This is the ultimate expression of the top-down, strong-governance model for enterprise SOA.

If you're into that sort of thing.


SOA at 3.5 Million Transactions Per Hour

Matthias Schorer talked about FIDUCIA IT AG and their service-oriented architecture. This financial services provider works with 780 banks in Europe, processing 35,000,000 transactions during the banking day. That works out to a little over 3.5 million transactions per hour.

Matthias described this as a service-oriented architecture, and it is. Be warned, however, that SOA does not imply or require web services. The services here exist in the middle tier. Instead of speaking XML, they mainly use serialized Java objects. As Matthias said, "if you control both ends of the communication, using XML is just crazy!"

They do use SOAP when communicating out to other companies.

They've done a couple of interesting things. They favor asynchronous communication, which makes sense when you architect for latency. Where many systems push data into the async messages, FIDUCIA does not. Instead, they put the bulk data into storage (usually files, sometimes structured data) and send control messages instructing the middle tier to process the records. This way, large files can be split up and processed in parallel by a number of the processing nodes. Obviously, this works when records are highly independent of each other.

Second, they have defined explicit rules and regulations about when to expect transactional integrity. There are enough restrictions that these are a minority of transactions. In all other cases, developers are required to design for the fact that ACID properties do not hold.

Third, they've build a multi-layered middle tier. Incoming requests first hit a pair of "Central Process Servers" which inspect the request. Requests are dispatched to individual "portals" based on their customer ID. Different portals will run different versions of the software, so FIDUCIA supports customers with different production versions of their software. Instead of attempting to combine versions on a single cluster, they just partition the portals (clusters.)

Each portal has its own load distribution mechanism, using work queues that the worker nodes listen to.

This multilevel structure lets them scale to over 1,000 nodes while keeping each cluster small and manageable.

The net result is that they can process up to 2,500 transactions per second, with no scaling limit in sight.

Amazon Blows Away Objections

Amazon must have been burning more midnight oil than usual lately.

Within the last two weeks, they've announced three new features that basically eliminate any remaining objections to their AWS computing platform.

Elastic IP Addresses 

Elastic IP addresses solve a major problem on the front end.  When an EC2 instance boots up, the "cloud" assigns it a random IP address. (Technically, it assigns two: one external and one internal.  For now, I'm only talking about the external IP.) With a random IP address, you're forced to use some kind of dynamic DNS service such as DynDNS. That lets you update your DNS entry to connect your long-lived domain name with the random IP address.

Dynamic DNS services work pretty well, but not universally well.  For one thing, there is a small amount of delay.  Dynamic DNS works by setting a very short time-to-live (TTL) on the DNS entries, which instructs intermediate DNS servers to cache the entry only for a few minutes.  When that works well, you still have a few minutes of downtime when you need to reassign your DNS name to a new IP address.  For some parts of the Net, dynamic DNS doesn't work well, usually when some ISP doesn't respect the TTL on DNS entries, but caches them for a longer time.

Elastic IP addresses solve this problem. You request an elastic IP address through a Web Services call.  The easiest way is with the command-line API:

$ ec2-allocate-address

Once the address is allocated, you own it until you release it. At this point, it's attached to your account, not to any running virtual machine. Still, this is good enough to go update your domain registrar with the new address. After you start up an instance, then you can attach the address to the machine. If the machine goes down, then the address is detached from that instance, but you still "own" it.

So, for a failover scenario, you can reassign the elastic IP address to another machine, leave your DNS settings alone, and all traffic will now come to the new machine.

Now that we've got elastic IPs, there's just one piece missing from a true HA architecture: load distribution. With just one IP address attached to one instance, you've got a single point of failure (SPOF). Right now, there are two viable options to solve that. First, you can allocate multiple elastic IPs and use round-robin DNS for load distribution. Second, you can attach a single elastic IP address to an instance that runs a software load balancer: pound, nginx, or Apache+mod_proxy_balancer. (It wouldn't surprise me to see Amazon announce an option for load-balancing-in-the-cloud soon.) You'd run two of these, with the elastic IP attached to one at any given time. Then, you need a third instance monitoring the other two, ready to flip the IP address over to the standby instance if the active one fails. (There are already some open-source and commercial products to make this easy, but that's the subject for another post.)

Availability Zones 

The second big gap that Amazon closed recently deals with geography.

In the first rev of EC2, there was absolutely no way to control where your instances were running. In fact, there wasn't any way inside the service to even tell where they were running. (You had to resort to pingtracing or geomapping of the IPs). This presents a problem if you need high availability, because you really want more than one location.

Availability Zones let you specify where your EC2 instances should run. You can get a list of them through the command-line (which, let's recall, is just a wrapper around the web services):

$ ec2-describe-availability-zones
AVAILABILITYZONE    us-east-1a    available
AVAILABILITYZONE    us-east-1b    available
AVAILABILITYZONE    us-east-1c    available

Amazon tells us that each availability zone is built independently of the others. That is, they might be in the same building or separate buildings, but they have their own network egress, power systems, cooling systems, and security. Beyond that, Amazon is pretty opaque about the availability zones. In fact, not every AWS user will see the same availability zones. They're mapped per account, so "us-east-1a" for me might map to a different hardware environment than it does for you.

How do they come into play? Pretty simply, as it turns out. When you start an instance, you can specify which availability zone you want to run it in.

Combine these two features, and you get a bunch of interesting deployment and management options.

Persistent Storage

Storage has been one of the most perplexing issues with EC2. Simply put, anything you stored to disk while your instance was running would be lost when you restart the instance. Instances always go back to the bundled disk image stored on S3.

Amazon has just announced that they will be supporting persistent storage in the near future. A few lucky users get to try it out now, in it's pre-beta incarnation.

With persistent storage, you can allocate space in chunks from 1 GB to 1 TB.  That's right, you can make one web service call to allocate a freaking terabyte! Like IP addresses, storage is owned by your account, not by an individual instance. Once you've started up an instance---say a MySQL server, for example---you attach the storage volume to it. To the virtual machine, the storage looks just like a device, so you can use it raw or format it with whatever filesystem you want.

Best of all, because this is basically a virtual SAN, you can do all kinds of SAN tricks, like snapshot copies for backups to S3.

Persistent storage done this way obviates some of the other dodgy efforts that have been going on, like  FUSE-over-S3, or the S3 storage engine for MySQL.

SimpleDB is still there, and it's still much more scalable than plain old MySQL data storage, but we've got scores of libraries for programming with relational databases, and very few that work with key-value stores. For most companies, and for the forseeable future, programming to a relational model will be the easiest thing to do. This announcement really lowers the barrier to entry even further.


With these announcements, Amazon has cemented AWS as a viable computing platform for real businesses.


The Granularity Problem

I spend most of my time dealing with large sites. They're always hungry for more horsepower, especially if they can serve more visitors with the same power draw. Power goes up much faster with more chassis than with more CPU core. Not to mention, administrative overhead tends to scale with the number of hosts, not the number of cores. For them, multicore is a dream come true.

I ran into an interesting situation the other day, on the other end of the spectrum.

One of my team was working with a client that had relatively modest traffic levels. They're in a mature industry with a solid, but not rabid, customer base. Their web traffic needs could easily be served by one Apache server running one CPU and a couple of gigs of RAM.

The smallest configuration we could offer, and still maintain SLAs, was two hosts, with a total of 8 CPU cores running at 2 GHz, 32 gigs of RAM, and 4 fast Ethernet ports.

Of course that's oversized! Of course it's going to cost more than it should! But at this point in time, if we're talking about dedicated boxes, that's the smallest configuration we can offer! (Barring some creative engineering, like using fully depreciated "classics" hardware that's off its original lease, but still has a year or two before EOL.)

As CPUs get more cores, the minimum configuration is going to become more and more powerful. The quantum of computing is getting large.

Not every application will need it, and that's another reason I think private clouds make a lot of sense. Companies can buy big boxes, then allocate them to specific applications in fractions. Gains cost efficiency in adminstration, power, and space consumption (though not heat production!) while still letting business units optimize their capacity downward to meet their actual demand. 

Software Failure Takes Down Blackberry Services

Anyone who's addicted to a Blackberry already knows about Monday's four-hour outage. For some of us, the Blackberry isn't just an electronic leash, it's part of our business operations.

Like cell phones, Blackberries have a huge, hidden infrastructure behind them. Corporate Blackberry Event Servers (BES) relay email, calendar, and contact information through RIM's infrastructure, out through the wireless carriers. It was RIM's own infrastructure that suffered from intermittent failures during the outage.

Data Center Knowledge reports that the outage was caused by a failed software upgrade

Releases are risky. We use testing and QA to reduce the risk, but every line of new or modified code represents an unknown.

How can we reduce the risk of an upgrade? One way is to roll it out slowly. Companies with widely distributed point-of-sale (POS) systems know this. They never push a release out to every store at once. They start with one or two. If that works, they go up to a larger handful, maybe four to eight. After a couple of days, they'll roll it out to an entire district. It can take a week or more to roll the release out everywhere.

In the interim, there are plenty of checkpoints where the release can be rolled back.

I strongly recommend approaching Web site releases the same way. Roll the new release out to one or two servers in your farm. Let a fraction of your customers into the new release. Watch for performance regressions, capacity problems, and functional errors. Absolutely ensure that you can roll it back if you need to. Once it's "baked" for a while in production, then roll it to the remaining app servers.

This approach demands a few corollaries. First, your database updates have to be structured in a forward-compatible way, and they must always allow for rollback. There can be no irrevocable updates. Second, two versions of your software will be operating simultaneously. That means your integration protocols and static assets have to be able to accommodate both versions. I discuss specific strategies for each of these aspects in Release It.

Finally, an aside: RIM's statement about the outage isn't reflected anywhere on their site. Once again, if what you want is the latest true information about a company, the very last place to find it is the company's own web site. 

The Pragmatic Architect on Security

Catching up on some reading, I finally got a chance to read Ted Neward's article "Pragmatic Architecture: Security".  It's very good.  (Actually, the whole series is pretty good, and I recommend them all.  At least as of February 2008... I make no assertions about future quality!)

Ted nails it.  I agree with all of the principles he identifies, and I particularly like his advice to "fail securely". 

I would add one more, though: Be visible.

After any breach, the three worst questions are always:

  1. How long has this been happening?
  2. How much have we lost?
  3. Why didn't we know about it sooner?

The answers are always, respectively, "Far too long", "We have no idea", and "We didn't expect that exploit". To which the only possible response is, "Well, duh, if you'd expected it, you would have closed the vulnerability."

Successful exploits are always successful because they stay hidden. Are you sure that nobody's in your systems right now, leaching data, stealing credit card numbers, or stealing products? Of course not. For a vivid case in point, google "Kerviel Societe Generale".

While you cannot prove a negative, you can improve your odds of detecting nefarious activity by making sure that everything interesting is logged. (And by "interesting", I mean "potentially valuable".) 

There are some pretty spiffy event correlation tools out there these days. They can monitor logs across hundreds of servers and network devices, extracting patterns of anomalous behavior. But, they only work if your application exposes data that could indicate a breach.

For example, you might not be able to log every login attempt, but you probably should log every admin login attempt.

Or, you might consider logging every price change. (I shudder to think about collusion between a merchant with pricing control and an outside buyer.  Imagine a 10-minute long sale on laptops: 90% off for 10 minutes only.)

If your internal web service listens on a port, then it should only accept connections from known sources. Whether you enforce that through IPTables, a hardware firewall, or inside the application itself, make sure you're logging refused connections.

Then, of course, once you're logging the data, make sure someone's monitoring it and keeping pattern and signature definitions up to date!

Two Books That Belong In Your Library

I seldom plug books---other than my own, that is. I've just read two important books, however, that really deserve your attention.

Concurrency, Everybody's Doing It

The first is Java Concurrency in Practice by Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, and Doug Lea. I've been doing Java development for close to thirteen years now, and I learned an enormous amount from this fantastic book. For example, I knew what the textbook definition of a volatile variable was, but I never knew why I would actually want to use one. Now I know when to use them and when they won't solve the problem.

Of course, JCP talks about the Java 5 concurrency library at great length. But this is no paraphrasing of the javadoc. (It was Doug Lea's original concurrency utility library that eventually got incorporated into Java, and we're all better off for it.) The authors start with illustrations of real issues in concurrent programming. Before they introduce the concurrency utilities, they explain a problem and illustrate potential solutions. (Usually involving at least one naive "solution" that has serious flaws.) Once they show us some avenues to explore, they introduce some neatly-packaged, well-tested utility class that either solves the problem or makes a solution possible. This removes the utility classes from the realm of "inscrutable magic" and presents them as "something difficult that you don't have to write."

The best part about JCP, though, is the combination of thoroughness and clarity with which it presents a very difficult subject. For example, I always understood about the need to avoid concurrent modification of mutable state. But, thanks to this book, I also see why you have to synchronize getters, not just setters. (Even though assignment to an integer is guaranteed to happen atomically, that isn't enough to guarantee that the change is visible to other threads. The only way to guarantee ordering is by crossing a synchronization barrier on the same lock.)

Blocked Threads are one of my stability antipatterns. I've seen hundreds of web site crashes. Every single one of them eventually boils down to blocked threads somewhere. Java Concurrency in Practice has the theory, practice, and tools that you can apply to avoid deadlocks, live locks, corrupted state, and a host of other problems that lurk in the most innocuous-looking code.

Capacity Planning is Science, Not Art

The second book that I want to recommend today is Capacity Planning for Web Services. I've had this book for a while. When I first started reading it, I put it down right away thinking, "This is way too basic to solve any real problems." That was a big error.

Capacity Planning may get off to a slow start, but that's only because the authors are both thorough and deliberate. Later in the book, that deliberate pace is very helpful, because it lets us follow the math.

This is the only book on capacity planning I've seen that actually deals with transmission time for HTTP requests and repsonses. In fact, some of the examples even compute the number of packets that a request or reply will need.

I have objected to some capacity planning books because they assume that every process can be represented by an average. Not this one. In the section on standalone web servers, for example, the authors break files into several classes, then use a weighted distribution of file sizes to compute the expected response time and bandwidth requirements. This is a very real-world approach, since web requests tend toward a bimodal distribution: small HTML, Javascript, and CSS intermixed with large media files and images. (In fact, I plan on using the models in this book to quantify the effect of segregating media files from dynamic pages.)

This is also the only book I've seen that recognizes that capacity limits can propagate both downward and upward through tiers. There's a great example of how doubling the CPU performance in an app tier ends up increasing the demand on the database server, which almost totally nullifies the effect of the CPU upgrade. It also recognizes that all requests are not created equal, and recommends clustering request types by their CPU and I/O demands, instead of averaging them all together.

Nearly every result or abstract law has an example, written in concrete terms, which helps bridge theory and practice.

Both of these books deal with material that easily leads off into clouds of theory and abstraction. (JCP actually quips, "What's a memory model, and why would I want one?") These excellent works avoid the Ivory Tower trap and present highly pragmatic, immediately useful wisdom.

Should Email Errors Keep Customers From Buying?

Somewhere inside every commerce site, there's a bit of code sending emails out to customers.  Email campaigning might have been in the requirements and that email code stands tall at the brightly-lit service counter.  On the other hand, it might have been added as an afterthought, languishing in some dark corner with the "lost and found" department.  Either way, there's a good chance it's putting your site at risk.

The simplest way to code an email sending routine looks something like this:

  1. Get a javax.mail.Session instance
  2. Get a javax.mail.Transport instance from the Session
  3. Construct a javax.mail.internet.MimeMessage instance
  4. Set some fields on the message: from, subject, body.  (Setting the body may involve reading a template from a file and interpolating values.)
  5. Set the recipients' Addresses on the message
  6. Ask the Transport to send the message
  7. Close the Transport
  8. Discard the Session

This goes into a servlet, a controller, or a stateless session bean, depending on which MVC framework or JEE architecture blueprint you're using.

There are two big problems here. (Actually, there are three, but I'm not going to deal with the "one connection per message" issue.)

Request-Handling Threads at Risk

As written, all the work of sending the email happens on the request-handling thread that's also responsible for generating the response page. Even on a sunny day, that means you're spending some precious request-response cycles on work that doesn't help build the page.

You should always look at a call out to an external server with suspicion. Many of them can execute asynchronously to page generation. Anything that you can offload to a background thread, you should offload so the request-handler can get back in the pool sooner. The user's experience will be better, and your site's capacity will be better, if you do.

Also, keep in mind that SMTP servers aren't always 100% reliable. Neither are the DNS servers that point you to them. That goes double if you're connecting to some external service. (And please, please don't even tell me you're looking up the recipient's MX record and contacting the receiving MTA directly!)

If the MTA is slow to accept your connection, or to process the email, then the request-handling thread could be blocked for a long time: seconds or even minutes. Will the user wait around for the response? Not likely. He'll probably just hit "reload" and double-post the form that triggered the email in the first place.

Poor Error Recovery

The second problem is the complete lack of error recovery.  Yes, you can log an exception when your connection to the MTA fails. But that only lets the administrator know that some amount of mail failed. It doesn't say what the mail was! There's no way to contact the users who didn't get their messages. Depending on what the messages said, that could be a very big deal.

At a minimum, you'd like to be able to detect and recovery from interruptions at the MTA---scheduled maintenance, Windows patching, unscheduled index rebulids, and the like. Even if "recovery" means someone takes the users' info from the log file and types in a new message on their desktops, that's better than nothing.

A Better Way

The good news is that there's a handy way to address both of these problems at once. Better still, it works whether you're dealing with internal SMTP based servers or external XML-over-HTTP bulk mailers.

Whenever a controller decides it's time to reach out and touch a user through email, it should drop a message on a JMS queue. This lets the request-handling thread continue with page generation immediately, while leaving the email for asynchronous processing.

You can either go down the road of message-driven beans (MDB) or you can just set up a pool of background threads to consume messages from the queue. On receipt of a message, the subscriber just executes the same email generation and transmission as before, with one exception. If the message fails due to a system error, such as a broken socket connection, the message can just go right back onto the message queue for later retry. (You'll probably want to update the "next retry time" to avoid livelock.)

Better Still

If you have a cluster of application servers that can all generate outbound email, why not take the next step? Move the MDBs out into their own app server and have the message queues from all the app servers terminate there? (If you're using pub-sub instead of point-to-point, this will be pretty much transparent.) This application will resemble a message broker... for good reason. It's essentially just pulling messages in from one protocol, transforming them, then sending them out over another protocol.

The best part? You don't even have to write the message broker yourself. There are plenty of open-source and commercial alternatives.


Sending email directly from the request-handling thread performs poorly, creates unpredictable page latency for users and risks dropping their emails right on the floor. It's better to drop a message in a queue for asynchronous transformation by a message broker: it's faster, more reliable, and there's less code for you to write.

Two Sites, One Antipattern

This week, I had Groundhog Day in December.  I was visiting two different clients, but they each told the same tale of woe.

At my first stop, the director of IT told me about a problem they had recently found and eliminated.

They're a retailer. Like many retailers, they try to increase sales through "upselling" and "cross-selling". So, when you go to check out, they show you some other products that you might want to buy.  It's good to show customers relevant products that are also advantageous to sell.
For example, if a customer buys a big HDTV, offer them cables (80% margin) instead of DVDs (3% margin).

All but one of the slots on that page are filled through deliberate merchandising. People decide what to display there, the same way they decide what to put in the endcaps or next to the register in a physical store. The final slot, though, gets populated automatically according to the products in the customer's cart. Based on the original requirements for the site, the code to populate that slot looked for products in the catalog with similar attributes, then sorted through them to find the "best" product.  (Based on some balance of closely-matched attributes and high margin, I suspect.)

The problem was that there were too many products that would match.  The attributes clustered too much for the algorithm, so the code for this slot would pull back thousands of products from the catalog.  It would turn each row in the result set into an object, then weed through them in memory.

Without that slot, the page would render in under a second.  With it, two minutes, or worse.

It had been present for more than two years. You might ask, "How could that go unnoticed for two years?" Well, it didn't, of course. But, because it had always been that way, most everyone was just used to it. When the wait times would get too bad, this one guy would just restart app servers until it got better.

Removing that slot from the page not only improved their stability, it vastly increased their capacity. Imagine how much more they could have added to the bottom line if they hadn't overspent for the last two years to compensate. 

At my second stop, the site suffered from serious stability problems. At any given time, it was even odds that at least one app server would be vapor locked. Three to five times a day, that would ripple through and take down all the app servers. One key symptom was a sudden spike in database connections.

Some nice work by the DBAs revealed a query from the app servers that was taking way too long. No query from a web app should ever take more than half a second, but this one would run for 90 seconds or more. Usually that means the query logic is bad.  In this case, though, the logic was OK, but the query returned 1.2 million rows. The app server would doggedly convert those rows into objects in a Vector, right up until it started thrashing the garbage collector. Eventually, it would run out of memory, but in the meantime, it held a lot of row locks.  All the other app servers would block on those row locks.  The team applied a band-aid to the query logic, and those crashes stopped.

What's the common factor here? It's what I call an "Unbounded Result Set".  Neither of these applications limited the amount of data they requested, even though there certainly were limits to how much they could process.  In essence, both of these applications trusted their databases.  The apps weren't prepared for the data to be funky, weird, or oversized. They assumed too much.

You should make your apps be paranoid about their data.   If your app processes one record at a time, then looping through an entire result set might be OK---as long as you're not making a user wait while you do.  But if your app that turns rows into objects, then it had better be very selective about its SELECTs.  The relationships might not be what you expect.  The data producer might have changed in a surprising way, particularly if it's not under your control.  Purging routines might not be in place, or might have gotten broken.  Definitely don't trust some other application or batch job to load your data in a safe way.

No matter what odd condition your app stumbles across in the database, it should not be vulnerable.

Read-write splitting with Oracle

Speaking of databases and read/write splitting, Oracle had a session at OpenWorld about it.

Building a read pool of database replicas isn't something I usually think of doing with Oracle, mainly due to their non-zero license fees.  It changes the scaling equation.

Still, if you are on Oracle and the fees work for you, consider Active Data Guard.   Some key facts from the slides:

  • Average latency for replication was 1 second
  • The maximum latency spike they observed was 10 seconds.
  • A node can take itself offline if it detects excessive latency.
  • You can use DBLinks to allow applications to think they're writing to a read node.  The node will transparently pass the writes through to the master.
  • This can be done without any tricky JDBC proxies or load-balancing drivers, just the normal Oracle JDBC driver with the bugs we all know and love.
  • Active Data Guard requires Oracle 11g.

A Dozen Levels of Done

What does "done" mean to you?  I find that my definition of "done" continues to expand. When I was still pretty green, I would say "It's done" when I had finished coding.  (Later, a wiser and more cynical colleague taught me that "done" meant that you had not only finished the work, but made sure to tell your manager you had finished the work.)

The next meaning of "done" that I learned had to do with version control. It's not done until it's checked in.

Several years ago, I got test infected and my definition of "done" expanded to include unit testing.

Now that I've lived in operations for a few years and gotten to know and love Lean Software Development, I have a new definition of "done".

Here goes:

A feature is not "done" until all of the following can be said about it:

  1. All unit tests are green.
  2. The code is as simple as it can be.
  3. It communicates clearly.
  4. It compiles in the automated build from a clean checkout.
  5. It has passed unit, functional, integration, stress, longevity, load, and resilience testing.
  6. The customer has accepted the feature.
  7. It is included in a release that has been branched in version control.
  8. The feature's impact on capacity is well-understood.
  9. Deployment instructions for the release are defined and do not include a "point of no return".
  10. Rollback instructions for the release are defined and tested.
  11. It has been deployed and verified.
  12. It is generating revenue.

Until all of these are true, the feature is just unfinished inventory.

Two Ways To Boost Your Flagging Web Site

Being fast doesn't make you scalable. But it does mean you can handle more capacity with your current infrastructure. Take a look at this diagram of request handlers.

13 Threads Needed When Requests Take 700ms

You can see that it takes 13 request handling threads to process this amount of load. In the next diagram, the requests arrive at the same rate, but in this picture it takes just 200 milliseconds to answer each one.

3 Threads Needed When Requests Take 200ms

Same load, but only 3 request handlers are needed at a time. So, shortening the processing time means you can handle more transactions during the same unit of time.

Suppose you're site is built on the classic "six-pack" architecture shown below. As your traffic grows and the site slows, you're probably looking at adding more oomph to the database servers. Scaling that database cluster up gets expensive very quickly. Worse, you have to bulk up both guns at once, because each one still has to be able to handle the entire load. So you're paying for big boxes that are guaranteed to be 50% idle.

Classic Six Pack

Let's look at two techniques almost any site can use to speed up requests, without having the Hulk Hogan and Andre the Giant of databases lounging around in your data center.

Cache Farms

Cache farming doesn't mean armies of Chinese gamers stomping rats and making vests. It doesn't involve registering a ton of domain names, either.

Pretty much every web app is already caching a bunch of things at a bunch of layers. Odds are, your application is already caching database results, maybe as objects or maybe just query results. At the top level, you might be caching page fragments. HTTP session objects are nothing but caches. The net result of all this caching is a lot of redundancy. Every app server instance has a bunch of memory devoted to caching. If you're running multiple instances on the same hosts, you could be caching the same object once per instance.

Caching is supposed to speed things up, right? Well, what happens when those app server instances get short on memory? Those caches can tie up a lot of heap space. If they do, then instead of speeding things up, the caches will actually slow responses down as the garbage collector works harder and harder to free up space.

So what do we have? If there are four app instances per host, then a frequently accessed object---like a product featured on the home page---will be duplicated eight times. Can we do better? Well, since I'm writing this article, you might suspect the answer is "yes". You'd be right.

The caches I've described so far are in-memory, internal caches. That is, they exist completely in RAM and each process uses its own RAM for caching. There exist products, commercial and open-source, that let you externalize that cache. By moving the cache out of the app server process, you can access the same cache from multiple instances, reducing duplication. Getting those objects out of the heap, You can make the app server heap smaller, which will also reduce garbage collection pauses. If you make the cache distributed, as well as external, then you can reduce duplication even further.

External caching can also be tweaked and tuned to help deal with "hot" objects. If you look at the distribution of accesses by ID, odds are you'll observe a power law. That means the popular items will be requested hundreds or thousands of times as often as the average item. In a large infrastructure, making sure that the hot items are on cache servers topologically near the application servers can make a huge difference in time lost to latency and in load on the network.

External caches are subject to the same kind of invalidation strategies as internal caches. On the other hand, when you invalidate an item from each app server's internal cache, they're probably all going to hit the database at about the same time. With an external cache, only the first app server hits the database. The rest will find that it's already been re-added to the cache.

External cache servers can run on the same hosts as the app servers, but they are often clustered together on hosts of their own. Hence, the cache farm.

Six Pack With Cache Farm

If the external cache doesn't have the item, the app server hits the database as usual. So I'll turn my attention to the database tier.

Read Pools

The toughest thing for any database to deal with is a mixture of read and write operations. The write operations have to create locks and, if transactional, locks across multiple tables or blocks. If the same tables are being read, those reads will have highly variable performance, depending on whether a read operation randomly encounters one of the locked rows (or pages, blocks, or tables, depending).

But the truth is that your application almost certainly does more reads than writes, probably to an overwhelming degree. (Yes, there are some domains where writes exceed reads, but I'm going to momentarily disregard mindless data collection.) For a travel site, the ratio will be about 10:1. For a commerce site, it will be from 50:1 to 200:1. There are a lot of variables here, especially when you start doing more effective caching, but even then, the ratios are highly skewed.

When your database starts to get that middle-age paunch and it just isn't as zippy as it used to be, think about offloading those reads. At a minimum, you'll be able to scale out instead of up. Scaling out with smaller, consistent, commodity hardware pleases everyone more than forklift upgrades. In fact, you'll probably get more performance out of your writes once all that pesky read I/O is off the write master.

How do you create a read pool? Good news! It uses nothing more than built-in replication features of the database itself. Basically, you just configure the write master to ship its archive logs (or whatever your DB calls them) to the read pool databases. They spin up the logs to bring their state into synch with the write master.

Six Pack With Cache Farm and Read Pool

By the way, for read pooling, you really want to avoid database clustering approaches. The overhead needed for synchronization obviates the benefits of read pooling in the first place.

At this point, you might be objecting, "Wait a cotton-picking minute! That means the read machines are garun-damn-teed to be out of date!" (That's the Foghorn Leghorn version of the objection. I'll let you extrapolate the Tony Soprano and Geico Gecko versions yourself.) You would be correct. The read machines will always reflect an earlier point in time.

Does that matter?

To a certain extent, I can't answer that. It might matter, depending on your domain and application. But in general, I think it matters less often than it seems. I'll give you an example from the retail domain that I know and love so well. Take a look at this product detail page from BestBuy.com. How often do you think each data field on that page changes? Suppose there is a pricing error that needs to be corrected immediately (for some definition of immediately.) What's the total latency before that pricing error will be corrected? Let's look at the end-to-end process.

  1. A human detects the pricing error.
  2. The observer notifies the responsible merchant.
  3. The merchant verifies that the price is in error and determines the correct price.
  4. Because this is an emergency, the merchant logs in to the "fast path" system that bypasses the nightly batch cycle.
  5. The merchant locates the item and enters the correct price
  6. She hits the "publish" button.
  7. The fast path system connects to the write master in production and updates the price.
  8. The read pool receives the logs with the update and applies them.
  9. The read pool process sends a message to invalidate the item in the app servers' caches.
  10. The next time users request that product detail page, they see the correct price.

That's the best-case scenario! In the real world, the merchant will be in a meeting when the pricing error is found. It may take a phone call or lookup from another database to find out the correct price. There might be a quick conference call to make the decision whether to update the price or just yank the item off the site. All in all, it might take an hour or two before the pricing error gets corrected. Whatever the exact sequence of events, odds are that the replication latency from the write master to the read pool is the very least of the delays.

Most of the data is much less volatile or critical than the price. Is an extra five minutes of latency really a big deal? When it can save you a couple of hundred thousand dollars on giant database hardware?

Summing It Up

The reflexive answer to scaling is, "Scale out at the web and app tiers, scale up in the data tier." I hope this shows that there are other avenues to improving performance and capacity.


For more on read pooling, see Cal Henderson's excellent book, Building Scalable Web Sites: Building, scaling, and optimizing the next generation of web applications.

The most popular open-source external caching framework I've seen is memcached. It's a flexible, multi-lingual caching daemon.

On the commercial side, GigaSpaces provides distributed, external, clustered caching. It adapts to the "hot item" problem dynamically to keep a good distribution of traffic, and it can be configured to move cached items closer to the servers that use them, reducing network hops to the cache.

Two Quick Observations

Several of the speakers here have echoed two themes about databases.

1. MySQL is in production in a lot of places. I think the high cost of commercial databases (read: Oracle) leads to a kind of budgetechture that concentrates all data in a single massive database. If you remove that cost from the equation, the idea of either functionally partitioning your data stores or creating multiple shards becomes much more palatable.

2. By far the most common database cluster structure has one write master with many read masters. Ian Flint spoke to us about the architectures behind Yahoo Groups and Yahoo Bix. Bix has 30 MySQL read servers and just one write master. Dan Pritchett from eBay had a similar ratio. (His might have been 10:1 rather than 30:1.) In a commerce site, where 98% of the traffic is browsing and only 2% is buying, a read-pooled cluster makes a lot of sense.


Three Vendors Worth Evaluating

Several vendors are sponsoring QCon. (One can only wonder what the registration fees would be if they didn't.) Of these, I think three have products worth immediate evaluation.


In the category of "really cool, but would I pay for it?" is Semmle. Their flagship product, SemmleCode, lets you treat your codebase as a database against which you can run queries. SemmleCode groks the static structure of your code, including relationships and dependencies. Along the way, it calculates pretty much every OO metric yet invented. It also looks at the source repository.

What can you do with it? Well, you can create a query that shows you all the cyclic dependencies in your code. The results can be rendered as a tree with explanations, a graph, or a chart. Or, you can chart your distribution of cyclomatic complexity scores over time. You can look for the classes or packages most likely to create a ripple effect.

Semmle ships with a sample project: the open-source drawing framework JHotDraw. In a stunning coincidence, I'm a contributor to JHotDraw. I wrote the glue code that uses Batik to export a drawing as SVG. So I can say with confidence, that when Semmle showed all kinds of cyclic dependencies in the exporters, it's absolutely correct. Every one of the queries I saw run against JHotDraw confirmed my own experience with that codebase. Where Semmle indicated difficulty, I had difficulty. Where Semmle showed JHotDraw had good structure, it was easy to modify and extend.

There are an enormous number of things you could do with this, but one thing they currently lack is build-time automation. Semmle integrates with Eclipse, but not ANT or Maven. I'm told that's coming in a future release.


Virtualization is a hot topic. VMWare has the market lead in this space, but I'm very impressed with 3Tera's AppLogic.

AppLogic takes virtualization up a level.  It lets you visually construct an entire infrastructure, from load balancers to databases, app servers, proxies, mail exchangers, and everything. These are components they keep in a library, just like transistors and chips in a circuit design program.

Once you've defined your infrastructure, a single button click will deploy the whole thing into the grid OS. And there's the rub. AppLogic doesn't work with just any old software and it won't work on top of an existing "traditional" infrastructure.

As a comparison, HP's SmartFrog just runs an agent on a bunch of Windows, Linux, or HP-UX servers. A management server sends instructions to the agents about how to deploy and configure the necessary software. So SmartFrog could be layered on top of an existing traditional infrastructure.

Not so with AppLogic. You build a grid specifically to support this deployment style. That makes it possible to completely virtualize load balancers and firewalls along with servers. Of course, it also means complete and total lock-in to 3tera.

Still, for someone like a managed hosting provider, 3tera offers the fastest, most complete definition and provisioning system I've seen.


What can I say about GigaSpaces? Anyone who's heard me speak knows that I adore tuple-spaces. GigaSpaces is a tuple-space in the same way that Tibco is a pub-sub messaging system. That is to say, the foundation is a tuple-space, but they've added high-level capabilities based on their core transport mechanism.

So, they now have a distributed caching system.  (They call it an "in-memory data grid". Um, OK.) There's a database gateway, so your front end can put a tuple into memory (fast) while a back-end process takes the tuple and writes it into the database.

Just this week, they announced that their entire stack is free for startups. (Interesting twist: most companies offer the free stuff to open-source projects.) They'll only start charging you money when you get over $5M in revenue. 

I love the technology. I love the architecture.

Architecting for Latency

Dan Pritchett, Technical Fellow at eBay, spoke about "Architecting for Latency". His aim was not to talk about minimizing latency, as you might expect, but rather to architect as though you believe latency is unavoidable and real.

We all know the effect latency can have on performance. That's the first-level effect. If you consider synchronous systems---such as SOAs or replicated DR systems---then latency has a dramatic effect on scalability as well. Whenever a synchronous call reaches across a long wire, the latency gets added directly to the processing time.

For example, if client A calls service B, then A's processing time will be at least the sum of B's processing time, plus the latency between A and B. (Yes, it seems obvious when you state it like that, but many architectures still act as though latency is zero.)

Furthermore, latency over IP networks is fundamentally variable. That means A's performance is unpredictable, and can never be made predictable.

Latency also introduces semantic problems. A replicated database will always have some discrepancy with the master database. A functionally or horizontally partitioned system will either allow discrepancies or must serialize traffic and give up scaling. You can imagine that eBay is much more interested in scaling than serializing traffic.

For example, when a new item is posted to eBay, it does not immediately show up in the search results. The ItemNode service posts a message that eventually causes the item to show up in search results. Admittedly, this is kept to a very short period of time, but still, the item will reach different data centers at different times. So, the search service inside the nearest data center will get the item before the search service inside the farthest. I suspect many eBay users would be shocked, and probably outraged, to hear that shoppers see different search results depending on where they are.

Now, the search service is designed to get consistent within a limited amount of time---for that item. With a constant flow of items, being posted from all over the country, you can imagine that there is a continuous variance among the search services. Like the quantum foam, however, this is near-impossible to observe. One user cannot see it, because a user gets pinned to a single data center. It would take multiple users, searching in the right category, with synchronized clocks, taking snapshots of the results to even observe that the discrepancies happen. And even then, they would only have a chance of seeing it, not a certainty. 

Another example. Dan talked about payment transfer from one user to another. In the traditional model, that would look something like this.

Synchronous write to both databases

You can think of the two databases as being either shards that contain different users or tables that record different parts of the transaction.

This is a design that pretends latency doesn't exist. In other words, it subscribes to Fallacy #2 of the Fallacies of Distributed Computing. Performance and scalability will suffer here.

(Frankly, it has an availability problem, too, because the availability of the payment service Pr(Payment) will now be Pr(Database 1) * Pr(Network 1) * Pr(Database 2) * Pr(Network 2). In other words, the availability of the payment service is coupled to the availability of Database 1, Database 2, and the two networks connecting Payment to the two databases.)

Instead, Dan recommends a design more like this:

Payment service with back end reconciliation

In this case, the payment service can set an expectation with the user that the money will be credited within some number of minutes. The availability and performance of the payment service is now independent from that of Database 2 and the reconciliation process. Reconciliation happens in the back end. It can be message-driven, batch-driven, or whatever. The main point is that it is decoupled in time and space from the payment service. Now the databases can exist in the same data center or on separate sides of the globe. Either way, the performance, availability, and scalability characteristics of the payment service don't change. That's architecting for latency.

Instead of ACID semantics, we think about BASE semantics. (A cute retronym for Basically Available Soft-state Eventually-consistent.) 

Now, many analysts, developers, and business users will object to the loss of global consistency. We heard a spirited debate last night between Dan and Nati Shalom, founder and CTO of GigaSpaces about that very subject.

I have two arguments of my own to support architecting for latency.

First, any page you show to a user represents a point-in-time snapshot. It can always be inaccurate even by the time you finish generating the page. Think about a commerce site saying "Ships in 2 - 3 days". That's true at the instant when the data is fetched. Ten milliseconds later, it might not be true. By the time you finish generating the page and the user's browser finishes rendering it (and fetching the 29 JavaScript files needed for the whizzy AJAX interface) the data is already a few seconds old. So global consistency is kind of useless in that case, isn't it? Besides, I can guarantee there's already a large amount of latency in the data path from inventory tracking to the availability field in the commerce database anyway.

Second, the cost of global consistency is global serialization. If you assume a certain volume of traffic you must support, the cost of a globally consistent solution will be a multiple of the cost of a latency-tolerant solution. That's because global consistency can only be achieved by making a single master system of record. When you try to reach large scales, that master system of record is going to be hideously expensive.

Latency is simply an immutable fact of the universe. If we architect for it, we can use it to our advantage. If we ignore it, we will instead be its victims.

For more of Dan's thinking about latency, see his article on InfoQ.com

SOA Without the Edifice

Sometimes the best interactions at a conference aren't the talks, they're shouting. An crowded bar with an over-amped DJ may seem like an unlikely place for a discussion on SOA. Even so, when it's Jim Webber, ThoughtWorks' SOA practice lead doing the shouting, it works. Given that Jim's topic is "Guerilla SOA", shouting is probably more appropriate than the hushed and reverential cathedral whispers that usually attend SOA discussions.

Jim's attitude is that SOA projects tend to attract two things: Taj Mahal architects and parasitic vendors. (My words, not Jim's.) The combined efforts of these two groups results in monumentally expensive edifices that don't deliver value. Worse still, these efforts consume work and attention that could go to building services aligned with the real business processes, not some idealized vision of what the processes ought to be.

Jim says that services should be aligned with business processes. When the business process changes, change the service. (To me, this automatically implies that the service cannot be owned by some enterprise governance council.) When you retire the business process, simply retire the service.

These sound like such common sense, that it's hard to imagine they could be controversial.

I'll be in the front row for Jim's talk later today.


Cameron Purdy: 10 Ways to Botch Enterprise Java Scalability and Reliability

Here at QCon, well-known Java developer Cameron Purdy gave a fun talk called "10 Ways to Botch Enterprise Java Scalability and Reliability".  (He also gave this talk at JavaOne.)

While I could quibble with Cameron's counting---there were actually more like 16 points thanks to some numerical overloading---I liked his content.  He echoes many of the antipatterns from Release It.   In particular, he talks about the problem I call "Unbounded Result Sets".  That is, whether using an ORM tool or straight database queries, you can always get back more than you expect. 

Sometimes, you get back way, way more than you expect. I once saw a small messaging table, that normally held ten or twenty rows, grow to over ten million rows.  The application servers never contemplated there could be so many messages.  Each one would attempt to fetch the entire contents of the table and turn them into objects.  So, each app server would run out of memory and crash.  That rolled back the transaction, allowing the next app server to impale itself on the same table.

Unbounded Result Sets don't just happen from "SELECT * FROM FOO;", though.  Think about an ORM handling the parent-child relationship for you.  Simply calling something like customer.getOrders() will return every order for that customer.  By writing that call, you implicitly assume that the set of orders for a customer will always be small.  Maybe.  Maybe not.  How about blogUser.getPosts()?  Or tickerSymbol.getTrades()?

Unbounded Result Sets also happen with web services and SOAs.  A seemingly innocuous request for information could create an overwhelming deluge---an avalanche of XML that will bury your system.  At the least, reading the results can take a long time.  In the worst case, you will run out of memory and crash.

The fundamental flaw with an Unbounded Result Set is that you are trusting someone else not to harm you, either a data producer or a remote web service. 

Take charge of your own safety! 

Be defensive!

Don't get hurt again in another dysfunctional relationship!

Kent Beck's Keynote: "Trends in Agile Development"

Kent Beck spoke with his characteristic mix of humor, intelligence, and empathy.  Throughout his career, Kent has brought a consistently humanistic view of development.  That is, software is written by humans--emotional, fallible, creative, and messy--for other humans.  Any attempt to treat development as robotic will end in tears.

During his keynote, Kent talked about engaging people through appreciative inquiry.  This is a learnable technique, based in human psychology, that helps focus on positive attributes.  It counters the negaitivity that so many developers and engineers are prone to.  (My take: we spend a lot of time, necessarily, focusing on how things can go wrong.  Whether by nature or by experience, that leads us to a pessimistic view of the world.)

Appreciative inquiry begins by asking, "What do we do well?"  Even if all you can say is that the garbage cans get emptied every night, that's at least something that works well.  Build from there.

He specifically recommended The Thin Book of Appreciative Inquiry, which I've already ordered.

I should also note that Kent has a new book out, called Implementation Patterns, which he described as being about, "Communicating with other people, through code."

From QCon San Francisco

I'm at QCon San Francisco this week.  (An aside: after being a speaker at No Fluff, Just Stuff, it's interesting to be the audience again.  As usual, on returning from travels in a different domain, one has a new perspective on familiar scenes.) This conference targets senior developers, architects, and project managers.  One of the very appealing things is the track on "Architectures you've always wondered about".  This coveres high-volume architectures for sites such as LinkedIn and eBay as well as other networked applications like Second Life.  These applications live and work in thin air, where traffic levels far outstrip most sites in the world.  Performance and scalability are two of my personal themes, so I'm very interested in learning from these pioneers about what happens when you've blown past the limits of traditional 3-tier, app-server centered architecture.

Through the remainder of the week, I'll be blogging five ideas, insights, or experiences from each day of the conference.

You Keep Using That Word. I Do Not Think It Means What You Think It Means.

"Scalable" is a tricky word. We use it like there's one single definition. We speak as if it's binary: this architecture is scalable, that one isn't.

The first really tough thing about scalability is finding a useful definition. Here's the one I use:

Marginal revenue / transaction > Marginal cost / transaction

The cost per transaction has to account for all cost factors: bandwidth, server capacity, physical infrastructure, administration, operations, backups, and the cost of capital.

(And, by the way, it's even better when the ratio of revenue to cost per transaction grows as the volume increases.)

The second really tough thing about scalability and architecture is that there isn't one that's right.  An architecture may work perfectly well for a range of transaction volumes, but fail badly as one variable gets large.

Don't treat "scalability" as either a binary issue or a moral failing. Ask instead, "how far will this architecture scale before the marginal cost deteriorates relative to the marginal revenue?" Then, follow that up with, "What part of the architecture will hit a scaling limit, and what can I incrementally replace to remove that limit?"

Engineering in the White Space

"Is software Engineering, or is it Art?"

Debate between the Artisans and the Engineers has simmered, and occasionally boiled, since the very introduction of the phrase "Software Engineering".  I won't restate all the points on both sides here, since I would surely forget someone's pet argument, and also because I see no need to be redundant.

Deep in my heart, I believe that building programs is art and architecture, but not engineering.

But, what if you're not just building programs?

Programs and Systems

A "program" has a few characteristics that I'll assign here:

  1. It accepts input.
  2. It produces output.
  3. It runs a sequence of instructions.
  4. Statically, it exhibits cohesion in its executable form. [*]
  5. Dynamically, it exhibits cohesion in its address space. [**]

* That is, the transitive closure of all code to be executed is finite, although it may not all be known in advance of execution.  This allows dynamic extension via plugins, but not, for example, dynamic execution of any scripts or code found on the Web.  So, a web browser is a program, but Javascript executed on some page is an independent program, not part of the browser itself.

** For "address space", feel free to substitute "object space", "process space", or "virtual memory". Cohesion requires that all the code that can access the address space should be regarded as a single program.  (IPC through shared memory is a special case of an output, and should be considered more akin to a database or memory-mapped file than to part of the program's own address space.)

Suppose you have two separate scripts that each manipulate the same database.  I would regard those as two separate---though not independent---programs.  A single instance of Tomcat may contain several independent programs, but all the servlets in one EAR file are part of one program.

For the moment, I will not consider trivial objections, such as two distinct sets of functionality that happen to be packaged and delivered in a single EAR file.  It's less interesting to me whether code does access the entire address space then whether it could.  A library checkout program that includes functions for both librarians and patrons may not use common code for card number lookup, but it could.  (And, arguably, it should.)  That makes it one program, in my eyes.

A "System", on the other hand, consists of interdependent programs that have commonalities in their inputs and outputs.  They could be arranges in a chain, a web, or a loop.  No matter, if one program's input depends on another program's output, then they are part of a system.

Systems can be composed, whereas programs cannot.  

Tricky White Space

Some programs run all the time, responding to intermittent inputs, these we call "servers".  It is very common to see servers represented as a deceptively simple little rectangle on a diagram.  Between servers, we draw little arrows to indicate communication, of some sort.

One little arrow might mean, "Synchronous request/reply using SOAP-XML over HTTP." That's quite a lot of information for one little glyph to carry.  There's not usually enough room to write all that, so we label the unfortunate arrow with either "XML over HTTP"---if viewing it from an internal perspective---or "SKU Lookup"---if we have an external perspective.

That little arrow, bravely bridging the white space between programs, looks like a direct contact.  It is Voyager, carrying its recorded message to parts unknown.  It is Aricebo, blasting a hopeful greeting into the endless dark.

Well, not really...

These days, the white space isn't as empty as it once was.  A kind of lumeniferous ether fills the void between servers on the diagram.

The Substrate

There is many a slip 'twixt cup and lip.  In between points A and B on our diagram, there exist some or all of the following:

  • Network interface cards
  • Network switches
  • Layer 2 - 3 firewalls
  • Layer 7 (application) firewalls
  • Intrusion Detection and Prevention Systems
  • Message queues
  • Message brokers
  • XML transformation engines
  • Flat file translations
  • FTP servers
  • Polling jobs
  • Database "landing zone" tables
  • ETL scripts
  • Metro-area SoNET rings
  • MPLS gateways
  • Trunk lines
  • Oceans
  • Ocean liners
  • Phillipine fishing trawlers (see, "Underwater Cable Break")

Even in the simple cases, there will be four or five computers between program A and B, each running their own programs to handle things like packet switching, traffic analysis, routing, threat analysis, and so on.

I've seen a single arrow, running from one server to another, labelled "Fulfillment".  It so happened that one server was inside my client's company while the other server was in a fulfillment house's company.  That little arrow, so critical to customer satisfaction, really represented a Byzantine chain of events that resembled a game of "Mousetrap" more than a single interface.  It had messages going to message brokers that appended lines to files, which were later picked up by an hourly job that would FTP the files to the "gateway" server (still inside my client's company.)  The gateway server read each line from the file and constructed and XML message, which it then sent via HTTP to the fulfillment house.

It Stays Up

We analogize bridge-building as the epitome of engineering. (Side note: I live in the Twin Cities area, so we're a little leery of bridge engineering right now.  Might better find another analogy, OK?)  Engineering a bridge starts by examining the static and dynamic load factors that the bridge must support: traffic density, weight, wind and water forces, ice, snow, and so on.

Bridging between two programs should consider static and dynamic loads, too.  Instead of just "SOAP-XML over HTTP", that one little arrow should also say, "Expect one query per HTTP request and send back one response per HTTP reply.  Expect up to 100 requests per second, and deliver responses in less than 250 milliseconds 99.999% of the time."

It Falls Down

Building the right failure modes is vital. The last job of any structure is to fall down well. The same is true for programs, and for our hardy little arrow.

The interface needs to define what happens on each end when things come unglued. What if the caller sends more than 100 requests per second? Is it OK to refuse them? Should the receiver drop requests on the floor, refuse politely, or make the best effort possible?

What should the caller do when replies take more than 250 milliseconds? Should it retry the call? Should it wait until later, or assume the receiver has failed and move on without that function?

What happens when the caller sends a request with version 1.0 of the protocol and gets back a reply in version 1.1? What if it gets back some HTML instead of XML?  Or an MP3 file instead of XML?

When a bridge falls down, it is shocking, horrifying, and often fatal. Computers and networks, on the other hand, fall down all the time.  They always will.  Therefore, it's incumbent on us to ensure that individual computers and networks fail in predictable ways. We need to know what happens to that arrow when one end disappears for a while.

In the White Space

This, then, is the essence of engineering in the white space. Decide what kind of load that arrow must support.  Figure out what to do when the demand is more than it can bear.  Decide what happens when the substrate beneath it falls apart, or when the duplicitous rectangle on the other end goes bonkers.

Inside the boxes, we find art.

The arrows demand engineering.


The 5 A.M. Production Problem

I've got a new piece up at InfoQ.com, discussing the limits of unit and functional testing: 

"Functional testing falls short, however, when you want to build software to survive the real world. Functional testing can only tell you what happens when all parts of the system are behaving within specification. True, you can coerce a system or subsystem into returning an error response, but that error will still be within the protocol! If you're calling a method on a remote EJB that either returns "true" or "false" or it throws an exception, that's all it will do. No amount of functional testing will make that method return "purple". Nor will any functional test case force that method to hang forever, or return one byte per second.

One of my recurring themes in Release It is that every call to another system, without exception, will someday try to kill your application. It usually comes from behavior outside the specification. When that happens, you must be able to peel back the layers of abstraction, tear apart the constructed fictions of "concurrent users", "sessions", and even "connections", and get at what's really happening."

Flash Mobs and TCP/IP Connections

In Release It, I talk about users and the harm they do to our systems.  One of the toughest types of user to deal with is the flash mob.  A flash mob often results from Attacks of Self-Denial, like when you suddenly offer a $3000 laptop for $300 by mistake.

When a flash mob starts to arrive, you will suddenly see a surge of TCP/IP connection requests at your load-distribution layer.  If the mob arrives slowly enough (less than 1,000 connections per second) then the app servers will be hurt the most.  For a really fast mob, like when your site hits the top spot on digg.com, you can get way more than 1,000 connections per second.  This puts the hurt on your web servers.

As the TCP/IP connection requests arrive, the OS queues them for servicing by the application.  As the application gets around to calling "accept" on the server socket, the server's TCP/IP stack sends back the SYN/ACK packet and the connection is established.  (There's a third step, but we can skip it for the moment.)  At that point, the server hands the established connection off to a worker thread to process the request.  Meanwhile, the thread that accepted the connection goes back to accept the next one.

Well, when a flash mob arrives, the connection requests arrive faster than the application can accept and dispatch them.   The TCP/IP stack protects itself by limiting the number of pending connection requests, so if the requests arrive faster than the application can accept them, the queue will grow until the stack has to start refusing connection requests.  At that point, your server will be returning intermittent errors and you're already failing.

The solution is much easier said than done: accept and dispatch connections faster than they arrive.

Filip Hanik compares some popular open-source servlet containers to see how well they stand up to floods of connection requests.  In particular, he demonstrates the value of Tomcat 6's new NIO connector.  Thanks to some very careful coding, this connector can accept 4,000 connections in 4 seconds on one server.  Ultimately, he gets it to accept 16,000 concurrent connections on a single server.  (Not surprisingly, RAM becomes the limiting factor.)

It's not clear that these connections can actually be serviced at that point, but that's a story for another day.

Self-Inflicted Wounds

My friend and colleague Paul Lord said, "Good marketing can kill you at any time."

He was describing a failure mode that I discuss in Release It!: Design and Deploy Production-Ready Software as "Attacks of Self-Denial".  These have all the characteristics of a distributed denial-of-service attack (DDoS), except that a company asks for it.  No, I'm not blaming the victim for electronic vandalism... I mean, they actually ask for the attack.

The anti-pattern goes something like this: marketing conceives of a brilliant promotion, which they send to 10,000 customers.  Some of those 10,000 pass the offer along to their friends.  Some of them post it to sites like FatWallet or TechBargains.  On the appointed day, hour, and minute, the site has a date with destiny as a million or more potential customers hit the deep link that marketing sent around in the email.  You know, the one that bypasses the content distribution network, embeds a session ID in the URL, and uses SSL?

Nearly every retailer I know has done this to themselves at one point.  Two holidays ago, one of my clients did it to themselves, when they announced that XBox 360 preorders would begin at a certain day and time.  Between actual customers and the amateur shop-bots that the tech-savvy segment cobbled together, the site got crushed.  (Yes, this was one where marketing sent the deep link that bypassed all the caching and bot-traps.)

Last holiday, Amazon did it to themselves when they discounted the XBox 360 by $300.  (What is it about the XBox 360?)  They offered a thousand units at the discounted price and got ten million shoppers.  All of Amazon was inaccessible for at least 20 minutes.  (It may not sound like much, but some estimates say Amazon generates $1,000,000 per hour during the holiday season, so that 20 minute outage probably cost them around $200,000!)

In Release It!, I discuss some non-technical ways to mitigate this behavior, as well as some design and architecture patterns you can apply to minimize damage when one of these Attacks of Self-Denial occur.

How to become an "architect"

Over at The Server Side, there's a discussion about how to become an "architect".  Though TSS comments often turn into a cesspool, I couldn't resist adding my own two cents.

I should also add that the title "architect" is vastly overused.  It's tossed around like a job grade on the technical ladder: associate developer, developer, senior developer, architect.  If you talk to a consulting firm, it goes more like: senior consultant (1 - 2 years experience), architect (3 - 5 years experience), senior technical architect (5+ years experience).  Then again, I may just be too cynical.

There are several qualities that the architecture of a system should be:

  1. Shared.  All developers on the team should have more or less the same vision of the structure and shape of the overall system.
  2. Incremental.  Grand architecture projects lead only to grand failures.
  3. Adaptable. Successful architectures can be used for purposes beyond their designers' original intentions.  (Examples: Unix pipes, HTTP, Smalltalk)
  4. Visible.  The "sacred, invisible architecture" will fall into disuse and disrepair.  It will not outlive its creator's tenure or interest.

Is the designated "architect" the only one who can produce these qualities?  Certainly not.  He/she should be the steward of the system, however, leading the team toward these qualities, along with the other -ilities, of course.

Finally, I think the most important qualification of an architect should be: someone who has created more than one system and lived with it in production.  Note that automatically implies that the architect must have at least delivered systems into production.  I've run into "architects" who've never had a project actually make it into production, or if they have, they've rolled off the project---again with the consultants---just as Release 1.0 went out the door.

In other words, architects should have scars. 

Planning to Support Operations

In 2005, I was on a team doing application development for a system that would be deployed to 600 locations. About half of those locations would not have network connections. We knew right away that deploying our application would be key, particularly since it is a "rich-client" application. (What we used to call a "fat client", before they became cool again.) Deployment had to be done by store associates, not IT. It had to be safe, so that a failed deployment could be rolled back before the store opened for business the next day. We spent nearly half of an iteration setting up the installation scripts and configuration. We set our continuous build server up to create the "setup.exe" files on every build. We did hundreds of test installations in our test environment.

Operations said that our software was "the easiest installation we've ever had." Still, that wasn't the end of it. After the first update went out, we asked operations what could be done to improve the upgrade process. Over the next three releases, we made numerous improvements to the installers:

  • Make one "setup.exe" that can install either a server or a client, and have the installer itself figure out which one to do.
  • Abort the install if the application is still running. This turned out to be particularly important on the server.
  • Don't allow the user to launch the application twice. Very hard to implement in Java. We were fortunate to find an installer package that made this a check-box feature in the build configuration file!
  • Don't show a blank Windows command prompt window. (An artifact of our original .cmd scripts that were launching the application.)
  • Create separate installation discs for the two different store brands.
  • When spawning a secondary application, force it's window to the front, avoiding the appearance of a hang if the user accidentally gives focus to the original window.

These changes reduced support call volume by nearly 50%.

My point is not to brag about what a great job we did. (Though we did a great job.) To keep improving our support for operations, we deliberately set aside a portion of our team capacity each iteration. Operations had an open invitation to our iteration planning meetings, where they could prioritize and select story cards the same as our other stakeholders. In this manner, we explicitly included Operations as a stakeholder in application construction. They consistently brought us ideas and requests that we, as developers, would not have come up with.

Furthermore, we forged a strong bond with Operations. When issues arose---as they always will---we avoided all of the usual finger-pointing. We reacted as one team, instead of two disparate teams trying to avoid responsibility for the problems. I attribute that partly to the high level of professionalism in both development and operations, and partly to the strong relationship we created through the entire development cycle.

Reflexivity and Introspection

A fascinating niche of programming languages consists of those languages which are constructed in themselves. For instance, Squeak is a Smalltalk whose interpreter is written in Squeak. Likewise, the best language for writing a LISP interpreter turns out to be LISP itself. (That one is more like nesting than bootstrapping, but it's closely related.)

I think Ruby has enough introspection to be built the same way. Recently, a friend clued me in to PyPy, a Python interpreter written in Python.

I'm sure there are many others. In fact the venerable GCC is written in its own flavor of C. Compiling GCC from scratch requires a bootstrapping phase, by compiling a small version of GCC, written in a more portable form of C, with some other C compiler. Then, the phase I micro-GCC compiles the whole GCC for the target platform.

Reflexivity arises when the language has sufficient introspective capabilities to describe itself. I cannot help but be reminded of Godel, Escher, Bach and the difficulties that reflexivity cause. Godel's Theorem doesn't really kick in until a formal system is complex enough to describe itself. At that point, Godel's Theorem proves that there will be true statements, expressed in the language of the formal system, that cannot be proven true. These are inevitably statements about themselves---the symbolic logic form of, "This sentence is false."

Long-time LISP programmers create works with such economy of expression that we can only use artistic metaphors to describe them. Minimalist. Elegant. Spare. Rococo.

Forth was my first introduction to self-creating languages. FORTH starts with a tiny kernel (small enough that it fit into a 3KB cartridge for my VIC-20) that gets extended one "word" at a time. Each word adds to the vocabulary, essentially customizing the language to solve a particular problem. It's really true that in FORTH, you don't write programs to solve problems. Instead, you invent a language in which solving the problem is trivial, then you spend your time implementing that language.

Another common aspect of these self-describing languages seems to be that they never become widely popular. I've heard several theories that attempted to explain this. One says that individual LISP programmers are so productive that they never need large teams. Hence, cross-pollination is limited and it is hard to demonstrate enough commercial demand to seem convincing. Put another way: if your started with equal populations of Java and LISP programmers, demand for Java programmers would quickly outstrip demand for LISP programmers... not because it's a superior language, but just because you need more Java programmers for any given task. This demand becomes self-reinforcing, as commercial programmers go where the demand is, and companies demand what they see is available.

I also think there's a particular mindset that admires and relates to the dynamic of the self-creating language. I suspect that programmers possessing that mindset are also the ones who get excited by metaprogramming.

More Wiki

My personal favorite is TWiki. It has some nice features like file attachments, a great search interface, high configurability, and a rich set of available plugins (including an XP tracker plugin.)

One cool thing about TWiki: configuration settings are accomplished through text on particular topics. For example, each "web" (set of interrelated topics) has a topic called "WebPreferences". The text on the WebPreferences topics actually controls the variables. Likewise, if you want to set personal preferences, you set them as variables--in text--on your personal topic. It's a lot harder to describe than it is to use.

There are some other nice features like role-based access control (each topic can have a variable that says which users or groups can modify the topic), multiple "webs", and so on.

The search interface is available as variable interpolation on a topic, so something like the "recent changes" topic just ends up being a date-ordered search of changes, limited to ten topics. This means that you can build dynamic views based on content, metadata, attachments, or form values. I once put a search variable on my home topic that would show me any task I was assigned to work on or review.

I've also been looking at Oahu Wiki. It's an open source Java wiki. It's fairly short on features at this point, but it has by far the cleanest design I've seen yet. I look forward to seeing more from this project.

Too Much Abstraction

The more I deal with infrastructure architecture, the more I think that somewhere along the way, we have overspecialized. There are too many architects that have never lived with a system in production, or spent time on an operations team. Likewise, there are a lot of operations people that insulate themselves from the specification and development of systems for which they will ultimately take responsibility.

The net result is suboptimization in the hardware/software fit. As a result, overall availability of the application suffers.

Here's a recent example.

First, we're trying to address the general issue of flowing data from production back into pre-production systems -- QA, production support, development, staging. The first attempt took 6 days to complete. Since the requirements of the QA environment stipulate that the data should be no more than one week out of date relative to production, that's a big problem. On further investigation, it appears that the DBA who was executing this process spent most of the time doing scps from one host to another. It's a lot of data, so in one respect 10 hour copies are reasonable.

But the DBA had never been told about the storage architecture. That's the domain of a separate "enterprise service" group. They are fairly protective of their domain and do not often allow their architecture documents to be distributed. They want to reserve the right to change them at will. Now, they will be quite helpful if you approach them with a storage problem, but the trick is knowing when you have a storage problem on your hands.

You see, all of the servers that the DBA was copying files from and to are all on the same SAN. An scp from one host on the SAN to another host on the SAN is pretty redundant.

There's an alternative solution that involves a few simple steps: Take a database snapshot onto a set of disks with mirrors, split the mirrors, and join them onto another set of mirrors, then do an RMAN "recovery" from that snapshot into the target database. Total execution time is about 4 hours.

From six days to four hours, just by restating the problem to the right people.

This is not intended to criticize any of the individuals involved. Far from it, they are all top-notch professionals. But the solution required merging the domains of knowledge from these two groups -- and the organizational structure explicitly discouraged that merging.

Another recent example.

One of my favorite conferences is the Colorado Software Summit. It's a very small, intensely technical crowd. I sometimes think half the participants are also speakers. There's a year-round mailing list for people who are interested in, or have been to, the Summit. These are very skilled and talented people. This is easily the top 1% of the software development field.

Even there, I occasionally see questions about how to handle things like transparent database connection failover. I'll admit that's not exactly a journeyman topic. Bring it up at a party and you'll have plenty of open space to move around in. What surprised me is that there are some fairly standard infrastructure patterns for enabling database connection failover that weren't known to people with decades of experience in the field. (E.g., cluster software reassigns ownership of a virtual IP address to one node or the other, with all applications using the virtual IP address for connections).

This tells me that we've overspecialized, or at least, that the groups are not talking nearly enough. I don't think it's possible to be an expert in high availability, infrastructure architecture, enterprise data management, storage solutions, OOA/D, web design, and network architecture. Somehow, we need to find an effective way to create joint solutions, so we don't have software being developed that's completely ignorant of its deployment architecture, nor should we have infrastructure investments that are not capable of being used by the software. We need closer ties between operations, architecture, and development.

Don't Build Systems That Boink

Note: This piece originally appeared in the "Marbles Monthly" newsletter in April 2003

I caught an incredibly entertaining special on The Learning Channel last week. A bunch of academics decided that they were going to build an authentic Roman-style catapult, based on some ancient descriptions. They had great plans, engineering expertise, and some really dedicated and creative builders. The plan was to hurl a 57 pound stone 400 yards, with a machine that weighed 30 tons. It was amazing to see the builders faces swing between hope and fear. The excitement mingled with apprehension.

At one point, the head carpenter said that it would be wonderful to see it work, but "I'm fairly certain it's going to boink." I immediately knew what he meant. "Boink" sums up all the myriad ways this massive device could go horribly wrong and wreak havoc upon them all. It could fall over on somebody. It could break, releasing all that kinetic energy in the wrong direction, or in every direction. The ball could fly off backwards. The rope might relax so much that it just did nothing. One of the throwing arms could break. They could both break. In other words, it could do anything other than what it was intended to do.

That sounds pretty familiar. I see the same expressions on my teammates' faces every day. This enormous project we're slaving on could fall over and crush us all into jelly. It could consume our hours, our minds, and our every waking hour. Worst case, it might cost us our families, our health, our passion. It could embarrass the company, or cost it tons of money. In fact, just about the most benign thing it could do is nothing.

So how do you make a system that don't boink? It is hard enough just making the system do what it is supposed to. The good news is that some simple "do's and don'ts" will take us a long way toward non-boinkage.

Automation is Your Friend #1: Runs lots of tests -- and run them all the time

Automated unit tests and automated functional tests will guarantee that you don't backslide. They provide concrete evidence of your functionality, and they force you to keep your code integrated.

Automation is Your Friend #2: Be fanatic about build and deployment processes

A reliable, fully automated build process will prevent headaches and heartbreaks. A bad process--or a manual process--will introduce errors and make it harder to deliver on an iterative cycle.

Start with a fully automated build script on day one. Start planning your first production-class deployment right away, and execute a deployment within the first three weeks. A build machine (it can be a workstation) should create a complete, installable numbered package. That same package should be delivered into each environment. That way, you can be absolutely certain that QA gets exactly the same build that went into integration testing.

Avoid the temptation to check out the source code to each environment. An unbelievable amount of downtime can be traced to a version label being changed between when the QA build and the production build got done.

Everything In Its Place

Keep things separated that either change at different speeds. Log files change very fast, so isolate them. Data changes a little less quickly but is still dynamic. "Content" changes slower yet, but is still faster than code. Configuration settings usually come somewhere between code and content. Each of these things should go in their own location, isolated and protected from each other.

Be transparent

Log everything interesting that happens. Log every exception or warning. Log the start and end of long-running tasks. Always make sure your logs include a timestamp!

Be sure to make the location of your log files configurable. It's not usually a good idea to keep log files in the same filesystem as your code or data. Filling up a filesystem with logs should not bring your system down.

Keep your configuration out of your code

It is always a good idea to separate metadata from code. This includes settings like host names, port numbers, database URLs and passwords, and external integrations.

A good configuration plan will allow your system to exist in different environments -- QA versus production, for example. It should also allow for clustered or replicated installations.

Keep your code and your data separated

The object-oriented approach is a good wasy to build software, but it's a lousy way to deploy systems. Code changes at a different frequency than data. Keep them separated. For example, in a web system, it should be easy to deploy a new code drop without disrupting the content of the site. Likewise, new content should not affect the code.

Keep Your Secrets

Here's a system I call "KeepYourSecrets.org". Recall a film noir detective telling the criminal mastermind that unless he drops a postcard in the mail in the next three days, all the details will go straight to the newspaper.

You can upload any kind of file -- it's all treated like binary. You can set some parameters like a distribution list and a checkin frequency. The system uses an IRC-like network to split your file in n parts, of which some k parts are needed to re-create the original. Up to n-k parts can be lost or compromised without losing or compromising the whole. (See "Applied Cryptography" for details.) With lots of hosts, you can split a document into multiple overlapping sets of pieces to provide another layer of resiliency against damage.

From then on, if you do not check in with the network on some periodic basis, the document goes out to the distribution list. NYTimes, Washington Post, CIA, whoever is on the distribution list for your file.

The network of server don't ever have to know who you are. They just need to know that you hold the private key that matches the public key that was used to upload the package.

It's possible to construct voting algorithms that the servers can use to decide if you have really checked in or not. This lets the network protect against a single compromised or hostile host. (You have to be resilient against hostile implementations.)

Because the hosts all communicate via some pub/sub or relay-chat protocol (Jabber, maybe?), the networks of hosts can be self-forming and self-identifying. If there is no central point of control, then the network as a whole cannot be stopped, subverted or forced to give up secrets by any single agency.

What you end up with is a secure, anonymous drop box that cannot be blocked, traced, or inflitrated. It is self-forming and highly resilient to the loss of constituent pieces.


Needles, Haystacks

So, this may seem a little off-topic, but it comes round in the end. Really, it does.

I've been aggravated with the way members of the fourth estate have been treating the supposed "information" that various TLAs had before the September 11 attacks. (That used to be my birthday, by the way. I've since decided to change it.) We hear that four of five good bits of information scattered across the hundreds of FBI, CIA, NSA, NRO, IRS, DEA, INS, or IMF offices "clearly indicate" that terrorists were planning to fly planes into buildings. Maybe so. Still, it doesn't take a doctorate in complexity theory to figure out that you could probably find just as much data to support any conclusion you want. I'm willing to bet that if the same amount of collective effort were invested, we could prove that the U. S. Government has evidence that Saddam Hussein and aliens from Saturn are going to land in Red Square to re-establish the Soviet Union and launch missiles at Guam.

You see, if you already have the conclusion in hand, you can sift through mountain ranges of data to find those bits that best support your conclusion. That's just hindsight. It's only good for gossipy hens clucking over the backyard fence, network news anchors, and not-so-subtle innuendos by Congresscritters.

The trouble is, it doesn't work in reverse. How many documents does just the FBI produce every day? 10,000? 50,000? How would anyone find exactly those five or six documents that really matter and ignore all of the chaff? That's the job of analysis, and it's damn hard. A priori, you could only put these documents together and form a conclusion through sheer dumb luck. No matter how many analysts the agencies hire, they will always be crushed by the tsunami of data.

Now, I'm not trying to make excuses for the alphabet soup gang. I think they need to reconsider some of their basic operations. I'll leave questions about separating counter-intelligence from law enforcement to others. I want to think about harnessing randomness. You see, government agencies are, by their very nature, bureaucratic entities. Bureaucracies thrive on command-and-control structures. I think it comes from protecting their budgets. Orders flow down the hierarchy, information flows up. Somewhere, at the top, an omniscient being directs the whole shebang. A command-and-control structure hates nothing more than randomness. Randomness is noise in the system, evidence of an inadequate procedures. A properly structured bureaucracy has a big, fat binder that defines who talks to whom, and when, and under what circumstances.

Such a structure is perfectly optimized to ignore things. Why? Because each level in the chain of command has to summarize, categorize, and condense information for its immediate superior. Information is lost at every exchange. Worse yet, the chance for somebody to see a pattern is minimized. The problem is this whole idea that information flows toward a converging point. Whether that point is the head of the agency, the POTUS, or an army of analysts in Foggy Bottom, they cannot assimilate everything. There isn't even any way to build information systems to support the mass of data produced every day, let alone correlating reports over time.

So, how do Dan Rather and his cohorts find these things and put them together? Decentralization. There are hordes of pit-bull journalists just waiting for the scandal that will catapult them onto CNN. ("Eat your heart out Wolf, I found the smoking gun first!")

Just imagine if every document produced by the Minneapolis field office of the FBI were sent to every other FBI agent and office in the country. A vast torrent of data flowing constantly around the nation. Suppose that an agent filing a report about suspicious flight school activity could correlate that with other reports about students at other flight schools. He might dig a little deeper and find some additional reports about increased training activity, or a cluster of expired visas that overlap with the students in the schools. In short, it would be a lot easier to correlate those random bits of data to make the connections. Humans are amazing at detecting patterns, but they have to see the data first!

This is what we should focus on. Not on rebuilding the $6 Billion Bureaucracy, but on finding ways to make available all of the data collected today. (Notice that I haven't said anything that requires weakening our 4th or 5th Amendment rights. This can all be done under laws that existed before 9/11.) Well, we certainly have a model for a global, decentrallized document repository that will let you search, index, and correlate all of its contents. We even have technologies that can induce membership in a set. I'd love to see what Google Sets would do with the 19 hijackers names, after you have it index the entire contents of the FBI, CIA, and INS databases. Who would it nominate for membership in that set?

Basically, the recipe is this: move away from ill-conceived ideas about creating a "global clearinghouse" for intelligence reports. Decentralize it. Follow the model of the Internet, Gnutella, and Google. Maximize the chances for field agents and analysts to be exposed to that last, vital bit of data that makes a pattern come clear. Then, when an agent perceives a pattern, make damn sure the command-and-control structure is ready to respond.


Here's a good roundup of recent traffic regarding REST.

REST and Change in APIs

In case it didn't come through, I'm intrigued by REST, because it seems more fluid than the WS-* specifications. I can do an HTTP request in about 5 lines of socket code in any modern language, from any client device.

The WS-splat crowd seem to be building YABS (yet another brittle standard). Riddle me this: what use is a service description in a standardized form if there is only one implementor of that service? WSDL only attains full value when there are standards built on top of WSDL. Just like XML, WSDL is a meta-standard. It is a standard for specifying other standards. Collected and diverse industry behemoths and leviathans make the rules for that playground.

I see two, equally likely, outcomes for any given service definition:

  • A defining body will standardize the interface for a particular web service. This will take far too long.
  • A dominant company in a star-like topography with its customers and suppliers (think Wal-mart) will impose an interface that its business partners must use.

Once such interfaces are defined, how easily might they be changes? I mean the WSDL (or other) definition of the service itself. Can anyone say CORBAservices? You'd better define your services right the first time, because there appears to be substantial friction opposing change.

How does REST avoid this issue? By eliminating layers. If I support a URI naming scheme like http://company.com/groupName/divisionName/departmentName/purchaseOrders/poNumber as a RESTful way to access purchase orders, and I find that we need to change it to /purchaseOrders/departmentNumber/poNumber, then both forms can co-exist. The alternative change in SOAP/WSDL-land would either modify the original endpoint (an incompatible change!) or would define a new service to support the new mode of lookup. (I suppose other hacks are available, too. Service.getPurchaseOrder2() or Service.getPurchaseOrderNew() for example.)

Of course, neither of these service architectures are implemented widely enough to really evaluate which one will be more accepting of change. I can tell you, though, that one of the huge CORBA-killers was the slow pace and resistance to change in the CORBAservices.

Here's another excellent discussion about

Here's another excellent discussion about REST for web services.

Debating "Web Services"

There is a huge and contentious debate under way right now related to "Web services". A sizable contingent of the W3C and various XML pioneers are challenging the value of SOAP, WSDL, and other "Web service" technology.

This is a nuanced discussion with many different positions being taken by the opponents. Some are critical of the W3C's participation in something viewed as a "pay to play" maneuver from Microsoft and IBM. Others are pointing out serious flaws in SOAP itself. To me, the most interesting challenge comes from the W3C's Technical Architecture Group (TAG). This is the group tasked with defining what the web is and is not. Several of the TAG, including the president of the Apache Foundation, are arguing that "Web services" as defined by SOAP, fundamentally are not "the web". ("The web" being defined crudely as "things are named via URI's" and "every time I ask for the same URI, I get the same results". My definition, not theirs.) With a "Web service", a URI doesn't name a thing, it names a process. What I get when I ask for a URI is no longer dependent solely on the state of the thing itself. Instead, what I get depends on my path through the application.

I'd encourage you to all sample this debate, as summarized by Simon St. Laurent (one of the original XML designers).


For the ultimate in temporal, architectural, language and spatial decoupling, try two of my favorite fluid technologies: publish-subscribe messaging and tuple-spaces.