Wide Awake Developers

« November 2008 | Main | January 2009 »

Attack of Self-Denial, 2008 Style

"Good marketing can kill your site at any time."

--Paul Lord, 2006

I just learned of another attack of self-denial from this past week.

Many retailers are suffering this year, particularly in the brick-and-mortar space. I have heard from several, though, who say that their online performance is not suffering as much as the physical stores are. In some cases, where the brand is strong and the products are not fungible, the online channel is showing year-over-year growth.

One retailer I know was running strong, with the site near it's capacity. They fit the bill for an online success in 2008. They have a great name recognition, a very strong, global brand, and their customers love their products. This past week, their marketing group decided to "take it to the next level."

They blasted an email campaign to four million customers.  It had a good offer, no qualifier, and a very short expiration time---one day only.  A short expiration like that creates a sense of urgency.  Good marketing reaches people and induces them to act, and in that respect, the email worked. Unfortunately, when that means millions of users hitting your site, you may run into trouble.

Traffic flooded the site and knocked it offline. It took more than 6 hours to get everything functioning again.

Instead of getting an extra bump in sales, they lost six hours of holiday-season revenue. As a rule of thumb, you should assume that a peak hour of holiday sales counts for six hours of off-season sales.

There are other technological solutions to help with this kind of traffic flood. For instance, the UX group can create a static landing page for the offer. Then marketing links to that static page in their email blast. Ops can push that static page out into their cache servers, or even into their CDN's edge network. This requires some preparation for each offer, and it takes some extra preparation before the first such offer, but it's very effective. The static page absorbs the bulk of the traffic, so only customers who really want to buy get passed into the dynamic site.

Marketing can also send the email out in waves, so people receive it at different times. That spreads the traffic spike out over a few hours. (Though this doesn't work so well when you send the waves throughout the night, because customers will all see it in a couple of hours in the morning.)

In really extreme cases, a portion of capacity can be carved out and devoted to handling promotional traffic. That way, if the promotion goes nuclear, at least the rest of the site is still online. Obviously, this would be more appropriate for a long-running promotion than a one-day event.

Of course, it should be obvious that all of these technological solutions depend on good communication.

At a surface level, it's easy to say that this happened because marketing had no idea how close to the edge the site was already running. That's true. It's also true, however, that operations previously had no idea what the capacity was. If marketing called and asked, "Can we support 4 million extra visits?" the current operations group could have answered "no". Previously, the answer would have been "I don't know."

So, operations got better, but marketing never got re-educated. Lines of communication were never opened, or re-opened. Better communication would have helped.

In any online business, you must have close communications between marketing, UX, development, and operations. They need to regard themselves as part of one integrated team, rather than four separate teams. I've often seen development groups that view operations as a barrier to getting their stuff released. UX and marketing view development as the barrier to getting their ideas implemented, and so on. This dynamic evolves from the "throw it over the wall" approach, and it can only result in finger-pointing and recriminations.

I'd bet there's a lot of finger-pointing going on in that retailer's hallways this weekend.

(Human | Pattern) Languages, part 2

At the conclusion of the modulating bridge, we expect to be in the contrasting key of C minor. Instead, the bridge concludes in the distantly related key of F sharp major... Instead of resolving to the tonic, the cadence concludes with two isolated E pitches. They are completely ambiguous. They could belong to E minor, the tonic for this movement. They could be part of E major, which we've just heard peeking out from behind the minor mode curtains. [He] doesn't resolve them into a definite key until the beginning of the third movement, characteristically labeled a "Scherzo".

In my last post, I lamented the missed opportunity we had to create a true pattern language about software. Perhaps calling it a missed opportunity is too pessimistic. Bear with me on a bit of a tangent. I promise it comes back around in the end.

The example text above is an amalgam of a lecture series I've been listening to. I'm a big fan of The Teaching Company and their courses. In particular, I've been learning about the meaning and structure of classical, baroque, romantic, and modern music from Professor Robert Greenberg.1 The sample I used here is from a series on Beethoven's piano sonatas. This isn't an actual quote, but a condensation of statements from one of the lectures. I'm not going to go into all the music theory behind this, but it is interesting.2

There are two things I want you to observe about the sample text. First, it's loaded with jargon. It has to be! You'd exhaust the conversational possibilities about the best use of a D-sharp pretty quickly. Instead, you'll talk about structures, tonalities, relationships between that D-sharp and other pitches. (D-sharp played together with a C? Very different from a quick sequence of D-sharp, E, D-sharp, C.) You can be sure that composers don't think in terms of individual notes. A D-sharp by itself doesn't mean anything. It only acquires meaning by its relation to other pitches. Hence all that stuff about keys---tonic, distantly related, contrasting. "Key" is a construct for discussing whole collections of pitches in a kind of shorthand. To a musician, there's a world of difference between G major and A flat minor, even though the basic pitch (the tonic) is only one half-step apart.

Also notice that the text addresses some structural features. The purpose and structure of a modulating bridge is pretty well understood, at least in certain circles. The notion that you can have an "expected" key certainly implies that there are rules for a sonata. In fact, the term "sonata" itself means some fairly specific things3... although to know whether we're talking about "a sonata" or "a movement in sonata form" requires some additional context.

In fact, this paragraph is all about context. It exists in the context of late Classical, early Romantic era music, specifically the music of Beethoven. In the Classical era, musical forms---such as sonata form---pretty much dictates the structure of the music. The number of movements, their relationships to each other, their keys, and even their tempos were well understood. A contemporary listener had every reason to expect that a first movement would be fast and bright, and if the first movement was in C major, then the second, slower movement would be a minuet and trio in G major.

Music and music theory have evolved over the last thousand-odd years. We have a vocabulary---the potentially off-putting jargon of the field. We have nesting, interrelating contexts. Large scale patterns (a piano sonata) create context for medium scale patterns (the first movement "allegretto") which in turn, create context for the medium and small scale patterns (the first theme in the allegretto consists of an ABA'BA phrasing, in which the opening theme sequences a motive upward over octaves.)  We even have the ability to talk about non sequiturs---like the modulating bridge above---where deliberate violation of the pattern language is done for effect.4

What is all this stuff if it isn't a pattern language?

We can take a few lessons, then, from the language of music.

The first lesson is this: give it time. Musical language has evolved over a long time. It has grown and been pruned back over centuries. New terms are invented as needed to describe new answers to a context. In turn, these new terms create fresh contexts to be exploited with yet other inventions.

Second, any such language must be able to assimilate change. Nothing is lost, even amidst the most radical revolutions. When the Twentieth Century modernists rejected the tonal system, they could only reject the structures and strictures of that language. They couldn't destroy the language itself. Phish plays fugues in concert... they just play them with electric guitars instead of harpsichords. There are Baroque orchestras today. They play in the same concert halls as the Pops and Philharmonics. The homophonic texture of plain chant still exists, and so do the once-heretical polyphony and church-sanctioned monophony. Nothing is lost, but new things can be encompassed and incorporated.

And, mainframes still exist with their COBOL programs, together with distributed object systems, message passing, and web services. The Singleton and Visitor patterns will never truly go away, any more than batch programming will disappear.

Third, we must continue to look at the relationships between different parts of our nascent pattern language. Just as individual objects aren't very interesting, isolated patterns are less interesting than the ways they can interact with each other.

I believe that the true language of software has as much to do with programming languages as the language of music has to do with notes. So, instead of missed opportunity, let us say instead that we are just beginning to discover our true language.


1. Professor Greenberg is a delightful traveling companion. He's witty, knowledgeable and has a way of teaching complex subjects without ever being condescending. He also sounds remarkably like Penn Jillette.

2. The main reason is that I would surely get it wrong in some details and risk losing the main point of my post here.

3. And here we see yet another of the complexities of language. The word "sonata" refers, at different times, to a three movement concert work, a single movement in a characteristic structure, a four movement concert work, and in Beethoven's case, to a couple of great fantasias that he declares to be sonatas simply because he says so.

4. For examples ad nauseum, see Richard Wagner and the "abortive gesture".

(Human | Pattern) Languages

We missed the point when we adopted "patterns" in the software world. Instead of an organic whole, we got a bag of tricks.

The commonly accepted definition of a pattern is "a solution to a problem in a context." This is true, but limiting. This definition loses an essential characteristic of patterns: Patterns relate to other patterns.

We talk about the context of a problem. "Context" is a mental shorthand. If we unpack the context it means many things: constraints, capabilities, style, requirements, and so on. We sometimes mislead ourselves by using the fairly fuzzy, abstract term "context" as a mental handle on a whole variety of very concrete issues. Context includes stated constraints like the functional requirements, along with unstated constraints like, "The computation should complete before the heat death of the universe." It includes other forces like, "This program is written in C#, so the solution to this problem should be in the same language or a closely related one." It should not require a supercooled quantum computer, for example.

Where does the context for a small-scale pattern originate?1 Context does not arise ex nihilio. No, the context for a small-scale pattern is created by larger patterns. Large grained patterns create the fabric of forces that we call the context for smaller patterns. In turn, smaller patterns fit into this fabric and, by their existence, they change it. Thus, the small scale patterns create feedback that can either resolve or exacerbate tensions inherent in the larger patterns.

Solutions that respect their context fit better with the rest of the organic whole. It would be strange to be reading some Java code, built into layered architecture with a relational database for storage, then suddenly find one component that has its own LISP interpreter and some functional code. With all respect to "polyglot programming", there'd better be a strong motivation for such an odd inclusion. It would be a discontinuity... in other words, it doesn't fit the context I described. That context---the layered architecture, the OO language, relational database---was created by other parts of the system.

If, on the other hand, the system was built as a blackboard architecture, using LISP as glue code over intelligent agents acting asynchronously, then it wouldn't be at all odd to find some recursive lambda expressions. In that context, they fit naturally and the Java code would be an oddity.

This interrelation across scale knits patterns together into a pattern language. By and large, what we have today is a growing group of proper nouns. Please don't get me wrong, the nouns themselves have use. It's very helpful to say "you want a Null Object there," and be understood. That vocabulary and the compression it provides is really important.

But we shouldn't mistake a group of nouns for a real pattern language. A language is more than just its nouns. A language also implies ways of connecting statements sensibly. It has idioms and semantics and semiotics.2 In a language, you can have dialog and argumentation.  Imagine a dialog in patterns as they exist today:

"Pipes and filters."

"Observer?"

"Chain of Responsibility!"

You might be able to make a comedy sketch out of that, but not much more. We cannot construct meaningful dialogs about patterns at all scales.

What we have are fragments of what might become a pattern language. GoF, the PLoPD books, the PoSA books... these are like a few charted territories on an unmapped continent. We don't yet have the language that would even let us relate these works together, let alone relating them to everything else.

Everything else?  Well, yes. By and large, patterns today are an outgrowth of the object-oriented programming community.  I contend, however, that "object-oriented" is a pattern! It's a large-scale pattern that creates really significant context for all the other patterns that can work within it. Solutions that work within the "object-oriented" context make no sense in an actor-oriented context, or a functional context, or a procedural context, and so on. Each of these other large-scale patterns admit different solutions to similar problems: persistence, user interaction, and system integration, to name a few. I can imagine a pattern called "Event Driven" that would work very well with "Object oriented", "Functional", and "Actor Oriented", but somewhat less well with "Procedural programming", and contradict utterly with "Batch Processing". (Though there might be a link between them called "Buffer file" or something like that.)

That's the piece that we missed. We don't have a pattern language yet. We're not even close.


1. By "large" and "small", I don't mean to imply that patterns simply nest hierarchically. It's more complex and subtle than that. When we do have a real pattern language, we'll find that there are medium-grained patterns that work together with several, but not all, of the large ones. Likewise, we'll find small-scale patterns that make medium sized ones more or less practical. It's not a decision tree or a heuristic.

2. That's what keeps, "Fill the idea with blue" from being a meaningful sentence. All the words work, and they're even the right part of speech, yet the sentence as a whole doesn't fit together.

Connection Pools and Engset

In my last post, I talked about using Erlang models to size the front end of a system. By using some fundamental capacity models that are almost a century old, you can estimate the number of request handling threads you need for a given traffic load and request duration.

Inside the Box

It gets tricky, though, when you start to consider what happens inside the server itself. Processing the request usually involves some kind of database interaction with a connection pool. (There are many ways to avoid database calls, or at least minimize the damage they cause. I'll address some of these in a future post, but you can also check out Two Ways to Boost Your Flagging Web Site for starters.) Database calls act like a kind of "interior" request that can be considered to have its own probability of queuing.

Exterior call to server becomes an "interior" call to a database.

Because this interior call can block, we have to consider what effects it will have on the duration of the exterior call. In particular, the exterior call must take at least the sum of the blocking time plus the processing time for the interior call.

At this point, we need to make a few assumptions about the connection pool. First, the connection pool is finite. Every connection pool should have a ceiling. If nothing else, the database server can only handle a finite number of connections. Second, I'm going to assume that the pool blocks when exhausted. That is, calling threads that can't get a connection right away will happily wait forever rather than abandoning the request. This is a simplifying assumption that I need for the math to work out. It's not a good configuration in practice!

With these assumption in place, I can predict the probability of blocking within the interior call. It's a formula closely related to the Erlang model from my last post, but with a twist. The Erlang models assume an essentially infinite pool of requestors. For this interior call, though, the pool of requestors is quite finite: it's the number of request handling threads for the exterior calls. Once all of those threads are busy, there aren't any left to generate more traffic on the interior call!

The formula to compute the blocking probability with a finite number of sources is the Engset formula. Like the Erlang models, Engset originated in the world of telephony. It's useful for predicting the outbound capacity needed on a private branch exchange (PBX), because the number of possible callers is known. In our case, the request handling threads are the callers and the connection pool is the PBX.

Practical Example

Using our 1,000,000 page views per hour from last time, Table 1 shows the Engset table for various numbers of connections in the pool. This assumes that the application server has a maximum of 40 request handling threads. This also supposes that the database processing time uses 200 milliseconds of the 250 milliseconds we measured for the exterior call.

NEngset(N,A,S)
0100.00000%
198.23183%
296.37740%
394.43061%
492.38485%
590.23293%
687.96709%
785.57891%
883.05934%
980.39867%
1077.58656%
1174.61210%
1271.46397%
1368.13065%
1464.60087%
1560.86421%
1656.91211%
1752.73932%
1848.34604%
1943.74105%
2038.94585%
2134.00023%
2228.96875%
2323.94730%
2419.06718%
2514.49235%
2610.40427%
276.97050%
284.30152%
292.41250%
301.21368%
310.54082%
320.21081%
330.07093%
340.02028%
350.00483%
360.00093%
370.00014%
380.00002%
390.00000%
400.00000%

Notice that when we get to 18 connections in the pool, the probability of blocking drops below 50%.  Also, notice how sharply the probability of blocking drops off around 23 to 31 connections in the pool. This is a decidedly nonlinear effect!

From this table, it's clear that even though there are 40 request handling threads that could call into this pool, there's not much point in having more than 30 connections in the pool. At 30 connections, the probability of blocking is already less than 1%, meaning that the queuing time is only going to add a few milliseconds to the average request.

Why do we care? Why not just crank up the connection pool size to 40? After all, if we did, then no request could ever block waiting for a connection. That would minimize latency, wouldn't it?

Yes, it would, but at a cost. Increasing the number of connections to the database by a third means more memory and CPU time on the database just managing those connections, even if they're idle. If you've got two app servers, then the database probably won't notice an extra 10 connections. Suppose you scale out at the app tier, though, and you now have 50 or 60 app servers. You'd better believe that the DB will notice an extra 500 to 600 connections. They'll affect memory needs, CPU utilization, and your ability to fail over correctly when a database node goes down.

Feedback and Coupling

There's a strong coupling between the total request duration in the interior call and the request duration for the exterior call. If we assume that every request must go through the database call, then the exterior response time must be strictly greater than the interior blocking time plus the interior processing time.

In practice, it actually gets a little worse than that, as this causal loop diagram illustrates.

 Time dependencies between the interior call and the exterior call.

It reads like this: "As the interior call blocking time increases, the exterior call duration increase. As the interior call blocking increases, the exterior call duration time increases." This type of representation helps clarify relations between the different layers. It's very often the case that you'll find feedback loops this way. Any time you do find a feedback loop, it means that slowdowns will produce increasing slowdowns. Blocking begets blocking, quickly resulting in a site hang.

Conclusions

Queues are like timing dots. Once you start seeing them, you'll never be able to stop. You might even start to think that your entire server farm looks like one vast, interconnected set of queues.

That's because it is.

People use database connection pools because creating new connections is very slow. Tuning your database connection pool size, however, is all about optimizing the cost of queueing against the cost of extra connections. Each connection consumes resources on the database server and in the application server. Striking the right balance starts by identifying the required exterior response time, then sizing the connection pool---or changing the architecture---so the interior blocking time doesn't break the SLA.

For much, much more on the topic of capacity modeling and analysis, I definitely recommend Neil Gunther's website, Performance Agora. His books are also a great---and very practical---way to start applying performance and capacity management.