Wide Awake Developers

Subtle Interactions, Non-local Problems

| Comments

Alex Miller has a really interesting blog post up today. In LBQ + GC = slow, he shows how LinkedBlockingQueue can leave a chain of references from tenured dead objects to live young objects.  That sounds really dirty, but it actually means something to Java programmers. Something bad.

The effect here is a subtle interaction between the code and the mostly hidden, yet omnipresent, garbage collector. This interaction just happens to hit a known sore spot for the generational garbage collector. I won’t spoil the ending, because I want you to read Alex’s piece.

In effect, a one-line change to LinkedBlockingQueue has a dramatic effect on the garbage collector’s performance. In fact, because the problem causes more full GC’s, you’d be likely to observe this problem in an area completely unconnected with the queue itself.  By leaving these refchains worming through multiple generations in the heap, the queue damages a resource needed by every other part of the application.

This is a classic common-mode dependency, and it’s very hard to diagnose because it results from hidden and asynchronous coupling.

Combining Here Docs and Blocks in Ruby

| Comments

Like a geocache, this is another post meant to help somebody who stumbles across it in a future Google search. (Or as an external reminder for me, when I forget how I did this six months from now.)

I’ve liked here-documents since the days of shell programming. Ruby has good support for here docs with variable interpolation. For example, if I want to construct a SQL query, I can do this:

def build_query(customer_id)
    select * 
     from customer
   where id = #{customer_id}

Disclaimer: Don’t do this if customer_id comes from user input!

Recently, I wanted a way to build inserts using a matching number of column names and placeholders.

def build_query
    insert into #{table} ( #{columns()} ) values ( #{column_placeholders()} )

In this case, columns and column_placeholders were both functions.

One oddity I ran into is the combination of here documents and block syntax. RubyDBI lets you pass a block when executing a query, the same way you would pass a block to File::open(). The block gets a "statement handle", which gets cleaned up when the block completes.

  dbh.execute(query) { |sth| 
    sth.fetch() { |row|
      # do something with the row

Combining these two lets you write something that looks like SQL invading Ruby:

  dbh.execute(<<-STMT) { |sth|
      select distinct customer, business_unit_id, business_unit_key_name
       from problem_ticket_lz
       order by customer
    sth.fetch { |row|
      print "#{row[1]}\t#{row[0]}\t#{row[2]}\n"

This looks pretty good overall, but take a look at how the block opening interacts with the here doc. The here doc appears to be line-oriented, so it always begins on the line after the <<-STMT token. On the other hand, the block open follows the function, so the here doc gets lexically interpolated in the middle of the block, even though it has no syntactic relation to the block. No real gripe, just an oddity.

Beautiful Architecture

| Comments

O’Reilly has released "Beautiful Architecture," a compilation of essays by software and system architects. I’m happy to announce that I have a chapter in this book. The finished book is shipping now, and available through Safari. I think the whole thing has turned out amazingly well, both instructive and interesting.

One of the editors, Diomidas Spinellis, has posted an excellent description and summary.

Another Cause of TNS-12541

| Comments

There are about a jillion forum posts and official pages on the web that talk about ORA-12541, the infamous "TNS:No Listener" error. Somewhere around 70% of them appear to be link-farmers who just scrape all the Oracle forums and mailing lists.  Virtually all of the pages just refer back to the official definition from Oracle, which says "there’s no listener running on the server" and tells you to log in to the server as admin and start up the listener.

Not all that useful, especially if you’re not the DBA.

I found a different way that you can get the same error code, even when the listener is running. Special thanks to blogger John Jacob, whose post didn’t quite solve my problem, but did set me on the right track.

Here’s my situation. My client is a laptop connecting to the destination network through a VPN client. I’m connecting to an Oracle 10g service with 2 nodes. Tnsping reported success, the connection assistant could connect successfully, but sqlplus always reported TNS-12541 TNS:No listener.  The listener was fine.

Turning on client side tracing, I saw that the initial connection attempt to the service VIP was successful, but that the server than sends back a packet with the hostname of a specific node to use. Here’s where the problem begins.

Thanks to some quirk in the VPN configuration, I can only resolve DNS names on the VPN if they’re fully qualified. The default search domain just flat doesn’t work.  So, I can resolve proddb02.example.com but not proddb02. That’s the catch, because the database sends back just the host portion of the node, not the FQDN. DNS resolution fails, but sqlplus reports it as "No listener", rather than saying "Host not found" or something useful like that.

Again, there are a jillion post and articles telling network admins how to fix the default domain search on a VPN concentrator. And, again, I’m not the network admin, either.

The best I can do as a user is work around this issue by adding the IPs of the physical DB nodes to the hosts file on my own client machine.  Sure, some day it’ll break when we re-address the DB nodes, and I will have long forgotten that I even put those addresses in C:\Windows\System32\Drivers\etc\hosts. Still, at least it works for now.

Using a Custom WindowProc From Ruby

| Comments

This is off the beaten path today, maybe even off the whole reservation. Still, I searched for some code to do this, and couldn’t find it. Maybe this will help somebody else trying to do the same thing.

I’m currently prototyping a desktop utility using Ruby and wxRuby. The combination actually makes Windows desktop programming palatable, which is a very pleasant surprise.

Part of what I’m doing involves showing messages with Snarl. I want my Ruby program to generate messages that can be clicked. Snarl is happy to tell you that your message has been clicked. It does it by sending your window a message, using whatever message code you want.

So, for example, if I want to get a WM_USER message back, then I create a new notification like this:

@msg = Snarl.new('Clickable message', {:message => 'Click me, please!', :timeout => Snarl::NO_TIMEOUT, :reply_window => @win_handle, :reply_window_message => Windows::WM_USER})

If the user clicks on my message, I’ll get a WM_USER event delivered to my window (identified by @win_handle). Since I’m using wxRuby, which wraps wxWidgets, that presents a bit of a problem. Although wxWidgets allows you to subclass its default window proc, wxRuby does not. A couple of forum posts suggested using the Windows API to hook the window proc, which is what I did.

Here’s the code:

  require 'rubygems'
rescue LoadError

I installed wxRuby as a gem, so that’s boilerplate.

require 'lib/snarl'
require 'wx'
require 'windows/api'

module WindProc
  include Windows

  WM_USER = 0x04FF

  API.auto_namespace = 'WindProc'
  API.auto_constant = true

  API.new('SetWindowLong', 'LIK', 'L', 'user32')
  API.new('CallWindowProc', 'PIIIL', 'L', 'user32')

This module just gets me access to the Windows API functions SetWindowLong and CallWindowProc. SetWindowLong is deprecated in favor of SetWindowLongPtr, but I couldn’t get that to load properly through the windows/api module. At some point, when you’re prototyping something, you just have to decide not to solve every puzzle, especially if you can find a workable alternative.

API.new() constructs a Ruby object implemented by some C native code. It uses the prototype string in the second argument to translate Ruby parameters into C values when you eventually call the API function. The conversion is done in glue code that knows how to map some Ruby primitives to C values, but it’s not all that bright. In particular, there’s no way to introspect on the Win32 API itself to see if you’re lying to the glue code. In fact, I’m lying a little bit here. The prototype I used—'LIK'—tells the API module that I’m looking for a function that takes a long, an integer, and a callback. Strictly speaking, this should have been 'LIL', but I needed the glue code to convert a Ruby procedure into a C pointer.

The next section defines a subclass of Wx::Frame, the base type for all standalone windows.

class HookedFrame < Wx::Frame
  def initialize(parent, id, title)
    super(parent, -1, title)

    evt_window_create() { |event| on_window_create(event) }

I register a handler for the window create event. At this point, I’m still within the bounds of wxWidget’s own event handling framework. The interesting bits happen inside the on_window_create method.

  def on_window_create(event)
    @old_window_proc = 0
    @my_window_proc = Win32::API::Callback.new('LIIL', 'I') { |hwnd, umsg, wparam, lparam|
      if not self.hooked_window_proc(hwnd, umsg, wparam, lparam) then
        WindProc::CallWindowProc.call(@old_window_proc, hwnd, umsg, wparam, lparam)
    @old_window_proc = WindProc::SetWindowLong.call(self.handle, WindProc::GWL_WNDPROC, @my_window_proc)

There are several juicy bits here. First, I’m using Win32::API::Callback.new() to create a callback object. How does this get used? It’s a little roundabout. When I call WindProc::SetWindowLong(), I pass the callback object. (This is why I used 'LIK' as the prototype string earlier.) Now, WindProc::SetWindowLong() isn’t just a pointer to the native Windows library function. It’s actually a Ruby object that wraps the library function. The API object is implemented by C code. Like the API object, the callback object is a Ruby object implemented by C code. In particular, it has an ivar that points to a Ruby procedure. Because I passed a block to Callback.new(), the block itself will be the procedure. Inside API.call(), any argument of type "K" gets set as the "active callback" and then substituted with a C function called CallbackFunction. CallbackFunction looks up the active callback, translates parameters according to the callback’s prototype, then tells Ruby to invoke the proc associated with the callback.


So, I call SetWindowLong.call(), passing it the Callback I created with a block. SetWindowLong.call() ultimately callls the Windows DLL function SetWindowsLong, passing it the address of CallbackFunction. When Windows calls CallbackFunction, it looks up the Ruby Callback object and invokes it’s procedure.

Another oddity. For some reason, although the callback object has an instance variable called @function, there seems to be no way to set it after construction. If you pass a block, @function will point to the block. If you don’t, @function will be nil, with no way to set it to anything else. In other words, the API will happily let you create useless Callback objects.

The rest is easy. Inside my block, I just call out to a method that can be overridden by descendants of HookedFrame. My test implementation just blurts out some stuff to let me know the plumbing is working.

  def hooked_window_proc(hwnd, uMsg, wParam, lParam)
    puts "In the hook: 0x#{uMsg.to_s(16)}\t#{wParam}\t#{lParam}\n"
    if uMsg == NotifierApp::WM_USER then
      puts "That's what I've been waiting to hear:\t#{wParam}\t#{lParam}\n"

As I reviewed this post, I realized a something else. ActiveCallback is static in the C glue code. That means there can only be one callback set at a time. If I called some other Windows API function with its own callback, that would overwrite the reference to my Ruby code. But, Windows would still keep calling to the same pointer as before. In other words, calling any other Windows API function that takes a callback would cause that callback to become my window proc! Yikes!

Overall, this works, but seems like a kludge. Ironically, even as I got this working, I started getting dissatisfied with Snarl itself. I think I need more flexibility to display persistent information, rather than just alerts.

OTUG Tonight

| Comments

This evening, I’m speaking at OTUG. The topic is "Clouds, Grids, and Fog".

There’s no denying that "cloud" has become a huge buzzword. It’s a crossover trend, too. It’s not just the CIO who is interested in cloud computing. It’s the CFO and the CMO, too. (Not to mention the CSO, if there is one.)  Underneath the buzz, though, there is something real and valuable.

I will talk about the driving trends that are leading us toward cloud computing and how it differs from grids and software-as-a-service. I’ll also talk at length about the architectural implications and effects of running your software on a cloud.

If you live in the Twin Cities, I hope to see you there.

Attack of Self-Denial, 2008 Style

| Comments

"Good marketing can kill your site at any time."

–Paul Lord, 2006

I just learned of another attack of self-denial from this past week.

Many retailers are suffering this year, particularly in the brick-and-mortar space. I have heard from several, though, who say that their online performance is not suffering as much as the physical stores are. In some cases, where the brand is strong and the products are not fungible, the online channel is showing year-over-year growth.

One retailer I know was running strong, with the site near it’s capacity. They fit the bill for an online success in 2008. They have a great name recognition, a very strong, global brand, and their customers love their products. This past week, their marketing group decided to "take it to the next level."

They blasted an email campaign to four million customers.  It had a good offer, no qualifier, and a very short expiration time—one day only.  A short expiration like that creates a sense of urgency.  Good marketing reaches people and induces them to act, and in that respect, the email worked. Unfortunately, when that means millions of users hitting your site, you may run into trouble.

Traffic flooded the site and knocked it offline. It took more than 6 hours to get everything functioning again.

Instead of getting an extra bump in sales, they lost six hours of holiday-season revenue. As a rule of thumb, you should assume that a peak hour of holiday sales counts for six hours of off-season sales.

There are other technological solutions to help with this kind of traffic flood. For instance, the UX group can create a static landing page for the offer. Then marketing links to that static page in their email blast. Ops can push that static page out into their cache servers, or even into their CDN’s edge network. This requires some preparation for each offer, and it takes some extra preparation before the first such offer, but it’s very effective. The static page absorbs the bulk of the traffic, so only customers who really want to buy get passed into the dynamic site.

Marketing can also send the email out in waves, so people receive it at different times. That spreads the traffic spike out over a few hours. (Though this doesn’t work so well when you send the waves throughout the night, because customers will all see it in a couple of hours in the morning.)

In really extreme cases, a portion of capacity can be carved out and devoted to handling promotional traffic. That way, if the promotion goes nuclear, at least the rest of the site is still online. Obviously, this would be more appropriate for a long-running promotion than a one-day event.

Of course, it should be obvious that all of these technological solutions depend on good communication.

At a surface level, it’s easy to say that this happened because marketing had no idea how close to the edge the site was already running. That’s true. It’s also true, however, that operations previously had no idea what the capacity was. If marketing called and asked, "Can we support 4 million extra visits?" the current operations group could have answered "no". Previously, the answer would have been "I don’t know."

So, operations got better, but marketing never got re-educated. Lines of communication were never opened, or re-opened. Better communication would have helped.

In any online business, you must have close communications between marketing, UX, development, and operations. They need to regard themselves as part of one integrated team, rather than four separate teams. I’ve often seen development groups that view operations as a barrier to getting their stuff released. UX and marketing view development as the barrier to getting their ideas implemented, and so on. This dynamic evolves from the "throw it over the wall" approach, and it can only result in finger-pointing and recriminations.

I’d bet there’s a lot of finger-pointing going on in that retailer’s hallways this weekend.

(Human | Pattern) Languages, Part 2

| Comments

At the conclusion of the modulating bridge, we expect to be in the contrasting key of C minor. Instead, the bridge concludes in the distantly related key of F sharp major… Instead of resolving to the tonic, the cadence concludes with two isolated E pitches. They are completely ambiguous. They could belong to E minor, the tonic for this movement. They could be part of E major, which we’ve just heard peeking out from behind the minor mode curtains. [He] doesn’t resolve them into a definite key until the beginning of the third movement, characteristically labeled a "Scherzo".

In my last post, I lamented the missed opportunity we had to create a true pattern language about software. Perhaps calling it a missed opportunity is too pessimistic. Bear with me on a bit of a tangent. I promise it comes back around in the end.

The example text above is an amalgam of a lecture series I’ve been listening to. I’m a big fan of The Teaching Company and their courses. In particular, I’ve been learning about the meaning and structure of classical, baroque, romantic, and modern music from Professor Robert Greenberg.1 The sample I used here is from a series on Beethoven’s piano sonatas. This isn’t an actual quote, but a condensation of statements from one of the lectures. I’m not going to go into all the music theory behind this, but it is interesting.2

There are two things I want you to observe about the sample text. First, it’s loaded with jargon. It has to be! You’d exhaust the conversational possibilities about the best use of a D-sharp pretty quickly. Instead, you’ll talk about structures, tonalities, relationships between that D-sharp and other pitches. (D-sharp played together with a C? Very different from a quick sequence of D-sharp, E, D-sharp, C.) You can be sure that composers don’t think in terms of individual notes. A D-sharp by itself doesn’t mean anything. It only acquires meaning by its relation to other pitches. Hence all that stuff about keys—tonic, distantly related, contrasting. "Key" is a construct for discussing whole collections of pitches in a kind of shorthand. To a musician, there’s a world of difference between G major and A flat minor, even though the basic pitch (the tonic) is only one half-step apart.

Also notice that the text addresses some structural features. The purpose and structure of a modulating bridge is pretty well understood, at least in certain circles. The notion that you can have an "expected" key certainly implies that there are rules for a sonata. In fact, the term "sonata" itself means some fairly specific things3… although to know whether we’re talking about "a sonata" or "a movement in sonata form" requires some additional context.

In fact, this paragraph is all about context. It exists in the context of late Classical, early Romantic era music, specifically the music of Beethoven. In the Classical era, musical forms—such as sonata form—pretty much dictates the structure of the music. The number of movements, their relationships to each other, their keys, and even their tempos were well understood. A contemporary listener had every reason to expect that a first movement would be fast and bright, and if the first movement was in C major, then the second, slower movement would be a minuet and trio in G major.

Music and music theory have evolved over the last thousand-odd years. We have a vocabulary—the potentially off-putting jargon of the field. We have nesting, interrelating contexts. Large scale patterns (a piano sonata) create context for medium scale patterns (the first movement "allegretto") which in turn, create context for the medium and small scale patterns (the first theme in the allegretto consists of an ABA’BA phrasing, in which the opening theme sequences a motive upward over octaves.)  We even have the ability to talk about non sequiturs—like the modulating bridge above—where deliberate violation of the pattern language is done for effect.4

What is all this stuff if it isn’t a pattern language?

We can take a few lessons, then, from the language of music.

The first lesson is this: give it time. Musical language has evolved over a long time. It has grown and been pruned back over centuries. New terms are invented as needed to describe new answers to a context. In turn, these new terms create fresh contexts to be exploited with yet other inventions.

Second, any such language must be able to assimilate change. Nothing is lost, even amidst the most radical revolutions. When the Twentieth Century modernists rejected the tonal system, they could only reject the structures and strictures of that language. They couldn’t destroy the language itself. Phish plays fugues in concert… they just play them with electric guitars instead of harpsichords. There are Baroque orchestras today. They play in the same concert halls as the Pops and Philharmonics. The homophonic texture of plain chant still exists, and so do the once-heretical polyphony and church-sanctioned monophony. Nothing is lost, but new things can be encompassed and incorporated.

And, mainframes still exist with their COBOL programs, together with distributed object systems, message passing, and web services. The Singleton and Visitor patterns will never truly go away, any more than batch programming will disappear.

Third, we must continue to look at the relationships between different parts of our nascent pattern language. Just as individual objects aren’t very interesting, isolated patterns are less interesting than the ways they can interact with each other.

I believe that the true language of software has as much to do with programming languages as the language of music has to do with notes. So, instead of missed opportunity, let us say instead that we are just beginning to discover our true language.

1. Professor Greenberg is a delightful traveling companion. He’s witty, knowledgeable and has a way of teaching complex subjects without ever being condescending. He also sounds remarkably like Penn Jillette.

2. The main reason is that I would surely get it wrong in some details and risk losing the main point of my post here.

3. And here we see yet another of the complexities of language. The word "sonata" refers, at different times, to a three movement concert work, a single movement in a characteristic structure, a four movement concert work, and in Beethoven’s case, to a couple of great fantasias that he declares to be sonatas simply because he says so.

4. For examples ad nauseum, see Richard Wagner and the "abortive gesture".

(Human | Pattern) Languages

| Comments

We missed the point when we adopted "patterns" in the software world. Instead of an organic whole, we got a bag of tricks.

The commonly accepted definition of a pattern is "a solution to a problem in a context." This is true, but limiting. This definition loses an essential characteristic of patterns: Patterns relate to other patterns.

We talk about the context of a problem. "Context" is a mental shorthand. If we unpack the context it means many things: constraints, capabilities, style, requirements, and so on. We sometimes mislead ourselves by using the fairly fuzzy, abstract term "context" as a mental handle on a whole variety of very concrete issues. Context includes stated constraints like the functional requirements, along with unstated constraints like, "The computation should complete before the heat death of the universe." It includes other forces like, "This program is written in C#, so the solution to this problem should be in the same language or a closely related one." It should not require a supercooled quantum computer, for example.

Where does the context for a small-scale pattern originate?1 Context does not arise ex nihilio. No, the context for a small-scale pattern is created by larger patterns. Large grained patterns create the fabric of forces that we call the context for smaller patterns. In turn, smaller patterns fit into this fabric and, by their existence, they change it. Thus, the small scale patterns create feedback that can either resolve or exacerbate tensions inherent in the larger patterns.

Solutions that respect their context fit better with the rest of the organic whole. It would be strange to be reading some Java code, built into layered architecture with a relational database for storage, then suddenly find one component that has its own LISP interpreter and some functional code. With all respect to "polyglot programming", there’d better be a strong motivation for such an odd inclusion. It would be a discontinuity… in other words, it doesn’t fit the context I described. That context—the layered architecture, the OO language, relational database—was created by other parts of the system.

If, on the other hand, the system was built as a blackboard architecture, using LISP as glue code over intelligent agents acting asynchronously, then it wouldn’t be at all odd to find some recursive lambda expressions. In that context, they fit naturally and the Java code would be an oddity.

This interrelation across scale knits patterns together into a pattern language. By and large, what we have today is a growing group of proper nouns. Please don’t get me wrong, the nouns themselves have use. It’s very helpful to say "you want a Null Object there," and be understood. That vocabulary and the compression it provides is really important.

But we shouldn’t mistake a group of nouns for a real pattern language. A language is more than just its nouns. A language also implies ways of connecting statements sensibly. It has idioms and semantics and semiotics.2 In a language, you can have dialog and argumentation.  Imagine a dialog in patterns as they exist today:

"Pipes and filters."


"Chain of Responsibility!"

You might be able to make a comedy sketch out of that, but not much more. We cannot construct meaningful dialogs about patterns at all scales.

What we have are fragments of what might become a pattern language. GoF, the PLoPD books, the PoSA books… these are like a few charted territories on an unmapped continent. We don’t yet have the language that would even let us relate these works together, let alone relating them to everything else.

Everything else?  Well, yes. By and large, patterns today are an outgrowth of the object-oriented programming community.  I contend, however, that "object-oriented" is a pattern! It’s a large-scale pattern that creates really significant context for all the other patterns that can work within it. Solutions that work within the "object-oriented" context make no sense in an actor-oriented context, or a functional context, or a procedural context, and so on. Each of these other large-scale patterns admit different solutions to similar problems: persistence, user interaction, and system integration, to name a few. I can imagine a pattern called "Event Driven" that would work very well with "Object oriented", "Functional", and "Actor Oriented", but somewhat less well with "Procedural programming", and contradict utterly with "Batch Processing". (Though there might be a link between them called "Buffer file" or something like that.)

That’s the piece that we missed. We don’t have a pattern language yet. We’re not even close.

1. By "large" and "small", I don’t mean to imply that patterns simply nest hierarchically. It’s more complex and subtle than that. When we do have a real pattern language, we’ll find that there are medium-grained patterns that work together with several, but not all, of the large ones. Likewise, we’ll find small-scale patterns that make medium sized ones more or less practical. It’s not a decision tree or a heuristic.

2. That’s what keeps, "Fill the idea with blue" from being a meaningful sentence. All the words work, and they’re even the right part of speech, yet the sentence as a whole doesn’t fit together.

Connection Pools and Engset

| Comments

In my last post, I talked about using Erlang models to size the front end of a system. By using some fundamental capacity models that are almost a century old, you can estimate the number of request handling threads you need for a given traffic load and request duration.

Inside the Box

It gets tricky, though, when you start to consider what happens inside the server itself. Processing the request usually involves some kind of database interaction with a connection pool. (There are many ways to avoid database calls, or at least minimize the damage they cause. I’ll address some of these in a future post, but you can also check out Two Ways to Boost Your Flagging Web Site for starters.) Database calls act like a kind of "interior" request that can be considered to have its own probability of queuing.

Exterior call to server becomes an "interior" call to a database.

Because this interior call can block, we have to consider what effects it will have on the duration of the exterior call. In particular, the exterior call must take at least the sum of the blocking time plus the processing time for the interior call.

At this point, we need to make a few assumptions about the connection pool. First, the connection pool is finite. Every connection pool should have a ceiling. If nothing else, the database server can only handle a finite number of connections. Second, I’m going to assume that the pool blocks when exhausted. That is, calling threads that can’t get a connection right away will happily wait forever rather than abandoning the request. This is a simplifying assumption that I need for the math to work out. It’s not a good configuration in practice!

With these assumption in place, I can predict the probability of blocking within the interior call. It’s a formula closely related to the Erlang model from my last post, but with a twist. The Erlang models assume an essentially infinite pool of requestors. For this interior call, though, the pool of requestors is quite finite: it’s the number of request handling threads for the exterior calls. Once all of those threads are busy, there aren’t any left to generate more traffic on the interior call!

The formula to compute the blocking probability with a finite number of sources is the Engset formula. Like the Erlang models, Engset originated in the world of telephony. It’s useful for predicting the outbound capacity needed on a private branch exchange (PBX), because the number of possible callers is known. In our case, the request handling threads are the callers and the connection pool is the PBX.

Practical Example

Using our 1,000,000 page views per hour from last time, Table 1 shows the Engset table for various numbers of connections in the pool. This assumes that the application server has a maximum of 40 request handling threads. This also supposes that the database processing time uses 200 milliseconds of the 250 milliseconds we measured for the exterior call.


Notice that when we get to 18 connections in the pool, the probability of blocking drops below 50%.  Also, notice how sharply the probability of blocking drops off around 23 to 31 connections in the pool. This is a decidedly nonlinear effect!

From this table, it’s clear that even though there are 40 request handling threads that could call into this pool, there’s not much point in having more than 30 connections in the pool. At 30 connections, the probability of blocking is already less than 1%, meaning that the queuing time is only going to add a few milliseconds to the average request.

Why do we care? Why not just crank up the connection pool size to 40? After all, if we did, then no request could ever block waiting for a connection. That would minimize latency, wouldn’t it?

Yes, it would, but at a cost. Increasing the number of connections to the database by a third means more memory and CPU time on the database just managing those connections, even if they’re idle. If you’ve got two app servers, then the database probably won’t notice an extra 10 connections. Suppose you scale out at the app tier, though, and you now have 50 or 60 app servers. You’d better believe that the DB will notice an extra 500 to 600 connections. They’ll affect memory needs, CPU utilization, and your ability to fail over correctly when a database node goes down.

Feedback and Coupling

There’s a strong coupling between the total request duration in the interior call and the request duration for the exterior call. If we assume that every request must go through the database call, then the exterior response time must be strictly greater than the interior blocking time plus the interior processing time.

In practice, it actually gets a little worse than that, as this causal loop diagram illustrates.

 Time dependencies between the interior call and the exterior call.

It reads like this: "As the interior call blocking time increases, the exterior call duration increase. As the interior call blocking increases, the exterior call duration time increases." This type of representation helps clarify relations between the different layers. It’s very often the case that you’ll find feedback loops this way. Any time you do find a feedback loop, it means that slowdowns will produce increasing slowdowns. Blocking begets blocking, quickly resulting in a site hang.


Queues are like timing dots. Once you start seeing them, you’ll never be able to stop. You might even start to think that your entire server farm looks like one vast, interconnected set of queues.

That’s because it is.

People use database connection pools because creating new connections is very slow. Tuning your database connection pool size, however, is all about optimizing the cost of queueing against the cost of extra connections. Each connection consumes resources on the database server and in the application server. Striking the right balance starts by identifying the required exterior response time, then sizing the connection pool—or changing the architecture—so the interior blocking time doesn’t break the SLA.

For much, much more on the topic of capacity modeling and analysis, I definitely recommend Neil Gunther’s website, Performance Agora. His books are also a great—and very practical—way to start applying performance and capacity management.