Wide Awake Developers

Another Cause of TNS-12541

| Comments

There are about a jillion forum posts and official pages on the web that talk about ORA-12541, the infamous "TNS:No Listener" error. Somewhere around 70% of them appear to be link-farmers who just scrape all the Oracle forums and mailing lists.  Virtually all of the pages just refer back to the official definition from Oracle, which says "there’s no listener running on the server" and tells you to log in to the server as admin and start up the listener.

Not all that useful, especially if you’re not the DBA.

I found a different way that you can get the same error code, even when the listener is running. Special thanks to blogger John Jacob, whose post didn’t quite solve my problem, but did set me on the right track.

Here’s my situation. My client is a laptop connecting to the destination network through a VPN client. I’m connecting to an Oracle 10g service with 2 nodes. Tnsping reported success, the connection assistant could connect successfully, but sqlplus always reported TNS-12541 TNS:No listener.  The listener was fine.

Turning on client side tracing, I saw that the initial connection attempt to the service VIP was successful, but that the server than sends back a packet with the hostname of a specific node to use. Here’s where the problem begins.

Thanks to some quirk in the VPN configuration, I can only resolve DNS names on the VPN if they’re fully qualified. The default search domain just flat doesn’t work.  So, I can resolve proddb02.example.com but not proddb02. That’s the catch, because the database sends back just the host portion of the node, not the FQDN. DNS resolution fails, but sqlplus reports it as "No listener", rather than saying "Host not found" or something useful like that.

Again, there are a jillion post and articles telling network admins how to fix the default domain search on a VPN concentrator. And, again, I’m not the network admin, either.

The best I can do as a user is work around this issue by adding the IPs of the physical DB nodes to the hosts file on my own client machine.  Sure, some day it’ll break when we re-address the DB nodes, and I will have long forgotten that I even put those addresses in C:\Windows\System32\Drivers\etc\hosts. Still, at least it works for now.

Using a Custom WindowProc From Ruby

| Comments

This is off the beaten path today, maybe even off the whole reservation. Still, I searched for some code to do this, and couldn’t find it. Maybe this will help somebody else trying to do the same thing.

I’m currently prototyping a desktop utility using Ruby and wxRuby. The combination actually makes Windows desktop programming palatable, which is a very pleasant surprise.

Part of what I’m doing involves showing messages with Snarl. I want my Ruby program to generate messages that can be clicked. Snarl is happy to tell you that your message has been clicked. It does it by sending your window a message, using whatever message code you want.

So, for example, if I want to get a WM_USER message back, then I create a new notification like this:

@msg = Snarl.new('Clickable message', {:message => 'Click me, please!', :timeout => Snarl::NO_TIMEOUT, :reply_window => @win_handle, :reply_window_message => Windows::WM_USER})

If the user clicks on my message, I’ll get a WM_USER event delivered to my window (identified by @win_handle). Since I’m using wxRuby, which wraps wxWidgets, that presents a bit of a problem. Although wxWidgets allows you to subclass its default window proc, wxRuby does not. A couple of forum posts suggested using the Windows API to hook the window proc, which is what I did.

Here’s the code:

begin
  require 'rubygems'
rescue LoadError
end

I installed wxRuby as a gem, so that’s boilerplate.

require 'lib/snarl'
require 'wx'
require 'windows/api'

module WindProc
  include Windows
  
  GWL_WNDPROC = -4

  WM_USER = 0x04FF

  API.auto_namespace = 'WindProc'
  API.auto_constant = true

  API.new('SetWindowLong', 'LIK', 'L', 'user32')
  API.new('CallWindowProc', 'PIIIL', 'L', 'user32')
end

This module just gets me access to the Windows API functions SetWindowLong and CallWindowProc. SetWindowLong is deprecated in favor of SetWindowLongPtr, but I couldn’t get that to load properly through the windows/api module. At some point, when you’re prototyping something, you just have to decide not to solve every puzzle, especially if you can find a workable alternative.

API.new() constructs a Ruby object implemented by some C native code. It uses the prototype string in the second argument to translate Ruby parameters into C values when you eventually call the API function. The conversion is done in glue code that knows how to map some Ruby primitives to C values, but it’s not all that bright. In particular, there’s no way to introspect on the Win32 API itself to see if you’re lying to the glue code. In fact, I’m lying a little bit here. The prototype I used—'LIK'—tells the API module that I’m looking for a function that takes a long, an integer, and a callback. Strictly speaking, this should have been 'LIL', but I needed the glue code to convert a Ruby procedure into a C pointer.

The next section defines a subclass of Wx::Frame, the base type for all standalone windows.

class HookedFrame < Wx::Frame
  def initialize(parent, id, title)
    super(parent, -1, title)

    evt_window_create() { |event| on_window_create(event) }
  end

I register a handler for the window create event. At this point, I’m still within the bounds of wxWidget’s own event handling framework. The interesting bits happen inside the on_window_create method.

  def on_window_create(event)
    @old_window_proc = 0
    @my_window_proc = Win32::API::Callback.new('LIIL', 'I') { |hwnd, umsg, wparam, lparam|
      if not self.hooked_window_proc(hwnd, umsg, wparam, lparam) then
        WindProc::CallWindowProc.call(@old_window_proc, hwnd, umsg, wparam, lparam)
      end
    }
    @old_window_proc = WindProc::SetWindowLong.call(self.handle, WindProc::GWL_WNDPROC, @my_window_proc)
  end

There are several juicy bits here. First, I’m using Win32::API::Callback.new() to create a callback object. How does this get used? It’s a little roundabout. When I call WindProc::SetWindowLong(), I pass the callback object. (This is why I used 'LIK' as the prototype string earlier.) Now, WindProc::SetWindowLong() isn’t just a pointer to the native Windows library function. It’s actually a Ruby object that wraps the library function. The API object is implemented by C code. Like the API object, the callback object is a Ruby object implemented by C code. In particular, it has an ivar that points to a Ruby procedure. Because I passed a block to Callback.new(), the block itself will be the procedure. Inside API.call(), any argument of type "K" gets set as the "active callback" and then substituted with a C function called CallbackFunction. CallbackFunction looks up the active callback, translates parameters according to the callback’s prototype, then tells Ruby to invoke the proc associated with the callback.

Whew.

So, I call SetWindowLong.call(), passing it the Callback I created with a block. SetWindowLong.call() ultimately callls the Windows DLL function SetWindowsLong, passing it the address of CallbackFunction. When Windows calls CallbackFunction, it looks up the Ruby Callback object and invokes it’s procedure.

Another oddity. For some reason, although the callback object has an instance variable called @function, there seems to be no way to set it after construction. If you pass a block, @function will point to the block. If you don’t, @function will be nil, with no way to set it to anything else. In other words, the API will happily let you create useless Callback objects.

The rest is easy. Inside my block, I just call out to a method that can be overridden by descendants of HookedFrame. My test implementation just blurts out some stuff to let me know the plumbing is working.

  def hooked_window_proc(hwnd, uMsg, wParam, lParam)
    puts "In the hook: 0x#{uMsg.to_s(16)}\t#{wParam}\t#{lParam}\n"
    if uMsg == NotifierApp::WM_USER then
      puts "That's what I've been waiting to hear:\t#{wParam}\t#{lParam}\n"
      true
    end
    false
  end

As I reviewed this post, I realized a something else. ActiveCallback is static in the C glue code. That means there can only be one callback set at a time. If I called some other Windows API function with its own callback, that would overwrite the reference to my Ruby code. But, Windows would still keep calling to the same pointer as before. In other words, calling any other Windows API function that takes a callback would cause that callback to become my window proc! Yikes!

Overall, this works, but seems like a kludge. Ironically, even as I got this working, I started getting dissatisfied with Snarl itself. I think I need more flexibility to display persistent information, rather than just alerts.

OTUG Tonight

| Comments

This evening, I’m speaking at OTUG. The topic is "Clouds, Grids, and Fog".

There’s no denying that "cloud" has become a huge buzzword. It’s a crossover trend, too. It’s not just the CIO who is interested in cloud computing. It’s the CFO and the CMO, too. (Not to mention the CSO, if there is one.)  Underneath the buzz, though, there is something real and valuable.

I will talk about the driving trends that are leading us toward cloud computing and how it differs from grids and software-as-a-service. I’ll also talk at length about the architectural implications and effects of running your software on a cloud.

If you live in the Twin Cities, I hope to see you there.

Attack of Self-Denial, 2008 Style

| Comments

"Good marketing can kill your site at any time."

–Paul Lord, 2006

I just learned of another attack of self-denial from this past week.

Many retailers are suffering this year, particularly in the brick-and-mortar space. I have heard from several, though, who say that their online performance is not suffering as much as the physical stores are. In some cases, where the brand is strong and the products are not fungible, the online channel is showing year-over-year growth.

One retailer I know was running strong, with the site near it’s capacity. They fit the bill for an online success in 2008. They have a great name recognition, a very strong, global brand, and their customers love their products. This past week, their marketing group decided to "take it to the next level."

They blasted an email campaign to four million customers.  It had a good offer, no qualifier, and a very short expiration time—one day only.  A short expiration like that creates a sense of urgency.  Good marketing reaches people and induces them to act, and in that respect, the email worked. Unfortunately, when that means millions of users hitting your site, you may run into trouble.

Traffic flooded the site and knocked it offline. It took more than 6 hours to get everything functioning again.

Instead of getting an extra bump in sales, they lost six hours of holiday-season revenue. As a rule of thumb, you should assume that a peak hour of holiday sales counts for six hours of off-season sales.

There are other technological solutions to help with this kind of traffic flood. For instance, the UX group can create a static landing page for the offer. Then marketing links to that static page in their email blast. Ops can push that static page out into their cache servers, or even into their CDN’s edge network. This requires some preparation for each offer, and it takes some extra preparation before the first such offer, but it’s very effective. The static page absorbs the bulk of the traffic, so only customers who really want to buy get passed into the dynamic site.

Marketing can also send the email out in waves, so people receive it at different times. That spreads the traffic spike out over a few hours. (Though this doesn’t work so well when you send the waves throughout the night, because customers will all see it in a couple of hours in the morning.)

In really extreme cases, a portion of capacity can be carved out and devoted to handling promotional traffic. That way, if the promotion goes nuclear, at least the rest of the site is still online. Obviously, this would be more appropriate for a long-running promotion than a one-day event.

Of course, it should be obvious that all of these technological solutions depend on good communication.

At a surface level, it’s easy to say that this happened because marketing had no idea how close to the edge the site was already running. That’s true. It’s also true, however, that operations previously had no idea what the capacity was. If marketing called and asked, "Can we support 4 million extra visits?" the current operations group could have answered "no". Previously, the answer would have been "I don’t know."

So, operations got better, but marketing never got re-educated. Lines of communication were never opened, or re-opened. Better communication would have helped.

In any online business, you must have close communications between marketing, UX, development, and operations. They need to regard themselves as part of one integrated team, rather than four separate teams. I’ve often seen development groups that view operations as a barrier to getting their stuff released. UX and marketing view development as the barrier to getting their ideas implemented, and so on. This dynamic evolves from the "throw it over the wall" approach, and it can only result in finger-pointing and recriminations.

I’d bet there’s a lot of finger-pointing going on in that retailer’s hallways this weekend.

(Human | Pattern) Languages, Part 2

| Comments

At the conclusion of the modulating bridge, we expect to be in the contrasting key of C minor. Instead, the bridge concludes in the distantly related key of F sharp major… Instead of resolving to the tonic, the cadence concludes with two isolated E pitches. They are completely ambiguous. They could belong to E minor, the tonic for this movement. They could be part of E major, which we’ve just heard peeking out from behind the minor mode curtains. [He] doesn’t resolve them into a definite key until the beginning of the third movement, characteristically labeled a "Scherzo".

In my last post, I lamented the missed opportunity we had to create a true pattern language about software. Perhaps calling it a missed opportunity is too pessimistic. Bear with me on a bit of a tangent. I promise it comes back around in the end.

The example text above is an amalgam of a lecture series I’ve been listening to. I’m a big fan of The Teaching Company and their courses. In particular, I’ve been learning about the meaning and structure of classical, baroque, romantic, and modern music from Professor Robert Greenberg.1 The sample I used here is from a series on Beethoven’s piano sonatas. This isn’t an actual quote, but a condensation of statements from one of the lectures. I’m not going to go into all the music theory behind this, but it is interesting.2

There are two things I want you to observe about the sample text. First, it’s loaded with jargon. It has to be! You’d exhaust the conversational possibilities about the best use of a D-sharp pretty quickly. Instead, you’ll talk about structures, tonalities, relationships between that D-sharp and other pitches. (D-sharp played together with a C? Very different from a quick sequence of D-sharp, E, D-sharp, C.) You can be sure that composers don’t think in terms of individual notes. A D-sharp by itself doesn’t mean anything. It only acquires meaning by its relation to other pitches. Hence all that stuff about keys—tonic, distantly related, contrasting. "Key" is a construct for discussing whole collections of pitches in a kind of shorthand. To a musician, there’s a world of difference between G major and A flat minor, even though the basic pitch (the tonic) is only one half-step apart.

Also notice that the text addresses some structural features. The purpose and structure of a modulating bridge is pretty well understood, at least in certain circles. The notion that you can have an "expected" key certainly implies that there are rules for a sonata. In fact, the term "sonata" itself means some fairly specific things3… although to know whether we’re talking about "a sonata" or "a movement in sonata form" requires some additional context.

In fact, this paragraph is all about context. It exists in the context of late Classical, early Romantic era music, specifically the music of Beethoven. In the Classical era, musical forms—such as sonata form—pretty much dictates the structure of the music. The number of movements, their relationships to each other, their keys, and even their tempos were well understood. A contemporary listener had every reason to expect that a first movement would be fast and bright, and if the first movement was in C major, then the second, slower movement would be a minuet and trio in G major.

Music and music theory have evolved over the last thousand-odd years. We have a vocabulary—the potentially off-putting jargon of the field. We have nesting, interrelating contexts. Large scale patterns (a piano sonata) create context for medium scale patterns (the first movement "allegretto") which in turn, create context for the medium and small scale patterns (the first theme in the allegretto consists of an ABA’BA phrasing, in which the opening theme sequences a motive upward over octaves.)  We even have the ability to talk about non sequiturs—like the modulating bridge above—where deliberate violation of the pattern language is done for effect.4

What is all this stuff if it isn’t a pattern language?

We can take a few lessons, then, from the language of music.

The first lesson is this: give it time. Musical language has evolved over a long time. It has grown and been pruned back over centuries. New terms are invented as needed to describe new answers to a context. In turn, these new terms create fresh contexts to be exploited with yet other inventions.

Second, any such language must be able to assimilate change. Nothing is lost, even amidst the most radical revolutions. When the Twentieth Century modernists rejected the tonal system, they could only reject the structures and strictures of that language. They couldn’t destroy the language itself. Phish plays fugues in concert… they just play them with electric guitars instead of harpsichords. There are Baroque orchestras today. They play in the same concert halls as the Pops and Philharmonics. The homophonic texture of plain chant still exists, and so do the once-heretical polyphony and church-sanctioned monophony. Nothing is lost, but new things can be encompassed and incorporated.

And, mainframes still exist with their COBOL programs, together with distributed object systems, message passing, and web services. The Singleton and Visitor patterns will never truly go away, any more than batch programming will disappear.

Third, we must continue to look at the relationships between different parts of our nascent pattern language. Just as individual objects aren’t very interesting, isolated patterns are less interesting than the ways they can interact with each other.

I believe that the true language of software has as much to do with programming languages as the language of music has to do with notes. So, instead of missed opportunity, let us say instead that we are just beginning to discover our true language.


1. Professor Greenberg is a delightful traveling companion. He’s witty, knowledgeable and has a way of teaching complex subjects without ever being condescending. He also sounds remarkably like Penn Jillette.

2. The main reason is that I would surely get it wrong in some details and risk losing the main point of my post here.

3. And here we see yet another of the complexities of language. The word "sonata" refers, at different times, to a three movement concert work, a single movement in a characteristic structure, a four movement concert work, and in Beethoven’s case, to a couple of great fantasias that he declares to be sonatas simply because he says so.

4. For examples ad nauseum, see Richard Wagner and the "abortive gesture".

(Human | Pattern) Languages

| Comments

We missed the point when we adopted "patterns" in the software world. Instead of an organic whole, we got a bag of tricks.

The commonly accepted definition of a pattern is "a solution to a problem in a context." This is true, but limiting. This definition loses an essential characteristic of patterns: Patterns relate to other patterns.

We talk about the context of a problem. "Context" is a mental shorthand. If we unpack the context it means many things: constraints, capabilities, style, requirements, and so on. We sometimes mislead ourselves by using the fairly fuzzy, abstract term "context" as a mental handle on a whole variety of very concrete issues. Context includes stated constraints like the functional requirements, along with unstated constraints like, "The computation should complete before the heat death of the universe." It includes other forces like, "This program is written in C#, so the solution to this problem should be in the same language or a closely related one." It should not require a supercooled quantum computer, for example.

Where does the context for a small-scale pattern originate?1 Context does not arise ex nihilio. No, the context for a small-scale pattern is created by larger patterns. Large grained patterns create the fabric of forces that we call the context for smaller patterns. In turn, smaller patterns fit into this fabric and, by their existence, they change it. Thus, the small scale patterns create feedback that can either resolve or exacerbate tensions inherent in the larger patterns.

Solutions that respect their context fit better with the rest of the organic whole. It would be strange to be reading some Java code, built into layered architecture with a relational database for storage, then suddenly find one component that has its own LISP interpreter and some functional code. With all respect to "polyglot programming", there’d better be a strong motivation for such an odd inclusion. It would be a discontinuity… in other words, it doesn’t fit the context I described. That context—the layered architecture, the OO language, relational database—was created by other parts of the system.

If, on the other hand, the system was built as a blackboard architecture, using LISP as glue code over intelligent agents acting asynchronously, then it wouldn’t be at all odd to find some recursive lambda expressions. In that context, they fit naturally and the Java code would be an oddity.

This interrelation across scale knits patterns together into a pattern language. By and large, what we have today is a growing group of proper nouns. Please don’t get me wrong, the nouns themselves have use. It’s very helpful to say "you want a Null Object there," and be understood. That vocabulary and the compression it provides is really important.

But we shouldn’t mistake a group of nouns for a real pattern language. A language is more than just its nouns. A language also implies ways of connecting statements sensibly. It has idioms and semantics and semiotics.2 In a language, you can have dialog and argumentation.  Imagine a dialog in patterns as they exist today:

"Pipes and filters."

"Observer?"

"Chain of Responsibility!"

You might be able to make a comedy sketch out of that, but not much more. We cannot construct meaningful dialogs about patterns at all scales.

What we have are fragments of what might become a pattern language. GoF, the PLoPD books, the PoSA books… these are like a few charted territories on an unmapped continent. We don’t yet have the language that would even let us relate these works together, let alone relating them to everything else.

Everything else?  Well, yes. By and large, patterns today are an outgrowth of the object-oriented programming community.  I contend, however, that "object-oriented" is a pattern! It’s a large-scale pattern that creates really significant context for all the other patterns that can work within it. Solutions that work within the "object-oriented" context make no sense in an actor-oriented context, or a functional context, or a procedural context, and so on. Each of these other large-scale patterns admit different solutions to similar problems: persistence, user interaction, and system integration, to name a few. I can imagine a pattern called "Event Driven" that would work very well with "Object oriented", "Functional", and "Actor Oriented", but somewhat less well with "Procedural programming", and contradict utterly with "Batch Processing". (Though there might be a link between them called "Buffer file" or something like that.)

That’s the piece that we missed. We don’t have a pattern language yet. We’re not even close.


1. By "large" and "small", I don’t mean to imply that patterns simply nest hierarchically. It’s more complex and subtle than that. When we do have a real pattern language, we’ll find that there are medium-grained patterns that work together with several, but not all, of the large ones. Likewise, we’ll find small-scale patterns that make medium sized ones more or less practical. It’s not a decision tree or a heuristic.

2. That’s what keeps, "Fill the idea with blue" from being a meaningful sentence. All the words work, and they’re even the right part of speech, yet the sentence as a whole doesn’t fit together.

Connection Pools and Engset

| Comments

In my last post, I talked about using Erlang models to size the front end of a system. By using some fundamental capacity models that are almost a century old, you can estimate the number of request handling threads you need for a given traffic load and request duration.

Inside the Box

It gets tricky, though, when you start to consider what happens inside the server itself. Processing the request usually involves some kind of database interaction with a connection pool. (There are many ways to avoid database calls, or at least minimize the damage they cause. I’ll address some of these in a future post, but you can also check out Two Ways to Boost Your Flagging Web Site for starters.) Database calls act like a kind of "interior" request that can be considered to have its own probability of queuing.

Exterior call to server becomes an "interior" call to a database.

Because this interior call can block, we have to consider what effects it will have on the duration of the exterior call. In particular, the exterior call must take at least the sum of the blocking time plus the processing time for the interior call.

At this point, we need to make a few assumptions about the connection pool. First, the connection pool is finite. Every connection pool should have a ceiling. If nothing else, the database server can only handle a finite number of connections. Second, I’m going to assume that the pool blocks when exhausted. That is, calling threads that can’t get a connection right away will happily wait forever rather than abandoning the request. This is a simplifying assumption that I need for the math to work out. It’s not a good configuration in practice!

With these assumption in place, I can predict the probability of blocking within the interior call. It’s a formula closely related to the Erlang model from my last post, but with a twist. The Erlang models assume an essentially infinite pool of requestors. For this interior call, though, the pool of requestors is quite finite: it’s the number of request handling threads for the exterior calls. Once all of those threads are busy, there aren’t any left to generate more traffic on the interior call!

The formula to compute the blocking probability with a finite number of sources is the Engset formula. Like the Erlang models, Engset originated in the world of telephony. It’s useful for predicting the outbound capacity needed on a private branch exchange (PBX), because the number of possible callers is known. In our case, the request handling threads are the callers and the connection pool is the PBX.

Practical Example

Using our 1,000,000 page views per hour from last time, Table 1 shows the Engset table for various numbers of connections in the pool. This assumes that the application server has a maximum of 40 request handling threads. This also supposes that the database processing time uses 200 milliseconds of the 250 milliseconds we measured for the exterior call.

NEngset(N,A,S)
0100.00000%
198.23183%
296.37740%
394.43061%
492.38485%
590.23293%
687.96709%
785.57891%
883.05934%
980.39867%
1077.58656%
1174.61210%
1271.46397%
1368.13065%
1464.60087%
1560.86421%
1656.91211%
1752.73932%
1848.34604%
1943.74105%
2038.94585%
2134.00023%
2228.96875%
2323.94730%
2419.06718%
2514.49235%
2610.40427%
276.97050%
284.30152%
292.41250%
301.21368%
310.54082%
320.21081%
330.07093%
340.02028%
350.00483%
360.00093%
370.00014%
380.00002%
390.00000%
400.00000%

Notice that when we get to 18 connections in the pool, the probability of blocking drops below 50%.  Also, notice how sharply the probability of blocking drops off around 23 to 31 connections in the pool. This is a decidedly nonlinear effect!

From this table, it’s clear that even though there are 40 request handling threads that could call into this pool, there’s not much point in having more than 30 connections in the pool. At 30 connections, the probability of blocking is already less than 1%, meaning that the queuing time is only going to add a few milliseconds to the average request.

Why do we care? Why not just crank up the connection pool size to 40? After all, if we did, then no request could ever block waiting for a connection. That would minimize latency, wouldn’t it?

Yes, it would, but at a cost. Increasing the number of connections to the database by a third means more memory and CPU time on the database just managing those connections, even if they’re idle. If you’ve got two app servers, then the database probably won’t notice an extra 10 connections. Suppose you scale out at the app tier, though, and you now have 50 or 60 app servers. You’d better believe that the DB will notice an extra 500 to 600 connections. They’ll affect memory needs, CPU utilization, and your ability to fail over correctly when a database node goes down.

Feedback and Coupling

There’s a strong coupling between the total request duration in the interior call and the request duration for the exterior call. If we assume that every request must go through the database call, then the exterior response time must be strictly greater than the interior blocking time plus the interior processing time.

In practice, it actually gets a little worse than that, as this causal loop diagram illustrates.

 Time dependencies between the interior call and the exterior call.

It reads like this: "As the interior call blocking time increases, the exterior call duration increase. As the interior call blocking increases, the exterior call duration time increases." This type of representation helps clarify relations between the different layers. It’s very often the case that you’ll find feedback loops this way. Any time you do find a feedback loop, it means that slowdowns will produce increasing slowdowns. Blocking begets blocking, quickly resulting in a site hang.

Conclusions

Queues are like timing dots. Once you start seeing them, you’ll never be able to stop. You might even start to think that your entire server farm looks like one vast, interconnected set of queues.

That’s because it is.

People use database connection pools because creating new connections is very slow. Tuning your database connection pool size, however, is all about optimizing the cost of queueing against the cost of extra connections. Each connection consumes resources on the database server and in the application server. Striking the right balance starts by identifying the required exterior response time, then sizing the connection pool—or changing the architecture—so the interior blocking time doesn’t break the SLA.

For much, much more on the topic of capacity modeling and analysis, I definitely recommend Neil Gunther’s website, Performance Agora. His books are also a great—and very practical—way to start applying performance and capacity management.

Thread Pools and Erlang Models

| Comments

Sizing, Danish Style

Folks in telecommunications and operations research have used Erlang models for almost a century. A. K. Erlang, a Danish telephone engineer, developed these models to help plan the capacity of the phone network and predict the grade of service that could be guaranteed, given some basic metrics about call volume and duration. Telephone networks are expensive to deploy, particularly when upgrading your trunk lines involves digging up large portions of rocky Danish ground or running cables under the North Sea.

The Erlang-B formula predicts the probability that an incoming call cannot be serviced, based on the call arrival rate, average call time, and number of lines available.  Erlang-C is similar, but allows for calls to be queued while waiting for service. It predicts the probability that a call will be queued. It can also show when calls will never be serviced, because the rate of arriving calls exceeds the system’s total capacity to serve them.

Erlang models are widely used in telecomm, including GPRS network sizing, trunk line sizing, call center staffing models, and other capacity planning arenas where request arrival is apparently random. In fact, you can use it to predict the capacity and wait time at a restaurant, bank branch, or theme park, too.

It should be pretty obvious that Erlang models are widely applicable in computer performance analysis, too. There’s a rich body of literature on this subject that goes back to the dawn of the mainframe. Erlang models are the foundation of most capacity management groups. I’m not even going to scratch the surface here, except to show how some back-of-the-envelope calculations can help you save millions of dollars.

One Million Page Views

In my case, I wanted to look at thread pool sizing. Suppose you have an even 1,000,000 requests per hour to handle. This implies an arrival rate (or lambda) of 0.27777… requests per millisecond. (Erlang units are dimensionless, but you need to start with the same units of time, whether it’s hours, days, or milliseconds.) I’m going to assume for the moment that the system is pretty fast, so it handles a request in 250 milliseconds, on average.

(Please note that there are many assumptions underneath simply statements like "on average". For the moment, I’ll pretend that request processing time follows a normal distribution, even though any modern system is more likely to be bimodal.)

Table 1 shows a portion of the Erlang-C table for these parameters. Feel free to double-check my work with this spreadsheet or this short C program to compute the Erlang-B and Erlang-C values for various numbers of threads. (Thanks to Kenneth J. Christensen for the original program. I can only claim credit for the extra "for" loop.)

Table 1. Erlang-C values at 250 ms / request

NPr_Queue (Erlang-C)
67undef
68undef
69undef
700.921417281
710.791698369
720.676255938
730.574128540
740.484342834
750.405921606
760.337892350
770.279296163
780.229196685
790.186688788
800.150906701
810.121031288
820.096296202
830.075992736
840.059473196
850.046152756
860.035509802
870.027084849
880.020478191
890.015346497
900.011398581
910.008390600
920.006120940
930.004424999
940.003170077
950.002250524
960.001583268
970.001103786
980.000762573
990.000522098

From Table 1, I can immediately see that anything less than 70 threads will never keep up. With less than 70 threads, the queue of unprocessed requests will grow without bound. I need at least 91 threads to get below a 1% chance that a request will be delayed by queueing.

Performance and Capacity

Now, what happens if the average request processing time goes up by 100 milliseconds on those same million requests? Adjusting the parameters, I get Table 2.

Table 2. Erlang-C values at 350 ms / request

NPr_Queue (Erlang-C)
96undef
97undef
980.907100356
990.797290966
1000.697789489
1010.608014385
1020.527376532
1030.455282634
1040.391138874
1050.334354749
1060.284347016
1070.240543652
1080.202387733
1090.169341130
1100.140887936
1110.116537521
1120.095827141
1130.078324041
1140.063626999
1150.051367297
1160.041209109
1170.032849334
1180.026016901
1190.020471625
1200.016002658
1210.012426630
1220.009585560
1230.007344611
1240.005589775
1250.004225555

Now we need a minimum of 99 threads before we can even expect to keep up and we need 122 threads to get down under that 1% queuing threshold.

On the other hand, what about increasing performance by 100 millseconds per request? I’ll let you run the calculator for that, but it looks to me like we need between 42 and 59 threads to meet the same thresholds.

That swing, from 150 to 350 milliseconds per request makes a huge difference in the number of concurrent threads your system must support to handle a million requests per hour—almost a factor of 3 times. Would you be willing to triple your hardware for the same request volume? Next time anyone says that “CPU is cheap”, fold your arms and tell them “Erlang would not approve.” On the flip side, it might be worth spending some administrator time on performance tuning to bring down your average page latency. Or maybe some programmer time to integrate memcached so every single page doesn’t have to trudge all the way to the database.

Summary and Extension

Obviously, there’s a lot more to performance analysis for web servers than this. Over time, I’ll be mixing more analytic pieces with the pragmatic, hands-on posts that I usually make. It’ll take some time. For one thing, I have to go back and learn about stochastic process and Markov chains. Pattern recognition and signal processing I’ve got. Advanced probability and statistics I don’t got.

In fact, I’ll offer a free copy of Release It to the first commenter who can show me how to derive an Erlang-like model that accounts for a) garbage collection times (bimodal processing time distribution), b) multiple coupled wait states during processing, c) non-equilibrium system states, and d) processing time that varies as a function of system utilization.

Constraint, Chaos, Collapse

| Comments

Patrick Muellr has an interesting post about being brainwashed into believing that the outrageous is normal. It’s a good read. (Hat tip to Reddit, whence many good things.) As often happens, I wrote such a long comment to his post that I felt it worthwhile to repost here.

My comment revolves around this chart of the Dow Jones Industrial Average over the last eighty years. (For the record, I’m not disputing anything about the rest of Patrick’s post. In fact, I agree with most of what he says. This chart and my comments aren’t central to his discussion about web development.) Some of you know that I’ve worked in finance before, and most of you know I have an interest in dynamics and complex systems. It’s been an interesting year.

Here’s a snapshot of the chart in question. It’s from Yahoo! Finance, and the image links to the live chart.



Most of the chart looks like an exponential, which suggests the effect of compound growth. In a functioning capital-based system you’d expect exactly that. Capital invested produces more capital. Any time an output is also a required input, you get exponential growth. One of Patrick’s other commenters points out that it looks almost linear when plotted on a logarithmic scale… a dead giveaway of an exponential.

No real system can produce infinite growth. Instead, they always hit a constraint. That could be a physical limitation on the available inputs. It could be a limit on the throughput of the system itself. In a sense, it almost doesn’t matter what the constraint itself happens to be. Rather, you should assume that a constraint exists.

In systems with a chaotic tendency, the system doesn’t slow down at all when approaching the constraint. In fact, it may be increasing at it’s greatest rate just before the constraint clamps down hardest. In such cases, you’ll either see a catastrophic collapse or a chaotic fluctuation.

I don’t know what the true constraint was in the financial system. Plenty of other people believe they know, and I’m happy to let them believe what they like. Just from looking at the chart, though, you could make a strong case that we really hit the constraint in 1999 and the rest has been chaos since then.

Licensing for Windows on EC2

| Comments

One thing I noticed when I fired up my first Windows instances on EC2 was that Windows never asked me for a license key.  From examining the registry, it appears that a valid license key is installed at boot time.  On two instances of image ami-b53cd8dc (ec2-public-windows-images/Server2003r2-i386-anon-v1.01 for i386) I got exactly the same key.

Likewise, on two different instances of ami-7b2bcf12 (ec2-public-windows-images/Server2003r2-x86_64-anon-v1.00 or x64), I got the same license key–though not the same key as the i386 image.

This tells me that the license key is probably baked into the image. It’s also possible that these particular license keys are unique to my account. If someone else wants to compare keys, it’d be an interesting experiment.

Either way, the extra 2.5 cents per hour on the small instance must go to Microsoft to pay for license rental.