Wide Awake Developers

Thread Pools and Erlang Models

| Comments

Sizing, Danish Style

Folks in telecommunications and operations research have used Erlang models for almost a century. A. K. Erlang, a Danish telephone engineer, developed these models to help plan the capacity of the phone network and predict the grade of service that could be guaranteed, given some basic metrics about call volume and duration. Telephone networks are expensive to deploy, particularly when upgrading your trunk lines involves digging up large portions of rocky Danish ground or running cables under the North Sea.

The Erlang-B formula predicts the probability that an incoming call cannot be serviced, based on the call arrival rate, average call time, and number of lines available.  Erlang-C is similar, but allows for calls to be queued while waiting for service. It predicts the probability that a call will be queued. It can also show when calls will never be serviced, because the rate of arriving calls exceeds the system’s total capacity to serve them.

Erlang models are widely used in telecomm, including GPRS network sizing, trunk line sizing, call center staffing models, and other capacity planning arenas where request arrival is apparently random. In fact, you can use it to predict the capacity and wait time at a restaurant, bank branch, or theme park, too.

It should be pretty obvious that Erlang models are widely applicable in computer performance analysis, too. There’s a rich body of literature on this subject that goes back to the dawn of the mainframe. Erlang models are the foundation of most capacity management groups. I’m not even going to scratch the surface here, except to show how some back-of-the-envelope calculations can help you save millions of dollars.

One Million Page Views

In my case, I wanted to look at thread pool sizing. Suppose you have an even 1,000,000 requests per hour to handle. This implies an arrival rate (or lambda) of 0.27777… requests per millisecond. (Erlang units are dimensionless, but you need to start with the same units of time, whether it’s hours, days, or milliseconds.) I’m going to assume for the moment that the system is pretty fast, so it handles a request in 250 milliseconds, on average.

(Please note that there are many assumptions underneath simply statements like "on average". For the moment, I’ll pretend that request processing time follows a normal distribution, even though any modern system is more likely to be bimodal.)

Table 1 shows a portion of the Erlang-C table for these parameters. Feel free to double-check my work with this spreadsheet or this short C program to compute the Erlang-B and Erlang-C values for various numbers of threads. (Thanks to Kenneth J. Christensen for the original program. I can only claim credit for the extra "for" loop.)

Table 1. Erlang-C values at 250 ms / request

NPr_Queue (Erlang-C)
67undef
68undef
69undef
700.921417281
710.791698369
720.676255938
730.574128540
740.484342834
750.405921606
760.337892350
770.279296163
780.229196685
790.186688788
800.150906701
810.121031288
820.096296202
830.075992736
840.059473196
850.046152756
860.035509802
870.027084849
880.020478191
890.015346497
900.011398581
910.008390600
920.006120940
930.004424999
940.003170077
950.002250524
960.001583268
970.001103786
980.000762573
990.000522098

From Table 1, I can immediately see that anything less than 70 threads will never keep up. With less than 70 threads, the queue of unprocessed requests will grow without bound. I need at least 91 threads to get below a 1% chance that a request will be delayed by queueing.

Performance and Capacity

Now, what happens if the average request processing time goes up by 100 milliseconds on those same million requests? Adjusting the parameters, I get Table 2.

Table 2. Erlang-C values at 350 ms / request

NPr_Queue (Erlang-C)
96undef
97undef
980.907100356
990.797290966
1000.697789489
1010.608014385
1020.527376532
1030.455282634
1040.391138874
1050.334354749
1060.284347016
1070.240543652
1080.202387733
1090.169341130
1100.140887936
1110.116537521
1120.095827141
1130.078324041
1140.063626999
1150.051367297
1160.041209109
1170.032849334
1180.026016901
1190.020471625
1200.016002658
1210.012426630
1220.009585560
1230.007344611
1240.005589775
1250.004225555

Now we need a minimum of 99 threads before we can even expect to keep up and we need 122 threads to get down under that 1% queuing threshold.

On the other hand, what about increasing performance by 100 millseconds per request? I’ll let you run the calculator for that, but it looks to me like we need between 42 and 59 threads to meet the same thresholds.

That swing, from 150 to 350 milliseconds per request makes a huge difference in the number of concurrent threads your system must support to handle a million requests per hour—almost a factor of 3 times. Would you be willing to triple your hardware for the same request volume? Next time anyone says that “CPU is cheap”, fold your arms and tell them “Erlang would not approve.” On the flip side, it might be worth spending some administrator time on performance tuning to bring down your average page latency. Or maybe some programmer time to integrate memcached so every single page doesn’t have to trudge all the way to the database.

Summary and Extension

Obviously, there’s a lot more to performance analysis for web servers than this. Over time, I’ll be mixing more analytic pieces with the pragmatic, hands-on posts that I usually make. It’ll take some time. For one thing, I have to go back and learn about stochastic process and Markov chains. Pattern recognition and signal processing I’ve got. Advanced probability and statistics I don’t got.

In fact, I’ll offer a free copy of Release It to the first commenter who can show me how to derive an Erlang-like model that accounts for a) garbage collection times (bimodal processing time distribution), b) multiple coupled wait states during processing, c) non-equilibrium system states, and d) processing time that varies as a function of system utilization.

Constraint, Chaos, Collapse

| Comments

Patrick Muellr has an interesting post about being brainwashed into believing that the outrageous is normal. It’s a good read. (Hat tip to Reddit, whence many good things.) As often happens, I wrote such a long comment to his post that I felt it worthwhile to repost here.

My comment revolves around this chart of the Dow Jones Industrial Average over the last eighty years. (For the record, I’m not disputing anything about the rest of Patrick’s post. In fact, I agree with most of what he says. This chart and my comments aren’t central to his discussion about web development.) Some of you know that I’ve worked in finance before, and most of you know I have an interest in dynamics and complex systems. It’s been an interesting year.

Here’s a snapshot of the chart in question. It’s from Yahoo! Finance, and the image links to the live chart.



Most of the chart looks like an exponential, which suggests the effect of compound growth. In a functioning capital-based system you’d expect exactly that. Capital invested produces more capital. Any time an output is also a required input, you get exponential growth. One of Patrick’s other commenters points out that it looks almost linear when plotted on a logarithmic scale… a dead giveaway of an exponential.

No real system can produce infinite growth. Instead, they always hit a constraint. That could be a physical limitation on the available inputs. It could be a limit on the throughput of the system itself. In a sense, it almost doesn’t matter what the constraint itself happens to be. Rather, you should assume that a constraint exists.

In systems with a chaotic tendency, the system doesn’t slow down at all when approaching the constraint. In fact, it may be increasing at it’s greatest rate just before the constraint clamps down hardest. In such cases, you’ll either see a catastrophic collapse or a chaotic fluctuation.

I don’t know what the true constraint was in the financial system. Plenty of other people believe they know, and I’m happy to let them believe what they like. Just from looking at the chart, though, you could make a strong case that we really hit the constraint in 1999 and the rest has been chaos since then.

Licensing for Windows on EC2

| Comments

One thing I noticed when I fired up my first Windows instances on EC2 was that Windows never asked me for a license key.  From examining the registry, it appears that a valid license key is installed at boot time.  On two instances of image ami-b53cd8dc (ec2-public-windows-images/Server2003r2-i386-anon-v1.01 for i386) I got exactly the same key.

Likewise, on two different instances of ami-7b2bcf12 (ec2-public-windows-images/Server2003r2-x86_64-anon-v1.00 or x64), I got the same license key–though not the same key as the i386 image.

This tells me that the license key is probably baked into the image. It’s also possible that these particular license keys are unique to my account. If someone else wants to compare keys, it’d be an interesting experiment.

Either way, the extra 2.5 cents per hour on the small instance must go to Microsoft to pay for license rental.

 

Windows on EC2, From a Mac

| Comments

It may be a bit perverse, but I wanted to hit a Windows EC2 instance from my Mac. After a little hitch getting started, I got it to work. There are a few quirks about accessing Windows instances, though.

First off, SSH is not enabled by default. You’ll need to use remote desktop to access your instance. Remote desktop uses port 3389, so the first step is to create a new security group for Windows desktop access

$ ec2-add-group windows -d 'Windows remote desktop access'
GROUP    windows    Windows remote desktop access

Then, allow access to port 3389 from your desired origin. I’m allowing it from anywhere, which isn’t a great idea, but I’m on the road a lot. I never know what the hotel’s network origin will be.

$ ec2-authorize windows -p 3389 -P tcp
GROUP        windows    
PERMISSION        windows    ALLOWS    tcp    3389    3389    FROM    CIDR    0.0.0.0/0

Obviously, you could add that permission to any existing group that you already use.

There’s a bit of a song and dance to log in. Where Linux instances typically use SSH with public-key authentication, Windows server requires a typed password. Amazon has come up with a reasonable, but slightly convoluted, way to extract a randomized password.

You will need to start your instance in the new security group and with a keypair. The docs could be a little clearer, in that here you’re providing the name of the keypair as it was registered with EC2. The first few times I tried this, I was giving it the path of the file containing the keypair, which doesn’t work.

$ ec2-describe-keypairs
KEYPAIR    devkeypair    02:10:65:9e:51:73:7e:93:bd:30:e2:5d:91:03:d5:e1:d4:0e:c0:f4
$ ec2-run-instances ami-782bcf11 -g windows -k devkeypair
RESERVATION    r-82429ceb    001356815600    windows
INSTANCE    i-f172db98    ami-782bcf11            pending    devkeypair    0        m1.small    2008-10-23T20:01:36+0000    us-east-1a            windows

After all that, and waiting through a Windows boot cycle, you can access the Windows desktop through RDP.

What’s that? You don’t have an RDP client, because you’re a Mac user? I like CoRD for that. I also saw a lot of references to rdesktop, which is available through Darwin Ports. (For today, I wasn’t prepared to install Ports just to try out the Windows EC2 instance!)

Extract the public IP address of your instance:

$ ec2-describe-instances
RESERVATION    r-82429ceb    001356815600    windows
INSTANCE    i-f172db98    ami-782bcf11    ec2-75-101-252-238.compute-1.amazonaws.com    domU-12-31-39-02-48-31.compute-1.internal    running    devkeypair    0        m1.small    2008-10-23T20:01:36+0000    us-east-1a        windows

Fire up CoRD and paste the IP address into "Quick Connect".

Well, now what? Obviously, you’ll use "Administrator" as the username, but what’s the password? There’s a new command in the latest release of ec2-api-tools called "ec2-get-password".

$ ec2-get-password i-f172db98 -k keys/devkeypair.pem
edhnsNG1J5

Note that this time, I’m using the path of my keypair file. EC2 uses this to decrypt the password from the instance’s console output. At boot time, Windows prints out the password, encrypted with the public key from the keypair you named when starting the instance.

Success at last: fully logged in to my virtual Windows server from my Mac desktop.

Don’t Break My Heart, EC2!

| Comments

I’m a huge booster of AWS and EC2. I have two talks about cloud computing, and one that’s pretty specific to AWS, on the No Fluff, Just Stuff traveling symposium.

With today’s announcement about EC2 coming out of beta, and about Windows support, I wanted to try out a Windows server on EC2.

Heartbreak!

ec2-describe-images -a | grep windows
IMAGE    ami-782bcf11    ec2-public-windows-images/Server2003r2-i386-anon-v1.00.manifest.xml    amazon    available    public        i386    machine        
IMAGE    ami-792bcf10    ec2-public-windows-images/Server2003r2-i386-EntAuth-v1.00.manifest.xml    amazon    available    public        i386    machine        
IMAGE    ami-7b2bcf12    ec2-public-windows-images/Server2003r2-x86_64-anon-v1.00.manifest.xml    amazon    available    public        x86_64    machine        
IMAGE    ami-7a2bcf13    ec2-public-windows-images/Server2003r2-x86_64-EntAuth-v1.00.manifest.xml    amazon    available    public        x86_64    machine        
IMAGE    ami-3934d050    ec2-public-windows-images/SqlSvrExp2003r2-i386-Anon-v1.00.manifest.xml    amazon    available    public        i386    machine        
IMAGE    ami-0f34d066    ec2-public-windows-images/SqlSvrExp2003r2-i386-EntAuth-v1.00.manifest.xml    amazon    available    public        i386    machine        
IMAGE    ami-8135d1e8    ec2-public-windows-images/SqlSvrExp2003r2-x86_64-Anon-v1.00.manifest.xml    amazon    available    public        x86_64    machine        
IMAGE    ami-9835d1f1    ec2-public-windows-images/SqlSvrExp2003r2-x86_64-EntAuth-v1.00.manifest.xml    amazon    available    public        x86_64    machine        
IMAGE    ami-6834d001    ec2-public-windows-images/SqlSvrStd2003r2-x86_64-Anon-v1.00.manifest.xml    amazon    available    public        x86_64    machine        
IMAGE    ami-6b34d002    ec2-public-windows-images/SqlSvrStd2003r2-x86_64-EntAuth-v1.00.manifest.xml    amazon    available    public        x86_64    machine        
IMAGE    ami-cd8b6ea4    khaz_windows2003srvEE/image.manifest.xml    602961847481    available    public        i386    machine        

mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-792bcf10
Server.InsufficientInstanceCapacity: Insufficient capacity.
mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-792bcf10
Server.InsufficientInstanceCapacity: Insufficient capacity.
mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-792bcf10 -z us-east-1a
Server.InsufficientInstanceCapacity: Insufficient capacity.
mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-792bcf10 -z us-east-1b
Server.InsufficientInstanceCapacity: Insufficient capacity.
mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-792bcf10 -z us-east-1c
Server.InsufficientInstanceCapacity: Insufficient capacity.

Ack! Insufficient capacity?! That’s not supposed to happen. Wait a second… let me try my own image

mtnygard@donk /var/tmp/nms $ ec2-describe-images
IMAGE    ami-8a0beee3    com.michaelnygard/nms-base-v1.manifest.xml    001356815600    available    private        i386    machine        
mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-8a0beee3
RESERVATION    r-0c4a9465    001356815600    default
INSTANCE    i-8e79d0e7    ami-8a0beee3            pending        0        m1.small    2008-10-23T17:25:21+0000    us-east-1c        
mtnygard@donk /var/tmp/nms $ ec2-run-instances ami-792bcf10
Server.InsufficientInstanceCapacity: Insufficient capacity.

Very interesting. Looks like there’s enough capacity to run all the Linux based images, but not enough for Windows?

Seems like there might be some contractual limit on how many Windows licenses Amazon is allowed to rent out. I would also infer some serious pent-up demand to eat them all up this quickly.

Or maybe it’s just a glitch. We’ll see.

Update [1:15 PM] I was just able to start five instances. Could be fluctuations in demand, or it could be clearing of a glitch. It’s always hard to tell what’s really happening inside the cloud.

Update [2:50 PM] My plaintive post in the AWS forums got a very quick response. The inscrutable wizard JeffW posted a “we’re working on it” and “it’s fixed” messages just 3 minutes apart. We’ll probably never know quite what was going on.

Perfection Is Not Always Required

| Comments

In my series on dirty data, I made the argument that sometimes incomplete, inaccurate, or inconsistent data was OK. In fact, not only is it OK, but it can be an advantage.

There’s a really slick Ruby library called WhatLanguage that illustrates this beautifully. The author also wrote a nice article introducing the library. WhatLanguage automatically determines the language that a piece of text is written in.

For example (from the article)

require 'whatlanguage'

"Je suis un homme".language # => :french

Very nice.

WhatLanguage works by comparing words in the input text to a data structure that can tell you whether a word exists in the corpus. There’s the catch, though. It can return a false positive! That would mean you get an incorrect "yes" sometimes for words that aren’t in the language in question. On the other hand, it’s guaranteed against false negatives.

You might imagine that there are pretty limited circumstances when you’d use a data structure that sometimes returns incorrect answers. (There is a calculable probability of a false positive. It never reaches zero.) It works for WhatLanguage, though.

You see, each word contributes to a histogram binned by possible language. Ultimately, one language "wins", based on whichever has the most entries in the histogram. False positives may contribute an extra point to incorrect languages, but the correct language will pretty much always emerge from the noise, provided there’s enough source text to work from.

So, there’s another example of information emerging from noisy inputs, just as long as there’s enough of it.

 

 

Arrival at JAOO

| Comments

Considering that it’s 7:30 AM local time—where "local" means Aarhus, Denmark—and I’m awake and online, it looks like I’ve successfully reset my internal clock.  Of course, my approach consisted of staying awake for 28 hours continuously then having three excellent beers with dinner.  There are probably easier ways, and there may be repercussions later.

I’ve always heard good things about JAOO, so it was an honor and a delight to be invited. So far, just hanging around the hotel has been interesting. Waiting to check in yesterday evening, I encountered Richard Gabriel and one of the guys who designed Windows PowerShell. (He still calls it Monad, which I think was a much better name than "PowerShell".  Also, I wish I’d gotten his name, but I was a too distracted by the problem with my reservation.)

After dinner, I started chatting with some ThoughtWorkers over a game of ZombieFluxx. Two observations: first, ZombieFluxx is the kind of game that only a computer programmer or a lawyer could love. The deck of cards includes many cards that change the rules of the game itself. Gameplay changes from turn to turn based on the current state of the rule cards showing. There’s even a card that requires you to groan like the undead whenever you turn over a new "zombie" card. Very meta.  Second, it seems that TW people make up half of every conference I go to. They must have a fantastic training budget, because they are disproportionately represented relative to their much larger competitors like Accenture, Deloitte, and that crowd. Woe to the conference industry if ThoughtWorks falls on hard times.

My primary goal for today was to get over jetlag. Having accomplished that before 8 AM, I’ll now see about straightening out my hotel situation. It’s hard to think much about software when you may not have a roof over your head come nightfall.

Update: Got my hotel issues resolved. Now at a thoroughly modern, thoroughly Danish hotel called the "Best Western Oasia". Funny, but I always think of "Best Western" as the cruddy, mildewed cheap hotels off the Interstate in places like west Texas and Birmingham, Alabama. This hotel may cause me to reevaluate that image! It’s nice, in a kind of "living inside Ikea" way.

(And, yes, I know Ikea is Swedish, not Danish. It’s the bare wood, spare furnishings, and black lacquer I’m talking about.)

The Infamous Seinfeld-Gates Ad

| Comments

The Seinfeld/Gates ad is so laughably bad that people are already building indexes of the negative reactions, less than 24 hours after it launched.

I have my own take on it.

Gates is the most recognizable geek on the planet. For most non-techies, he is the archetype of geekhood.

What kind of name recognition does Steve Ballmer have?  Outside of developers, developers, developers, and developers.  Would a silver-haired manager ever use him for a cheesy business analogy in a meeting?  Nope. Blank looks all around.  Tiger Woods and Bill Gates make good metaphors. Steve Ballmer doesn’t.

Ray Ozzie? Not a chance. Even most techies don’t know who Ozzie is.

This commercial wasn’t about churros, The Conquistador, or briefs riding up. It was all about one line.

"Brain meld".

It slipped by fast, but that was it. That was the line where billg@microsoft.com began the public torch-passing ceremony.

A couple more spots, and we’ll see either Ballmer or Ozzie entering the plot. Then we get the handoff, where John Q. Public is now meant to understand, "OK, Bill Gates has retired, but he’s passed his wireframe glasses and nervous tics on to this guy."

Seriously, it’s torch-passing.  Don’t believe me? You will when you see Ballmer air-running past a giant BSOD in the final ad.

In Korean

| Comments

"Release It" has now been translated into Korean. I just received three copies of a work that’s hauntingly familiar, but totally opaque to me.

I kind of wonder how the pop-culture jokes came through.  I bet C3PO and R2D2 made it OK, but I wonder whether "dodge, duck, dip, dive, and dodge" made it past the Korean copy editor.  (For that matter, I’m faintly surprised it made it past the English copy editor.)