Wide Awake Developers

« Release It! is released! | Main | Release It! is shipping »

Flash Mobs and TCP/IP Connections

In Release It, I talk about users and the harm they do to our systems.  One of the toughest types of user to deal with is the flash mob.  A flash mob often results from Attacks of Self-Denial, like when you suddenly offer a $3000 laptop for $300 by mistake.

When a flash mob starts to arrive, you will suddenly see a surge of TCP/IP connection requests at your load-distribution layer.  If the mob arrives slowly enough (less than 1,000 connections per second) then the app servers will be hurt the most.  For a really fast mob, like when your site hits the top spot on digg.com, you can get way more than 1,000 connections per second.  This puts the hurt on your web servers.

As the TCP/IP connection requests arrive, the OS queues them for servicing by the application.  As the application gets around to calling "accept" on the server socket, the server's TCP/IP stack sends back the SYN/ACK packet and the connection is established.  (There's a third step, but we can skip it for the moment.)  At that point, the server hands the established connection off to a worker thread to process the request.  Meanwhile, the thread that accepted the connection goes back to accept the next one.

Well, when a flash mob arrives, the connection requests arrive faster than the application can accept and dispatch them.   The TCP/IP stack protects itself by limiting the number of pending connection requests, so if the requests arrive faster than the application can accept them, the queue will grow until the stack has to start refusing connection requests.  At that point, your server will be returning intermittent errors and you're already failing.

The solution is much easier said than done: accept and dispatch connections faster than they arrive.

Filip Hanik compares some popular open-source servlet containers to see how well they stand up to floods of connection requests.  In particular, he demonstrates the value of Tomcat 6's new NIO connector.  Thanks to some very careful coding, this connector can accept 4,000 connections in 4 seconds on one server.  Ultimately, he gets it to accept 16,000 concurrent connections on a single server.  (Not surprisingly, RAM becomes the limiting factor.)

It's not clear that these connections can actually be serviced at that point, but that's a story for another day.

Comments

I love the term "Attacks of Self-Denial". I've been in a lot of places that are excellent at this, and I've seen it both in the web tier (e.g. sending out advertising e-mails to all customers at once) and in batch processes (e.g. starting up processes that take 1 hr+ to finish with 10 minutes between them). This is one of the areas where the communication breakdown between business and technical people is huge.

More on topic, there is a point made on that page which is really interesting to this discussion -- namely, why bother accepting connections that you won't have the power to service? Sure, you can connect to them at the stack level, but if you can't do anything with the connection, who cares? You could probably expand the definition of "servicable connections" by adding in a filter at the app server layer which keeps track of the rate of incoming connections and redirects to a "We're Overloaded!" page, and that would at least provide a slightly better customer experience -- but at the end of the day, given enough bulls I can trample any runner.

See you at OTUG, BTW -- I'd like to grab you there to chat a bit.

Robert,

I coined the term "Attack of Self-Denial" for my book. It strongly resembles a Distributed Denial of Service (DDoS) attack, except the traffic is legitimate and you instigate it yourself.

I totally agree that you're better off directing connections you can't serve to some polite refusal page. It has to be done way before the app server, though. The most likely place to do it is in a hardware load balancer. If you use Akamai, you can even rig up a dynamic control system that lets your app handshake with Akamai to tell Akamai to divert users.

I also think this kind of self-awareness is vitally important for service-oriented architectures. No matter how big you build your service, there will come a time when it cannot respond within an acceptable window. At that point, you can take one of two approaches: the caller can time out and treat it as a system error--which makes the user wait for the timeout period--or you can have the service provider monitor its own response time and provide fast errors rather than slow timeouts. (And, by the way, having the caller timeout still puts more load on the service provider, exactly when it least needs additional load. Worse yet, the caller will end up abandoning the connection, thus totally wasting the work.) This combines the "Handshaking" and "Fail Fast" stability patterns from "Release It!".

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)