Wide Awake Developers

Flash Mobs and TCP/IP Connections

| Comments

In Release It, I talk about users and the harm they do to our systems.  One of the toughest types of user to deal with is the flash mob.  A flash mob often results from Attacks of Self-Denial, like when you suddenly offer a $3000 laptop for $300 by mistake.

When a flash mob starts to arrive, you will suddenly see a surge of TCP/IP connection requests at your load-distribution layer.  If the mob arrives slowly enough (less than 1,000 connections per second) then the app servers will be hurt the most.  For a really fast mob, like when your site hits the top spot on digg.com, you can get way more than 1,000 connections per second.  This puts the hurt on your web servers.

As the TCP/IP connection requests arrive, the OS queues them for servicing by the application.  As the application gets around to calling "accept" on the server socket, the server’s TCP/IP stack sends back the SYN/ACK packet and the connection is established.  (There’s a third step, but we can skip it for the moment.)  At that point, the server hands the established connection off to a worker thread to process the request.  Meanwhile, the thread that accepted the connection goes back to accept the next one.

Well, when a flash mob arrives, the connection requests arrive faster than the application can accept and dispatch them.   The TCP/IP stack protects itself by limiting the number of pending connection requests, so if the requests arrive faster than the application can accept them, the queue will grow until the stack has to start refusing connection requests.  At that point, your server will be returning intermittent errors and you’re already failing.

The solution is much easier said than done: accept and dispatch connections faster than they arrive.

Filip Hanik compares some popular open-source servlet containers to see how well they stand up to floods of connection requests.  In particular, he demonstrates the value of Tomcat 6’s new NIO connector.  Thanks to some very careful coding, this connector can accept 4,000 connections in 4 seconds on one server.  Ultimately, he gets it to accept 16,000 concurrent connections on a single server.  (Not surprisingly, RAM becomes the limiting factor.)

It’s not clear that these connections can actually be serviced at that point, but that’s a story for another day.

Release It! Is Released!

| Comments

"Release It!" has been officially announced in this press release.  Andy Hunt, my editor, also posted announcements to several mailing lists.

It’s been a long road, so I’m thrilled to see this release.

When you release a new software system, that’s not the end of the process, but just the beginning of the system’s life.  It is the same thing here.  Though it’s taken me two years to get this book done and on the market, this is not the end of the book’s creation, but the beginning of it’s life.


Self-Inflicted Wounds

| Comments

My friend and colleague Paul Lord said, "Good marketing can kill you at any time."

He was describing a failure mode that I discuss in Release It!: Design and Deploy Production-Ready Software as "Attacks of Self-Denial".  These have all the characteristics of a distributed denial-of-service attack (DDoS), except that a company asks for it.  No, I’m not blaming the victim for electronic vandalism… I mean, they actually ask for the attack.

The anti-pattern goes something like this: marketing conceives of a brilliant promotion, which they send to 10,000 customers.  Some of those 10,000 pass the offer along to their friends.  Some of them post it to sites like FatWallet or TechBargains.  On the appointed day, hour, and minute, the site has a date with destiny as a million or more potential customers hit the deep link that marketing sent around in the email.  You know, the one that bypasses the content distribution network, embeds a session ID in the URL, and uses SSL?

Nearly every retailer I know has done this to themselves at one point.  Two holidays ago, one of my clients did it to themselves, when they announced that XBox 360 preorders would begin at a certain day and time.  Between actual customers and the amateur shop-bots that the tech-savvy segment cobbled together, the site got crushed.  (Yes, this was one where marketing sent the deep link that bypassed all the caching and bot-traps.)

Last holiday, Amazon did it to themselves when they discounted the XBox 360 by $300.  (What is it about the XBox 360?)  They offered a thousand units at the discounted price and got ten million shoppers.  All of Amazon was inaccessible for at least 20 minutes.  (It may not sound like much, but some estimates say Amazon generates $1,000,000 per hour during the holiday season, so that 20 minute outage probably cost them around $200,000!)

In Release It!, I discuss some non-technical ways to mitigate this behavior, as well as some design and architecture patterns you can apply to minimize damage when one of these Attacks of Self-Denial occur.

Design Patterns in Real Life

| Comments

I’ve seen walking cliches before.  There was this one time in the Skyway that I actually saw a guy with a white cane being led by a woman with huge dark sunglasses and a guide dog.  Today, though, I realized I was watching a design pattern played out with people instead of objects.

I’ve used the Reactor pattern in my software before.  It’s particularly helpful when you combine it with non-blocking multiplexed I/O, such as Java’s NIO package.

Consider a server application such as a web server or mail transfer agent.  A client connects to a socket on the server to send a request. The server and client talk back and forth a little bit, then the server either processes or denies the client’s request.

If the server just used one thread, then it could only handle a single client at a time.  That’s not likely to make a winning product. Instead, the server uses multiple threads to handle many client connections.

The obvious approach is to have one thread handle each connection.  In other words, the server keeps a pool of threads that are ready and waiting for a request.  Each time through its main loop, the server gets a thread from the pool and, on that thread, calls the socket "accept" method.  If there’s already a client connection request waiting, then "accept" returns right away.  If not, the thread blocks until a client connects.  Either way, once "accept" returns, the server’s thread has an open connection to a client.

At that point, the thread goes on to read from the socket (which blocks again) and, depending on the protocol, may write a response or exchange more protocol handshaking.  Eventually, the demands of protocol satisfied, the client and server say goodbye and each end closes the socket.  The worker thread pulls a Gordon Freeman and disappears into the pool until it gets called up for duty again.

It’s a simple, obvious model.  It’s also really inefficient.  Any given thread spends most of its life doing nothing.  It’s either blocked in the pool, waiting for work, or it’s blocked on a socket "accept", "read", or "write" call.

If you think about it, you’ll also see that the naive server can handle only as many connections as it has threads.  To handle more connections, it must fork more threads.  Forking threads is expensive in two ways.  First, starting the thread itself is slow.  Second, each thread requires a certain amount of scheduling overhead.  Modern JVMs scale well to large numbers of threads, but sooner or later, you’ll still hit the ceiling.

I won’t go into all the details of non-blocking I/O here.  (I can point you to a decent article on the subject, though.)  Its greatest benefit is you do not need to dedicate a thread to each connection.  Instead, a much smaller pool of threads can be allocated, as needed, to handle individual steps of the protocol.  In other words, thread 13 doesn’t necessarily handle the whole conversation. Instead, thread 4 might accept the connection, thread 29 reads the initial request, thread 17 starts writing the response and thread 99 finishes sending the response.

This model employs threads much more efficiently.  It also scales to many more concurrent requests.  Bookkeeping becomes a hassle, though. Keeping track of the state of the protocol when each thread only does a little bit with the conversation becomes a challenge.  Finally, the (hideously broken) multithreading restrictions in Java’s "selector" API make fully multiplexed threads impossible.

The Reactor pattern predates Java’s NIO, but works very well here.  It uses a single thread, called the Acceptor, to await incoming "events". This one thread sleeps until any of the connections needs service: either due to an incoming connection request, a socket ready to read, or a socket ready for write.  As soon as one of these events occurs, the Acceptor hands the event off to a dispatcher (worker) thread that then processes the event.

You can visualize this by sitting in a TGI Friday’s or Chili’s restaurant.  (I’m fond of the crowded little ones inside airports. You know, the ones with a third of the regular menu and a line stretching out the door.  Like a home away from home for me lately.) The "greeter" accepts incoming connections (people) and hands them off to a "worker" (server).  The greeter is then ready for the next incoming request.  (The line out the door is the listen queue, in case you’re keeping score.)  When the kitchen delivers the food, it doesn’t wait for the original worker thread.  Instead, a different worker thread (a runner) brings the food out to the table.

I’ll keep my eyes open for other examples of object-oriented design patterns in real life–though I don’t expect to see many based on polymorphism.

Another Path to a Killer Product

| Comments

Give individuals powers once reserved for masses

Here’s a common trajectory:

1. Something is so expensive that groups (or even an entire government) have to share them.  Think about mainframe computers in the Sixties.

2. The price comes down until a committed individual can own one.  Think homebrew computers in the Seventies.  The "average" person  wouldn’t own one, but the dedicated geek-hobbyist would.

3. The price comes down until the average individual can own one.  Think PCs in the Eighties.

4. The price comes down until the average person owns dozens.  PCs, game consoles, MP3 players, GPS navigators, laptops, embedded processors in toasters and cars.  An average person may have half a dozen devices that once were considered computers.

Along the way, the product first gains broader and broader functionality, then becomes more specific and dedicated.

Telephones, radios and televisions all followed the same trajectory.  You would probably call these moderately successful products.

So: find something so expensive that groups have to purchase and share it.  Make it cheap enough for a private individual.

Quantum Manipulations

| Comments

I work in information technology, but my first love is science.  Particularly the hard sciences of physics and cosmology.

There’ve been a series of experiments over the last few years that have demonstrated quantum manipulations of light and matter that approach the macroscopic realm.

A recent result from Harvard (HT to Dion Stewart for the link) has gotten a lot of (incorrect) play.  It involves absorbing photons with a Bose-Einstein condensate, then reproducing identical photons at some distance in time and space.  I’ve been reading about these experiments with a lot of interest, along with the experiments going the "other" direction: supraluminal group phase travel.

I wish the science writers would find a new metaphor, though.  They all talk in terms of "stopping light" or "speeding up light".  None of these have to do with changing the speed of light, either up or down.  This is about photons, not the speed of light.

In fact, this latest one is even more interesting when you view it in terms of the "computational universe" theory of Seth Lloyd.  What they’ve done is captured the complete quantum state of the photons, somehow ‘imprinted’ on the atoms in the condensate, then recreated the photons from that quantum state.

This isn’t mere matter-energy conversion as the headlines have said.  It’s something much more.

The Bose-Einstein condensate can be described as a phase of matter colder than a solid.  It’s much weirder than that, though.  In the condensate, all the particles in all the atoms achieve a single wavefunction.  You can describe the entire collection of protons, neutrons and electrons as if it were one big particle with its own wavefunction.

This experiment with the photons shows that the photons’ wavefunctions can be superposed with the wavefunction of the condesnate, then later extracted to separate the photons from the condensate.

The articles somewhat misrepresent this as being about converting light (energy) to matter, but its really about converting the photon particles to pure information then using that information to recreate identical particles elsewhere.  Yikes!

A Path to a Product

| Comments

Here’s a "can’t lose" way to identify a new product: Enable people to plan ahead less. 

Take cell phones.  In the old days, you had to know where you were going before you left.  You had to make reservations from home.  You had to arrange a time and place to meet your kids at Disney World.

Now, you can call "information" to get the number of a restaurant, so you don’t have to decide where you’re going until the last possible minute.  You can call the restaurant for reservations from your car while you’re already on your way.

With cell phones, your family can split up at a theme park without pre-arranging a meeting place or time.

Cell phones let you improvise with success.  Huge hit.

GPS navigation in cars is another great example.  No more calling AAA weeks before your trip to get "TripTix" maps.  No more planning your route on a road atlas.  Just get in your car, pick a destination and start driving.  You don’t even have to know where to get gas or food
along the way.

Credit and debit cards let you go places without planning ahead and carrying enough cash, gold, or jewels to pay your way.

The Web is the ultimate "preparation avoidance" tool.  No matter what you’re doing, if you have an always-on ‘Net connection, you can improvise your way through meetings, debates, social engagements, and work situations.

Find another product that lets procrastinators succeed, and you’ve got a sure winner.  There’s nothing that people love more than the personal liberation of not planning ahead.


How to Become an “Architect”

| Comments

Over at The Server Side, there’s a discussion about how to become an "architect".  Though TSS comments often turn into a cesspool, I couldn’t resist adding my own two cents.

I should also add that the title "architect" is vastly overused.  It’s tossed around like a job grade on the technical ladder: associate developer, developer, senior developer, architect.  If you talk to a consulting firm, it goes more like: senior consultant (1 - 2 years experience), architect (3 - 5 years experience), senior technical architect (5+ years experience).  Then again, I may just be too cynical.

There are several qualities that the architecture of a system should be:

  1. Shared.  All developers on the team should have more or less the same vision of the structure and shape of the overall system.
  2. Incremental.  Grand architecture projects lead only to grand failures.
  3. Adaptable. Successful architectures can be used for purposes beyond their designers’ original intentions.  (Examples: Unix pipes, HTTP, Smalltalk)
  4. Visible.  The "sacred, invisible architecture" will fall into disuse and disrepair.  It will not outlive its creator’s tenure or interest.

Is the designated "architect" the only one who can produce these qualities?  Certainly not.  He/she should be the steward of the system, however, leading the team toward these qualities, along with the other -ilities, of course.

Finally, I think the most important qualification of an architect should be: someone who has created more than one system and lived with it in production.  Note that automatically implies that the architect must have at least delivered systems into production.  I’ve run into "architects" who’ve never had a project actually make it into production, or if they have, they’ve rolled off the project—again with the consultants—just as Release 1.0 went out the door.

In other words, architects should have scars. 

Planning to Support Operations

| Comments

In 2005, I was on a team doing application development for a system that would be deployed to 600 locations. About half of those locations would not have network connections. We knew right away that deploying our application would be key, particularly since it is a "rich-client" application. (What we used to call a "fat client", before they became cool again.) Deployment had to be done by store associates, not IT. It had to be safe, so that a failed deployment could be rolled back before the store opened for business the next day. We spent nearly half of an iteration setting up the installation scripts and configuration. We set our continuous build server up to create the "setup.exe" files on every build. We did hundreds of test installations in our test environment.

Operations said that our software was "the easiest installation we’ve ever had." Still, that wasn’t the end of it. After the first update went out, we asked operations what could be done to improve the upgrade process. Over the next three releases, we made numerous improvements to the installers:

  • Make one "setup.exe" that can install either a server or a client, and have the installer itself figure out which one to do.
  • Abort the install if the application is still running. This turned out to be particularly important on the server.
  • Don’t allow the user to launch the application twice. Very hard to implement in Java. We were fortunate to find an installer package that made this a check-box feature in the build configuration file!
  • Don’t show a blank Windows command prompt window. (An artifact of our original .cmd scripts that were launching the application.)
  • Create separate installation discs for the two different store brands.
  • When spawning a secondary application, force it’s window to the front, avoiding the appearance of a hang if the user accidentally gives focus to the original window.

These changes reduced support call volume by nearly 50%.

My point is not to brag about what a great job we did. (Though we did a great job.) To keep improving our support for operations, we deliberately set aside a portion of our team capacity each iteration. Operations had an open invitation to our iteration planning meetings, where they could prioritize and select story cards the same as our other stakeholders. In this manner, we explicitly included Operations as a stakeholder in application construction. They consistently brought us ideas and requests that we, as developers, would not have come up with.

Furthermore, we forged a strong bond with Operations. When issues arose—as they always will—we avoided all of the usual finger-pointing. We reacted as one team, instead of two disparate teams trying to avoid responsibility for the problems. I attribute that partly to the high level of professionalism in both development and operations, and partly to the strong relationship we created through the entire development cycle.

"Us" and "Them"

| Comments

As a consultant, I’ve joined a lot of projects, usually not right when the team is forming. Over the years, I’ve developed a few heuristics that tell me a lot about the psychological health of the team. Who lunches together? When someone says "whole team meeting," who is invited? Listen for the "us and them" language. How inclusive is the "us" and who is relegated to "them?" These simple observations speak volumes about the perception of the development team. You can see who they consider their stakeholders, their allies, and their opponents.

Ten years ago, for example, the users were always "them." Testing and QA was always "them." Today, particularly on agile teams, testers and users often get "us" status (As an aside, this may be why startups show such great productivity in the early days. The company isn’t big enough to allow "us" and "them" thinking to set in. Of course, the converse is true as well: us and them thinking in a startup might be a failure indicator to watch out for!). Watch out if an "us" suddenly becomes "them." Trouble is brewing!

Any conversation can create a "happy accident;" some understanding that obviates a requirement, avoids a potential bug, reduces cost, or improves the outcome in some other way. Conversations prevented thanks to an armed-camp mentality are opportunities lost.

One of the most persistent and perplexing "us" and "them" divisions I see is between development and operations. Maybe it’s due to the high org-chart distance (OCD) between development groups and operations groups. Maybe it’s because development doesn’t tend to plan as far ahead as operations does. Maybe it’s just due to a long-term dynamic of requests and refusals that sets each new conversation up for conflict. Whatever the cause, two groups that should absolutely be working as partners often end up in conflict, or worse, barely speaking at all.

This has serious consequences. People in the "us" tent get their requests built very quickly and accurately. People in the "them" tent get told to write specifications. Specifications have their place. Specifications are great for the fourth or fifth iteration of a well-defined process. During development, though, ideas need to be explored, not specified. If a developer has a vague idea about using the storage area network to rapidly move large volumes of data from the content management system into production, but he doesn’t know how to write the request, the idea will wither on the vine.

The development-operations divide virtually ensures that applications will not be transitioned to operations as effectively as possible. Some vital bits of knowledge just don’t fit into a document template. For example, developers have knowledge about the internals of the application that can help diagnose and recover from system failures. (Developer: "Oh, when you see all the request handling threads blocked inside the K2 client library, just bounce the search servers. The app will come right back." Operations: "Roger that. What’s a thread?") These gaps in knowledge degrade uptime, either by extending outages or preventing operations from intervening. If the company culture is at all political, one or two incidents of downtime will be enough to start the finger-pointing between development and operations. Once that corrosive dynamic gets started, nothing short of changing the personnel or the leadership will stop it.