Wide Awake Developers


Circuit Breaker in Scala

FaKod (I think that translates as "The Fatalistic Coder"?) has written a nice Scala implementation of the Circuit Breaker pattern, and even better, has made it available on GitHub.

Check out http://github.com/FaKod/Circuit-Breaker-for-Scala for the code.

The Circuit Breaker can be mixed in to any type. See http://wiki.github.com/FaKod/Circuit-Breaker-for-Scala/ for an example of usage.

GMail Outage Was a Chain Reaction

Google has published an explanation of the widespread GMail outage from September 1st. In this explanation, they trace the root cause to a layer of "request routers":

...a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!". This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded.

This perfectly describes the "Chain Reaction" stability antipattern from Release It!

Minireview: Beginning Scala

As you can probably tell from my recent posts, I've been learning Scala. I recently dug into another Scala book, Beginning Scala by David Pollak.

Beginning Scala is a nice, gentle introduction to this language. It takes a gradual, example driven approach that emphasizes running code early. This makes it a good intro for people who want to use the language for applications first, then worry about creating frameworks later.

Don't let that fool you, though. Pollak gets to the sophisticated parts soon enough. I particularly like a example of creating a new "control structure" to execute stuff in the context of a JDBC connection. This puts some meat on the argument that Scala is a "scalable language." Where other languages either implement this as a keyword (as in Groovy's "with") or a framework (Spring's "templates"), here it can be added with one page of example code.

Beginning Scala also has a very thorough discussion of actors. I appreciate this, because actors were my main motivation for learning Scala in the first place.

Pollak separates the act of consuming a library from that of creating a library. He advises us to worry most about types, traits, co- and contravariance, etc. mainly when we are creating libraries. True to this notion, chapter 7 is called "Traits and Types and Gnarly Stuff for Architects". It doesn't sound like much fun, but it is important material. I find that Scala makes me think more about the type system than other languages. It's strongly, and statically, typed. (So much so, in fact, that it makes me realize just how loose Java's own type system is.) As such, it pays to have a firm understanding of how code turns into types. Scala has a rich set of tools for building an expressive type system, but there is also complexity there. Checking in at 60 pages, this chapter covers Scala's tools along with guidance on good styles and idioms.

Interestingly, although there is a Lift logo on the cover, there's nothing about Lift in the book itself. Considering that Pollak is the creator of Lift, it's curious that this book doesn't deal with it. Perhaps that's being left for another title.

Overall, I endorse Beginning Scala.

Beautiful Architecture

O'Reilly has released "Beautiful Architecture," a compilation of essays by software and system architects. I'm happy to announce that I have a chapter in this book. The finished book is shipping now, and available through Safari. I think the whole thing has turned out amazingly well, both instructive and interesting.

One of the editors, Diomidas Spinellis, has posted an excellent description and summary.

In Korean

"Release It" has now been translated into Korean. I just received three copies of a work that's hauntingly familiar, but totally opaque to me.

I kind of wonder how the pop-culture jokes came through.  I bet C3PO and R2D2 made it OK, but I wonder whether "dodge, duck, dip, dive, and dodge" made it past the Korean copy editor.  (For that matter, I'm faintly surprised it made it past the English copy editor.)

97 Things Every Software Architect Should Know

O'Reilly is creating a new line of "community-authored" books. One of them is called "97 Thing Every Software Architect Should Know".

All of the "97 Things" books will be created by wiki, with the best entries being selected from all the wiki contributions.

I've contributed several axioms that have been selected for the book:

Long-time readers of this blog may recognize some of these themes.

You can see the whole wiki here.


Word Cloud Bandwagon

Wordle has been meming it's way around the 'Net lately.  Figured I'd join the crowd by doing a word cloud for Release It.  This is from the preface.


Considering that this is just from fairly simple text analysis, I'm surprised at how accurately it represents the key concerns. "Software" and "money" have roughly equal prominence. "Life" appears near the middle, along with "excitement", "revenue", "production" and "systems". Not bad for an algortihm.

Release It has won a Jolt Productivity award

It's an honor and a thrill for me to report that Release It received a Jolt Productivity award!


Software Failure Takes Down Blackberry Services

Anyone who's addicted to a Blackberry already knows about Monday's four-hour outage. For some of us, the Blackberry isn't just an electronic leash, it's part of our business operations.

Like cell phones, Blackberries have a huge, hidden infrastructure behind them. Corporate Blackberry Event Servers (BES) relay email, calendar, and contact information through RIM's infrastructure, out through the wireless carriers. It was RIM's own infrastructure that suffered from intermittent failures during the outage.

Data Center Knowledge reports that the outage was caused by a failed software upgrade

Releases are risky. We use testing and QA to reduce the risk, but every line of new or modified code represents an unknown.

How can we reduce the risk of an upgrade? One way is to roll it out slowly. Companies with widely distributed point-of-sale (POS) systems know this. They never push a release out to every store at once. They start with one or two. If that works, they go up to a larger handful, maybe four to eight. After a couple of days, they'll roll it out to an entire district. It can take a week or more to roll the release out everywhere.

In the interim, there are plenty of checkpoints where the release can be rolled back.

I strongly recommend approaching Web site releases the same way. Roll the new release out to one or two servers in your farm. Let a fraction of your customers into the new release. Watch for performance regressions, capacity problems, and functional errors. Absolutely ensure that you can roll it back if you need to. Once it's "baked" for a while in production, then roll it to the remaining app servers.

This approach demands a few corollaries. First, your database updates have to be structured in a forward-compatible way, and they must always allow for rollback. There can be no irrevocable updates. Second, two versions of your software will be operating simultaneously. That means your integration protocols and static assets have to be able to accommodate both versions. I discuss specific strategies for each of these aspects in Release It.

Finally, an aside: RIM's statement about the outage isn't reflected anywhere on their site. Once again, if what you want is the latest true information about a company, the very last place to find it is the company's own web site. 

Tim Ross' C# Circuit Breaker

Tim Ross has published his implementation of the Circuit Breaker pattern from Release It, complete with unit tests.

I barely speak C#, so I'm not in any position to review his implementation, but I'm delighted to see it!


"Release It" is a Jolt Award Finalist

The Jolt Awards have been described as "the Oscar's of our industry". (Really. It's on the front page of the site.)  The list of past book winners reads like an essential library for the software practitioner. Even the finalists and runners-up are essential reading.

Release It has now joined the company of finalists. The competition is very tough... I've read "Beautiful Code" and "Manage It!", and both are excellent. I'll be on pins and needles until the awards ceremony on March 5th.  Honestly, though, I'm just thrilled to be in such good company.

Releasing a free SingleLineFormatter

A number of readers have asked me for reference implementations of the stability and capacity patterns.

I've begun to create some free implementations to go along with Release It. As of today, it just includes a drop-in formatter that you can use in place of the java.util.logging default (which is horrible).

This formatter keeps all the fields lined up in columns, including truncating the logger name and method name if necessary. A columnar format is much easier for the human eye to scan. We all have great pattern-matching machinery in our heads. I can't for the life of me understand why so many vendors work so hard to defeat it. The one thing that doesn't get stuffed into a column is a stack trace. It's good for a stack trace to interrupt the flow of the log file... that's something that you really want to pop out when scanning the file.

It only takes a minute to plug in the SingleLineFormatter. Your admins will thank you for it.

Read about the library.

Download it as .zip or .tgz.

Normal Accidents

While I was writing Release It!, I was influenced by James R. Chile's book Inviting Disaster. One of Chile's sources is Normal Accidents, by Charles Perrow. I've just started reading, and even the first two pages offer great insight.

Normal Accidents describes systems that are inherently unstable, to the point that system failures are inevitable and should be expected.  These "normal" accidents result from systems that exhibit the characteristics of high "interactive complexity" and "tight coupling".

Interactive complexity refers to internal linkages, hidden from the view of operators. These invisible relations between components or subsystems produce multiple effects from a single cause.  They can also produce outcomes that do not seem to relate to their inputs.

In software systems, interactive complexity is endemic. Any time two programs share a server or database, they are linked. Any time a system contains a feedback loop, it inherently has higher interactive complexity. Feedback loops aren't always obvious.  For example, suppose a new software release consumes a fraction more CPU per transaction than before. That small increment might puch the server from a non-contending regime and a contending one.  Once in contention, the added CPU usage creates more latency. That latency, and the increase in task-switching overhead, produces more latency. Positive feedback.

High interactive complexity leads operators to misunderstand the system and its warning signs. Thus misinformed, they act in ways that do not avert the crisis and may actually precipitate it. 

When processes happen very fast, and there is no way to isolate one part of the system from another, the system is tightly coupled.  Tight coupling allows small incidents to spread into large-scale failures.

Classic "web architecture" exhibits both high interactive complexity and tight coupling. Hence, we should expect "normal" accidents.  Uptime will be dominated by the occurence of these accidents, rather than the individual probability of failure in each component.

The first section of Release It! deals exclusively with system stability.  It shows how to reduce coupling and diminish interactive complexity. 

Y B Slow?

I've long been a fan of the Firebug extension for Firefox.  It gives you great visibility into the ebb and flow of browser traffic.  It sure beats rolling your own SOCKS proxy to stick between your browser and the destination site.

Now, I have to also endorse YSlow from Yahoo.  YSlow adds interpretation and recommendations to Firebug's raw data.

For example, when I point YSlow at www.google.com, here's how it "grades" Google's performance:

Google gets an A for performance

Not bad.  On the other hand, www.target.com doesn't fare as well.

Target gets an F for performance

Along with the high-level recommendations, YSlow will also tally up the page weight, including a nice breakdown of cached versus non-cached requests and download size.

Cache stats for Target.com

There are so many good reasons to use this tool. In Release It, I spend a lot of time talking about the money companies waste on bloated HTML and unnecessary page requests.  Fat pages hurt users and they hurt companies.  Users don't want to wait for all your extra whitespace, table-formatting, and shims to download.  Companies shouldn't have to pay for all the added, useless bandwidth.  YSlow is a great tool to help eliminate the bloat, speed up page delivery, and make happy users.

The 5 A.M. Production Problem

I've got a new piece up at InfoQ.com, discussing the limits of unit and functional testing: 

"Functional testing falls short, however, when you want to build software to survive the real world. Functional testing can only tell you what happens when all parts of the system are behaving within specification. True, you can coerce a system or subsystem into returning an error response, but that error will still be within the protocol! If you're calling a method on a remote EJB that either returns "true" or "false" or it throws an exception, that's all it will do. No amount of functional testing will make that method return "purple". Nor will any functional test case force that method to hang forever, or return one byte per second.

One of my recurring themes in Release It is that every call to another system, without exception, will someday try to kill your application. It usually comes from behavior outside the specification. When that happens, you must be able to peel back the layers of abstraction, tear apart the constructed fictions of "concurrent users", "sessions", and even "connections", and get at what's really happening."

Release It holding strong at Amazon

Well, Release It continues to hold the #1 spot in Amazon's "Hot New Releases" list for Design Tools and Techniques.  I've even got a couple of five-star reviews... and they weren't written by friends or family.

Release It! is shipping

Release It is now shipping!  People who ordered directly from The Pragmatic Programmers are receiving their hardcopies now.  It will take Amazon and Barnes and Noble a few days or a week to work the inventory through their supply chain, but they should be shipping soon, too!


Flash Mobs and TCP/IP Connections

In Release It, I talk about users and the harm they do to our systems.  One of the toughest types of user to deal with is the flash mob.  A flash mob often results from Attacks of Self-Denial, like when you suddenly offer a $3000 laptop for $300 by mistake.

When a flash mob starts to arrive, you will suddenly see a surge of TCP/IP connection requests at your load-distribution layer.  If the mob arrives slowly enough (less than 1,000 connections per second) then the app servers will be hurt the most.  For a really fast mob, like when your site hits the top spot on digg.com, you can get way more than 1,000 connections per second.  This puts the hurt on your web servers.

As the TCP/IP connection requests arrive, the OS queues them for servicing by the application.  As the application gets around to calling "accept" on the server socket, the server's TCP/IP stack sends back the SYN/ACK packet and the connection is established.  (There's a third step, but we can skip it for the moment.)  At that point, the server hands the established connection off to a worker thread to process the request.  Meanwhile, the thread that accepted the connection goes back to accept the next one.

Well, when a flash mob arrives, the connection requests arrive faster than the application can accept and dispatch them.   The TCP/IP stack protects itself by limiting the number of pending connection requests, so if the requests arrive faster than the application can accept them, the queue will grow until the stack has to start refusing connection requests.  At that point, your server will be returning intermittent errors and you're already failing.

The solution is much easier said than done: accept and dispatch connections faster than they arrive.

Filip Hanik compares some popular open-source servlet containers to see how well they stand up to floods of connection requests.  In particular, he demonstrates the value of Tomcat 6's new NIO connector.  Thanks to some very careful coding, this connector can accept 4,000 connections in 4 seconds on one server.  Ultimately, he gets it to accept 16,000 concurrent connections on a single server.  (Not surprisingly, RAM becomes the limiting factor.)

It's not clear that these connections can actually be serviced at that point, but that's a story for another day.

Release It! is released!

"Release It!" has been officially announced in this press release.  Andy Hunt, my editor, also posted announcements to several mailing lists.

It's been a long road, so I'm thrilled to see this release.

When you release a new software system, that's not the end of the process, but just the beginning of the system's life.  It is the same thing here.  Though it's taken me two years to get this book done and on the market, this is not the end of the book's creation, but the beginning of it's life.


Self-Inflicted Wounds

My friend and colleague Paul Lord said, "Good marketing can kill you at any time."

He was describing a failure mode that I discuss in Release It!: Design and Deploy Production-Ready Software as "Attacks of Self-Denial".  These have all the characteristics of a distributed denial-of-service attack (DDoS), except that a company asks for it.  No, I'm not blaming the victim for electronic vandalism... I mean, they actually ask for the attack.

The anti-pattern goes something like this: marketing conceives of a brilliant promotion, which they send to 10,000 customers.  Some of those 10,000 pass the offer along to their friends.  Some of them post it to sites like FatWallet or TechBargains.  On the appointed day, hour, and minute, the site has a date with destiny as a million or more potential customers hit the deep link that marketing sent around in the email.  You know, the one that bypasses the content distribution network, embeds a session ID in the URL, and uses SSL?

Nearly every retailer I know has done this to themselves at one point.  Two holidays ago, one of my clients did it to themselves, when they announced that XBox 360 preorders would begin at a certain day and time.  Between actual customers and the amateur shop-bots that the tech-savvy segment cobbled together, the site got crushed.  (Yes, this was one where marketing sent the deep link that bypassed all the caching and bot-traps.)

Last holiday, Amazon did it to themselves when they discounted the XBox 360 by $300.  (What is it about the XBox 360?)  They offered a thousand units at the discounted price and got ten million shoppers.  All of Amazon was inaccessible for at least 20 minutes.  (It may not sound like much, but some estimates say Amazon generates $1,000,000 per hour during the holiday season, so that 20 minute outage probably cost them around $200,000!)

In Release It!, I discuss some non-technical ways to mitigate this behavior, as well as some design and architecture patterns you can apply to minimize damage when one of these Attacks of Self-Denial occur.

How to become an "architect"

Over at The Server Side, there's a discussion about how to become an "architect".  Though TSS comments often turn into a cesspool, I couldn't resist adding my own two cents.

I should also add that the title "architect" is vastly overused.  It's tossed around like a job grade on the technical ladder: associate developer, developer, senior developer, architect.  If you talk to a consulting firm, it goes more like: senior consultant (1 - 2 years experience), architect (3 - 5 years experience), senior technical architect (5+ years experience).  Then again, I may just be too cynical.

There are several qualities that the architecture of a system should be:

  1. Shared.  All developers on the team should have more or less the same vision of the structure and shape of the overall system.
  2. Incremental.  Grand architecture projects lead only to grand failures.
  3. Adaptable. Successful architectures can be used for purposes beyond their designers' original intentions.  (Examples: Unix pipes, HTTP, Smalltalk)
  4. Visible.  The "sacred, invisible architecture" will fall into disuse and disrepair.  It will not outlive its creator's tenure or interest.

Is the designated "architect" the only one who can produce these qualities?  Certainly not.  He/she should be the steward of the system, however, leading the team toward these qualities, along with the other -ilities, of course.

Finally, I think the most important qualification of an architect should be: someone who has created more than one system and lived with it in production.  Note that automatically implies that the architect must have at least delivered systems into production.  I've run into "architects" who've never had a project actually make it into production, or if they have, they've rolled off the project---again with the consultants---just as Release 1.0 went out the door.

In other words, architects should have scars. 

Planning to Support Operations

In 2005, I was on a team doing application development for a system that would be deployed to 600 locations. About half of those locations would not have network connections. We knew right away that deploying our application would be key, particularly since it is a "rich-client" application. (What we used to call a "fat client", before they became cool again.) Deployment had to be done by store associates, not IT. It had to be safe, so that a failed deployment could be rolled back before the store opened for business the next day. We spent nearly half of an iteration setting up the installation scripts and configuration. We set our continuous build server up to create the "setup.exe" files on every build. We did hundreds of test installations in our test environment.

Operations said that our software was "the easiest installation we've ever had." Still, that wasn't the end of it. After the first update went out, we asked operations what could be done to improve the upgrade process. Over the next three releases, we made numerous improvements to the installers:

  • Make one "setup.exe" that can install either a server or a client, and have the installer itself figure out which one to do.
  • Abort the install if the application is still running. This turned out to be particularly important on the server.
  • Don't allow the user to launch the application twice. Very hard to implement in Java. We were fortunate to find an installer package that made this a check-box feature in the build configuration file!
  • Don't show a blank Windows command prompt window. (An artifact of our original .cmd scripts that were launching the application.)
  • Create separate installation discs for the two different store brands.
  • When spawning a secondary application, force it's window to the front, avoiding the appearance of a hang if the user accidentally gives focus to the original window.

These changes reduced support call volume by nearly 50%.

My point is not to brag about what a great job we did. (Though we did a great job.) To keep improving our support for operations, we deliberately set aside a portion of our team capacity each iteration. Operations had an open invitation to our iteration planning meetings, where they could prioritize and select story cards the same as our other stakeholders. In this manner, we explicitly included Operations as a stakeholder in application construction. They consistently brought us ideas and requests that we, as developers, would not have come up with.

Furthermore, we forged a strong bond with Operations. When issues arose---as they always will---we avoided all of the usual finger-pointing. We reacted as one team, instead of two disparate teams trying to avoid responsibility for the problems. I attribute that partly to the high level of professionalism in both development and operations, and partly to the strong relationship we created through the entire development cycle.

Inviting Domestic Disaster

We had a minor domestic disaster this morning. It's not unusual. With four children, there's always some kind of crisis. Today, I followed a trail of water along the floor to my youngest daughter. She was shaking her "sippy cup" upside down, depositing a full cup of water on the carpet... and on my new digital grand piano. 

Since the entire purpose of the "sippy cup" is to contain the water, not to spread it around this house, this was perplexing.

On investigation, I found that this failure in function actually mimicked common dynamics of major disasters. In Inviting Disaster, James R. Chiles describes numerous mechanical and industrial disasters, each with a terrible cost in lives. In Release It, I discuss software failures that cost millions of dollars---though, thankfully, no lives. None of these failures come as a bolt from the blue. Rather, each one has precursor incidents: small issues whose significance are only obvious in retrospect. Most of these chains of events also involve humans and human interaction with the technological environment.

The proximate cause of this morning's problem was inside the sippy cup itself. The removable valve was inserted into the lid backwards, completely negating its purpose. A few weeks earlier, I had pulled a sippy cup from the cupboard with a similarly backward valve. I knew it had been assembled by my oldest, who has the job of emptying the dishwasher, so I made a mental note to provide some additional instruction. Of course, mental notes are only worth the paper they're written on. I never did get around to speaking with her about it.

Today, my wonderful mother-in-law, who is visiting for the holidays, filled the cup and gave it to my youngest child. My mother-in-law, not having dealt with thousands of sippy cup fillings, as I have, did not notice the reversed valve, or did not catch its significance.

My small-scale mess was much easier to clean up than the disasters in "Release It!" or "Inviting Disaster". It shared some similar features, though. The individual with experience and knowledge to avert the problem--me--was not present at the crucial moment. The preconditions were created by someone who did not recognize the potential significance of her actions. The last person who could have stopped the chain of events did not have the experience to catch and stop the problem. Change any one of those factors and the crisis would not have occurred.

Book Completed

I'm thrilled to report that my book is now out of my hands and into the hands of copy editors and layout artists.

It's been a long trip. At the beginning, I had no idea just how much work was needed to write an entire book. I started this project 18 months ago, with a sample chapter, a table of contents, and a proposal. That was a few hundred pages, three titles, and a thousand hours ago.

Now "Release It! Design and Deploy Production-Ready Software" is close to print. Even in these days of the permanent ephemerance of electronic speech, there's still something incomparably electric about seeing your name in print.

Along with publication of the book, I will be making some changes to this blog. First, it's time to find a real home. That means a new host, but it should be transparent to everyone but me. Second, I will be adding non-blog content: excerpts from the book, articles, and related content. (I have some thoughts about capacity management that need a home.) Third, if there is interest, I will start a discussion group or mailing list for conversation about survivable software.